Theme NexT works best with JavaScript enabled
1.新建项目yangguang 1.1 新建项目 1 2 scrapy startproject yangguang cd yangguang
1.2 新建爬虫文件 1 scrapy genspider yg wz.sun0769.com
1.3 初始化
1 start_urls = ['http://wz.sun0769.com/political/index/politicsNewest?id=1&page=1/' ]
(重要!!!)在settings.py文件中添加下载限速配置(笔者由于下载过快直接导致整个宿舍的ip被网站封了……):
写入访问请求头的用户代理(每个网站都不同,打开netWork随便选中一个链接都可看到):
1 USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
2.items.py
(分析网页)在items.py
文件中定义字典元素:
1 2 3 4 5 6 7 8 class YangguangItem (scrapy.Item ): status = scrapy.Field() title = scrapy.Field() href = scrapy.Field() publish_date = scrapy.Field() content_img = scrapy.Field() content_text = scrapy.Field()
1 from yangguang.items import YangguangItem
3.正式爬取 3.1 分组爬取列表页内容 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 def parse (self, response ): url = "http://wz.sun0769.com" li_list = response.xpath('//*[@class="title-state-ul"]/li' ) for li in li_list: item = YangguangItem() item["status" ] = li.xpath('./*[@class="state2"]/text()' ).extract_first() item["title" ] = li.xpath('./*[@class="state3"]/a/text()' ).extract_first() item["href" ] = li.xpath('./*[@class="state3"]/a/@href' ).extract_first() item["publish_date" ] = li.xpath('./*[@class="state5 "]/text()' ).extract_first() item["href" ] = url + str(item["href" ]) print(item["href" ]) yield scrapy.Request( str(item["href" ]), callback=self.parse_detail, meta={"item" : item} )
3.2 进一步爬取内容详情页 1 2 3 4 5 6 7 8 9 10 def parse_detail (self, response ): item = response.meta["item" ] item["content_text" ] = response.xpath('//*[@class="details-box"]/pre/text()' ).extract() item["content_img" ] = response.xpath('//*[@class="clear details-img-list Picture-img"]/img/@src' ).extract() print(item["title" ])
3.3 在parse函数最后实现翻页请求 1 2 3 4 5 6 7 8 9 10 next_url = response.xpath('//*[@class="arrow-page prov_rota"]/@href' ).extract_first() next_url = url + next_url if response.xpath('//*[@class="mr-three paging-box"]/a[4]/text()' ) != "3" : yield scrapy.Request( next_url, callback=self.parse )
3.4 pipelines.py 1 2 3 4 5 6 7 8 9 10 11 12 13 import reclass YangguangPipeline : def process_item (self, item, spider ): item["content_img" ] = self.content_process(item["content_img" ]) item["content_text" ] = self.content_process(item["content_text" ]) return item def content_process (self, content ): content = [re.sub(r"\xa0|\s" , "" , i) for i in content] content = [i for i in content if len(i) > 0 ] return content
4.结果显示
-------- 本文结束
感谢阅读 --------