小白用scrapy爬取某站MeiNv套图

作者: zhouxinhuagg 发布时间： 2020-01-7 4.08K 人阅读

首先，很久以前我就已经很想爬此站的meinv套图到本地来，一开始用bs4、requests....等等库做成单文件爬虫，每次扑了8张左右就停下不动了，也没什么提示，后来学习了scrapy框架，感觉非常不错，于是抄抄别人，自己写写，把一个入门级的scrapy爬取图片并创建目录的框架捣鼓出来了，给大家献丑了。
   好了，进入正题。
   1、由于我的win10装了2.7和3.7两个版本的python，所以创建scrapy项目用python3 -m scrapy startproject fa24spider
   2、用pycharm进入项目目录，首先设计items.py，先把可能需要的放进去，到时不用直接删除或注释掉就好了

# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=238618]@Time[/url]    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# [url=home.php?mod=space&uid=686208]@AuThor[/url]  : ZekiLee[/i][/color]# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
 
class Fa24SpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # 套图名称
    pic_title = scrapy.Field()
    # 图片地址
    pic_url = scrapy.Field()
    # 图片名称
    pic_name = scrapy.Field()
    # 保存地址
    pic_path = scrapy.Field()
    # 反爬虫用的反重定向地址
referer = scrapy.Field()

3、开始做爬虫主程序spiders.py

# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# @Time    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# @Author  : ZekiLee[/i][/color][color=#808080][i]
[/i][/color]import scrapy
from fa24spider.items import Fa24SpiderItem
 
class SpidersSpider(scrapy.Spider):
    name = 'spiders'
    allowed_domains = ['24fa.top']
    top_url = "https://www.24fa.top"
    start_urls = ['https://www.24fa.top/MeiNv/index.html']
 
    def parse(self, response):
        """
        每页的套图链接
        """
        title_link_list = response.xpath('//td[@align="center"]/a/@href').extract()
        for title_link in title_link_list:
            title_url = title_link.replace("..", self.top_url)
            yield scrapy.Request(url=title_url, callback=self.pic_parse)
 
        # 做翻页处理，如果有下一页，则取出下一页的地址，yield：返回给parse函数真理
        next_page_link = response.xpath('//div[@class="pager"]//a[@title="后页"]/@href').extract_first("")
        if next_page_link:
            next_page_url = next_page_link.replace("..", self.top_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)
 
    def pic_parse(self, response):
        """
        进入套图链接后，处理每一页图片链接
        """
        item = Fa24SpiderItem()
        title = response.xpath('//h1[@class="title2"]/text()').extract()[0]
        item["pic_title"] = title
        # 获取referer
        referer = response.url
        item["referer"] = referer
        pic_url_list = response.xpath('//div[@id="content"]//img/@src').extract()
        for pic_link in pic_url_list:
            pic_name = pic_link[-10:] #图片名称取了图片链接的后10位字符
            item["pic_name"] = pic_name
            pic_url = pic_link.replace("../..", self.top_url)
            item["pic_url"] = pic_url
 
        yield item
 
        # 同样，套图页面里也是分页了的，所以同样要处理下一页
        next_page_link = response.xpath('//a[@title="下一页"]/@href').extract_first("")
        if next_page_link:
            next_page_url = next_page_link.replace("../..", self.top_url)
            yield scrapy.Request(url=next_page_url, callback=self.pic_parse)

4、进入settings.py进行设置

4.1 设置user-agent

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'

# Obey robots.txt rules
# 不遵守爬虫协议，否则......你懂的
ROBOTSTXT_OBEY = False
 
IMAGES_URLS_FIELD = "pic_url"
 
#自定义保存路径
IMAGES_STORE = "G:\\Fa24"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# 下载延迟，可以自己设置，太快可能被网站ban掉
DOWNLOAD_DELAY = 3

4.2 自定一个pipelines，后面的数字自己改吧，数字越小，就越优先

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
#    'fa24spider.pipelines.Fa24SpiderPipeline': 300,
    'fa24spider.pipelines.Fa24TopPipeline':1,
}

5、基本上setting设置好了，重头戏来了，该怎么下载，来看pipelines.py ，基本上都有注释说明了

# -*- coding: utf-8 -*-
[color=#808080][i]#!/usr/bin/env python3[/i][/color][color=#808080][i]
[/i][/color][color=#808080][i]# @Time    : 2020/1/7 11:13
[/i][/color][color=#808080][i]# @Author  : ZekiLee[/i][/color]# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
from scrapy.utils.project import get_project_settings
import scrapy
import os
import shutil
 
 
class Fa24SpiderPipeline(object):
    def process_item(self, item, spider):
        return item
 
 
class Fa24TopPipeline(ImagesPipeline):
    # 获取settings中设置保存的路径
    IMAGES_STORE = get_project_settings().get("IMAGES_STORE")
    # 重写ImagesPipeline类的方法
    # 发送图片下载请求
    def get_media_requests(self, item, info):
 
        image_url = item["pic_url"]
        # headers是请求头主要是防反爬虫
        header = {
            "referer":item["referer"],
            "user-agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
                  }
        yield scrapy.Request(image_url, headers=header)
 
    def item_completed(self, results, item, info):
        # image_path 得到的是保存在full目录下用哈希值命名的图片列表路径
        # image_path = ['full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg']
        image_path = [x["path"] for ok,x in results if ok]
 
        # 定义分类保存的路径
        # img_path 得到的是settings中定义的路径+套图名称
        new_path = '%s\%s'%(self.IMAGES_STORE,item["pic_title"])
 
        # 如果目录不存在，则创建目录
        if not os.path.exists(new_path):
            os.mkdir(new_path)
 
        # 将文件从默认下路路径移动到指定路径下
        # self.IMAGES_STORE + "\\" + image_path[0] 就是原路径 G:\Fa24\full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg
        # image_path[0][image_path[0].find("full\\")+6:] 把原目录'full/5db315b42dfc54a0d2bd0488c87913dfc25a71ef.jpg'中的“full/”去掉
        pic_name = image_path[0][image_path[0].find("full\\")+6:] # 得到的是哈希值命名的图片名
        old_path = self.IMAGES_STORE + "\\" + image_path[0]
        shutil.move(old_path, new_path + "\\" + pic_name)
# 哈希值的名字太长太长了，改一下名吧
        os.rename(new_path + "\\" + pic_name,new_path + "\\" + item["pic_name"])
        # 把图片路径传回给item
        item["pic_url"] = new_path + "\\" + item["pic_name"]
        # item["pic_url"] = new_path + "\\" + image_path[0][image_path[0].find("full\\")+6:]
 
        return item

6、大功告成！pycharm下面的Terminal执行
>python3 -m scrapy crawl spiders
然后.....就飞法法哗啦啦下载啦！！

本文最后更新于2020年1月7日，若涉及的内容可能已经失效，直接留言反馈补链即可，我们会处理，谢谢