ps:一般如果不团队开发,用不到scrapy

第一个demo

启新建scrapy项目

1
scrapy startproject 项目名

执行之后会在目录下生成scrapy项目的文件结构

目录结构

1
2
3
4
5
6
7
8
9
10
11
12
D:.
│ scrapy.cfg

└─newDemo
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py 代表文件夹会被转换成python的项目包,一般空的就行,或者导入一些包

└─spiders
__init__.py

作用

1
2
3
4
5
6
7
8
9
10
11
12
13
scrapy.cfg    部署文件,scrapy部署的时候是部署到服务器上的

items.py 填写到爬取的字段,比如爬name就声明name,爬url可以声明url,爬什么东西就定义什么字段

middlewares.py 中间件,默认两个,一个是爬虫中间件NewdemoSpiderMiddleware,一个是下载中间件NewdemoDownloaderMiddleware
下载中间件NewdemoDownloaderMiddleware中会定义一些爬取的流程,比如爬虫创建的时候会做什么操作、出错了做什么操作,类似于hook

pipelines.py 爬虫爬取完毕之后的处理代码,数据清洗、数据存储的流程就在这里面

settings.py 整个项目的配置,比如是否遵守robots.txt,默认遵守,需要改成false

spiders下的__init__.py 爬虫核心代码,spiders下可以不止一个.py文件,一个py文件对应一个网站的爬虫,也就是可以爬多个
cd到newDemo文件夹下,然后运行scrapy genspider newnewTest baidu.com ,newnewTest是文件名,最后的参数是网址

newnewTest.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import scrapy

# 爬虫可以有多个,通过name启动
# 下面的name就是爬虫的名字
# allowed_domains是需要爬取的域名集合,因为有时候爬取的时候会爬到别的url上去
# start_urls是从哪些url开始爬
# parse是定义了对响应包数据的处理
class NewnewtestSpider(scrapy.Spider):
name = "newnewTest"
allowed_domains = ["baidu.com"]
start_urls = ["https://baidu.com"]

def parse(self, response):
self.log(response.body)
pass

运行scrapy项目

1
scrapy crawl 爬虫文件名

scrapy体系架构简介

1
2
3
4
5
6
7
8
9
engine是核心引擎,是twist,twist可以理解为异步io,不会有同步的等待时间

spders也就是上面的爬虫代码,可以有多个

Scheduler是隐藏在engine核心引擎后的调度器

Downloader是上面提到的下载中间件

ITEM PIPELINES是items.py和pipeline.py,起到最后的数据清洗和存储的作用

梳理一下scrapy启动爬虫的流程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1. scrapy crawl newnewTest 启动目标爬虫Spiders

2.获取爬虫代码newnewTest.py里定义的start_urls返回的数据,拿到返回的数据之后会进入class NewnewtestSpider(scrapy.Spider)里的scrapy.Spider

3.进入scrapy.Spider源码,可以看到里面有start_requests函数,返回了一个封装的Request对象,而且是循环遍历start_urls数组的,注意这里是封装对象,没有真的去请求
def start_requests(self) -> Iterable[Request]:
...
for url in self.start_urls:
yield Request(url, dont_filter=True)

4.拿到Request对象之后,会通过twist核心引擎交给调度器Scheduler,Scheduler里维护着队列,会把请求对象一个个去处理,先进先出的原则

5.调度器把要处理的请求通过twist核心引擎交给下载中间件NewdemoDownloaderMiddleware,下载中间件去发起网络请求,并返回结果

6.下载中间件NewdemoDownloaderMiddleware里有两个函数,一个是process_request,一个是process_response
def process_request(self, request, spider),这里request参数也就是前面封装的Request对象,可以在这里对对象做一些操作,比如添加headers、cookie、设置代理等等
def process_response(self, request, response, spider),也就是接收返回数据的函数

7.process_response之后,会把返回的响应结果通过twist核心引擎再交给爬虫Spiders

8.交给爬虫其实就是交到了class NewnewtestSpider(scrapy.Spider)里的parse函数里,对返回的数据进行操作就是在parse函数里,比如数据匹配等

9.经过处理之后的响应数据,结合ITEM再交给引擎twist。这里结合ITEM的意思实际上就是,item.py就是一个item对象,这个对象去结合上面的parse函数去使用,去匹配字段的

10.把ITEM对象存储的数据交给pipeline,去存储

当然,可以把数据处理存储都写在parse函数里,但是这样的话和写基础的爬虫没区别了,不是使用软件工程思想去开发的代码

而且在settings.py中需要打开中间件,把注释给去掉

1
2
3
4
5
6
7
8
9
10
11
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# "newDemo.middlewares.NewdemoSpiderMiddleware": 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# "newDemo.middlewares.NewdemoDownloaderMiddleware": 543,
#}

爬虫中间件SpiderMiddleware,用来hook爬虫的,在爬虫的进入和出去的时候进行hook,下载中间件是针对具体的每个请求去hook的,爬虫中间件应用于有多个爬虫的情况,对于只有一个爬虫py的话其实用不到

爬虫中间件的函数process_start_requests可以在封装Request对象之前对爬虫进行处理,也就是处理start_urls里面的url数据,然后再交给Spider去封装成Request对象,也就是说process_start_requests的调用在start_requests函数之前

爬虫中间件的函数process_spider_input的调用发生在process_response之后,也就是请求成功发起并返回响应之后,而且在parse函数之前。可以理解为下载中间件返回了结果,还没交到爬虫py代码的之前

爬虫中间件的函数process_spider_output的调用发生在parse函数之后

爬虫中间件的函数调用时机,可以立即为对于具体的爬虫代码(这里是newnewTest.py),在进入这个爬虫代码和离开这个爬虫代码的时候进行调用

scrapy常用命令

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1.创建项目
scrapy startproject <name>
2.创建爬虫
scrapy genspider <name> <url>

全局命令
1.直接运行爬虫文件,无需创建项目,只需要一个爬虫文件就行
scrapy runspider <爬虫文件名>
2.打开命令行工具
scrapy shell <要爬取的url>
运行之后会进入scrapy内置命令行
3.爬取主址,这里的url不能带?的get参数的,无法解析
scrapy fetch <url>
4.浏览器查看
scrapy view <url>
5.版本信息
scrapy version

项目命令,需要在项目内运行
1.运行爬虫
scrapy crawl <爬虫名,不加.py>
2.列举爬虫
scrapy list
3.本机scrapy运行速度
scrapy bench

爬取豆瓣读书Demo

新建爬虫

1
scrapy genspider douban book.douban.com

只爬标题和简介的话,重点在于关注怎么实现翻页的(在parse中生成新的Request对象)、怎么去用item去存储的

douban.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request
from ..items import StudyItem

class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["book.douban.com"]
start_urls = ["https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T"]

def __init__(self):
super().__init__()
self.page = 20

def parse(self, response):
soup=BeautifulSoup(response.text,'lxml')
titles=soup.select(".subject-item .info h2 a")
contents=soup.select(".subject-item .info p")
studyItem=StudyItem()
for title,content in zip(titles, contents):
mtile=title.text.replace('\n','').replace(' ','')
mcontent=content.text.replace('\n', '').replace(' ', '')
studyItem['title']=mtile
studyItem['content']=mcontent
yield studyItem

if self.page<100:
self.page += 20
url = f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={self.page}&type=T'
yield Request(url)

items.py

1
2
3
4
5
6
7
8
9
10
11
12
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class StudyItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
content=scrapy.Field()

pipelines.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class StudyPipeline:
def process_item(self, item, spider):
with open("douban.json",'a+',encoding='utf8')as f:
f.write(item['title']+item['content']+'\n')
return item

scrapy的管道pipeline

pipelines.py的process_item处理来自爬虫yield生成的item

process_item里可以使用ItemAdapteritem转换成字典,然后使用json.dumps存储到文件中

1
2
3
4
5
6
7
8
from itemadapter import ItemAdapter
import json
class StudyPipeline:
def process_item(self, item, spider):
m_item=ItemAdapter(item).asdict()
with open("douban.json",'a+',encoding='utf8')as f:
f.write(json.dumps(item,ensure_ascii=False)+'\n')
return item

settings.py里可以设置多个管道

1
2
3
4
5
6
7
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 最后数字的意思是从低到高 优先级 100 300 1-1000
ITEM_PIPELINES = {
"study.pipelines.TestPipeline": 100,
"study.pipelines.StudyPipeline": 300,
}

所以管道可以写多个,依次对item进行处理

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from itemadapter import ItemAdapter
import json

class StudyPipeline:
def process_item(self, item, spider):
print("StudyPipeline...")
with open("douban.json",'a+',encoding='utf8')as f:
f.write(json.dumps(item,ensure_ascii=False)+'\n')
return item

class TestPipeline:
def process_item(self, item, spider):
print("TestPipeline....")
m_item=ItemAdapter(item).asdict()
return m_item

scrapy的管道内置函数

一个pipeline有四个process_itemopen_spiderclose_spiderfrom_crawler

open_spider是打开每个爬虫的时候直接调用的,前面说了一个项目可以有多个爬虫

from_crawlerclassmethod,也就是这个函数是绑定在类本身上的,而不需要经过实例化,from_crawler调用的最早

一般open_spiderclose_spider可以做一些前置和扫尾的操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
import json

class StudyPipeline:
def __init__(self,name=None):
self.pipeName=name
print("StudyPipeline name:",self.pipeName)

def process_item(self, item, spider):
print("StudyPipeline...")
self.file.write(json.dumps(item,ensure_ascii=False)+'\n')
return item

def open_spider(self, spider):
print(f"{spider.name} is open")
self.file=open("douban.json",'a+',encoding='utf8')

def close_spider(self, spider):
print(f"{spider.name} is close")
self.file.close()

@classmethod
# 这里crawler是整个爬虫大项目
# crawler.settings.get就是获取settings.py文件中的PILE_NAME那一行的属性值
# cls就是pipeline类本身,return cls(...)相当于实例化的时候调用init函数传参了
# 这里return的数据被上面init函数接收并打印
def from_crawler(cls,crawler):
print("pipeline from_crawler生成")
return cls(
crawler.settings.get("PILE_NAME")
)

class TestPipeline:
def process_item(self, item, spider):
print("TestPipeline....")
m_item=ItemAdapter(item).asdict()
return m_item

scrapy的信号机制

在middleware.py中可以看到,使用了信号,这里crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)的意思是当触发了signals.spider_opened的信号时,就调用spider_opened函数,这里的s就是middleware类的实例,也就是自己

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class NewdemoSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
# 省略...
def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)

可以看到有很多信号

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
"""
Scrapy signals

These signals are documented in docs/topics/signals.rst. Please don't add new
signals here without documenting them there.
"""

engine_started = object()
engine_stopped = object()
spider_opened = object()
spider_idle = object()
spider_closed = object()
spider_error = object()
request_scheduled = object()
request_dropped = object()
request_reached_downloader = object()
request_left_downloader = object()
response_received = object()
response_downloaded = object()
headers_received = object()
bytes_received = object()
item_scraped = object()
item_dropped = object()
item_error = object()
feed_slot_closed = object()
feed_exporter_closed = object()

# for backward compatibility
stats_spider_opened = spider_opened
stats_spider_closing = spider_closed
stats_spider_closed = spider_closed

item_passed = item_scraped

request_received = request_scheduled

比如我想知道代码执行了多少次yield,也可以使用信号机制,依葫芦画瓢就行了,在爬虫代码中新增一个from_crawler方法,绑定信号和处理的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import scrapy
from bs4 import BeautifulSoup
from scrapy.http import Request
from ..items import StudyItem
from scrapy import signals

class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["book.douban.com"]
start_urls = ["https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start=20&type=T"]

def __init__(self):
super().__init__()
self.page = 20
self.items=0

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
_cls=cls()
crawler.signals.connect(_cls.item_number,signals.item_scraped)
crawler.signals.connect(_cls.spider_close, signals.spider_closed)
return _cls

def item_number(self):
self.items+=1
print("yield 处理了一个item")

def spider_close(self):
if self.items:
print(f"当前处理了{self.items}个item")
else:
print("没有处理item")


def parse(self, response):
soup=BeautifulSoup(response.text,'lxml')
titles=soup.select(".subject-item .info h2 a")
contents=soup.select(".subject-item .info p")
studyItem=StudyItem()
for title,content in zip(titles, contents):
mtile=title.text.replace('\n','').replace(' ','')
mcontent=content.text.replace('\n', '').replace(' ', '')
studyItem['title']=mtile
studyItem['content']=mcontent
yield studyItem

if self.page<20:
self.page += 20
url = f'https://book.douban.com/tag/%E5%B0%8F%E8%AF%B4?start={self.page}&type=T'
yield Request(url)

scrapy随机UA中间件

放在下载中间件里,因为是对每个请求进行处理的

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals
import random

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class StudySpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s

def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.

# Should return None or raise an exception.
return None

def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.

# Must return an iterable of Request, or item objects.
for i in result:
yield i

def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.

# Should return either None or an iterable of Request or item objects.
pass

def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.

# Must return only requests (not items).
for r in start_requests:
yield r

def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)


class StudyDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.

my_requests=0

@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s

def process_request(self, request, spider):
print("StudyDownloaderMiddleware:",request.headers)
# Called for each request that goes through the downloader
# middleware.

# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None

def process_response(self, request, response, spider):
# Called with the response returned from the downloader.

# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response

def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.

# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass

def spider_opened(self, spider):
spider.logger.info("Spider opened: %s" % spider.name)


class HeadersDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s

def process_request(self, request, spider):
uas=['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)']
print("HeadersDownloaderMiddleware:",request.headers)
request.headers['User-Agent']=random.choice(uas)
return None

注意看下process_request的返回值的几种选择,返回None就是继续后面的中间件处理,返回Response对象就会交给process_response去处理,返回Request对象会交给爬虫调度器继续进行请求

# - return None: continue processing this request

# - or return a Response object

# - or return a Request object

# - or raise IgnoreRequest: process_exception() methods of

还需要在settings.py中设置中间件优先级

scrapy中间件process_request返回值处理

我们知道process_request的返回值的有几种选择,返回None就是继续后面的中间件处理,返回Response对象就会交给process_response去处理,相当于不去真正发起请求,直接自己就给返回数据了,返回Request对象会交给process_request函数处理(按中间件顺序去处理)

注意返回Response对象的话,如果当前的中间件没有process_response函数,就会找后面中间件的process_response函数,知道交给爬虫的parse函数

# - return None: continue processing this request

# - or return a Response object

# - or return a Request object

# - or raise IgnoreRequest: process_exception() methods of

demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class HeadersDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s

def process_request(self, request, spider):
# - or return a Response object prosess_response处理
# - or return a Request object process_request递归
# - or raise IgnoreRequest: process_exception() methods of process_exception处理
# installed downloader middleware will be called
print("Processing request: %s" % request.url)
uas=['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)']
print("HeadersDownloaderMiddleware:",request.headers)
request.headers['User-Agent']=random.choice(uas)
#return None
#return scrapy.http.Response(url=request.url,body='我修改的body'.encode('utf-8'))
#return scrapy.http.Request(url=request.url,dont_filter=True)
raise scrapy.exceptions.IgnoreRequest("我把请求丢弃了")

scrapy中间件process_response返回值处理

可以返回的种类,返回Response就会交给后面中间件的process_response去处理,知道交给爬虫的parse函数

# - return a Response object

# - return a Request object

# - or raise IgnoreRequest

demo

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class HeadersDownloaderMiddleware:
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
return s

def process_request(self, request, spider):
# - or return a Response object prosess_response处理
# - or return a Request object process_request递归
# - or raise IgnoreRequest: process_exception() methods of process_exception处理
# installed downloader middleware will be called
print("Processing request: %s" % request.url)
uas=['Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)']
print("HeadersDownloaderMiddleware:",request.headers)
request.headers['User-Agent']=random.choice(uas)
return None
#return scrapy.http.Response(url=request.url,body='我修改的body'.encode('utf-8'))
#return scrapy.http.Request(url=request.url,dont_filter=True)
#raise scrapy.exceptions.IgnoreRequest("我把请求丢弃了")

def process_response(self, request, response, spider):
print("HeadersDownloaderMiddleware process_response:",response.body.decode('utf-8'))
# Called with the response returned from the downloader.

# Must either;
# - return a Response object 直接返回给爬虫
# - return a Request object 交给process_request去处理
#raise 没有处理结果
#return response
raise scrapy.exceptions.IgnoreRequest("HeadersDownloaderMiddleware触发错误")

def process_exception(self, request, exception, spider):
print("HeadersDownloaderMiddleware process_exception:",exception)

scrapy的Request对象

Request对象来自于scrapy.http.RequestRequest对象可以指定回调函数,不指定默认就是parse函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy
from scrapy.http import Request


class ReqSpider(scrapy.Spider):
name = "req"
allowed_domains = ["req.com"]
start_urls = ["http://req.com/"]

def start_requests(self):
for url in self.start_urls:
yield Request(url, dont_filter=True,
callback=self.my_parse,
meta={"test":"我存放的信息"},
flags=['DEBUG','ERROR'],
cb_kwargs={"test":"这是test",'test2':111})
def my_parse(self, response,test,test2):
print("调用my_parse")
print(response.meta['test'])

scrapy发送post请求

不推荐使用Request对象,这个是用来发GET的,发POST比较麻烦

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy
from scrapy.http import Request,FormRequest,JsonRequest
import json

"""
1.Request POST body
2.FormRequest b"Content-Type", b"application/x-www-form-urlencoded" 只能用来发POST的数据中content-type是x-www-form-urlencoded的
3.JsonRequest "Content-Type", "application/json" 只能用来发POST的数据中content-type是application/json的
"""

class HttpostSpider(scrapy.Spider):
name = "httpost"
start_urls = ["https://httpbin.org/post"]

def start_requests(self):
for url in self.start_urls:
yield JsonRequest(url, dont_filter=True,
method='POST',
data={"user":"xjb"})


def parse(self, response):
print(response.text)

Twisted邮件提醒

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from scrapy.mail import MailSender

class DoubanSpider(scrapy.Spider):
name = "douban"
allowed_domains = ["book.douban.com"]
start_urls = ["https://book.douban.com"]

def parse(self, response):
if response.text:
# smtphost是邮箱的主网站,比如qq
# mailfrom是邮箱地址
# smtpuser是邮箱地址
# smtppass是邮箱授权码
mailer = MailSender(smtphost=settings.MAIL_HOST,
mailfrom=settings.MAIL_FROM,
smtpuser=settings.MAIL_USER,
smtppass=settings.MAIL_PASS)
mailer.send(to=["123123213@qq.com"],
subject="Some subject",
body="Some body",
cc=["123123213@qq.com"])

scrapy暂停和重启

在某个url暂停一下,下次启动的时候不再重新爬已经爬过的url

暂停是ctrl + c按一次,不是两次

启动的时候有-s参数,可以指定临时目录,已经爬取过的url的数据会存在这个目录下, -s JOBDIR="tmp" ,下次启动的时候再加上这个-s参数,会从上次停止的地方再次启动,但是实测这种方法不能重启

自己实现的话,其实也就是在发起请求的位置去比对url,在爬虫中间件的process_start_requests函数里去判断

1
2
3
4
5
6
7
8
9
10
11
12
13
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
spider.logger.info("爬虫开始请求")
for r in start_requests:
with open(settings.TMP_FILE,'r')as f:
cc=f.read()
if (r.url in cc):
print("爬虫忽略",r.url)
continue
yield r

爬虫模板

1
2
3
scrapy genspider -l 查看模板

scrapy genspider -t crawl 按照模板生成代码

生成的爬虫代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class BaiduSpider(CrawlSpider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = ["https://baidu.com"]
# 链接提取器 使用正则表达式 提取出想要的链接之后可以调用callback定义的处理函数
# follow 是否深度爬取,爬取了一个页面之后,是否继续迭代爬取
# Rule可以有多个
# 匹配到链接之后,默认会发起请求,并且把返回数据交给回调函数处理
rules = (Rule(LinkExtractor(allow=r"Items/"), callback="parse_item", follow=True),)

def parse_item(self, response):
item = {}
#item["domain_id"] = response.xpath('//input[@id="sid"]/@value').get()
#item["name"] = response.xpath('//div[@id="name"]').get()
#item["description"] = response.xpath('//div[@id="description"]').get()
return item

多个Rule的话涉及到大批量爬虫,需要在settings.py里设置并发量和延迟

1
2
3
4
5
6
7
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3

scrapy爬取名人名言网站

翻页的话是url的变化,在每个页面再去匹配详情页的url,访问详情页

不使用模板

爬虫代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import scrapy
from bs4 import BeautifulSoup
from xjb import items

class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['lz13.cn']
# 翻页,页面的变化范围是1-254
start_urls = [f'https://www.lz13.cn/lizhi/mingrenmingyan-{i}.html'
for i in range(1,254)]

def parse(self, response):
soup=BeautifulSoup(response.text,'lxml')
# 获取详情页的url
small_url=soup.select('.PostHead span h3 a')
for url in small_url:
# 抛出Request对象,并且把响应结果丢给parse_content函数,不指定的话默认交给parse了
# yield scrapy.Request会自动发起请求
yield scrapy.Request(url=url['href'],callback=self.parse_content)

def parse_content(self,response):
item=items.XjbItem()
# 获取详情页的数据
soup=BeautifulSoup(response.text,'lxml')
title=soup.select_one('.PostContent p').text
content=soup.select('.PostContent p')[1:]
tmp=''
for c in content:
tmp+=c.text

item['title']=title.strip()
item['content']=tmp
# yield item会自动交给pipeline处理
yield item

item.py

1
2
3
4
5
6
7
8
9
10
11
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy

class XjbItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
content=scrapy.Field()

pipeline.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter

class XjbPipeline:
def process_item(self, item, spider):
self.writer.write(item['title']+" "+item['content']+"\n")
return item

def open_spider(self,spider):
self.writer=open('res.txt','a+',encoding='utf8')

def close_spider(self,spider):
self.writer.close()

使用crawl模板

如果使用模板进行爬虫的话

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from bs4 import BeautifulSoup
from xjb import items

class Test2Spider(CrawlSpider):
name = 'test2'
allowed_domains = ['lz13.cn']
# 起始一个url
start_urls = ['https://www.lz13.cn/lizhi/mingrenmingyan-1.html']

#详情页 翻页 采集URL
rules = (
#详情页,不需要深度爬取,调用回调函数进行数据提取
Rule(LinkExtractor(allow=r'www.lz13.cn/mingrenmingyan/\d+.html'), callback='parse_item'),
#翻页,需要深度爬取,因为翻页了之后,还需要继续找后面的翻页的url进行请求
#不需要回调函数,只是匹配url并发起请求
#比如这里匹配到一个新的页面https://www.lz13.cn/lizhi/mingrenmingyan-99.html
#默认在请求https://www.lz13.cn/lizhi/mingrenmingyan-99.html之后,会把返回的数据自动再次根据这两条Rule进行匹配
Rule(LinkExtractor(allow=r'www.lz13.cn/lizhi/mingrenmingyan-\d+.html',
restrict_xpaths='//div[@class="pager"]/a'),follow=True),
)

def parse_item(self, response):
item=items.XjbItem()

soup=BeautifulSoup(response.text,'lxml')
title=soup.select_one('.PostContent p').text
content=soup.select('.PostContent p')[1:]
tmp=''
for c in content:
tmp+=c.text

item['title']=title.strip()
item['content']=tmp
yield item

scrapy日志设置

五种日志级别,越往下越低

1
2
3
4
5
DEBUG
INFO
WARNING
ERROR
CRITICAL

在settings.py中可以设置日志的路径级别,如果这里级别设置为INFO,那么级别比INFO低的DEBUG不会被打印

1
2
3
LOG_LEVEL='DEBUG'
LOG_FILE='Project.log'
LOG_ENABLED=True