在编写爬虫时,性能的消耗主要在IO请求中,当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。
1. 同步执行
1 import requests
2
3 def fetch_async(url):
4 response = requests.get(url)
5 return response
6
7
8 url_list = ['http://www.github.com', 'http://www.bing.com']
9
10 for url in url_list:
11 fetch_async(url)
2. 多线程执行(多个线程并发执行,时间长短取决于最长的URL请求)
1 from concurrent.futures import ThreadPoolExecutor
2 import requests
3
4
5 def fetch_async(url):
6 response = requests.get(url)
7 return response
8
9
10 url_list = ['http://www.github.com', 'http://www.bing.com']
11 pool = ThreadPoolExecutor(5)
12 for url in url_list:
13 pool.submit(fetch_async, url)
14 pool.shutdown(wait=True)
3. 多进程执行(在CPU核心数足够的情况下,多个进程并行执行,时间长短取决于最长的URL请求,理论上会快于多线程)
1 from concurrent.futures import ProcessPoolExecutor
2 import requests
3
4 def fetch_async(url):
5 response = requests.get(url)
6 return response
7
8
9 url_list = ['http://www.github.com', 'http://www.bing.com']
10 pool = ProcessPoolExecutor(5)
11 for url in url_list:
12 pool.submit(fetch_async, url)
13 pool.shutdown(wait=True)
4. 多线程+回调函数(实现了异步非阻塞,在IO等待的情况下可以做其它事情)
1 from concurrent.futures import ThreadPoolExecutor
2 import requests
3
4 def fetch_async(url):
5 response = requests.get(url)
6 return response
7
8
9 def callback(future):
10 print(future.result())
11
12
13 url_list = ['http://www.github.com', 'http://www.bing.com']
14 pool = ThreadPoolExecutor(5)
15 for url in url_list:
16 v = pool.submit(fetch_async, url)
17 v.add_done_callback(callback)
18 pool.shutdown(wait=True)
5. 多进程+回调函数(实现了异步非阻塞,在IO等待的情况下可以做其它事情)
1 from concurrent.futures import ProcessPoolExecutor
2 import requests
3
4
5 def fetch_async(url):
6 response = requests.get(url)
7 return response
8
9
10 def callback(future):
11 print(future.result())
12
13
14 url_list = ['http://www.github.com', 'http://www.bing.com']
15 pool = ProcessPoolExecutor(5)
16 for url in url_list:
17 v = pool.submit(fetch_async, url)
18 v.add_done_callback(callback)
19 pool.shutdown(wait=True)
通过上述代码均可以完成对请求性能的提高,对于多线程和多进程的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO会是首选:
1. asyncio示例一
import asyncio
@asyncio.coroutine
def func1():
print('before...func1......')
yield from asyncio.sleep(5)
print('end...func1......')
tasks = [func1(), func1()]
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*tasks))
loop.close()
2. asyncio示例二
1 import asyncio
2
3
4 @asyncio.coroutine
5 def fetch_async(host, url='/'):
6 print(host, url)
7 reader, writer = yield from asyncio.open_connection(host, 80)
8
9 request_header_content = """GET %s HTTP/1.0\r\nHost: %s\r\n\r\n""" % (url, host,)
10 request_header_content = bytes(request_header_content, encoding='utf-8')
11
12 writer.write(request_header_content)
13 yield from writer.drain()
14 text = yield from reader.read()
15 print(host, url, text)
16 writer.close()
17
18 tasks = [
19 fetch_async('www.cnblogs.com', '/wupeiqi/'),
20 fetch_async('dig.chouti.com', '/pic/show?nid=4073644713430508&lid=10273091')
21 ]
22
23 loop = asyncio.get_event_loop()
24 results = loop.run_until_complete(asyncio.gather(*tasks))
25 loop.close()
3. asyncio + aiohttp
1 import aiohttp
2 import asyncio
3
4
5 @asyncio.coroutine
6 def fetch_async(url):
7 print(url)