Python爬虫入门教程： All IT eBooks多线程爬取 - Python

TOP

Python爬虫入门教程： All IT eBooks多线程爬取(一)

2019-05-23 14:58:37 【大中小】浏览:28次

Tags：Python 爬虫入门教程 All eBooks 线程

All IT eBooks多线程爬取-写在前面

对一个爬虫爱好者来说，或多或少都有这么一点点的收集癖 ~ 发现好的图片，发现好的书籍，发现各种能存放在电脑上的东西，都喜欢把它批量的爬取下来。然后放着，是的，就这么放着.......然后慢慢的遗忘掉.....

All IT eBooks多线程爬取-爬虫分析

打开网址 http://www.allitebooks.com/ 发现特别清晰的小页面，一看就好爬
在这里插入图片描述

在点击一本图书进入，发现下载的小链接也很明显的展示在了我们面前，小激动一把，这么清晰无广告的网站不多见了。
在这里插入图片描述

All IT eBooks多线程爬取-撸代码

这次我采用了一个新的模块 requests-html 这个模块的作者之前开发了一款 requests，你应该非常熟悉了，线程控制采用的 queue
安装 requests-html 模块

pip install requests-html

关于这个模块的使用，你只需要使用搜索引擎搜索一下这个模块名称，那文章也是很多滴，作为能学到这篇博客的你来说，是很简单的拉~

我们编写一下核心的内容

from requests_html import HTMLSession
from queue import Queue
import requests
import random

import threading
CARWL_EXIT = False
DOWN_EXIT = False

#####
# 其他代码
####
if __name__ == '__main__':

    page_queue = Queue(5)
    for i in range(1,6):
        page_queue.put(i)  # 把页码存储到page_queue里面

    # 采集结果
    data_queue = Queue()

    # 记录线程列表
    thread_crawl = []
    # 每次开启5个线程
    craw_list = ["采集线程1号","采集线程2号","采集线程3号","采集线程4号","采集线程5号"]

    for thread_name in craw_list:
        c_thread = ThreadCrawl(thread_name,page_queue,data_queue)
        c_thread.start()
        thread_crawl.append(c_thread)

    while not page_queue.empty():
        pass

    # 如果page_queue为空，采集线程退出循环
    CARWL_EXIT = True
    for thread in thread_crawl:
        thread.join()
        print("抓取线程结束")

上面就是爬取图书详情页面的线程了，我开启了5个线程爬取，页码也只爬取了5 页，如果你需要更多的，只需要修改

    page_queue = Queue(5)
    for i in range(1,6):
        page_queue.put(i)  # 把页码存储到page_queue里面

下面我们把 ThreadCrawl 类编写完毕

session = HTMLSession()

# 这个地方是 User_Agents 以后我把他配置到服务器上面，就可以远程获取了  这个列表里面有很多项，你自己去源码里面找吧
USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20"
]
# 获取图书下载链接的线程类
class ThreadCrawl(threading.Thread):
    # 构造函数
    def __init__(self,thread_name,page_queue,data_queue):

        super(ThreadCrawl,self).__init__()
        self.thread_name = thread_name
        self.page_queue = page_queue
        self.data_queue = data_queue
        self.page_url = "http://www.allitebooks.com/page/{}"   #URL拼接模板

    def run(self):
        print(self.thread_name+" 启动*********")

        while not CARWL_EXIT:
            try:
                page = self.page_queue.get(block=False)
                page_url = self.page_url.format(page)   # 拼接URL操作
                self.get_list(page_url)   # 分析页面链接 

            except Exception as e:
                print(e)
                break


    # 获取当前列表页所有图书链接
    def get_list(self,url):
        try:
            response = session.get(url)
        except Exception as e:
            print(e)
            raise e

        all_link = response.html.find('.entry-title>a') # 获取页面所有图书详情链接

        for link in all_link:
            self.get_book_url(link.attrs['href'])   # 获取图书链接

    # 获取图书下载链接
    def get_book_url(self,url):
        try:
            response = session.get(url)

        except Exception as e:
            print(e)
            raise e

        download_url = response.html.find('.download-links a', first=True)

        if download_url is not None: # 如果下载链接存在，那么继续下面的爬取工作
            link = download_url.attrs['href']
            self.data_queue.put(link)   # 把图书下载地址 存储到 data_queue里面，准备后面的下载
            print("抓取到{}".format(link))

上述代码一个非常重要的内容就是把图书的下载链接存储到了data_queue 里面，这些数据在另一个下载线程里面是最基本的数据。

下面开始编写图书下载的类和方法。

我开启了4个线程，操作和上面的非常类似

class ThreadDown(threading.Thread):
    def __init__(self,

首页上一页 1 2 下一页尾页 1/2/2
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：Python爬虫入门教程： All IT eBo..	下一篇：零基础入门Python数据分析，只需..