Python爬虫（学习笔记） - Python

TOP

Python爬虫（学习笔记）(五)

2023-07-23 13:45:23 【大中小】浏览:100次

Tags：Python 爬虫习笔记

匹配

获取指定内容

持久化数据存储

实例化soup对象

将页面文本载入soup对象

soup解析获取指定内容

持久化存储数据

1实例化一个etree对象，将需要解析的对象加载到该对象中

2调用etree中的xpaht方法，结合xpath表达式，实现标签定位和内容捕获

初始化

re.findall(ex,text,re.S)

本地数据初始化

soup=BeautifulSoup(fp,'lxml')

网页文本初始化

soup=BeautifulSoup(page_text,'lxml')

将本地文档源码加载到etree对象中

etree.parse(filePath)

将互联网的源码数据加载到etree对象中

etree.HTML(page_text)

方法

requests.get(url,headers)

requests.post(url,param,headers)

response.text文本

response.json

response.content二进制

soup.a/div/p/title/text/string

text返回该标签下所有文本内容（各级子标签）

string只返回本标签下的文本内容

例子：soup.title.parent.name

例子：soup.div.div.div.a.text

soup.find(‘tagName’)，soup.find_all()

soup.find(‘tagName’,class_=’属性名’)

soup.select('.bookcont > ul > span > a')

selecet多级访问，返回所有匹配项

.bookcont表示属性名ul表示标签名

find_all()方法没有找到目标是返回空列表,find()方法找不到目标时,返回None

通过CSS的类名查找:soup.select(".sister")

通过tag标签逐层查找:（可跨级）

soup.select("body a")

找到某个tag标签下的直接子标签:

soup.select("head > title")（不可跨级）

soup.a[‘href’]

xpath(‘xpath表达式’)

./表示当前目录

/（最左侧）表示根目录

/（不在最左侧）表示一个层级，相当于bs4的>

//（最左侧）表示多个层级相当于bs4的空格

//（不在最左侧）表示从任意层级开始定位

属性定位：//tag[@attrName=’attrVal’]

索引定位：索引下标从1开始而不是从0开始，tag[@attrName=’attrVal’]/p[3]

‘//div[@class=’tang’]//li[5]/a/text()’[0]

/text()本层级内容

//text()所有层级内容

/@attrName取属性值

其他

中文文档

https://docs.python.org/zh-cn/3/library/index.html

中文文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find

中文文档

https://www.w3cschool.cn/lxml/

通过bs4解析三国

from bs4 import BeautifulSoup

import lxml

import requests

url='https://so.gushiwen.cn/guwen/book_46653FD803893E4F7F702BCF1F7CCE17.aspx'

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0'}

page_text=requests.get(url=url,headers=headers).text

soup=BeautifulSoup(page_text,'lxml')

#print(soup.div.div.div.a.text)#古诗文网

#print(soup.div.div.string)#None

#print(soup.div.div.div.a['href'])

chapters=soup.select('.bookcont > ul > span > a')

print(chapters)

with open('sanguo.txt','w',encoding='utf-8') as fp:

for a in chapters:

title=a.string

link=a['href']

page_content=requests.get(url=link,headers=headers).text

soup2=BeautifulSoup(page_content,'lxml')

content=soup2.find('div',class_='contson')

content_text=content.text

fp.write(title+':\n'+content_text+'\n')

print(title ,' 写入成功！！！')

导入

指定URL

UA伪装

读取起始页

创建soup对象，解析标题和内容URL

读取章节内容

创建soup2，解析章节内容

数据持久化，写入本地TXT文件

正则解析--爬取网页中的美女图片

import os.path

import requests

import re #导入包

#UA伪装

headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0'}

#存储图片的文件夹

folder='./m

首页上一页 2 3 4 5 6 下一页尾页 5/6/6
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：科技报告数据语料处理（关键词、..	下一篇：Python绘制饼状图对商品库存进行..