Python爬虫（学习笔记） - Python

设为首页加入收藏

当前位置：

首页 -> 其它语言 -> Python

TOP

Python爬虫（学习笔记）(三)

2023-07-23 13:45:23 【大中小】浏览:99次

Tags：Python 爬虫习笔记

页...')

try:

OnePage(i)

except:

continue

通过xpath解析，获取网页图片

import os.path

#导入包

import requests

from lxml import etree

#指定URL

url='https://pic.netbian.com/4kmeinv/'

#UA伪装

headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36 SE 2.X MetaSr 1.0'}

#读取整页文本

page_text=requests.get(url=url,headers=headers).text

#解决中文乱码问题-方法2

#response=requests.get(url=url,headers=headers)

#response.encoding='utf-8'

#page_text=response.text

#初始化etree对象

tree=etree.HTML(page_text)

#解析获取所有对象列表

li_list=tree.xpath('//div[@class="slist"]//li')

folder='./4kMeinv'

if not os.path.exists(folder):

os.mkdir(folder)

#遍历对象列表

for li in li_list:

#从对象解析出图片URL，并补全为完整URL

src='https://pic.netbian.com'+li.xpath('./a/img/@src')[0]

#读取图片二进制数据

data=requests.get(url=src,headers=headers).content

#获取文件名称

name=li.xpath('./a/img/@alt')[0]+'.jpg'

#解决中文乱码问题-方法1

name=name.encode('iso-8859-1').decode('gbk')

#print

path=folder+'/'+name

#数据持久化存储

with open(path,'wb') as fp:

fp.write(data)

print(name,'，保存成功！！！')

导入包

指定URL

UA伪装

读取整页文本

etree初始化

xpath解析

xpath二次解析

获取图片URL

下载图片数据（二进制）

获取文件名称

数据持久化

HTML常用标签及其全称

HTML标签	英文全称	中文释义
a	Anchor	锚
abbr	Abbreviation	缩写词
acronym	Acronym	取首字母的缩写词
address	Address	地址
dfn	Defines a Definition Term	定义定义条目
kbd	Keyboard	键盘（文本）
samp	Sample	示例（文本
var	Variable	变量（文本）
tt	Teletype	打印机（文本）
code	Code	源代码（文本）
pre	Preformatted	预定义格式（文本）
blockquote	Block Quotation	区块引用语
cite	Citation	引用
q	Quotation	引用语
strong	Strong	加重（文本）
em	Emphasized	加重（文本）
b	Bold	粗体（文本）
i	Italic	斜体（文本）
big	Big	变大（文本）
small	Small	变小（文本）
sup	Superscripted	上标（文本）
sub	Subscripted	下标（文本）
bdo

首页上一页 1 2 3 4 5 6 下一页尾页 3/6/6
【大中小】【打印】【繁体】【投稿】【收藏】【推荐】【举报】【评论】【关闭】【返回顶部】

上一篇：科技报告数据语料处理（关键词、..	下一篇：Python绘制饼状图对商品库存进行..