Python学习之urlib模块和urllib2模块学习 - linux编程基础

一 urlib模块

利用urllib模块可以打开任意个url。
1.
urlopen() 打开一个url返回一个文件对象，可以进行类似文件对象的操作。

In [308]: import urllib

In [309]: file=urllib.urlopen('

In [310]: file.readline()

Out[310]: '\xe7\x99\xbe\xe5\xba\xa6\xe4\xb8 可以用read(),readlines(),fileno(),close()这些函数 In [337]: file.info() Out[337]: <httplib.HTTPMessage instance at 0x2394a70> In [338]: file.getcode() Out[338]: 200 In [339]: file.geturl() Out[339]: 'http://www.baidu.com/' 2.urlretrieve() 将url对应的html页面保存为文件 In [404]: filename=urllib.urlretrieve('http://www.baidu.com/',filename='/tmp/baidu.html') In [405]: type (filename) Out[405]: <type 'tuple'> In [406]: filename[0] Out[406]: '/tmp/baidu.html' In [407]: filename Out[407]: ('/tmp/baidu.html', <httplib.HTTPMessage instance at 0x23ba878>) In [408]: filename[1] Out[408]: <httplib.HTTPMessage instance at 0x23ba878> 3.urlcleanup() 清除由urlretrieve()产生的缓存 In [454]: filename=urllib.urlretrieve('http://www.baidu.com/',filename='/tmp/baidu.html') In [455]: urllib.urlcleanup() 4.urllib.quote()和urllib.quote_plus() 将url进行编码 In [483]: urllib.quote('http://www.baidu.com') Out[483]: 'http%3A//www.baidu.com' In [484]: urllib.quote_plus('http://www.baidu.com') Out[484]: 'http%3A%2F%2Fwww.baidu.com' 5.urllib.unquote()和urllib.unquote_plus() 将编码后的url解码 In [514]: urllib.unquote('http%3A//www.baidu.com') Out[514]: 'http://www.baidu.com' In [515]: urllib.unquote_plus('http%3A%2F%2Fwww.baidu.com') Out[515]: 'http://www.baidu.com' 6.urllib.urlencode() 将url中的键值对以&划分，可以结合urlopen()实现POST方法和GET方法 In [560]: import urllib In [561]: params=urllib.urlencode({'spam':1,'eggs':2,'bacon':0}) In [562]: f=urllib.urlopen("http://python.org/query %s" %params) In [563]: f.readline() Out[563]: '<!doctype html>\n' In [564]: f.readlines() Out[564]: ['\n', '\n', '\n', '<html class="no-js" lang="en" dir="ltr"> \n', '\n', 二 urllib2模块 urllib2比urllib多了些功能，例如提供基本的认证，重定向，cookie等功能 https://docs.python.org/2/library/urllib2.html https://docs.python.org/2/howto/urllib2.html In [566]: import urllib2 In [567]: f=urllib2.urlopen('http://www.python.org/') In [568]: print f.read(100) --------> print(f.read(100)) <!doctype html>  打开python的官网并返回头100个字节内容 HTTP基于请求和响应，客户端发送请求，服务器响应请求。urllib2使用一个Request对象代表发送的请求，调用urlopen()打开Request对象可以返回一个response对象。reponse对象是一个类似文件的对象，可以像文件一样进行操作 In [630]: import urllib2 In [631]: req=urllib2.Request('http://www.baidu.com') In [632]: response=urllib2.urlopen(req) In [633]: the_page=response.read() In [634]: the_page Out[634]: '<!DOCTYPE html><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu. 通常情况下需要向一个url以POST的方式发送数据。 In [763]: import urllib In [764]: import urllib2 In [765]: url='http://xxxxxx/login.php'<div class="article-middle-ad" style="margin: 30px 0; padding: 20px 0; text-align: center; clear: both;"> <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-3873923678443673" crossorigin="anonymous"></script>  <ins class="adsbygoogle" style="display:block" data-ad-client="ca-pub-3873923678443673" data-ad-slot="5723320063" data-ad-format="auto" data-full-width-responsive="true"></ins> <script> (adsbygoogle = window.adsbygoogle || []).push({}); </script> </div> In [766]: values={'ver' : '1.7.1', 'email' : 'xxxxx', 'password' : 'xxxx', 'mac' : '111111111111'} In [767]: data=urllib.urlencode(values) In [768]: req=urllib2.Request(url,data) In [769]: response=urllib2.urlopen(req) In [770]: the_page=response.read() In [771]: the_page 如果不使用urllib2.Request()发送data参数，urllib2使用GET请求，GET请求和POST请求差别在于POST请求常有副作用，POST请求会通过某些方式改变系统的状态。也可以通过GET请求发送数据。 In [55]: import urllib2 In [56]: import urllib In [57]: url='http://xxx/login.php' In [58]: values={'ver' : 'xxx', 'email' : 'xxx', 'password' : 'xxx', 'mac' : 'xxx'} In [59]: data=urllib.urlencode(values) In [60]: full_url=url + ' ' + data In [61]: the_page=urllib2.urlopen(full_url) In [63]: the_page.read() Out[63]: '{"result":0,"data":0}' 默认情况下,urllib2使用Python-urllib/2.6 表明浏览器类型，可以通过增加User-Agent HTTP头 In [107]: import urllib In [108]: import urllib2 In [109]: url='http://xxx/login.php' In [110]: user_agent='Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' In [111]: values={'ver' : 'xxx', 'email' : 'xxx', 'password' : 'xxx', 'mac' : 'xxxx'} In [112]: headers={'User-Agent' : user_agent} In [114]: data=urllib.urlencode(values) In [115]: req=urllib2.Request(url,data,headers) In [116]: response=urllib2.urlopen(req) In [117]: the_page=response.read() In [118]: the_page 当给定的url不能连接时，urlopen()将报URLError异常，当给定的url内容不能访问时，urlopen()会报HTTPError异常 #/usr/bin/python from urllib2 import Request,urlopen,URLError,HTTPError req=Request('http://10.10.41.42/index.html') try: response=urlopen(req) except HTTPError as e: print 'The server couldn\'t fulfill the request.' print 'Error code:',e.code except URLError as e: print 'We failed to fetch a server.' print 'Reason:',e.reason else: print "Everything is fine" 这里需要注意的是在写异常处理时，HTTPError必须要写在URLError前面 #/usr/bin/python from urllib2 import Request,urlopen,URLError,HTTPError req=Request('http://10.10.41.42') try: response=urlopen(req) except URLError as e: if hasattr(e,'reason'): print 'We failed to fetch a server.' print 'Reason:',e.reason elif hasattr(e,'code'): print 'The server couldn\'t fulfill the request.' print 'Error code:',e.code else: print "Everything is fine" hasattr()函数判断一个对象是否有给定的属性 </div> <div class="pagination mt-40"> </div> <div class="article-footer mt-60 pt-40"> <div class="share-box">  <div class="bdsharebuttonbox"><a href="#" class="bds_more" data-cmd="more"></a><a href="#" class="bds_qzone" data-cmd="qzone" title="分享到QQ空间"></a><a href="#" class="bds_tsina" data-cmd="tsina" title="分享到新浪微博"></a><a href="#" class="bds_tqq" data-cmd="tqq" title="分享到腾讯微博"></a><a href="#" class="bds_renren" data-cmd="renren" title="分享到人人网"></a><a href="#" class="bds_weixin" data-cmd="weixin" title="分享到微信"></a></div> <script>window._bd_share_config={"common":{"bdSnsKey":{},"bdText":"","bdMini":"2","bdMiniList":false,"bdPic":"","bdStyle":"0","bdSize":"24"},"share":{}};with(document)0[(getElementsByTagName('head')[0]||body).appendChild(createElement('script')).src='http://bdimg.share.baidu.com/static/api/js/share.js?v=89860593.js?cdnversion='+~(-new Date()/36e5)];</script> </div> <div class="prev-next mt-20"> <div class="prev"> 上一篇 <a href="bencandy.php?fid=54&id=21953" onclick="" class="link">Python学习之socket模块</a> </div> <div class="next"> 下一篇 <a href="bencandy.php?fid=54&id=21951" onclick="" class="link">shell脚本中的“2< ＂＂>&..</a> </div> </div> </div> </div> </div> </div> </div> <footer class="container mt-20 site-footer"> <div id="copyright"> Copyright © https://www.cppentry.com all rights reserved <a href="http://www.miibeian.gov.cn" target="_blank">粤ICP备13067022号-3</a> </div> </footer> <script> // Logic for double click admin editing </script> <script> if(typeof clickEdit !== 'undefined') clickEdit.init(); </script> </body> </html>