当前位置：首页 > news >正文

网站建设工作计划免费建立网站的平台

news 2025/11/22 20:40:23

网站建设工作计划,免费建立网站的平台,线上设计师都在哪挣钱,网站建议公司Python爬虫技术系列-03requests库案例参考1 Requests基本使用1.1 Requests库安装与使用1.1.1 Requests库安装1.1.2 Rrequests库介绍1.1.3 使用Requests一般分为三个步骤1.1.4 requests的公共方法 2 Requests库使用案例2.1 GET请求携带参数和headers2.2 POST请求#xff0c;写… Python爬虫技术系列-03requests库案例参考1 Requests基本使用1.1 Requests库安装与使用1.1.1 Requests库安装1.1.2 Rrequests库介绍1.1.3 使用Requests一般分为三个步骤1.1.4 requests的公共方法 2 Requests库使用案例2.1 GET请求携带参数和headers2.2 POST请求写的参数和headers2.3 携带参数设置User-Agent发送POST请求,文件上传2.4 获取cookie2.5 保持session 实现模拟登录2.6 模拟古诗词网2.7 超级鹰打码平台2.8 Requests结合lxml库2.9 爬取免费代理2.10 代理ip设置参考参考:https://blog.csdn.net/Faith_Lzt/article/details/124933765 https://www.bilibili.com/video/BV1Db4y1m7Ho? 1 Requests基本使用 Requests官方文档中关于Requests的介绍是Requests是一个优雅而简单的Python HTTP库是为人类构建的。 Requests可以完成Keep-Alive带Cookie的持久化sessionSSL认证文件上传下载等诸多功能本小节主要介绍Requests库的安装与基本使用尽管如此也力求通过合适的案例帮助读者完成对Requests的使用更多高阶操作可以查看官网。 1.1 Requests库安装与使用 1.1.1 Requests库安装安装Requests pip install requests2.27.1 -i https://pypi.tuna.tsinghua.edu.cn/simple1.1.2 Rrequests库介绍 Requests 是用Python语言编写基于urllib但是它比 urllib 更加方便可以节约我们大量的工作完全满足 HTTP 测试需求。本小节主要基于requests完成数据爬取的基本操作。两个核心对象 Requests库包括2个核心对象Request和Response。Request用于发送请求Response对象用于接受服务器返回的所有信息也包含发送的Request请求信息。 r requests.get(url)上面代码中requests.get(url)构造了一个向服务器请求资源的Request对象返回的对象r就是一个包含服务器资源的Response的对象。 Response对象的属性如下所示 Response案例如下 # 使用requests进行GET请求 import requests response requests.get(https://www.baidu.com) # response的类型 print(type(response)--,type(response)) # 返回的HTTP状态码 print(response.status_code--,response.status_code) # 输出Response对象转换后的字符串,会乱码 print(response.text[310:352]) # 给Response设定编码输出无乱码 response.encoding utf-8 print(response.text[310:352]) # 把Response对象转换为bytes数据编码为UTF-8字符串输出字符串 print(response.content.decode(UTF-8)[310:352])输出结果如下 type(response)-- class requests.models.Response response.status_code-- 200 titleç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“/title title百度一下你就知道/title/head body link title百度一下你就知道/title/head body link从输出结果可以看出直接输出response.text会出现乱码因为Response的默认encoding编码为ISO-8859-1在设置response.encoding utf-8’后输出不再乱码。最后的先转变为content在decode的方式也可以正常输出。 Response.text()的输出给如python爬虫系列的文章中的lxml或bs4解析就完成了数据获取到数据解析的全部流程。 Requsts库常用方法有7种如下表所示其中requests.requset(method,url,**kwargs)是核心方法其它方法都是基于request()当method的值为GET时与get()方法等价。如下所示 requests.request(‘GET’,url,**kwargs) requests.get(url,**kwargs)url参数为请求的路径。 **kwargs控制访问参数为可选项具体含义如下所示 POST请求 import requests post1 requests.post(http://httpbin.org/post) print(post1.text) print(**10)输出为: {args: {}, data: , files: {}, form: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Content-Length: 0, Host: httpbin.org, User-Agent: python-requests/2.27.1, X-Amzn-Trace-Id: Root1-6520bc5a-1890396b6d0ca83b3674482d}, json: null, origin: 120.194.158.180, url: http://httpbin.org/post }**********PUT请求 import requests put1 requests.put(http://httpbin.org/put) print(put1.text) print(**10)输出为: {args: {}, data: , files: {}, form: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Content-Length: 0, Host: httpbin.org, User-Agent: python-requests/2.27.1, X-Amzn-Trace-Id: Root1-6520bc5b-5d05ea907c9d15926a676af3}, json: null, origin: 120.194.158.180, url: http://httpbin.org/put }**********DELETE请求 import requests delete1 requests.delete(http://httpbin.org/delete) print(delete1.text) print(**10)输出为: {args: {}, data: , files: {}, form: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Content-Length: 0, Host: httpbin.org, User-Agent: python-requests/2.27.1, X-Amzn-Trace-Id: Root1-6520bc5b-0863e0b34612239b4ebf0c21}, json: null, origin: 120.194.158.180, url: http://httpbin.org/delete }**********GET请求 import requests get1 requests.get(http://httpbin.org/get) print(get1.text) print(**10)输出为: {args: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Host: httpbin.org, User-Agent: python-requests/2.27.1, X-Amzn-Trace-Id: Root1-6520bc5c-7c53173b5cb5d0b936f01fe4}, origin: 120.194.158.180, url: http://httpbin.org/get }**********1.1.3 使用Requests一般分为三个步骤第一步导入模块 import requests第二步完成请求 url http://httpbin.org/get r requests.get(urlurl)http://httpbin.org是一个很好的测试网站其后台是基于 Python Flask编写的 HTTP Request Response Service。该服务主要用于测试 HTTP 库。你可以向他发送请求,然后他会按照指定的规则将你的请求返回也可以直接访问本小节采用直接访问的方式测试。第三步输出获得的响应 print(--查看请求头信息--) print(r.text)输出为 --查看请求头信息-- {args: {}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Host: httpbin.org, User-Agent: python-requests/2.27.1, X-Amzn-Trace-Id: Root1-62837c9a-6119e7f618e55b0f0de1d0d2}, origin: 120.216.231.238, url: http://httpbin.org/get }通过输出可以看出我们在发送请求的时候携带的User-Agent表明了发起请求的身份为requests库这个也是在爬取数据时会被服务器拦截的原因。 1.1.4 requests的公共方法 import requests response requests.get(http://httpbin.org/get) print(response.json(),end***\n)# 以json的形式返回响应内容对象格式为dict print(response.content,end***\n)# 以二进制的形式返回响应内容对象格式为bytes print(response.text,end***\n)# 以字符串的形式返回响应内容对象格式为str print(response.url,end***\n)# 返回请求的url print(response.status_code,end***\n) # 返回本次请求的状态码 print(response.reason,end***\n)# 返回状态码对应的原因 print(response.headers,end***\n)# 返回响应头 print(response.cookies,end***\n)# 返回cookice信息 print(response.raw,end***\n)# 返回原始响应体 print(response.encoding,end***\n)# 返回编码格式 print(**10)输出为: 关于requests库目前为止是否有一定的理解了多谢您的互动。关于requests的基本使用我们会通过一个案例进行介绍。 2 Requests库使用案例在上一小节完成了关于Requests库的Response对象的解析与发送简单个get请求本小节通过案例的方式进行Requests库的使用。 2.1 GET请求携带参数和headers # urllib # (1) 一个类型以及六个方法 # 2get请求 # 3post请求百度翻译 # 4ajax的get请求 # 5ajax的post请求 # 6cookie登陆微博 # 7代理# requests # (1)一个类型以及六个属性 # 2get请求 # 3post请求 # 4代理 # 5cookie 验证码import requestsurl https://www.baidu.com/sheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 }data {wd:北京 }# url 请求资源路径 # params 参数 # kwargs 字典 response requests.get(urlurl,paramsdata,headersheaders)content response.textprint(content[0:400]) print(response.content.decode(utf-8)[0:400])# 总结 # 1参数使用params传递 # 2参数无需urlencode编码 # 3不需要请求对象的定制 # 4请求资源路径中的可以加也可以不加输出为: 2.2 POST请求写的参数和headers import requests import json url https://fanyi.baidu.com/sugheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 } data {kw: eye } # url 请求地址 # data 请求参数 # kwargs 字典 response requests.post(urlurl,datadata,headersheaders) content response.text print(content) obj json.loads(content) print(obj)# 总结 # 1post请求是不需要编解码 # 2post请求的参数是data # 3不需要请求对象的定制输出为: 2.3 携带参数设置User-Agent发送POST请求,文件上传 import requests data {city: beijing} # 设置data headers {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0 #设置User-Agent } files {file: open(01.gif, rb)} # 读取本地的一个图片 response requests.post(http://httpbin.org/post, filesfiles,datadata, headersheaders) # 通过POST请求把文件数据和header传递过去 print(response.text)输出结果如下 {args: {}, data: , files: {file: data:application/octet-stream;base64,R0lGODl...HEgAA7}, form: {city: beijing}, headers: {Accept: */*, Accept-Encoding: gzip, deflate, br, Content-Length: 1176, Content-Type: multipart/form-data; boundary92043b0ae4dec43bbb54919200f6033a, Host: httpbin.org, User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0, X-Amzn-Trace-Id: Root1-62848095-0a1878c6321f07ef5ccc9047}, json: null, origin: 120.216.231.238, url: http://httpbin.org/post }输出结果表明文件数据和请求头都被服务端接受并且图片在上传过程被采用base64编码方式传递。 2.4 获取cookie import requests response requests.get(https://www.zhihu.com) print(response.cookies) for key, value in response.cookies.items():print(key value)输出结果如下 RequestsCookieJar[Cookie _xsrfpGKfNvOikSpo8NRnLCbNKfAuLOLWyhLb for .zhihu.com/] _xsrfpGKfNvOikSpo8NRnLCbNKfAuLOLWyhLb 输出结果中通过response.cookies获取了知乎首页的cookie这对需要登录注册才能访问的网站可以替代手工获取cookie进行填充是比较方便的一种方法。 2.5 保持session 实现模拟登录 import requests session requests.Session() #获取session session.get(http://httpbin.org/cookies/set/name/123) # 通过会话设置cookie的值为name123 response session.get(http://httpbin.org/cookies) # 使用session获取cookie print(response.text) # 输出响应输出为 {cookies: {name: 123} }输出结果表明对于两次访问通过session发送get请求可以获取第一次请求时设置的cookie如果对于浏览器来说使用session进行访问会被服务端认为是同一个用户在持续访问使用于需要保持登录状态的数据爬虫场景。 2.6 模拟古诗词网参考:https://www.bilibili.com/video/BV1Db4y1m7Ho # 通过登陆然后进入到主页面# 通过找登陆接口我们发现登陆的时候需要的参数很多 # _VIEWSTATE: /m1O5dxmOo7f1qlmvtnyNyhhaUrWNVTs3TMKIsm1lvpIgs0WWWUCQHl5iMrvLlwnsqLUN6Wh1aNpitc4WnOt0So3k6UYdFyqCPI6jWSvC8yBA1Q39I7uuR4NjGo # __VIEWSTATEGENERATOR: C93BE1AE # from: http://so.gushiwen.cn/user/collect.aspx # email: 465XXXqq.com # pwd: action # code: PId7 # denglu: 登录# 我们观察到_VIEWSTATE __VIEWSTATEGENERATOR code是一个可以变化的量# 难点:(1)_VIEWSTATE __VIEWSTATEGENERATOR 一般情况看不到的数据都是在页面的源码中 # 我们观察到这两个数据在页面的源码中所以我们需要获取页面的源码然后进行解析就可以获取了 # (2)验证码 import osimport requests# 这是登陆页面的url地址 url https://so.gushiwen.cn/user/login.aspx?fromhttp://so.gushiwen.cn/user/collect.aspxheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 }# 获取页面的源码 response requests.get(url url,headers headers) content response.text# 解析页面源码然后获取_VIEWSTATE __VIEWSTATEGENERATOR from bs4 import BeautifulSoupsoup BeautifulSoup(content,lxml)# 获取_VIEWSTATE viewstate soup.select(#__VIEWSTATE)[0].attrs.get(value)# 获取__VIEWSTATEGENERATOR viewstategenerator soup.select(#__VIEWSTATEGENERATOR)[0].attrs.get(value)# 获取验证码图片 code soup.select(#imgCode)[0].attrs.get(src) code_url https://so.gushiwen.cn code# 有坑 # import urllib.request # urllib.request.urlretrieve(urlcode_url,filenamecode.jpg) # requests里面有一个方法 session 通过session的返回值就能使用请求变成一个对象session requests.session() # 验证码的url的内容 response_code session.get(code_url) # 注意此时要使用二进制数据因为我们要使用的是图片的下载 content_code response_code.content # wb的模式就是将二进制数据写入到文件 with open(./code.jpg,wb)as fp:print(os.getcwd())fp.write(content_code)# 获取了验证码的图片之后下载到本地然后观察验证码观察之后然后在控制台输入这个验证码就可以将这个值给 # code的参数就可以登陆code_name input(请输入你的验证码)# 点击登陆 url_post https://so.gushiwen.cn/user/login.aspx?fromhttp%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspxdata_post {__VIEWSTATE: viewstate,__VIEWSTATEGENERATOR: viewstategenerator,from: http://so.gushiwen.cn/user/collect.aspx,email: 465XXXqq.com,pwd: action,code: code_name,denglu: 登录, }response_post session.post(url url, headers headers, data data_post)content_post response_post.textwith open(gushiwen.html,w,encoding utf-8)as fp:fp.write(content_post)# 难点 # 1 隐藏域 # 2 验证码 2.7 超级鹰打码平台参考:https://www.bilibili.com/video/BV1Db4y1m7Ho?p89 http://www.chaojiying.com/ #!/usr/bin/env python # coding:utf-8import requests from hashlib import md5class Chaojiying_Client(object):def __init__(self, username, password, soft_id):self.username usernamepassword password.encode(utf8)self.password md5(password).hexdigest()self.soft_id soft_idself.base_params {user: self.username,pass2: self.password,softid: self.soft_id,}self.headers {Connection: Keep-Alive,User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0),}def PostPic(self, im, codetype):im: 图片字节codetype: 题目类型参考 http://www.chaojiying.com/price.htmlparams {codetype: codetype,}params.update(self.base_params)files {userfile: (ccc.jpg, im)}r requests.post(http://upload.chaojiying.net/Upload/Processing.php, dataparams, filesfiles, headersself.headers)return r.json()def ReportError(self, im_id):im_id:报错题目的图片IDparams {id: im_id,}params.update(self.base_params)r requests.post(http://upload.chaojiying.net/Upload/ReportError.php, dataparams, headersself.headers)return r.json()if __name__ __main__:chaojiying Chaojiying_Client(超级鹰用户名, 超级鹰用户名的密码, 96001) #用户中心软件ID 生成一个替换 96001im open(a.jpg, rb).read() #本地图片文件路径来替换 a.jpg 有时WIN系统须要//print(chaojiying.PostPic(im, 1902)) #1902 验证码类型官方网站价格体系 3.4版 print 后要加() 2.8 Requests结合lxml库本案例采用Requests库读取百度新闻首页F12打开浏览器确定网页结构然后通过lxml库进行解析获取新闻标题和新闻链接如下图所示从图中可以看出热点新闻都在idpane-news的div标签中该标签下有多个ul标签每个ul标签中都对应一些新闻。获取dive标签的第1个ul子节点并获取其下得li标签。在li标签下有a标签a标签中的文本为新闻标题href为新闻连接。代码如下 import requests from lxml import etreeresponse requests.get(http://news.baidu.com/) # 请求百度新闻 response.encodingutf-8 # 设置响应编码格式 selector etree.HTML(response.text) # 把响应数据传递给etree模块 bd_news selector.xpath(//div[idpane-news]/ul[1]/li) # 解析获得新闻列表 for bd_new in bd_news:print(bd_new.xpath(.//a[1]/text())[0]) # 输出新闻表题 print(bd_new.xpath(.//a[1]/href)[0]) #输出新闻链接输出结果如下春耕春管不误时战疫情抢农时要两手抓两手稳 https://wap.peoXXXXX692598/6565842 西安博物院发挥“大学校”功能让文物活起来 http://www.cnrXXXX18_525829518.shtml … 张文宏领衔首个国产新冠药对抗奥密克戎研究结果正式发表 http://baijiahXXXX8359231533 输出结果中可以看到新闻标题和新闻连接均已获得。关于Requests还有更多的应用本书不再进行拓展读者可以自行研究。 2.9 爬取免费代理 from urllib import request,parse import re,time,xlwt# 这是登陆页面的url地址 url https://so.gushiwen.cn/user/login.aspx?fromhttp://so.gushiwen.cn/user/collect.aspxheaders {User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 }urlhttps://www.kuaidaili.com/free/inha/def auto_index():s1for loc in range(1,50):indexstr(loc)/time.sleep(2)#先休息2秒surllib_request(url,index,s)#得到行数不断传进去写入excel不断循环worksheet.save(代理ip.xls)def urllib_request(url,index,rows):#请求url 返回htmlbaseurlurlindexprint(baseurl)rq request.Request(baseurl, headersheaders) # 添加请求resp request.urlopen(rq) # 访问html resp.read().decode(utf-8)return compile_html(html,rows)#跳转def compile_html(html,sheetLocation):stylere.compile(rtbody\s(.*?)\s/table,re.S) # ip表resultre.findall(style,html)[0] # 第一个数据# print(result)compile_ctd data-title\IP\(.*?)/tdcompile_ptd data-title\PORT\(.*?)/tdcompile_std data-title\匿名度\(.*?)/tdcompile_ktd data-title\类型\(.*?)/tdcompile_ltd data-title\位置\(.*?)/tdcompile_vtd data-title\响应速度\(.*?)/tdcompile_ttd data-title\最后验证时间\(.*?)/td#ipipTextre.compile(compile_c,re.S)ip_text re.findall(ipText, result)#端口portText re.compile(compile_p,re.S)port_text re.findall(portText, result)#匿名度securityText re.compile(compile_s, re.S)s_text re.findall(securityText, result)#协议kindText re.compile(compile_k, re.S)k_text re.findall(kindText, result)#位置locText re.compile(compile_l, re.S)l_text re.findall(locText, result)#速度vText re.compile(compile_v, re.S)v_text re.findall(vText, result)#更新时间tText re.compile(compile_t, re.S)t_text re.findall(tText, result)lengthlen(ip_text)for i in range(0,length):#左闭右开print(ip_text[i], port_text[i], s_text[i], l_text[i], k_text[i],v_text[i], t_text[i])print(sheetLocation,i)proxy_excel.write(sheetLocation,0,ip_text[i])proxy_excel.write(sheetLocation,1,port_text[i])proxy_excel.write(sheetLocation,2,s_text[i])proxy_excel.write(sheetLocation,3,l_text[i])proxy_excel.write(sheetLocation,4,k_text[i])proxy_excel.write(sheetLocation,5,v_text[i])proxy_excel.write(sheetLocation,6,t_text[i])sheetLocation1#写完一行加1return sheetLocation#返回列# print(ip_text,port_text,s_text,l_text,k_text,t_text)def write_excel(worksheet):#创建文件proxy_excel worksheet.add_sheet(proxySheet)proxy_excel.write(0, 0, ip)proxy_excel.write(0, 1, port)proxy_excel.write(0, 2, 安全性)proxy_excel.write(0, 3, 地区)proxy_excel.write(0, 4, 协议)proxy_excel.write(0, 5, 速度)proxy_excel.write(0, 6, 更新时间)return proxy_excel worksheet xlwt.Workbook(encodingutf-8) proxy_excelwrite_excel(worksheet) auto_index() 2.10 代理ip设置 import requests from bs4 import BeautifulSoup header{User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36} url_p60.182.35.230 port_p8888 proxies {http: url_p:port_p,https: url_p:port_p} urlhttp://www.httpbin.org/get header{User-Agent:Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36} reqrequests.get(url,headersheader,proxiesproxies) htmlreq.text soupBeautifulSoup(html,lxml) print(soup.text)

查看全文

http://www.huolong8.cn/news/279029/