class: center, middle, inverse, title-slide .title[ # 网络协议HTTP、HTTPS,Requests包 ] .subtitle[ ## 🌱 ] .author[ ### 吴燕丰 ] .institute[ ### 江西财大,金融学院 ] .date[ ### 2020/10/30 ] --- ### HTTP、HTTPS HTTP: **H**yper**T**ext **T**ransfer **P**rotocol,超文本传输协议,是万维网数据通信的基础。 HTTPS: **H**yper**T**ext **T**ransfer **P**rotocol **S**ecure,**安全**版本的HTTP协议。 .pull-left[ ![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/Internet1.svg/220px-Internet1.svg.png) ] .pull-right[ ![](https://raw.githubusercontent.com/wsgzao/storage-public/master/img/20200510171326.png) ] .footnote[ 图片来源:[https://wsgzao.github.io/post/https/](https://wsgzao.github.io/post/https/) ] --- .center[ <img src='https://miro.medium.com/max/768/1*aSca7t_PcHAdPvv5n4K2NQ.png' style='width: 80%'> ] --- ### HTTP通讯过程 ![](http://www.ituring.com.cn/figures/2014/PIC%20HTTP/05.d01z.003.png) .footnote[ 图片来源:https://www.ituring.com.cn/book/1229 ] --- #### HTTPS通讯过程 .center[ ![](https://www.runoob.com/wp-content/uploads/2017/05/201208201734403507.png) ] .footnote[ <br> 图片来源:[https://www.runoob.com/w3cnote/https-ssl-intro.html](https://www.runoob.com/w3cnote/https-ssl-intro.html) ] --- ### HTTPS通讯过程 ![](https://raw.githubusercontent.com/wsgzao/storage-public/master/img/20200510171633.png) .footnote[ 图片来源:[https://wsgzao.github.io/post/https/](https://wsgzao.github.io/post/https/) ] --- ### HTTP、HTTPS解释资料 .pull-left[ 比较不错的资料: - [《图解HTTP》](https://www.ituring.com.cn/book/1229) - [《图解HTTPS》学习博客](https://wsgzao.github.io/post/https/) - [HTTPS 与 SSL 证书概要](https://www.runoob.com/w3cnote/https-ssl-intro.html) - [《图解HTTPS》学习博客](https://www.jianshu.com/p/f487b940d017) ] .pull-right[ <img width=350 src='https://file.ituring.com.cn/ScreenShow/170974f7ce622b38fa9c'> ] .footnote[ 注释:如果你还没懂,没关系,大概了解即可。 ] --- ### requests 模块 安装: 1. 打开`Anaconda Prompt`(默认你安装了Anaconda),不清楚的可参照 [安装第三方模块](http://www.yyschools.com/courses/FinancialData/Presentation/Chapter02_FunctionModule_Python/Chapter02_FunctionModule_Python.html#7) 2. 输入安装如下命令后,按<kbd>Enter</kbd>键即可安装 ```bash pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple ``` --- ### 例子 请求访问[江西财经大学主页](http://www.jxufe.edu.cn/),并打印出文本内容: ```python import requests x = requests.get('http://www.jxufe.edu.cn/') print(x.text) ``` --- ### 下载年报PDF 下载链接例子:http://news.windin.com/ns/getatt.php?id=124484400&att_id=103006392&code=EC511722C83B ```python import requests href = 'http://news.windin.com/ns/getatt.php?id=124484400&att_id=103006392&code=EC511722C83B' r = requests.get(href, allow_redirects=True) f = open('filename.pdf', 'wb') f.write(r.content) f.close() r.close() ``` --- ### 语法(Syntax)与方法(Methods) 语法: ```python requests.methodname(params) ``` 方法: |Method|Description| |------|-----------| |delete(url, args)|Sends a DELETE request to the specified url| |get(url, params, args)|Sends a GET request to the specified url| |head(url, args)|Sends a HEAD request to the specified url| |patch(url, data, args)|Sends a PATCH request to the specified url| |post(url, data, json, args)|Sends a POST request to the specified url| |put(url, data, args)|Sends a PUT request to the specified url| |request(method, url, args)|Sends a request of the specified method to the specified url| --- ### Requests模块学习资料 [Requests: 让 HTTP 服务人类](https://requests.readthedocs.io/zh_CN/latest/) --- ### Beautiful Soup模块 一个现成的从HTML或XML文件中提取数据的Python库,如果你还没掌握正则表达式(`re`模块),抑或没时间深入学习`re`(即使老师建议学习)。 从Anaconda Prompt安装:安装步骤可参照见 [安装第三方模块](http://www.yyschools.com/courses/FinancialData/Presentation/Chapter02_FunctionModule_Python/Chapter02_FunctionModule_Python.html#7) ```bash pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple ``` 例子: ```python from bs4 import BeautifulSoup import requests r = requests.get('http://www.jxufe.edu.cn') html = r.text soup = BeautifulSoup(html) print(soup.prettify()) ``` --- ### Beautiful Soup文档 - [https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/](https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/) - [https://www.crummy.com/software/BeautifulSoup/bs4/doc/<sup>[*]</sup>](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - [Python 爬虫利器二之 Beautiful Soup 的用法](https://cuiqingcai.com/1319.html) .footnote[ 注释\[*\]:如果你也想把英语一起学了,推荐阅读这个英文使用文档。 ] --- class: reverse, middle, center ### 留空 ### 故事未结束 --- class: middle, center ### 通用的工具 --- ### Selenium Installing Python bindings for Selenium ```bash pip install selenium ``` Drivers Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your **PATH(环境变量)**, e. g., place it in /usr/bin or /usr/local/bin (这是UNIX OS情况). Windows OS情况见[PATH(环境变量)配置](../../application/env-variable/env-variable.html) - Chrome: https://chromedriver.chromium.org/downloads - Edge: https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ - Firefox: https://github.com/mozilla/geckodriver/releases - Safari: https://webkit.org/blog/6900/webdriver-support-in-safari-10/ .footnote[ 提示:如果所爬取页码需要验证码或非机器人验证,那么爬取该网页变得非常困难。 ] --- ### Example ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys browser = webdriver.Chrome() # or Firefox() browser.get('http://www.sse.com.cn/') # 选中“披露”菜单 elem = browser.find_element(By.CSS_SELECTOR, '#menu_tab > li:nth-child(3) > a') innerHTML = elem.get_attribute('innerHTML') browser.quit() ``` 使用文档地址:https://selenium-python.readthedocs.io/ 使用教程:https://www.geeksforgeeks.org/selenium-python-tutorial/ --- ### 例子:上交所-定期报告页面 ```python from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys browser = webdriver.Chrome() browser.get('http://www.sse.com.cn/disclosure/listedinfo/regular/') browser.find_element(By.ID, "inputCode").click() browser.find_element(By.ID, "inputCode").send_keys("601919") browser.find_element(By.CSS_SELECTOR, ".bi-search").click() css_selector = "body > div.container.sse_content > div > div.col-lg-9.col-xxl-10 > div > div.sse_colContent.js_regular > div.table-responsive > table" element = browser.find_element(By.CSS_SELECTOR, css_selector) table_html = element.get_attribute('innerHTML') f = open('table01.html','w',encoding='utf-8') f.write(table_html) f.close() browser.quit() ``` --- class: middle, center ### 勇敢地探索吧! --- ### Requests-HTML (效果一般) 虽然,我们可以使用`requests`模块获取网页的源代码,但获得的仅是静态的源代码。我们需要的数据,往往依赖源代码里JavaScript脚本执行以后的页面数据。那么,我们可以使用更进一步的模块`Requests-HTML`。其使用文档:[https://docs.python-requests.org/projects/requests-html/en/latest/](https://docs.python-requests.org/projects/requests-html/en/latest/) 首先,使用Anaconda Prompt安装 ``` pip install requests-html ``` When using this library you automatically get: - **Full JavaScript support!** (对于复杂网页仍然不够!) - CSS Selectors (a.k.a jQuery-style, thanks to PyQuery). - XPath Selectors, for the faint of heart. - Mocked user-agent (like a real web browser). - Automatic following of redirects. - Connection–pooling and cookie persistence. - The Requests experience you know and love, with magical parsing abilities. - Async Support --- ### Sample Codes ```python from requests_html import HTMLSession session = HTMLSession() r = session.get('https://python.org/') ``` Grab a list of all links on the page, as–is (anchors excluded): ```python r.html.links ``` > {'//docs.python.org/3/tutorial/', ... Grab a list of all links on the page, in absolute form (anchors excluded): ```python r.html.absolute_links ``` > {'https://github.com/python/pythondotorg/issues',... --- ### JavaScript Support Let’s grab some text that’s rendered by JavaScript: ```python r = session.get('http://python-requests.org/') r.html.render() r.html.search('Python 2 will retire in only {months} months!')['months'] ``` > `<time>25</time>` .footnote[ **Note**, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once. You may also need to install a few Linux packages to get pyppeteer working. ]