网络协议HTTP、HTTPS，Requests包

class: center, middle, inverse, title-slide

.title[
# 网络协议HTTP、HTTPS，Requests包
]
.subtitle[
## 🌱
]
.author[
### 吴燕丰
]
.institute[
### 江西财大，金融学院
]
.date[
### 2020/10/30
]

---

### HTTP、HTTPS

HTTP: **H**yper**T**ext **T**ransfer **P**rotocol，超文本传输协议，是万维网数据通信的基础。

HTTPS: **H**yper**T**ext **T**ransfer **P**rotocol **S**ecure，**安全**版本的HTTP协议。

.pull-left[
![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/83/Internet1.svg/220px-Internet1.svg.png)
]

.pull-right[
![](https://raw.githubusercontent.com/wsgzao/storage-public/master/img/20200510171326.png)
]

.footnote[
图片来源：[https://wsgzao.github.io/post/https/](https://wsgzao.github.io/post/https/)
]

---

.center[
<img src='https://miro.medium.com/max/768/1*aSca7t_PcHAdPvv5n4K2NQ.png' style='width: 80%'>
]

---

### HTTP通讯过程

![](http://www.ituring.com.cn/figures/2014/PIC%20HTTP/05.d01z.003.png)

.footnote[
图片来源：https://www.ituring.com.cn/book/1229
]

---

#### HTTPS通讯过程

.center[
![](https://www.runoob.com/wp-content/uploads/2017/05/201208201734403507.png)
]

.footnote[
<br>
图片来源：[https://www.runoob.com/w3cnote/https-ssl-intro.html](https://www.runoob.com/w3cnote/https-ssl-intro.html)
]

---

### HTTPS通讯过程

![](https://raw.githubusercontent.com/wsgzao/storage-public/master/img/20200510171633.png)

.footnote[
图片来源：[https://wsgzao.github.io/post/https/](https://wsgzao.github.io/post/https/)
]

---

### HTTP、HTTPS解释资料

.pull-left[
比较不错的资料：

- [《图解HTTP》](https://www.ituring.com.cn/book/1229)

- [《图解HTTPS》学习博客](https://wsgzao.github.io/post/https/)

- [HTTPS 与 SSL 证书概要](https://www.runoob.com/w3cnote/https-ssl-intro.html)

- [《图解HTTPS》学习博客](https://www.jianshu.com/p/f487b940d017)
]

.pull-right[
<img width=350 src='https://file.ituring.com.cn/ScreenShow/170974f7ce622b38fa9c'>
]

.footnote[
注释：如果你还没懂，没关系，大概了解即可。
]

---

### requests 模块

安装：

1. 打开`Anaconda Prompt`（默认你安装了Anaconda），不清楚的可参照
[安装第三方模块](http://www.yyschools.com/courses/FinancialData/Presentation/Chapter02_FunctionModule_Python/Chapter02_FunctionModule_Python.html#7)

2. 输入安装如下命令后，按<kbd>Enter</kbd>键即可安装

```bash
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple
```

---

### 例子

请求访问[江西财经大学主页](http://www.jxufe.edu.cn/)，并打印出文本内容：

```python
import requests

x = requests.get('http://www.jxufe.edu.cn/')

print(x.text)
```

---
### 下载年报PDF

下载链接例子：http://news.windin.com/ns/getatt.php?id=124484400&att_id=103006392&code=EC511722C83B

```python
import requests

href = 'http://news.windin.com/ns/getatt.php?id=124484400&att_id=103006392&code=EC511722C83B'

r = requests.get(href, allow_redirects=True)
f = open('filename.pdf', 'wb')
f.write(r.content)
f.close()
r.close()
```

---
### 语法（Syntax）与方法（Methods）

语法：

```python
requests.methodname(params)
```
方法：

|Method|Description|
|------|-----------|
|delete(url, args)|Sends a DELETE request to the specified url|
|get(url, params, args)|Sends a GET request to the specified url|
|head(url, args)|Sends a HEAD request to the specified url|
|patch(url, data, args)|Sends a PATCH request to the specified url|
|post(url, data, json, args)|Sends a POST request to the specified url|
|put(url, data, args)|Sends a PUT request to the specified url|
|request(method, url, args)|Sends a request of the specified method to the specified url|

---

### Requests模块学习资料

[Requests: 让 HTTP 服务人类](https://requests.readthedocs.io/zh_CN/latest/)

---

### Beautiful Soup模块

一个现成的从HTML或XML文件中提取数据的Python库，如果你还没掌握正则表达式（`re`模块），抑或没时间深入学习`re`（即使老师建议学习）。

从Anaconda Prompt安装：安装步骤可参照见
[安装第三方模块](http://www.yyschools.com/courses/FinancialData/Presentation/Chapter02_FunctionModule_Python/Chapter02_FunctionModule_Python.html#7)

```bash
pip install beautifulsoup4 -i https://pypi.tuna.tsinghua.edu.cn/simple
```

例子：

```python
from bs4 import BeautifulSoup

import requests

r = requests.get('http://www.jxufe.edu.cn')
html = r.text

soup = BeautifulSoup(html)
print(soup.prettify())
```

---

### Beautiful Soup文档

- [https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/](https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/)

- [https://www.crummy.com/software/BeautifulSoup/bs4/doc/<sup>[*]</sup>](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- [Python 爬虫利器二之 Beautiful Soup 的用法](https://cuiqingcai.com/1319.html)

.footnote[
注释\[*\]：如果你也想把英语一起学了，推荐阅读这个英文使用文档。
]

---
class: reverse, middle, center

### 留空

### 故事未结束

---
class: middle, center

### 通用的工具

---
### Selenium

Installing Python bindings for Selenium

```bash
pip install selenium
```

Drivers

Selenium requires a driver to interface with the chosen browser. Firefox, for example, requires geckodriver, which needs to be installed before the below examples can be run. Make sure it’s in your **PATH(环境变量)**, e. g., place it in /usr/bin or /usr/local/bin (这是UNIX OS情况).

Windows OS情况见[PATH(环境变量)配置](../../application/env-variable/env-variable.html)

- Chrome:	https://chromedriver.chromium.org/downloads
- Edge:	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
- Firefox:	https://github.com/mozilla/geckodriver/releases
- Safari:	https://webkit.org/blog/6900/webdriver-support-in-safari-10/

.footnote[
提示：如果所爬取页码需要验证码或非机器人验证，那么爬取该网页变得非常困难。
]

---
### Example

```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome() # or Firefox()

browser.get('http://www.sse.com.cn/')

# 选中“披露”菜单
elem = browser.find_element(By.CSS_SELECTOR, 
                            '#menu_tab > li:nth-child(3) > a')

innerHTML = elem.get_attribute('innerHTML')

browser.quit()
```

使用文档地址：https://selenium-python.readthedocs.io/

使用教程：https://www.geeksforgeeks.org/selenium-python-tutorial/

---
### 例子：上交所-定期报告页面

```python
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://www.sse.com.cn/disclosure/listedinfo/regular/')
browser.find_element(By.ID, "inputCode").click()
browser.find_element(By.ID, "inputCode").send_keys("601919")
browser.find_element(By.CSS_SELECTOR, ".bi-search").click()
css_selector = "body > div.container.sse_content > div > div.col-lg-9.col-xxl-10 > div > div.sse_colContent.js_regular > div.table-responsive > table"
element = browser.find_element(By.CSS_SELECTOR, css_selector)

table_html = element.get_attribute('innerHTML')

f = open('table01.html','w',encoding='utf-8')
f.write(table_html)
f.close()

browser.quit()
```

---
class: middle, center

### 勇敢地探索吧!

---
### Requests-HTML （效果一般）

虽然，我们可以使用`requests`模块获取网页的源代码，但获得的仅是静态的源代码。我们需要的数据，往往依赖源代码里JavaScript脚本执行以后的页面数据。那么，我们可以使用更进一步的模块`Requests-HTML`。其使用文档：[https://docs.python-requests.org/projects/requests-html/en/latest/](https://docs.python-requests.org/projects/requests-html/en/latest/)

首先，使用Anaconda Prompt安装

```
pip install requests-html
```

When using this library you automatically get:
- **Full JavaScript support!** (对于复杂网页仍然不够！)
- CSS Selectors (a.k.a jQuery-style, thanks to PyQuery).
- XPath Selectors, for the faint of heart.
- Mocked user-agent (like a real web browser).
- Automatic following of redirects.
- Connection–pooling and cookie persistence.
- The Requests experience you know and love, with magical parsing abilities.
- Async Support

---
### Sample Codes

```python
from requests_html import HTMLSession
session = HTMLSession()

r = session.get('https://python.org/')
```

Grab a list of all links on the page, as–is (anchors excluded):

```python
r.html.links
```

> {'//docs.python.org/3/tutorial/', ...

Grab a list of all links on the page, in absolute form (anchors excluded):

```python
r.html.absolute_links
```

> {'https://github.com/python/pythondotorg/issues',...

---
### JavaScript Support

Let’s grab some text that’s rendered by JavaScript:

```python
r = session.get('http://python-requests.org/')
r.html.render()
r.html.search('Python 2 will retire in only {months} months!')['months']
```

> `<time>25</time>`

.footnote[
**Note**, the first time you ever run the render() method, it will download Chromium into your home directory (e.g. ~/.pyppeteer/). This only happens once. You may also need to install a few Linux packages to get pyppeteer working.
]