PDF格式财务报表处理

class: center, middle, inverse, title-slide

# PDF格式财务报表处理
## 🌱
### 吴燕丰
### 江西财大，金融学院
### 2020/12/03

---

### PyMuPDF

安装（Anaconda Prompt）

```bash
pip install PyMuPDF
```

**导入PyMuPDF**

```python
import fitz
print(fitz.__doc__)
```

```python
doc = fitz.open(filename)     # or fitz.Document(filename)
```

---
### Some Document Methods and Attributes

|Method / Attribute|	Description|
|------------------|-------------|
|Document.page_count|	the number of pages (int)|
|Document.metadata| the metadata (dict)|
|Document.get_toc()| get the table of contents (list)|
|Document.load_page()| read a Page|

---
### Working with Outlines

```python
toc = doc.get_toc()
```

### Working with Pages

```python
page = doc.load_page(pno)  # loads page number 'pno' of the document (0-based)
page = doc[pno]  # the short form

for page in doc:
    # do something with 'page'

# ... or read backwards
for page in reversed(doc):
    # do something with 'page'

# ... or even use 'slicing'
for page in doc.pages(start, stop, step):
    # do something with 'page'
```

---
### Inspecting the Links, Annotations or Form Fields of a Page

```python
# get all links on a page
links = page.get_links()

for link in page.links():
    # do something with 'link'

for annot in page.annots():
    # do something with 'annot'

for field in page.widgets():
    # do something with 'field'
```

---
### Extracting Text and Images

```python
text = page.get_text(opt)
```

Use one of the following strings for opt to obtain different formats:

- “**text**”: (default) plain text with line breaks. No formatting, no text position details, no images.
- “**blocks**”: generate a list of text blocks (= paragraphs).
- “**words**”: generate a list of words (strings not containing spaces).
- “**html**”: creates a full visual version of the page including any images. This can be displayed with your internet browser.
- “**dict**” / “json”: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure.
- “**rawdict**” / “rawjson”: a super-set of “dict” / “json”.
- “**xhtml**”: text information level as the TEXT version but includes images. Can also be displayed by internet browsers.
- “**xml**”: contains no images, but full position and font information down to each single text character. Use an XML module to interpret.

---
### Searching for Text

```python
areas = page.search_for("mupdf")
```

### Modifying, Creating, Re-arranging and Deleting Pages

- Document.delete_page()
- Document.delete_pages()
- Document.copy_page()
- Document.fullcopy_page()
- Document.move_page()
- Document.select()
- Document.insert_page()
- Document.new_page()

---
### Joining and Splitting PDF Documents

```python
# append complete doc2 to the end of doc1
doc1.insert_pdf(doc2)

doc2 = fitz.open()                 # new empty PDF
doc2.insert_pdf(doc1, to_page = 9)  # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")
```

---
### Saving

As mentioned above, Document.save() will always save the document in its current state.

### PyMuPDF Documentation

[https://pymupdf.readthedocs.io/en/latest/tutorial.html](https://pymupdf.readthedocs.io/en/latest/tutorial.html)

---

### pdfplumber(pdf + plumber): pdf管道工

安装(Anaconda Prompt)

```bash
pip install pdfplumber -i https://pypi.tuna.tsinghua.edu.cn/simple
```

#### 安装失败解决方案

有同学反应不能正常安装，提示如下的错误

>ERROR: Could not install packages due to an EnvironmentError: [WinError 5] 拒绝访问。：‘c:\\ProgramData\\Anaconda3\\Lib\\site-packages’

可以尝试如下安装(Anaconda Prompt)：

```bash
pip install pdfplumber --user -i https://pypi.tuna.tsinghua.edu.cn/simple
```

抑或以管理员身份打开Anaconda Prompt，然后再安装，见后续截图。

---

### 安装失败解决方案（续）

---

![](./images/run_as_admin_pip.jpg)

解决方案参考资料：

- https://blog.csdn.net/allenjsj/article/details/80149551
- https://stackoverflow.com/questions/51912999/could-not-install-packages-due-to-an-environmenterror-winerror-5-access-is-de
- [在install后加上--user ](https://blog.csdn.net/u012735708/article/details/83301875?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control)

---

### 例子

```python
import pdfplumber

with pdfplumber.open("path/to/file.pdf") as pdf:
    first_page = pdf.pages[0]  #获取第一页
    print(first_page.chars[0])
```

---

### 常用方法

|Method|注释|
|------|----|
|extract_text()|用来提页面中的文本，将页面的所有字符对象整理为的那个字符串|
|extract_words()|返回的是所有的单词及其相关信息|
|extract_tables()|提取页面的表格|
|to_image()|用于可视化调试时，返回PageImage类的一个实例|

---

### 示例

```python
import pdfplumber
import pandas as pd

with pdfplumber.open("path/to/file.pdf") as pdf:
    page_count = len(pdf.pages)
    print(page_count)  # 得到页数
    for page in pdf.pages:
        print('---------- 第[%d]页 ----------' % page.page_number)
        # 获取当前页面的全部文本信息，包括表格中的文字
        print(page.extract_text())
```

---

### 示例

```python
import pdfplumber
import pandas as pd
import re

for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text",
                                                         "horizontal_strategy": "lines",
                                                        "intersection_tolerance":20}): # 边缘相交合并单元格大小

# print(pdf_table)
            for row in pdf_table:
                # 去掉回车换行
                print([re.sub('\s+', '', cell) if cell is not None else None for cell in row])
```

---

### 使用文档

- [https://github.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber)

- [https://www.cnblogs.com/xiao-apple36/p/10496707.html](https://www.cnblogs.com/xiao-apple36/p/10496707.html)