class: center, middle, inverse, title-slide # PDF格式财务报表处理 ## 🌱 ### 吴燕丰 ### 江西财大,金融学院 ### 2020/12/03 --- ### PyMuPDF 安装(Anaconda Prompt) ```bash pip install PyMuPDF ``` **导入PyMuPDF** ```python import fitz print(fitz.__doc__) ``` ```python doc = fitz.open(filename) # or fitz.Document(filename) ``` --- ### Some Document Methods and Attributes |Method / Attribute| Description| |------------------|-------------| |Document.page_count| the number of pages (int)| |Document.metadata| the metadata (dict)| |Document.get_toc()| get the table of contents (list)| |Document.load_page()| read a Page| --- ### Working with Outlines ```python toc = doc.get_toc() ``` ### Working with Pages ```python page = doc.load_page(pno) # loads page number 'pno' of the document (0-based) page = doc[pno] # the short form for page in doc: # do something with 'page' # ... or read backwards for page in reversed(doc): # do something with 'page' # ... or even use 'slicing' for page in doc.pages(start, stop, step): # do something with 'page' ``` --- ### Inspecting the Links, Annotations or Form Fields of a Page ```python # get all links on a page links = page.get_links() for link in page.links(): # do something with 'link' for annot in page.annots(): # do something with 'annot' for field in page.widgets(): # do something with 'field' ``` --- ### Extracting Text and Images ```python text = page.get_text(opt) ``` Use one of the following strings for opt to obtain different formats: - “**text**”: (default) plain text with line breaks. No formatting, no text position details, no images. - “**blocks**”: generate a list of text blocks (= paragraphs). - “**words**”: generate a list of words (strings not containing spaces). - “**html**”: creates a full visual version of the page including any images. This can be displayed with your internet browser. - “**dict**” / “json”: same information level as HTML, but provided as a Python dictionary or resp. JSON string. See TextPage.extractDICT() for details of its structure. - “**rawdict**” / “rawjson”: a super-set of “dict” / “json”. - “**xhtml**”: text information level as the TEXT version but includes images. Can also be displayed by internet browsers. - “**xml**”: contains no images, but full position and font information down to each single text character. Use an XML module to interpret. --- ### Searching for Text ```python areas = page.search_for("mupdf") ``` ### Modifying, Creating, Re-arranging and Deleting Pages - Document.delete_page() - Document.delete_pages() - Document.copy_page() - Document.fullcopy_page() - Document.move_page() - Document.select() - Document.insert_page() - Document.new_page() --- ### Joining and Splitting PDF Documents ```python # append complete doc2 to the end of doc1 doc1.insert_pdf(doc2) doc2 = fitz.open() # new empty PDF doc2.insert_pdf(doc1, to_page = 9) # first 10 pages doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages doc2.save("first-and-last-10.pdf") ``` --- ### Saving As mentioned above, Document.save() will always save the document in its current state. ### PyMuPDF Documentation [https://pymupdf.readthedocs.io/en/latest/tutorial.html](https://pymupdf.readthedocs.io/en/latest/tutorial.html) --- ### pdfplumber(pdf + plumber): pdf管道工 安装(Anaconda Prompt) ```bash pip install pdfplumber -i https://pypi.tuna.tsinghua.edu.cn/simple ``` #### 安装失败解决方案 有同学反应不能正常安装,提示如下的错误 >ERROR: Could not install packages due to an EnvironmentError: [WinError 5] 拒绝访问。:‘c:\\ProgramData\\Anaconda3\\Lib\\site-packages’ 可以尝试如下安装(Anaconda Prompt): ```bash pip install pdfplumber --user -i https://pypi.tuna.tsinghua.edu.cn/simple ``` 抑或以管理员身份打开Anaconda Prompt,然后再安装,见后续截图。 --- ### 安装失败解决方案(续) <img src="./images/run_as_admin.jpg" width='90%'> --- ![](./images/run_as_admin_pip.jpg) 解决方案参考资料: - https://blog.csdn.net/allenjsj/article/details/80149551 - https://stackoverflow.com/questions/51912999/could-not-install-packages-due-to-an-environmenterror-winerror-5-access-is-de - [在install后加上--user ](https://blog.csdn.net/u012735708/article/details/83301875?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control) --- ### 例子 ```python import pdfplumber with pdfplumber.open("path/to/file.pdf") as pdf: first_page = pdf.pages[0] #获取第一页 print(first_page.chars[0]) ``` --- ### 常用方法 |Method|注释| |------|----| |extract_text()|用来提页面中的文本,将页面的所有字符对象整理为的那个字符串| |extract_words()|返回的是所有的单词及其相关信息| |extract_tables()|提取页面的表格| |to_image()|用于可视化调试时,返回PageImage类的一个实例| --- ### 示例 ```python import pdfplumber import pandas as pd with pdfplumber.open("path/to/file.pdf") as pdf: page_count = len(pdf.pages) print(page_count) # 得到页数 for page in pdf.pages: print('---------- 第[%d]页 ----------' % page.page_number) # 获取当前页面的全部文本信息,包括表格中的文字 print(page.extract_text()) ``` --- ### 示例 ```python import pdfplumber import pandas as pd import re with pdfplumber.open("path/to/file.pdf") as pdf: page_count = len(pdf.pages) print(page_count) # 得到页数 for page in pdf.pages: print('---------- 第[%d]页 ----------' % page.page_number) for pdf_table in page.extract_tables(table_settings={"vertical_strategy": "text", "horizontal_strategy": "lines", "intersection_tolerance":20}): # 边缘相交合并单元格大小 # print(pdf_table) for row in pdf_table: # 去掉回车换行 print([re.sub('\s+', '', cell) if cell is not None else None for cell in row]) ``` --- ### 使用文档 - [https://github.com/jsvine/pdfplumber](https://github.com/jsvine/pdfplumber) - [https://www.cnblogs.com/xiao-apple36/p/10496707.html](https://www.cnblogs.com/xiao-apple36/p/10496707.html)