class: center, middle, inverse, title-slide # Python处理Word、Excel文档 ### 吴燕丰 ### 江西财大,金融学院 ### 2020/12/10,更新于2022-03-31 --- ### python-docx 模块 安装(通过Anaconda Prompt): ```bash pip install python-docx -i https://pypi.tuna.tsinghua.edu.cn/simple ``` 例子 ```python from docx import Document document = Document() document.add_heading('标题') document.add_paragraph('段落内容。。。') document.add_table(rows=2, cols=2) document.add_page_break() document.save('demo.docx') ``` .footnote[ **注意**: - python-docx模块不能处理旧版本的Microsoft Word文档。 - doc 转 docx 参见知乎:https://zhuanlan.zhihu.com/p/64189783 ] --- ### `.doc` 转 `.docx` 批量转换 `.doc` -> `.docx`(先安装pypiwin32模块: `pip install pypiwin32`) ```python import os #用于获取目标文件所在路径 path="path\\to\\doc_files" # 文件夹绝对路径 files=[] for file in os.listdir(path): if file.endswith(".doc"): #排除文件夹内的其它干扰文件,只获取".doc"后缀的word文件 files.append(path+file) files ``` ```python from win32com import client as wc #导入模块 word = wc.Dispatch("Word.Application") # 打开word应用程序 for file in files: doc = word.Documents.Open(file) #打开word文件 doc.SaveAs("{}x".format(file), 12)#另存为后缀为".docx"的文件,其中参数12指docx文件 doc.Close() #关闭原来word文件 word.Quit() print("完成!") ``` --- ### User Guide 方法链接:https://python-docx.readthedocs.io/en/latest/user/documents.html - [快速入门](https://python-docx.readthedocs.io/en/latest/user/quickstart.html) - [处理文档](https://python-docx.readthedocs.io/en/latest/user/documents.html) - [处理文字](https://python-docx.readthedocs.io/en/latest/user/text.html) - [处理节](https://python-docx.readthedocs.io/en/latest/user/sections.html) - [处理页眉页脚](https://python-docx.readthedocs.io/en/latest/user/hdrftr.html) - [API基础](https://python-docx.readthedocs.io/en/latest/user/api-concepts.html) - [理解格式](https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html) - [处理格式](https://python-docx.readthedocs.io/en/latest/user/styles-using.html) - [理解图片及其他图形](https://python-docx.readthedocs.io/en/latest/user/shapes.html) 中文资料: [Python 操作 Word](http://www.ityouknow.com/python/2019/12/31/python-word-105.html) --- class: middle, center, reverse ### Next Part --- ### openpyxl—Python Excel文件处理库 **openpyxl** is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files. [https://openpyxl.readthedocs.io/en/stable/](https://openpyxl.readthedocs.io/en/stable/) 有许多可以处理Excel表格的第三方库,数据分析中知名的**pandas**库,也可读写Excel文件。 但是,如果要精细化地处理Excel文件,**openpyxl**库也许会是最好的选择。比如,获取Excel表格中的超链接地址。 可以试试这个Excel文件:[半导体行业.xlsx](半导体行业.xlsx)<supscript>*</supscript> 要使用,先安装:(Anaconda Prompt里输入如下代码) ```bash pip install openpyxl ``` .footnote[ *:[示例代码](#10) ] --- ### Sample Code ```python from openpyxl import Workbook wb = Workbook() # grab the active worksheet ws = wb.active # Data can be assigned directly to cells ws['A1'] = 42 # Rows can also be appended ws.append([1, 2, 3]) # Python types will automatically be converted import datetime ws['A2'] = datetime.datetime.now() # Save the file wb.save("sample.xlsx") ``` --- ### Tutorial and Cookbook [https://openpyxl.readthedocs.io/en/stable/tutorial.html](https://openpyxl.readthedocs.io/en/stable/tutorial.html) ```python >>> from openpyxl import Workbook >>> wb = Workbook() ``` --- ### Cookbook [https://openpyxl.readthedocs.io/en/stable/usage.html](https://openpyxl.readthedocs.io/en/stable/usage.html) ```python >>> from openpyxl import Workbook >>> from openpyxl.utils import get_column_letter >>> wb = Workbook() >>> dest_filename = 'empty_book.xlsx' >>> ws1 = wb.active >>> ws1.title = "range names" >>> for row in range(1, 40): ... ws1.append(range(600)) >>> >>> ws2 = wb.create_sheet(title="Pi") >>> ws2['F5'] = 3.14 >>> >>> ws3 = wb.create_sheet(title="Data") >>> for row in range(10, 20): ... for col in range(27, 54): ... _ = ws3.cell(column=col, row=row, value="{0}".format(get_column_letter(col))) >>> print(ws3['AA10'].value) AA >>> wb.save(filename = dest_filename) ``` --- ### 示例代码 ```python import pandas as pd import openpyxl, re xlsx = '半导体行业.xlsx' df = pd.read_excel(xlsx) exf = openpyxl.load_workbook(xlsx) sheet = exf.active; C2 = sheet['C2']; C = sheet['C'] links = [c.value for c in C]; links_1 = links[1:-1] links_2 = ''.join(links_1) sample = '=HYPERLINK("http://news.windin.com/ns/bulletin.php?code=2B41260EA8F3&id=123597160&type=1","方正证券:2020年年度报告")' p = re.compile('"(.*?)","(.*?)"') list_of_tuple = p.findall(links_2) ``` --- ### 示例代码(续) ```python df2 = pd.DataFrame({'link': [t[0] for t in list_of_tuple], 'f_name': [t[1] for t in list_of_tuple]}) df2.sort_values(by=['f_name'],inplace=True,ignore_index=True) df2.to_csv('半导体行业.csv') df2_1 = pd.DataFrame({'link': [t[0] for t in list_of_tuple if ('(' in t[1]) or ('(' in t[1])], 'f_name': [t[1] for t in list_of_tuple if ('(' in t[1]) or ('(' in t[1])]}) df2_1.sort_values(by=['f_name'],inplace=True,ignore_index=True) ``` --- class: middle, center, reverse You know, we must learn many by ourselves! .pull-right[ —Yanfeng Wu 2022-03-31 ]