Python处理Word、Excel文档

class: center, middle, inverse, title-slide

# Python处理Word、Excel文档
### 吴燕丰
### 江西财大，金融学院
### 2020/12/10，更新于2022-03-31

---

### python-docx 模块

安装（通过Anaconda Prompt）：

```bash
pip install python-docx -i https://pypi.tuna.tsinghua.edu.cn/simple
```

例子

```python
from docx import Document

document = Document()

document.add_heading('标题')
document.add_paragraph('段落内容。。。')
document.add_table(rows=2, cols=2)
document.add_page_break()
document.save('demo.docx')
```

.footnote[
**注意**：  
- python-docx模块不能处理旧版本的Microsoft Word文档。  
- doc 转 docx 参见知乎：https://zhuanlan.zhihu.com/p/64189783
]

---

### `.doc` 转 `.docx`

批量转换 `.doc` -> `.docx`（先安装pypiwin32模块： `pip install pypiwin32`）

```python
import os #用于获取目标文件所在路径
path="path\\to\\doc_files" # 文件夹绝对路径
files=[]
for file in os.listdir(path):
    if file.endswith(".doc"): #排除文件夹内的其它干扰文件，只获取".doc"后缀的word文件
        files.append(path+file) 
files
```

```python
from win32com import client as wc #导入模块
word = wc.Dispatch("Word.Application") # 打开word应用程序
for file in files:
    doc = word.Documents.Open(file) #打开word文件
    doc.SaveAs("{}x".format(file), 12)#另存为后缀为".docx"的文件，其中参数12指docx文件
    doc.Close() #关闭原来word文件
word.Quit()
print("完成！")
```

---

### User Guide

方法链接：https://python-docx.readthedocs.io/en/latest/user/documents.html

- [快速入门](https://python-docx.readthedocs.io/en/latest/user/quickstart.html)
- [处理文档](https://python-docx.readthedocs.io/en/latest/user/documents.html)
- [处理文字](https://python-docx.readthedocs.io/en/latest/user/text.html)
- [处理节](https://python-docx.readthedocs.io/en/latest/user/sections.html)
- [处理页眉页脚](https://python-docx.readthedocs.io/en/latest/user/hdrftr.html)
- [API基础](https://python-docx.readthedocs.io/en/latest/user/api-concepts.html)
- [理解格式](https://python-docx.readthedocs.io/en/latest/user/styles-understanding.html)
- [处理格式](https://python-docx.readthedocs.io/en/latest/user/styles-using.html)
- [理解图片及其他图形](https://python-docx.readthedocs.io/en/latest/user/shapes.html)

中文资料:

[Python 操作 Word](http://www.ityouknow.com/python/2019/12/31/python-word-105.html)

---
class: middle, center, reverse

### Next Part

---
### openpyxl&mdash;Python Excel文件处理库

**openpyxl** is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files.

[https://openpyxl.readthedocs.io/en/stable/](https://openpyxl.readthedocs.io/en/stable/)

有许多可以处理Excel表格的第三方库，数据分析中知名的**pandas**库，也可读写Excel文件。

但是，如果要精细化地处理Excel文件，**openpyxl**库也许会是最好的选择。比如，获取Excel表格中的超链接地址。
可以试试这个Excel文件：[半导体行业.xlsx](半导体行业.xlsx)<supscript>*</supscript>

要使用，先安装：（Anaconda Prompt里输入如下代码）

```bash
pip install openpyxl
```

.footnote[
*：[示例代码](#10)
]
---
### Sample Code

```python
from openpyxl import Workbook
wb = Workbook()

# grab the active worksheet
ws = wb.active

# Data can be assigned directly to cells
ws['A1'] = 42

# Rows can also be appended
ws.append([1, 2, 3])

# Python types will automatically be converted
import datetime
ws['A2'] = datetime.datetime.now()

# Save the file
wb.save("sample.xlsx")
```

---
### Tutorial and Cookbook

[https://openpyxl.readthedocs.io/en/stable/tutorial.html](https://openpyxl.readthedocs.io/en/stable/tutorial.html)

```python
>>> from openpyxl import Workbook
>>> wb = Workbook()
```

---
### Cookbook

[https://openpyxl.readthedocs.io/en/stable/usage.html](https://openpyxl.readthedocs.io/en/stable/usage.html)

```python
>>> from openpyxl import Workbook
>>> from openpyxl.utils import get_column_letter
>>> wb = Workbook()
>>> dest_filename = 'empty_book.xlsx'
>>> ws1 = wb.active
>>> ws1.title = "range names"
>>> for row in range(1, 40):
...     ws1.append(range(600))
>>>
>>> ws2 = wb.create_sheet(title="Pi")
>>> ws2['F5'] = 3.14
>>>
>>> ws3 = wb.create_sheet(title="Data")
>>> for row in range(10, 20):
...     for col in range(27, 54):
...         _ = ws3.cell(column=col, row=row, value="{0}".format(get_column_letter(col)))
>>> print(ws3['AA10'].value)
AA
>>> wb.save(filename = dest_filename)
```
---
### 示例代码

```python
import pandas as pd
import openpyxl, re

xlsx = '半导体行业.xlsx'
df = pd.read_excel(xlsx)

exf = openpyxl.load_workbook(xlsx)
sheet = exf.active; C2 = sheet['C2']; C = sheet['C']

links = [c.value for c in C]; links_1 = links[1:-1]
links_2 = ''.join(links_1)

sample = '=HYPERLINK("http://news.windin.com/ns/bulletin.php?code=2B41260EA8F3&id=123597160&type=1","方正证券:2020年年度报告")'

p = re.compile('"(.*?)","(.*?)"')
list_of_tuple = p.findall(links_2)
```

---
### 示例代码（续）

```python
df2 = pd.DataFrame({'link': [t[0] for t in list_of_tuple],
                    'f_name': [t[1] for t in list_of_tuple]})
df2.sort_values(by=['f_name'],inplace=True,ignore_index=True)
df2.to_csv('半导体行业.csv')

df2_1 = pd.DataFrame({'link': [t[0] for t in list_of_tuple if ('(' in t[1]) or ('（' in t[1])],
                      'f_name': [t[1] for t in list_of_tuple if ('(' in t[1]) or ('（' in t[1])]})
df2_1.sort_values(by=['f_name'],inplace=True,ignore_index=True)
```

---
class: middle, center, reverse

You know, we must learn many by ourselves!

.pull-right[
&mdash;Yanfeng Wu

2022-03-31
]