正则表达式

class: center, middle, inverse, title-slide

# 正则表达式
## Regular Expression
### 吴燕丰
### 江西财大，金融学院
### 2020/09/24，更新于2022/03/19

---

### Python RegEx (正则表达式)

A .hlight[RegEx], or .hlight[Regular Expression], is a sequence of characters that forms a search pattern.

.hlight[RegEx] can be used to check if a string contains the specified search pattern.

### &#127792;

Search the string to see if it starts with "顺丰" and ends with "年度报告":

```python
import re

txt = "顺丰公司2021年年度报告"
x = re.search("^(顺丰).*(年度报告)$", txt)
print(x.group())
```

.footnote[注意：
该页及后续多页引用自 [Python RegEx（W3schools 教程 ）](https://www.w3schools.com/python/python_regex.asp)
]

---

### RegEx Functions

The ![:color white,#008080](re) module offers a set of functions that allows us to search a string for a match:

|Function|Description|
|:--------:|-----------|
|![:color white,#008080](findall)|Returns a list containing all matches|
|![:color white,#008080](search)	|Returns a Match object if there is a match anywhere in the string|
|![:color white,#008080](split)	|Returns a list where the string has been split at each match|
|![:color white,#008080](sub)	|Replaces one or many matches with a string|

---

### The findall() Function

The ![:color white,#008080](findall&#40;&#41;)
function returns a list containing all matches.

### &#127792;

Print a list of all matches:

```python
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)
```

---

### The search() Function

The ![:color white,#008080](search&#40;&#41;) function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

### &#127792;

Search for the first white-space character in the string:

```python
import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", 
      x.start())
```

---

### The split() Function

The ![:color white,#008080](split&#40;&#41;) function returns a list where the string has been split at each match:

### &#127792;

Split at each white-space character:

```python
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)
```

---

### The split() Function (续)

You can control the number of occurrences by specifying the *maxsplit* parameter:

### &#127792;

Split the string only at the first occurrence:

```python
import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)
```

---

### The sub() Function

The ![:color white,#008080](sub&#40;&#41;) function replaces the matches with the text of your choice:

### &#127792;

Replace every white-space character with the number 9:

```python
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)
```

---
### The sub() Function (续)

You can control the number of replacements by specifying the *count* parameter:

### &#127792;

Replace the first 2 occurrences:

```python
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)
```

---

### Match Object (匹配对象)

A ![:color white,#008080](Match Object) is an object containing information about the search and the result.

Note: If there is no match, the value `None` will be returned, instead of the Match Object.

### &#127792;

Do a search that will return a Match Object:

```python
import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object
```

---

### Match Object (续)

The ![:color white,#008080](Match Object) has properties and methods used to retrieve information about the search, and the result:

- ![:color white,#008080](.span&#40;&#41;) returns a tuple containing the start-, and end positions of the match.

- ![:color white,#008080](.string) returns the string passed into the function

- ![:color white,#008080](.group&#40;&#41;) returns the part of the string where there was a match

--
### &#127792;

Print the position (start- and end-position) of the first match occurrence.
The regular expression looks for any words that starts with an upper case "S":

```python
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())
```

---
### &#127792;

Print the string passed into the function:

```python
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)
```

### &#127792;

Print the part of the string where there was a match.

The regular expression looks for any words that starts with an upper case "S":

```python
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())
```

---

### 特殊字符(Metacharacters)

![](./images/metacharacters.png)

---

### 特殊序列(Special Sequences)

![](./images/specialsequences.png)

---

### 字符集

![](./images/sets.png)

---
class: center,middle

### 贪婪 or 不贪婪？

### 有点为难！

---

### 例子

```python
import re
 
line = "Cats are smarter than dogs"
# .* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
 
if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
```

```
## matchObj.group() :  Cats are smarter than dogs
## matchObj.group(1) :  Cats
## matchObj.group(2) :  smarter
```

---

### 例子（续）

Why `'(.*?)'` matches nothing in the following code snippet(代码片段)?

```python
import re
 
line = "Cats are smarter than dogs"
# .* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符
matchObj = re.match( r'(.*) are (.*?)', line, re.M|re.I)
 
if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
```

```
## matchObj.group() :  Cats are 
## matchObj.group(1) :  Cats
## matchObj.group(2) :
```

非贪婪模式（non-greedy fashion）

---

### 非贪婪模式（non-greedy fashion）

`'.*'` matches
- ''  (0次重复：即什么都没有)
- 's' (1次重复)
- 'sm' (2次重复)
- ...
- 'smarter' (7次重复)
- ...
- 'smarter than dogs' (17次重复)

`'.*?'` causes `'.*'` to match as few repetitions as possible (non-greedy fashion), i.e.,
0次重复。

---

### 非贪婪模式（non-greedy fashion）（续）

Why `'(.*?) '` matches `'smarter '`?

```python
import re
 
line = "Cats are smarter than dogs"
# .* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符
matchObj = re.match( r'(.*) are (.*?) ', line, re.M|re.I)
 
if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
```

```
## matchObj.group() :  Cats are smarter 
## matchObj.group(1) :  Cats
## matchObj.group(2) :  smarter
```

What if(假如变量line取如下的式子，结果会如何呢？)

```python
line = "Cats are  smarter than dogs"
# are和smarter之间有两个空隔
```

---

### 非贪婪模式（non-greedy fashion）（续）

**are和smarter之间有两个空隔**

Now `'(.*?) '` matches `''`(0次重复，什么也不匹配（非空格）)!

```python
import re
 
line = "Cats are  smarter than dogs"
# are和smarter之间有两个空隔
matchObj = re.match( r'(.*) are (.*?) ', line, re.M|re.I)
 
if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
```

```
## matchObj.group() :  Cats are  
## matchObj.group(1) :  Cats
## matchObj.group(2) :
```

---

### 贪婪模式（greedy fashion）（续）

`'(.*) '`: greedy fashion

```python
import re
 
line = "Cats are smarter than dogs"
# .* 表示任意匹配除换行符（\n、\r）之外的任何单个或多个字符
matchObj = re.match( r'(.*) are (.*) ', line, re.M|re.I)
 
if matchObj:
   print ("matchObj.group() : ", matchObj.group())
   print ("matchObj.group(1) : ", matchObj.group(1))
   print ("matchObj.group(2) : ", matchObj.group(2))
else:
   print ("No match!!")
```

```
## matchObj.group() :  Cats are smarter than 
## matchObj.group(1) :  Cats
## matchObj.group(2) :  smarter than
```

`'(.*) '` matches the following in greedy fashion, so 'smarter than ' is selected:

- 'smarter ' (1次重复)
- 'smarter than ' (2次重复) &#10003;

---

### 关于`'(.*?)'`和`'(.*?) '`非贪婪模式的总结

.pull-left[
- `'(.*?)'`匹配![:color white,#008080]('&#40;.&#42;&#41;')的非贪婪
 + `''`
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(&#10003;)
 + `'s'`
 + `'sm'`
 + `'...'`
 + `'smarter than dogs'`

所以匹配`''`
]

.pull-right[
- `'(.*?) '`匹配![:color white,#008080]('&#40;.&#42;&#41; ')的非贪婪
 + `'smarter '`
 &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;(&#10003;)
 + `'smarter than '`

所以匹配`'smarter '`
]

.footnote[
**注意**：![:color white,#008080]('&#40;.&#42;&#41; ')比![:color white,#008080]('&#40;.&#42;&#41;')多一个空格
]
---

### Another Example: greedy or non-greedy

What is the meaning of non-greedy?

```python
import re

re.findall(r'<.*?>', '<a> b <c>')
```

.pull-left[

### Wrong!

Matches nothing.

Because `'<.*?>'` only matches `<>`.
]

.pull-right[

### Right!

Matches:

- `<>`
- `<a>`
- ``
- `<ab>`
- `<a...>` as long as `'...'` doesn't contain `'>'`

]

---

### 续

.pull-left[
Non-Greedy Fashion: 匹配三个:
 + `'<>'` &#10003;
 + `'<a>'` &#10003;
 + `''` &#10003;

```python
import re

re.findall(r'<.*?>',
 '<> <a> b <c>')
```

```
## ['<>', '<a>', '<c>']
```
]

.pull-right[
Greedy Fashion：匹配一个:
 + `'<>'` &#10007;
 + `'<a>'` &#10007;, `''` &#10007;
 + `'<> <a> b <c>'` &#10003;

```python
import re

re.findall(r'<.*>',
 '<> <a> b <c>')
```

```
## ['<> <a> b <c>']
```
]

---

### 正则表达式教程

[Regular Expression HOWTO:](https://docs.python.org/3/howto/regex.html)

.pull-left[
- [概述](https://docs.python.org/zh-cn/3/howto/regex.html#introduction)
- [简单匹配](https://docs.python.org/zh-cn/3/howto/regex.html#simple-patterns)
   + 匹配字符
   + 重复
- [使用正则表达式](https://docs.python.org/zh-cn/3/howto/regex.html#using-regular-expressions)
   + 编译正则表达式
   + 反斜杠灾难
   + 应用匹配
   + 模块级函数
   + 编译标志
]

.pull-right[
- [更多模式能力](https://docs.python.org/zh-cn/3/howto/regex.html#more-pattern-power)
   + 更多元字符
   + 分组
   + 非捕获和命名组
   + 前向断言
- [修改字符串](https://docs.python.org/zh-cn/3/howto/regex.html#modifying-strings)
   + 分割字符串
   + 搜索和替换
- [常见问题](https://docs.python.org/zh-cn/3/howto/regex.html#common-problems)
]

[Python RegEx](https://www.w3schools.com/python/python_regex.asp)

[https://www.runoob.com/python3/python3-reg-expressions.html](https://www.runoob.com/python3/python3-reg-expressions.html)

[https://www.liujiangblog.com/course/python/74](https://www.liujiangblog.com/course/python/74)

---
class: inverse

### 往期视频

---

### 应用&mdash;上市公司年报格式准则

近些年，证监会对年报格式准则，进行了四次修订：

- [2021年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式（2021年修订）.pdf)

- [2017年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式（2017年修订）.pdf)

- [2016年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式（2016年修订）.pdf)

- [2012年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式（2012年修订）.pdf)

如何借助Python解读不同修订之间的差异？

参考 [上市公司年报格式准则](../../application/annual_report_regulation/arr.html)