正则表达式

class: center, middle, inverse, title-slide

# 正则表达式
## 进阶
### 吴燕丰
### 江西财大，金融学院
### 2020/10/30，更新于2022-04-06

---

### 匹配数字

怎么匹配`3,000,000`这样的数字？

```python
import re

re.search(r'\d{1,3}(,\d{3})*', '3,000,000')
```

```
## <re.Match object; span=(0, 9), match='3,000,000'>
```

`3,000,000.01`又如何呢？

```python
import re

re.search('\d{1,3}(,\d{3})*(\.\d+)?', '3,000,000.01')
```

```
## <re.Match object; span=(0, 12), match='3,000,000.01'>
```

---

### 如何匹配所有的数字呢？

`re.findall()`?

```python
import re

re.findall(r'\d{1,3}(,\d{3})*', '3,000,000')
```

```
## [',000']
```

没有匹配成功！Why?

**注意**：严格讲，上面匹配成功，只是返回结果不符合我们预期。

正确的操作(正则括号的非捕获版本`(?:...)`)

```python
import re

re.findall(r'\d{1,3}(?:,\d{3})*', '3,000,000')
```

```
## ['3,000,000']
```

---

### 我们理解对了`re.findall()`吗？

`re.findall(pattern, string, flags=0)`

- Return all **non-overlapping matches** of pattern in string, as a **list of strings**. The string is scanned left-to-right, and matches are returned in the order found.

- If **one or more groups** are present in the pattern, return a **list of groups**;

- this will be a **list of tuples** if the pattern has **more than one group**.

- Empty matches are included in the result.

```python
import re

re.findall(r'ab', 'abacab')
```

```
## ['ab', 'ab']
```

???
中文

对 string 返回一个不重复的 pattern 的匹配列表， string 从左到右进行扫描，匹配按找到的顺序返回。如果样式里存在一到多个组，就返回一个组合列表；就是一个元组的列表（如果样式里有超过一个组合的话）。空匹配也会包含在结果里。 在 3.7 版更改: 非空匹配现在可以在前一个空匹配之后出现了。

---

### `re.findall()`

`re.findall(pattern, string, flags=0)`

- 如果pattern里包括**一个**group，比如`'(a)b'`包括一个group`'(a)'`[1]，则返回一个list of groups；*不在group内的匹配内容不会返回*。

```python
import re

re.findall(r'(a)b', 'abacab')
```

```
## ['a', 'a']
```

- 如果pattern里包括**两个及以上**group，则返回一个list of tuples；

```python
re.findall(r'(a)(b)', 'abacab')
```

```
## [('a', 'b'), ('a', 'b')]
```

.footnote[
\[1\]: group以括号()表示
]

---

### `re.findall()`

`(...)` 与 `{n}`的联合使用

```python
re.findall(r'(a){1}', 'aaaa')
```

```
## ['a', 'a', 'a', 'a']
```

```python
re.findall(r'(a){2}', 'aaaa')
```

```
## ['a', 'a']
```

.pull-left[
和你预期的一样吗？
]

.pull-right[
反正和我的不一样！
]

---

### Let Me Explain! （捕获分组）

- `re.findall(r'(a){1}', 'aaaa')`匹配成功4次：
 + `'a'`：'aaaa'，返回'(a)'匹配的最后一次，只有一次；
 + `'a'`：'aaaa'，返回'(a)'匹配的最后一次，只有一次；
 + `'a'`：'aaaa'，返回'(a)'匹配的最后一次，只有一次；
 + `'a'`：'aaaa'，返回'(a)'匹配的最后一次，只有一次。
 
- `re.findall(r'(a){2}', 'aaaa')`匹配成功2次：
 + `'aa'`：'aaaa'，返回'(a)'匹配的最后一次，所以是'a'，即'aaaa'；
 + `'aa'`：'aaaa'，返回'(a)'匹配的最后一次，所以是'a'，即'aaaa'。

What if we want to return 'aaaa' instead of 'aaaa'?

且看下页分解！

---

### Non-Capturing Version（非捕获分组）

Non-Capturing Version of regular parentheses`(?:...)` v.s. `(...)`

```python
re.findall(r'(?:a){2}', 'aaaa')
```

```
## ['aa', 'aa']
```

`(?:...)`帮助文档:

> A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

中文：
>正则括号的非捕获版本。 匹配在括号内的任何正则表达式，但该分组所匹配的子字符串 不能 在执行匹配后被获取或是之后在模式中被引用。

---

### 匹配数字

```python
import re

re.findall(r'\d{1,3}(?:,\d{3})*', '3,000,000')
```

```
## ['3,000,000']
```

```python
import re

line = '''
3,000,000
400
4,000
'''
re.findall(r'\d{1,3}(?:,\d{3})*', line)
```

```
## ['3,000,000', '400', '4,000']
```

```python
re.findall(r'\d{1,3}(?:,\d{3})*(?:\.\d+)?', '32,000.00123')
```

```
## ['32,000.00123']
```

---

### Please Explain

```python
re.findall(r'(,\d{3}){1}','3,000,000,000')
```

```
## [',000', ',000', ',000']
```

```python
re.findall(r'(,\d{3}){3}','3,000,000,000')
```

```
## [',000']
```

---

### Please Explain (continuing)

**令人困惑的** `*、+、?`

```python
re.findall(r'(,\d{3})*','3,000,000,000')
```

```
## ['', ',000', '']
```

```python
re.findall(r'(,\d{3})+','3,000,000,000')
```

```
## [',000']
```

```python
re.findall(r'(,\d{3})?','3,000,000,000')
```

```
## ['', ',000', ',000', ',000', '']
```

.footnote[
** * ** : 0次重复，也算匹配成功。所以，空字符 '' 也被  '(,\d{3})*' 匹配成功。
]

---
### Lookahead and Lookbehind

'`(?=...)`' **lookahead** (向右看)

Matches if ... matches next, but **doesn’t consume** any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

'`(?!...)`' **negative lookahead**

Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

'`(?<=...)`' **lookbehind** (向左看)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion.

'`(?<!...)`' **negative lookbehind**

Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion.

---
### 向前看、向后看&mdash;例子

匹配所有逗号（','）分割的数字，但是不包括像40000这样的数字

```python
import re

line = '''
3,000,000
400
4,000
40000
'''
m = re.findall(r'(?<=\s)\d{1,3}(?:,\d{3})*(?=\s)', line)
print(m)
```

```
## ['3,000,000', '400', '4,000']
```

仔细研读
[Regular expression operations](https://docs.python.org/3/library/re.html#module-re)<supscript>*</supscript>

.footnote[
*：老师说的是认真的，否则一些诀窍你就不会了。
]

---
class: inverse

### 往期视频

---
class: center, middle

### 致谢！

[Stack Overflow: Python regular expressions - re.search() vs re.findall()](https://stackoverflow.com/questions/9000960/python-regular-expressions-re-search-vs-re-findall)

[Stack Overflow: Regular expression to match numbers with or without commas and decimals in text](https://stackoverflow.com/questions/5917082/regular-expression-to-match-numbers-with-or-without-commas-and-decimals-in-text)

---

### 应用&mdash;正则表达式常见应用

- 匹配IP地址

- 匹配邮箱

- 匹配日期

- 时间

- 匹配手机号码

- 匹配固定电话号码

[正则表达式常见应用](../../application/re_common_applications/re_common_applications.html)

---
class: middle

.center[
好读书，不求甚解；每有会意，便欣然忘食。
]
.pull-right[
&mdash;晋·陶潜（365年－427年）
]

>先生不知何许人也，亦不详其姓字；宅边有五柳树，因以为号焉。闲静少言，不慕荣利。好读书，不求甚解；每有会意，便欣然忘食。

.pull-right[
&mdash;晋·陶潜《五柳先生传》
]