class: center, middle, inverse, title-slide # 正则表达式 ## 进阶 ### 吴燕丰 ### 江西财大,金融学院 ### 2020/10/30,更新于2022-04-06 --- ### 匹配数字 怎么匹配`3,000,000`这样的数字? ```python import re re.search(r'\d{1,3}(,\d{3})*', '3,000,000') ``` ``` ## <re.Match object; span=(0, 9), match='3,000,000'> ``` <br> `3,000,000.01`又如何呢? ```python import re re.search('\d{1,3}(,\d{3})*(\.\d+)?', '3,000,000.01') ``` ``` ## <re.Match object; span=(0, 12), match='3,000,000.01'> ``` --- ### 如何匹配所有的数字呢? `re.findall()`? ```python import re re.findall(r'\d{1,3}(,\d{3})*', '3,000,000') ``` ``` ## [',000'] ``` 没有匹配成功!<span style="color:red;font-size:32pt">Why?</span> -- **注意**:严格讲,上面匹配成功,只是返回结果不符合我们预期。 -- 正确的操作(正则括号的非捕获版本`(?:...)`) ```python import re re.findall(r'\d{1,3}(?:,\d{3})*', '3,000,000') ``` ``` ## ['3,000,000'] ``` --- ### 我们理解对了`re.findall()`吗? `re.findall(pattern, string, flags=0)` - Return all **non-overlapping matches** of pattern in string, as a **list of strings**. The string is scanned left-to-right, and matches are returned in the order found. - If **one or more groups** are present in the pattern, return a **list of groups**; - this will be a **list of tuples** if the pattern has **more than one group**. - Empty matches are included in the result. <br> ```python import re re.findall(r'ab', 'abacab') ``` ``` ## ['ab', 'ab'] ``` ??? 中文 对 string 返回一个不重复的 pattern 的匹配列表, string 从左到右进行扫描,匹配按找到的顺序返回。如果样式里存在一到多个组,就返回一个组合列表;就是一个元组的列表(如果样式里有超过一个组合的话)。空匹配也会包含在结果里。<br><br>在 3.7 版更改: 非空匹配现在可以在前一个空匹配之后出现了。 --- ### `re.findall()` `re.findall(pattern, string, flags=0)` - 如果pattern里包括**一个**group,比如`'(a)b'`包括一个group`'(a)'`<sup>[1]</sup>,则返回一个list of groups;*不在group内的匹配内容不会返回*。 ```python import re re.findall(r'(a)b', 'abacab') ``` ``` ## ['a', 'a'] ``` - 如果pattern里包括**两个及以上**group,则返回一个list of tuples; ```python re.findall(r'(a)(b)', 'abacab') ``` ``` ## [('a', 'b'), ('a', 'b')] ``` .footnote[ \[1\]: group以括号()表示 ] --- ### `re.findall()` `(...)` 与 `{n}`的联合使用 ```python re.findall(r'(a){1}', 'aaaa') ``` ``` ## ['a', 'a', 'a', 'a'] ``` ```python re.findall(r'(a){2}', 'aaaa') ``` ``` ## ['a', 'a'] ``` .pull-left[ <span style='font-size:24pt'>和你预期的一样吗?</span> ] -- .pull-right[ <span style='font-size:24pt'>反正和我的不一样!</span> ] --- ### Let Me Explain! (捕获分组) - `re.findall(r'(a){1}', 'aaaa')`匹配成功4次: + `'a'`:'<span style='color:red'>a</span>aaa',返回'(a)'匹配的最后一次,只有一次; + `'a'`:'a<span style='color:red'>a</span>aa',返回'(a)'匹配的最后一次,只有一次; + `'a'`:'aa<span style='color:red'>a</span>a',返回'(a)'匹配的最后一次,只有一次; + `'a'`:'aaa<span style='color:red'>a</span>',返回'(a)'匹配的最后一次,只有一次。 - `re.findall(r'(a){2}', 'aaaa')`匹配成功2次: + `'aa'`:'<span style='color:red'>aa</span>aa',返回'(a)'匹配的最后一次,所以是'a',即'<span style='color:red'>a</span><span style='color:red;font-size:32pt'>a</span>aa'; + `'aa'`:'aa<span style='color:red'>aa</span>',返回'(a)'匹配的最后一次,所以是'a',即'aa<span style='color:red'>a</span><span style='color:red;font-size:32pt'>a</span>'。 What if we want to return '<span style='color:red;font-size:32pt'>aa</span>aa' instead of '<span style='color:red'>a</span><span style='color:red;font-size:32pt'>a</span>aa'? 且看下页分解! --- ### Non-Capturing Version(非捕获分组) Non-Capturing Version of regular parentheses`(?:...)` v.s. `(...)` ```python re.findall(r'(?:a){2}', 'aaaa') ``` ``` ## ['aa', 'aa'] ``` `(?:...)`帮助文档: > A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern. 中文: >正则括号的非捕获版本。 匹配在括号内的任何正则表达式,但该分组所匹配的子字符串 不能 在执行匹配后被获取或是之后在模式中被引用。 --- ### 匹配数字 ```python import re re.findall(r'\d{1,3}(?:,\d{3})*', '3,000,000') ``` ``` ## ['3,000,000'] ``` ```python import re line = ''' 3,000,000 400 4,000 ''' re.findall(r'\d{1,3}(?:,\d{3})*', line) ``` ``` ## ['3,000,000', '400', '4,000'] ``` ```python re.findall(r'\d{1,3}(?:,\d{3})*(?:\.\d+)?', '32,000.00123') ``` ``` ## ['32,000.00123'] ``` --- ### Please Explain ```python re.findall(r'(,\d{3}){1}','3,000,000,000') ``` ``` ## [',000', ',000', ',000'] ``` ```python re.findall(r'(,\d{3}){3}','3,000,000,000') ``` ``` ## [',000'] ``` --- ### Please Explain (continuing) **令人困惑的** `*、+、?` ```python re.findall(r'(,\d{3})*','3,000,000,000') ``` ``` ## ['', ',000', ''] ``` ```python re.findall(r'(,\d{3})+','3,000,000,000') ``` ``` ## [',000'] ``` ```python re.findall(r'(,\d{3})?','3,000,000,000') ``` ``` ## ['', ',000', ',000', ',000', ''] ``` .footnote[ ** * ** : 0次重复,也算匹配成功。所以,空字符 '' 也被 '(,\d{3})*' 匹配成功。 ] --- ### Lookahead and Lookbehind '`(?=...)`' **lookahead** (向右看) Matches if ... matches next, but **doesn’t consume** any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'. '`(?!...)`' **negative lookahead** Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. '`(?<=...)`' **lookbehind** (向左看) Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. '`(?<!...)`' **negative lookbehind** Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. --- ### 向前看、向后看—例子 匹配所有逗号(',')分割的数字,但是不包括像40000这样的数字 ```python import re line = ''' 3,000,000 400 4,000 40000 ''' m = re.findall(r'(?<=\s)\d{1,3}(?:,\d{3})*(?=\s)', line) print(m) ``` ``` ## ['3,000,000', '400', '4,000'] ``` 仔细研读 [Regular expression operations](https://docs.python.org/3/library/re.html#module-re)<supscript>*</supscript> .footnote[ *:老师说的是认真的,否则一些诀窍你就不会了。 ] --- class: inverse ### 往期视频 <video width=100% controls> <source src="../../video/chapter03-正则表达式进阶1.mp4"> </video> --- class: center, middle ### 致谢! [Stack Overflow: Python regular expressions - re.search() vs re.findall()](https://stackoverflow.com/questions/9000960/python-regular-expressions-re-search-vs-re-findall) [Stack Overflow: Regular expression to match numbers with or without commas and decimals in text](https://stackoverflow.com/questions/5917082/regular-expression-to-match-numbers-with-or-without-commas-and-decimals-in-text) --- ### 应用—正则表达式常见应用 - 匹配IP地址 - 匹配邮箱 - 匹配日期 - 时间 - 匹配手机号码 - 匹配固定电话号码 [正则表达式常见应用](../../application/re_common_applications/re_common_applications.html) --- class: middle .center[ 好读书,不求甚解;每有会意,便欣然忘食。 ] .pull-right[ —晋·陶潜(365年-427年) ] -- <br><br> >先生不知何许人也,亦不详其姓字;宅边有五柳树,因以为号焉。闲静少言,不慕荣利。好读书,不求甚解;每有会意,便欣然忘食。 .pull-right[ —晋·陶潜《五柳先生传》 ]