class: center, middle, inverse, title-slide # 正则表达式 ## Regular Expression ### 吴燕丰 ### 江西财大,金融学院 ### 2020/09/24,更新于2022/03/19 --- ### Python RegEx (正则表达式) A .hlight[RegEx], or .hlight[Regular Expression], is a sequence of characters that forms a search pattern. .hlight[RegEx] can be used to check if a string contains the specified search pattern. ### 🌰 Search the string to see if it starts with "顺丰" and ends with "年度报告": ```python import re txt = "顺丰公司2021年年度报告" x = re.search("^(顺丰).*(年度报告)$", txt) print(x.group()) ``` .footnote[注意: 该页及后续多页引用自 [Python RegEx(W3schools 教程 )](https://www.w3schools.com/python/python_regex.asp) ] --- ### RegEx Functions The ![:color white,#008080](re) module offers a set of functions that allows us to search a string for a match: |Function|Description| |:--------:|-----------| |![:color white,#008080](findall)|Returns a list containing all matches| |![:color white,#008080](search) |Returns a Match object if there is a match anywhere in the string| |![:color white,#008080](split) |Returns a list where the string has been split at each match| |![:color white,#008080](sub) |Replaces one or many matches with a string| --- ### The findall() Function The ![:color white,#008080](findall()) function returns a list containing all matches. ### 🌰 Print a list of all matches: ```python import re txt = "The rain in Spain" x = re.findall("ai", txt) print(x) ``` --- ### The search() Function The ![:color white,#008080](search()) function searches the string for a match, and returns a Match object if there is a match. If there is more than one match, only the first occurrence of the match will be returned: ### 🌰 Search for the first white-space character in the string: ```python import re txt = "The rain in Spain" x = re.search("\s", txt) print("The first white-space character is located in position:", x.start()) ``` --- ### The split() Function The ![:color white,#008080](split()) function returns a list where the string has been split at each match: ### 🌰 Split at each white-space character: ```python import re txt = "The rain in Spain" x = re.split("\s", txt) print(x) ``` --- ### The split() Function (续) You can control the number of occurrences by specifying the *maxsplit* parameter: ### 🌰 Split the string only at the first occurrence: ```python import re txt = "The rain in Spain" x = re.split("\s", txt, 1) print(x) ``` --- ### The sub() Function The ![:color white,#008080](sub()) function replaces the matches with the text of your choice: ### 🌰 Replace every white-space character with the number 9: ```python import re txt = "The rain in Spain" x = re.sub("\s", "9", txt) print(x) ``` --- ### The sub() Function (续) You can control the number of replacements by specifying the *count* parameter: ### 🌰 Replace the first 2 occurrences: ```python import re txt = "The rain in Spain" x = re.sub("\s", "9", txt, 2) print(x) ``` --- ### Match Object (匹配对象) A ![:color white,#008080](Match Object) is an object containing information about the search and the result. Note: If there is no match, the value `None` will be returned, instead of the Match Object. ### 🌰 Do a search that will return a Match Object: ```python import re txt = "The rain in Spain" x = re.search("ai", txt) print(x) #this will print an object ``` --- ### Match Object (续) The ![:color white,#008080](Match Object) has properties and methods used to retrieve information about the search, and the result: - ![:color white,#008080](.span()) returns a tuple containing the start-, and end positions of the match. - ![:color white,#008080](.string) returns the string passed into the function - ![:color white,#008080](.group()) returns the part of the string where there was a match -- ### 🌰 Print the position (start- and end-position) of the first match occurrence. The regular expression looks for any words that starts with an upper case "S": ```python import re txt = "The rain in Spain" x = re.search(r"\bS\w+", txt) print(x.span()) ``` --- ### 🌰 Print the string passed into the function: ```python import re txt = "The rain in Spain" x = re.search(r"\bS\w+", txt) print(x.string) ``` -- ### 🌰 Print the part of the string where there was a match. The regular expression looks for any words that starts with an upper case "S": ```python import re txt = "The rain in Spain" x = re.search(r"\bS\w+", txt) print(x.group()) ``` --- ### 特殊字符(Metacharacters) ![](./images/metacharacters.png) --- ### 特殊序列(Special Sequences) ![](./images/specialsequences.png) --- ### 字符集 ![](./images/sets.png) --- class: center,middle ### 贪婪 or 不贪婪? -- ### 有点为难! --- ### 例子 ```python import re line = "Cats are smarter than dogs" # .* 表示任意匹配除换行符(\n、\r)之外的任何单个或多个字符 matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!") ``` ``` ## matchObj.group() : Cats are smarter than dogs ## matchObj.group(1) : Cats ## matchObj.group(2) : smarter ``` --- ### 例子(续) Why `'(.*?)'` matches nothing in the following code snippet(代码片段)? ```python import re line = "Cats are smarter than dogs" # .* 表示任意匹配除换行符(\n、\r)之外的任何单个或多个字符 matchObj = re.match( r'(.*) are (.*?)', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!") ``` ``` ## matchObj.group() : Cats are ## matchObj.group(1) : Cats ## matchObj.group(2) : ``` 非贪婪模式(non-greedy fashion) --- ### 非贪婪模式(non-greedy fashion) `'.*'` matches - '' (0次重复:即什么都没有) - 's' (1次重复) - 'sm' (2次重复) - ... - 'smarter' (7次重复) - ... - 'smarter than dogs' (17次重复) `'.*?'` causes `'.*'` to match as few repetitions as possible (non-greedy fashion), i.e., 0次重复。 --- ### 非贪婪模式(non-greedy fashion)(续) Why `'(.*?) '` matches `'smarter '`? ```python import re line = "Cats are smarter than dogs" # .* 表示任意匹配除换行符(\n、\r)之外的任何单个或多个字符 matchObj = re.match( r'(.*) are (.*?) ', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!") ``` ``` ## matchObj.group() : Cats are smarter ## matchObj.group(1) : Cats ## matchObj.group(2) : smarter ``` What if(假如变量line取如下的式子,结果会如何呢?) ```python line = "Cats are smarter than dogs" # are和smarter之间有两个空隔 ``` --- ### 非贪婪模式(non-greedy fashion)(续) **are和smarter之间有两个空隔** Now `'(.*?) '` matches `''`(0次重复,什么也不匹配(非空格))! ```python import re line = "Cats are smarter than dogs" # are和smarter之间有两个空隔 matchObj = re.match( r'(.*) are (.*?) ', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!") ``` ``` ## matchObj.group() : Cats are ## matchObj.group(1) : Cats ## matchObj.group(2) : ``` --- ### 贪婪模式(greedy fashion)(续) `'(.*) '`: greedy fashion ```python import re line = "Cats are smarter than dogs" # .* 表示任意匹配除换行符(\n、\r)之外的任何单个或多个字符 matchObj = re.match( r'(.*) are (.*) ', line, re.M|re.I) if matchObj: print ("matchObj.group() : ", matchObj.group()) print ("matchObj.group(1) : ", matchObj.group(1)) print ("matchObj.group(2) : ", matchObj.group(2)) else: print ("No match!!") ``` ``` ## matchObj.group() : Cats are smarter than ## matchObj.group(1) : Cats ## matchObj.group(2) : smarter than ``` `'(.*) '` matches the following in greedy fashion, so 'smarter than ' is selected: - 'smarter ' (1次重复) - 'smarter than ' (2次重复) <span style='color:red;font-size:16pt;'>✓</span> --- ### 关于`'(.*?)'`和`'(.*?) '`非贪婪模式的总结 .pull-left[ - `'(.*?)'`匹配![:color white,#008080]('(.*)')的非贪婪 + `''` (<span style='color:red;font-size:16pt;'>✓</span>) + `'s'` + `'sm'` + `'...'` + `'smarter than dogs'` 所以匹配`''` ] .pull-right[ - `'(.*?) '`匹配![:color white,#008080]('(.*) ')的非贪婪 + `'smarter '` (<span style='color:red;font-size:16pt;'>✓</span>) + `'smarter than '` <br><br> <br><br> 所以匹配`'smarter '` ] .footnote[ **注意**:![:color white,#008080]('(.*) ')比![:color white,#008080]('(.*)')多一个空格 ] --- ### Another Example: greedy or non-greedy What is the meaning of non-greedy? ```python import re re.findall(r'<.*?>', '<a> b <c>') ``` .pull-left[ ### Wrong! Matches nothing. Because `'<.*?>'` only matches `<>`. ] .pull-right[ ### Right! Matches: - `<>` - `<a>` - `<b>` - `<ab>` - `<a...>` as long as `'...'` doesn't contain `'>'` ] --- ### 续 .pull-left[ Non-Greedy Fashion: 匹配三个: + `'<>'` <span style='color:red;font-size:16pt;'>✓</span> + `'<a>'` <span style='color:red;font-size:16pt;'>✓</span> + `'<b>'` <span style='color:red;font-size:16pt;'>✓</span> ```python import re re.findall(r'<.*?>', '<> <a> b <c>') ``` ``` ## ['<>', '<a>', '<c>'] ``` ] .pull-right[ Greedy Fashion:匹配一个: + `'<>'` <span style='color:red;font-size:16pt;'>✗</span> + `'<a>'` <span style='color:red;font-size:16pt;'>✗</span>, `'<b>'` <span style='color:red;font-size:16pt;'>✗</span> + `'<> <a> b <c>'` <span style='color:red;font-size:16pt;'>✓</span> ```python import re re.findall(r'<.*>', '<> <a> b <c>') ``` ``` ## ['<> <a> b <c>'] ``` ] --- ### 正则表达式教程 [Regular Expression HOWTO:](https://docs.python.org/3/howto/regex.html) .pull-left[ - [概述](https://docs.python.org/zh-cn/3/howto/regex.html#introduction) - [简单匹配](https://docs.python.org/zh-cn/3/howto/regex.html#simple-patterns) + 匹配字符 + 重复 - [使用正则表达式](https://docs.python.org/zh-cn/3/howto/regex.html#using-regular-expressions) + 编译正则表达式 + 反斜杠灾难 + 应用匹配 + 模块级函数 + 编译标志 ] .pull-right[ - [更多模式能力](https://docs.python.org/zh-cn/3/howto/regex.html#more-pattern-power) + 更多元字符 + 分组 + 非捕获和命名组 + 前向断言 - [修改字符串](https://docs.python.org/zh-cn/3/howto/regex.html#modifying-strings) + 分割字符串 + 搜索和替换 - [常见问题](https://docs.python.org/zh-cn/3/howto/regex.html#common-problems) ] -- [Python RegEx](https://www.w3schools.com/python/python_regex.asp) [https://www.runoob.com/python3/python3-reg-expressions.html](https://www.runoob.com/python3/python3-reg-expressions.html) [https://www.liujiangblog.com/course/python/74](https://www.liujiangblog.com/course/python/74) --- class: inverse ### 往期视频 <video width=800 height=450 controls> <source src="../../video/chapter03-正则表达式.mp4"> </video> --- ### 应用—上市公司年报格式准则 近些年,证监会对年报格式准则,进行了四次修订: - [2021年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式(2021年修订).pdf) - [2017年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式(2017年修订).pdf) - [2016年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式(2016年修订).pdf) - [2012年修订版](../../application/annual_report_regulation/公开发行证券的公司信息披露内容与格式准则第2号——年度报告的内容与格式(2012年修订).pdf) 如何借助Python解读不同修订之间的差异? 参考 [上市公司年报格式准则](../../application/annual_report_regulation/arr.html)