- RE Module
- Functions
- Flag Constants
- Regular Expression Objects
- Match Objects
- Pattern Syntax
- Special Characters
- Character Classes
- Set
- Group
- Or
- Wildcard
- Quantifiers
- Start / End
import re
prog = re.compile(pattern)
result = prog.search(string)re.compile(pattern, flags=0)compiles a pattern into a regular expression object and returns it.re.match(pattern, string, flags=0)matches the pattern at the beginning of the string, and returns a match object orNoneif it does not match.re.fullmatch(pattern, string, flags=0)matches the pattern to the whole string, and returns a match object orNoneif it does not match.re.search(pattern, string, flags=0)scans through the string for the first occurance of the pattern, and returns a match object orNoneif no position matches.re.findall(pattern, string, flags=0)returns all non-overlapping matches of the pattern in the string, as a list of strings or tuples (when the pattern includes multiple groups) in the order found.re.finditer(pattern, string, flags=0)returns an iterator yielding match objects over all non-overlapping matches in the order found.re.sub(pattern, repl, string, count=0, flags=0)returns the string obtained by replacing the non-overlapping occurrences of the pattern in the string by the replacement string or function. A non-negative integer can be used to specify the maximum number of pattern occurrences to be replaced.re.split(pattern, string, maxsplit=0, flags=0)splits the string by the occurrences of the pattern and returns the resultant list. A non-negative integer can be used to specify the number of splits, and the remainder of the string is returned as the final element of the list.
Flag constants passed as the last argument for most of the manipulation functions:
| Syntax | Abbr | Description |
|---|---|---|
re.IGNORECASE |
re.I |
Perform case-insensitive matching |
re.MULTILINE |
re.M |
^ and $ match the start and end of each line. By default, ^ only for the first line and $ for the last line |
re.DOTALL |
re.S |
. matches any character at all. By default, it will match anything except a newline |
re.VERBOSE |
re.X |
Allow multiline patterns by ignoring whitespace and # comments except when in a set |
Values can be combined using bitwise OR (the | operator) when passed to the function.
Compiled regular expression objects support methods:
prog.match(string)matches the pattern at the beginning of the string, and returns a match object orNoneif it does not match.prog.fullmatch(string)matches the pattern to the whole string, and returns a match object orNoneif it does not match.prog.search(string)scans through the string for the first occurance of the pattern, and returns a match object orNoneif no position matches.prog.findall(string)returns all non-overlapping matches of the pattern in the string, as a list of strings or tuples (when the pattern includes multiple groups) in the order found.prog.finditer(string)returns an iterator yielding match objects over all non-overlapping matches in the order found.prog.sub(repl, string, count=0)returns the string obtained by replacing the non-overlapping occurrences of the pattern in the string by the replacement string or function. A non-negative integer can be used to specify the maximum number of pattern occurrences to be replaced.prog.split(string, maxsplit=0)splits the string by the occurrences of the pattern and returns the resultant list. A non-negative integer can be used to specify the number of splits, and the remainder of the string is returned as the final element of the list.
Match objects always have a boolean value of True. To access their contents, use methods:
result.group(1,2,...,N)returns a single string if there is a single argument (the whole matched string if the argument equals 0 or is omitted), and returns a tuple with one item per argument if there are multiple arguments.result.groups()returns a tuple containing all the subgroups of the match.
Usually for the pattern, a raw string notation (an r prefixes the string) is used to exempt python from escaping backslashes.
| Syntax | Description |
|---|---|
\n |
Newline / Linefeed |
\r |
Carriage Return |
\t |
Horizontal Tab |
\v |
Vertical Tab |
\f |
Formfeed |
\0 |
Null |
\a |
Bell |
\b |
Backspace character (only inside sets) |
\c |
Control character |
\\ |
Backslash |
| Syntax | Description | ASCII Set Equivalence |
|---|---|---|
\d |
Matches decimal digits | [0-9] |
\D |
Matches any character which is not a decimal digit | [^0-9] |
\w |
Matches word characters | [a-zA-Z0-9_] |
\W |
Matches any character which is not a word character | [^a-zA-Z0-9_] |
\s |
Matches whitespace characters | [ \t\n\r\f\v] |
\S |
Matches any character which is not a whitespace character | [^ \t\n\r\f\v] |
[] is used to indicate a set of characters.
Inside sets,
- Characters can be listed individually, or ranges can be indicated by giving two characters and separating them by a
-; - If the first character of the set is
^, it means the complement; - To match a literal
], place it as the first character or precede it with a backslash for an escape character; - Others symbols lose special meanings and no need to be escaped, but special characters and character classes are still accepted.
() is used to create and capture a group of characters.
| is used to create a regular expression that will match characters on either side.
. is used to match a character that can be anything except a newline. To match a newline as well, use flag constants.
{} is used to specifies some repeated occurrences of the preceding character or characters grouped by round brackets.
- A single number indicates an exact number of occurrences;
- Two numbers seperated by a
,indicates a range of numbers of occurrences; - An omitted left bound means a lower bound of zero, and an omitted right bound means an infinite upper bound.
There are also special qualifiers for more convenient usage:
| Syntax | Description | Equivalent Curly Bracket Denotation |
|---|---|---|
? |
zero or one occurrences | {,1} |
* |
zero or more occurrences | {,} |
+ |
one or more occurrences | {1,} |
By default, qualifiers and quantifiers are all greedy (try to match as much text as possible).
? is used to change their behaviors to a non-greedy fashion when postponed to qualifiers or quantifiers.
| Syntax | Description |
|---|---|
^ |
Matches the start of the line |
$ |
Matches the end of the line |
\A |
Matches only at the start of the string |
\Z |
Matches only at the end of the string |
\b |
Matches the empty string, but only at the beginning or end of a word (works as word boundaries) |
\B |
Matches the empty string, but only when it is not at the beginning or end of a word |
Phone and email
import re
phoneRegex = re.compile(r'\+?(\d{1,3})(-| )([- ()\d]+)')
emailRegex = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}')
# Example string content sourced from the National Science Foundation website
phoneFound = phoneRegex.search('Call the Help Desk at 1-800-381-1532')
print('Country code: ' + phoneFound.group(1))
print('Local number: ' + phoneFound.group(3))String strip
import re
def regex_strip(string, chars=None):
if chars:
pattern = r'\A[' + chars + r']*(.*?)[' + chars + r']*\Z'
else:
pattern = r'\A\s*(.*?)\s*\Z'
return re.fullmatch(pattern, string, re.DOTALL).group(1)Sweigart, A. (2015). Automate the Boring Stuff With Python: Practical Programming for Total Beginners. San Francisco, CA: No Starch Press.