Pyton Regular Expressions

Introduction to REGEX

REGEX is a module used for regular expression matching in the Python programming language. In fact, REGEX is actually just short for regular expressions, which refer to the pattern of characters used in a string. This concept can apply to simple words, phone numbers, email addresses, or any other number of patterns. For example, if you search for the letter “f” in the sentence “For the love of all that is good, finish the job,” the goal is to look for occurrences of the character “f” in the sentence. This is the most basic application of regular expressions: you can look for only alphabetic characters in strings mixed with letters, numbers, and special characters. On the other hand, in a string that read “a2435?#@s560” you could choose to look only for the letters within that string. You could also look through text specifically for phone numbers (###-###-####). The format of a phone number is a very specific pattern of numbers and hyphens and more than just a single character – the general syntax of which we’ll discuss next.

First, it should be quickly noted that regex is generally case-sensitive: the letter “a” and the letter “A” would be considered to be separate characters. Also, when dealing with numbers, you will never deal with more than one digit at a time, since there isn’t a single character that represents anything beyond 0 through 9. Let’s go through some of the important meta-characters used to type out the patterns we need to look for. Just like regular strings, the patterns always start and end with double quotations (“”). So let’s say you’re looking for occurrences of the letter “e”: you can exactly write “e”. If you’re looking for a phrase, a part of a word, or a whole word such as “was”, then you can write exactly “was”. The two different applications of regular expressions are no different from entering a regular string.

Using characters to create indentations

Now let’s get into something special: we can actually use the period (.) to represent any character other than a newline character, which creates indentations. Let’s say the pattern you’re looking for is “h.s”: this means any character ranging from a letter, a number, or a special character can be between the “h” and the “s”. Finally, we have two characters that reference the specific position of a pattern.

  • The caret (^) looks for a pattern that starts the string or text. So if you had the sentence “This looks like a tree” and you look for the pattern “^This” it will successfully match since “This” is in the beginning. The caret must be the first character of the pattern.
  • On the opposite end of the spectrum, we have the dollar sign ($) which indicates the pattern must be at the end. So taking the previous example, if the pattern is “tree$”, you will return a successful match since the word “tree” ends the string. The dollar sign must always conclude the pattern.

The next couple of meta-characters refer to the number of times a regex occurs in a string.

  • The asterisk (*) checks for zero or more occurrences of a pattern. This means that regardless of if the specific character, characters, or pattern actually occurs or not, it will always be a match. For example, if we had the pattern“abc*”, then as long as we have a string containing “ab” it will pass. The “c” can occur or not and it’s will meet requirements. So the strings “ab”, “abc”, and “abccc” all match the pattern.
  • The plus sign (+) checks for one or more occurrences of a pattern. This means that as long as the pattern is matched at least once, a successful match has been made. No occurrence means that the match was unsuccessful. You can also do braces () and in between you enter the specific number of occurrences you are looking for. All of these meta-characters follow the regex.
  • The vertical bar (|), much like in programming languages, represents “or”. If you had the sentence “I’m departing from Miami at six o’clock” and the regex is “go|departing”, the match would be successful because even though “go” isn’t present, “departing” is.

Sets in REGEX

Next, we’ll discuss sets created by brackets ([]). A set expands the possibilities for making patterns, and represents exactly 1 character. For example, if you have the pattern “abc”, then that means you’re literally looking for “abc”. However, when the pattern is “[abc]”, you’re looking for occurrences “a”, “b”, or “c”. Similarly, “0123” means you are literally looking for “0123”. If you have “[0123]”, then you’re looking for occurrences of 0, 1, 2, or 3.

Back to Top