whu-textual-analysis/lectures/programming/introductions/Introduction_Regular_Expressions.py

# -*- coding: utf-8 -*-
"""
INTRODUCTION TO REGULAR EXPRESSION

@author: Alexander Hillert, Goethe University Frankfurt
This version: June 3, 2019

What are regular expressions?

Regular expressions allow you to search for general patterns in texts. The
standard string commands like .count("search_term") and .replace("old_word","new_word")
can only count and replace one specific word, respectively. They cannot search
for general patterns like all words that consist of three or more letters.
Assume that you want to identify all numbers in a text or that you search for
the year of birth in bios of corporate executives. In the examples, you need a
search tool that can process broad patterns --> you need regular expressions.
Consider the second example, i.e. you would like to automatically identify
people's year of birth from their bios. You know that the number must have four
digits and that the first two digits must equal 19. Of course, you could
hardcode all possible years (1900, 1901, ..., 1999), but this is unnecessarily
complicated and slows down the program. Therefore, it is better to learn
how to use regex.

Useful online resources:
1. https://regex101.com/
On this webpage, you can enter a text and a regular expression.
The webpage highlights the matches and provides explanations for
every part of the regex pattern.
Caution: click on "Python" in the left menu (the default language is php)!

2. https://docs.python.org/3/library/re.html
The offical documentation of regular expression in Python 3.

"""

# To be able to use regular expressions you need to import the re package first.
import re

# Select the directory where you saved the accompanying txt-file.
directory="C:/Lehre/Textual Analysis/Programming/Files/"


# In this introduction, we use the accompanying txt-file "Text_Introduction_Regular_Expressions.txt"
# open the file
text_file=open(directory+'Text_Introduction_Regular_Expressions.txt','r',encoding='cp1252')
# read its content
text=text_file.read()

# Let's start with the example from the beginning and search for people's years of birth.
# The standard search command for regular expressions is re.search. It searches
# for the FIRST match of the expression in the text.
# First try
match=re.search("19[0-9]{2}",text)
# This command searches for four digits of which the first is a 1, the second a 9,
# and then there are two further digits which can be any digits.
# [0-9] refers to any digit. Equivalently, you can write \d which also refers
# to any digits.
# The {2} specifies that there must be exactly to digits.

print(match)
# match contains information on the match:
# span is the position in text where the match starts and ends; here 226 and 230
# furthermore, the matched text is shown. Here, the first match is 1956.
# You can use the positions to print the text before the match, after the match,
# and, of course, of the matched text.
start=match.start()
end=match.end()
print("From beginning of the document to the match: \n"+text[:start]+"\n\n")
print("The match itself: \n"+text[start:end]+"\n\n")
print("From end of match to end of document: \n"+text[end:]+"\n\n")

# To access the match, you can also use the command .group(0):
print("Alternative way to access the matched text: \n"+match.group(0)+"\n\n")

# CAUTION
# If no match is found the variable match does not exist.
# Example: search for a ten digit number that start with 19
match=re.search("19[0-9]{8}",text)
# The command start=match.start() returns the follwoing error:
# "AttributeError: 'NoneType' object has no attribute 'start'"
# SOLUTION
match=re.search("19[0-9]{8}",text)
if match:
    # match found, the start .start() is now conditional on the existence of match
    start=match.start()
    print("Match found. Starting at position "+str(start))
else:
    # no match found
    print("No match found")

'''
Information on Syntax, Special Characters in Regular Expression

Character       Meaning
[]              Indicates a set of characters
\[              Matches the actual [
\]              Matches the actual ]
^               negation; the symbols listed afterwards are not allowed in the match
                E.g., [^0-9] will not match any numbers but all other symbols.
\d              Any digit, i.e. 0, 1, 2, ..., 9. Equivalent to [0-9]
\n              Linefeed/newline, the start of a new line.
\s              Any whitespace, i.e. a tab, a space.
                CAUTION: \s matches also the newline (\n). This property of \s
                can lead to unintended matches.
                RECOMMENDATION: to match whitespaces only use [ \t], i.e. a space
                and a tab (\t).
\S              Any non-whitespace symbol.
.               Any character (digit, letter, symbol [!,?,%,etc.], spaces) but
                NOT the newline, \n.
\.              Matches the actual dot.
\w              Matches word characters, i.e. [0-9a-zA-Z_]
                The underscore (_) is defined to be a word character.
\W              Matches any non-word characters, i.e. [^0-9a-zA-Z_]
|               Or condition (for an example see line 272)
()              Like in math: parentheses indicate which characters of an expression
                belong togehter. (For an example see line 272.)
\(              Matches the actual (
\)              Matches the actual )

(?i)            Performs the regex case-insensitive. Must be put at the beginning
                of the regex. E.g. re.search("(?i)TeSt",text) will match
                TEST, test, Test, etc.
re.IGNORECASE   Performs the regex case-insensitive. Must be put at the end of
                the regex as an option. E.g. re.search("test",text,re.IGNORECASE)
'''
# Examples of character sets
# 1. [0-9]: numbers
match=re.search("[0-9]","ABC abc 123")
print(match)
#2. [a-z]: any lower case letter
match=re.search("[a-z]","ABC abc 123")
print(match)
#3. [A-Z]: any upper case letter
match=re.search("[A-Z]","ABC abc 123")
print(match)
#4. [cde]: lower case letters c, d, and e.
match=re.search("[cde]","ABC abc 123")
print(match)
#5. [^A-Zab]: all symbols except captial letters and a and b.
match=re.search("[^A-Zab]","ABC abc 123")
print(match)
# you don't see any character because the match is the first white space before abc


'''
Quantifiers for regular expression:
n and m refer to non-negative integers (0, 1, 2, ...), where m>n
Quantifier      Meaning
{n}             The preceding pattern must be found EXACTLY n times.
{n,}            The preceding pattern must be found AT LEAST n times.
{,n}            The preceding pattern must be found AT MOST n times.
{n,m}           The preceding pattern must be found AT LEAST n but AT MOST m times.
{n,}?           The ? tells the regex not to be "greedy" (see lines 211 for details)

There are alternative notations for commonly used quantifiers:
* is equivalent to {0,}, i.e. 0 or more repetitions of the preceding pattern.
+ is equivalent to {1,}, i.e. 1 or more repetitions of the preceding pattern.
? is equivalent to {0,1}, i.e. 0 or 1 repetition of the preceding pattern.
'''

# re.search() returns only the first match: How to get all matches?
# Alternative 1: use a loop.
text1=text
i=1
match=re.search("19[0-9]{2}",text1)
# Repeat the following commands until no more matches are found.
while match:
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    end=match.end()
    text1=text1[end:]
    match=re.search("19[0-9]{2}",text1)
    i=i+1

# Alternative 2: use re.findall
# The syntax is identical to re.search
list_of_matches=re.findall("19[0-9]{2}",text)
print(list_of_matches)
# the individual matches can be called by list_of_matches[i], where i ranges
# from zero to the number of matches minus one.
# Remember: the first element of a list has the position 0
for i in range(0,len(list_of_matches)):
    print("This is match number "+str(i+1)+" using the re.findall command: "+list_of_matches[i])


# When you read the text you will observe that there are only six years of birth
# in the text and not eight -> there are two mismatches -> adjust filter to
# get only the years of birth and not all years.
text1=text
i=1
# Check whether the word born appears before the year. The distance between
# born and the year must be smaller or equal 15 (plus the two white spaces)
match=re.search("born .{,15} 19[0-9]{2}",text1)
while match:
    print("This is match number "+str(i)+": "+match.group(0))
    # Extract the year
    match1=re.search("19[0-9]{2}",match.group(0))
    print("The year of match number "+str(i)+" is: "+match1.group(0))
    # Check whether there are further matches after the end of the previous match
    end=match.end()
    text1=text1[end:]
    match=re.search("born .{,15} 19[0-9]{2}",text1)
    i=i+1


# The quantifiers introduced above are "greedy". For example, if a pattern matches overlapping
# text parts of different length, the regex will return the longest match.
# Example: search for the first sentence in a text. You know that sentences
# end with period in this example.
text2="This is the first senctence. This is the second sentence. And so on"
# Search for a positive number of occurances of characters followed by a period.
# Remeber that the dot is \. in regex. The . will match any character.
match=re.search(".{1,}\.",text2)
print(match.group(0))
# -> the regex returns the first and second sentence.
# To get the first match that fulfils the regex, put a ? after the quantifiers.
# This makes the quantifier "non-greedy", and only the first occurance will be matched.
match=re.search(".{1,}?\.",text2)
print(match.group(0))

# You will often have situations where there are multiple versions of the same
# pattern. How can you include all of them in one regular expression?
# Example 1: search for the word "losses" in the following sentence:
text3="X Corp's soda division returned significant losses in the last quarter. Losses will be reduced this quarter."
# the first letter of "loss" can be upper or lower case
print("Example 1: Loss and loss")
text4=text3
i=1
# A set of characters [] is matched if at least one of the components of the
# set is found in the text. This works only for a single letter/number/symbol
# but not for sequences of multiple letters/numbers/symbols.
match=re.search("[Ll]oss",text3)
while match:
    end=match.end()
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    text4=text4[end:]
    match=re.search("[Ll]oss",text4)
    i=i+1

# Alternatively
list_of_matches=re.findall("[Ll]oss",text3)
print("Alternative using re.findall: "+str(list_of_matches))

# In this example, you could also simply perform a case-insensitive match.
print("Case-INsensitive matching using re.IGNORECASE")
text4=text3
i=1
match=re.search("loss",text3,re.IGNORECASE)
while match:
    end=match.end()
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    text4=text4[end:]
    match=re.search("loss",text4,re.IGNORECASE)
    i=i+1
# Or equivalently
print("Case-INsensitive matching using (?i)")
text4=text3
i=1
match=re.search("(?i)loss",text3)
while match:
    end=match.end()
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    text4=text4[end:]
    match=re.search("(?i)loss",text4)
    i=i+1


# Example 2: search for the expressions "profits declined" and "profits decreased"
# in the following sentence:
text3="X Corp's profits declined in 2010, while Y Inc.'s profits decreased the year before."
# Here, [] no longer works because we need to match terms consisting of several
# characters and [] matches only one character. -> use the OR-operator |
print("Example 2: profits declied and profits decreased - First try")
text4=text3
i=1
match=re.search("profits declined|decreased",text3)
while match:
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    end=match.end()
    text4=text4[end:]
    match=re.search("profits declined|decreased",text4)
    i=i+1
# Problem: regex interprets the entire set of characters before the | as one
# alternative.
# Solution: use parantheses to define the boundaries.

print("Example 2: profits declied and profits decreased - Second try")
text4=text3
i=1
match=re.search("profits (declined|decreased)",text3)
while match:
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    end=match.end()
    text4=text4[end:]
    match=re.search("profits (declined|decreased)",text4)
    i=i+1

# Alternative: does re.findall work?
list_of_matches=re.findall("profits (declined|decreased)",text3)
print(list_of_matches)
# -> No! Because there is a major difference between re.search and re.findall
# in the way they treat parantheses ().
# re.search follows the general regular expression syntax that is also used in
# other programming languages.
# To use re.findall you have to write down the full text before and after the |.
list_of_matches=re.findall("profits declined|profits decreased",text3)
print(list_of_matches)


# More information on the difference between re.search and re.findall
# Example 3: let's search for the numbers in the second part of the txt file
# and compare what the two commands do.
# Get the second part
match=re.search("Here are some numbers:",text)
text4=text[match.end():]
print(text4)
match=re.search("[0-9]{1,}([0-9]{3}|,){0,}\.{0,1}[0-9]{0,}",text4)
# What are the individual parts of this pattern?
# [0-9]{1,} There has to be at least one digit.
# ([0-9]{3}|,){0,} The first digit can be followed by combinations of three
# digits and commas (as thousand separator).
# \.{0,1} There can be zero or one period as decimal separator.
# [0-9]{0,} There can be multiple decimal places.

i=1
while match:
    print("This is match number "+str(i)+": "+match.group(0))
    # Check whether there are further matches after the end of the previous match
    end=match.end()
    text4=text4[end:]
    match=re.search("[0-9]{1,}([0-9]{3}|,){0,}\.{0,1}[0-9]{0,}",text4)
    i=i+1

# Can we obtain the same result by using re.findall?
match=re.search("Here are some numbers:",text)
text4=text[match.end():]
list_of_matches=re.findall("[0-9]{1,}([0-9]{3}|,){0,}\.{0,1}[0-9]{0,}",text4)
print(list_of_matches)
# Does not work!
# One has to put "?:" in the part that captures the repetition of the thousands.
# This tells re.findall to return the full match and not subpatterns.
list_of_matches=re.findall("[0-9]{1,}(?:[0-9]{3}|,){0,}\.{0,1}[0-9]{0,}",text4)
print(list_of_matches)

# TAKE AWAY: The matching of re.findall does not always match that of re.search
# Be careful when using re.findall!!!


# How to delete or substitute parts of texts?
# Alternative 1: identify the beginning and end of the matched text part and
# remove it from the overall text.
# Example delete all numbers in the text
text4=text
print("Original Text:\n"+text4)
match=re.search("[0-9]{1,}(,[0-9]{3}){0,}(\.[0-9]{1,}){0,1}",text4)
while match:
    # Remove the match
    text4=text4[:match.start()]+text4[match.end():]
    # Check whether there are further matches in the remaining text
    match=re.search("[0-9]{1,}(,[0-9]{3}){0,}(\.[0-9]{1,}){0,1}",text4)
print("Text without numbers using re.search:\n"+text4)

# Alternative 2: use re.sub (sub -> substitute)
# syntax: new_text=re.sub(pattern, replacement, old_text)
# replacement is some string. Regular expressions are only allowed in the pattern
# but not in the replacement.
text4=text
text4=re.sub("[0-9]{1,}(,[0-9]{3}){0,}(\.[0-9]{1,}){0,1}","",text4)

print("Text without numbers using re.sub:\n"+text4)
# re.sub is the more efficient way.
# Furthermore, re.sub can not only delete text but also replace text.
# Example
text4=text
text4=re.sub("[0-9]{1,}(,[0-9]{3}){0,}(\.[0-9]{1,}){0,1}","NUMBER",text4)
print("Text where numbers are replaced by the word 'NUMBER':\n"+text4)


# Make sure you get the right match --> importance of word boundaries.
# When you search for a word it can happen that the word is part of a different
# longer word. For example, searching for "high" would also match "highlight".
# To avoid such mismatches you can either include word boundaries in the search
# (Alternative 1) or split the text first by word boundaries into single words
# and perform standard string search operations afterwards (Alternative 2).
# Alternative 2 does not return the individual matches but tells you for example
# the number of matches
# Example: search for the word "is"
# Alternative 1:
match=re.search("is",text)
print("Searching without word boundaries yields: '"+match.group(0)+\
"' But the surrounding text is: '"+text[match.start()-1:match.end()+1]+"'")
match=re.search("\Wis\W",text)
print("Searching with word boundaries yields: '"+match.group(0)+\
"' and the surrounding text is: '"+text[match.start()-1:match.end()+1]+"'")
# You see that the preceding and subsequent word boundaries are also matched
# and saved as the matched term. However, often you want the match to include only
# the actual word without its boundaries.
# Solution: use so called "look ahead" and "look back" conditions.

'''
Look ahead and look behind/back conditions

Regex requires that the parts of the pattern that are classified as look ahead
or look back/behind are present in the text but does not include them in the match.

Syntax:
positive look ahead:    (?=)      Example: X(?=\W) requires that there is a word
                                           boundary after X
negative look ahead:    (?!)      Example: X(?!\W) requires that there must NOT
                                           be a word boundary after X.
positive look back:    (?<=)      Example: (?<=\W)X requires that there is a word
                                           boundary before X
negative look back:    (?<!)      Example: (?<!\W)X requires that there must NOT
                                           be a word boundary before X.
'''
match=re.search("(?<=\W)is(?=\W)",text)
print("Searching with word boundaries as look ahead and look back condition yields: '" #
      +match.group(0)+"' and the surrounding text is: '"+text[match.start()-1:match.end()+1]+"'")

# Does it work also with re.finall?
list_of_matches=re.findall("\Wis\W",text)
print("Word boundaries using re.findall: "+str(list_of_matches))
list_of_matches=re.findall("(?<=\W)is(?=\W)",text)
print("Word boundaries as look ahead and look back condition using re.findall: "+str(list_of_matches))
print("In total there are "+str(len(list_of_matches))+" matches.")
# --> Yes, the approach also work with re.findall.

# Alternative 2:
# Use re.split(), which is similar to split() but more powerful.
text_split=re.split("\W",text)
print(text_split)
# Problem: there are elements in the list that are not words, e.g. ''. These
# elements are created because there can be a series of non-word characters (\W),
# e.g. ' (' in 'Balmer (born'.
# Solution: treat a series of wordboundaries \W as a single split character
text_split=re.split("\W{1,}",text)
print(text_split)
# Now, you do not need to include word boundaries and can use standard string
# operations.
number_matches=text_split.count("is")
print("Using standard string operations, we get "+str(number_matches)+" matches.")
# -> same result.