## Lecture 28 - 03/24

### Regular Expressions (regex)

• In some data, we often want to find occurrences of patterns
• For example in a string representation of amino-acid chain of a protein, we may be looking for a particular sequence of amino-acids
• Big application is parsing variable user input into a standard form for the rest of the program to use (dealing with whitespace/lack thereof, etc)
• Regular Expressions, commonly known as regex, are a powerful tool for this pattern matching
• Regex supported by all major programming languages
• A regex expression is a string that encodes the pattern to be found in a very short hand notation
• Some regex evaluating function will take in the sequence of data and the regex expression, and find/return any matching patterns in the regex expression
• Example:
• Say we have an amino-acid sequence String seq = "GPCGGWCAASCGGPYACGGWAGYHAGWHWAH"
• Let's say we want to find the protein sequence given by the regex expression
String regex = "C.{2,4}C.{3}[LIVMFYWCX].{8}H.{3,5}H"
• This pattern is for an occurence of C, then two to four of any character, then C, then three of any character, than one character of the set LIVMFYWCX, ...
• Let findMatch be our programming language's regex evaluator to find a regex pattern in some data
• Then findMatch(regex, seq) returns "CAASCGGPYACGGWAGYHAGWH"
• Ex 2: regex for a social security number: [0-9]{3}-[0-9]{2}-[0-9]{4}
• Basic regex operators:
• Regex AB* finds occurences of A followed by zero or more copies of B (A, AB, ABB, ABBB)
• Regex (AB)* finds occurences of zero or more copies of AB (ABABABAB, ABAB, AB, null)
• Note: order of operations of regex operators matters!
• Expanded regex operators (these are convenient combinations of the above four):
• More regex examples:
• Regex .*SPB.* finds (RASPBERRY, CRISPBREAD) but not (SUBSPACE, SUBSPECIES)
• Regex [0-9]{3}-[0-9]{2}-[0-9]{4} finds (231-41-5121, 573-57-1821) but not (231415121, 57-3571821)
• Regex [a-z]+@([a-z]+\.)+(edu|com) finds (horse@pizza.com) but not (frank_99@yahoo.com, hug@cs)
• Regex [a-z][a-zA-Z0-9]* finds (x, snake3, percolationGrid) but not (percolation_grid, Planet3)
• Regex for a string containing both lowercase letter and number: (.*[0-9].*[a-z].*)|(.*[a-z].*[0-9].*)
• Note, in third example above, \ is an escape character, allowing us to search for the occurence of .
• Regex are notoriously hard to read/debug, joked as a "write only language"
• Regex very poor at problems involving:
• counting (patterns a and b occur exact same number of times)
• complex structure (palindromes - string same forwards and backwards)
• parsing
• More regex operatprs (look up online for even more):

#### Regex in Java:

• Escape characters: any normal regex expression involving \ becomes \\ (to do with how Java processes Strings)
• String has a .matches(String regex) boolean method that returns whether a string matches the given regex
• Find substring occurrences with Pattern and Matcher, along with .find() and .group(i):

String seq = "jim,cs61b-abe,30117827,xi,cs61b-bqd,15039872";
String regexString = "(cs61b-[a-z]{3}),([0-9]+)";
Pattern pat = Pattern.compile(regexString); // Compile the regex into a Pattern object

Matcher mat = pat.matcher(seq);             // Create a matcher object from Pattern and data
while (mat.find()) {                        // .find() finds the next match in data
// group(i) returns ith group of elements in match
// grouping index decided by parenthesis group from regex elements are part of
// group(0) returns the entire match
System.out.println("Full match: " + mat.group(0)); // "Full match: cs61b-abe,30117827"
", sid: " + mat.group(2)); // "login: cs61b-abe, sid: 30117827"
}

• So how are regex expressions matched under the hood?
• Represent as a graph traversal problem
• We create a special regex graph called an NFA graph (Nondeterministic Finite Automation), with special types of edge connections (epsilon edges and match edges) for different regex operators
• We then run something called NFA traversal to match some data to the NFA graph
• NFA is a major concept in CS theory
• A simpler concept than but related to Turing machines
• This is deeply related to why regex is poor at certain pattern recognition
• Below is an example NFA graph for the regex (A*B|AC)D:
• Regex graph construction: given a regex, construct the graph:
• Start by creating a graph of N vertices
• Vertex 0$0$ is (
• Vertices 1$1$ to N3$N-3$ are regex characters
• Vertex N2$N-2$ is )
• Vertex N1$N-1$ is the goal state
• Add match arrows for every non-special character to the next state
• Add one epsilon arrow for every ( and ) to their next state
• Add two epsilon arrows for every | state: (he | has a corresponding ( and ) state)
• One from the corresponding ( to the state after |
• One from the | to the corresponding )
• Add three epsilon arrows for every *:
• One to state following *
• One to state preceding *
• One from stat preceding * to * (the last two thus create a cycle)
• Regex graph matching: determine if a given string s matches the regex graph:
• Start by setting an int i to 0, and add 0 to some set activeSet
• For example, we have s = "ABCD"; i = 0; activeSet = {0};
• Find all states reachable by epsilon edges/paths from the active set. Add these to the active set
• Eliminate all states that don't match s.charAt(i), then increment i
• Replace every item of the active set by its successor via match edge
• Repeat until i is equal to the string length
• Now find the epsilon reachable states from our active set; if this includes the final/goal state, we have a match
• For a better understanding, see this demo