Latest revision as of 10:25, 1 September 2020

A Finite State Automaton (FSA) is a mathematical model of computation comprising all 4 of the following: 1) a finite number of states, of which exactly one is active at any given time; 2) transition rules to change the active state; 3) an initial state; and 4) one or more final states. We can draw an FSA by representing each state as a circle, the final state as a double circle, the start state as the only state with an incoming arrow, and the transition rules as labeled-edges connecting the states. When labels are assigned to states, they appear inside the circle representing the state.

In this category, FSAs will be limited to parsing strings. That is, determining if a string is valid or not.

Basics

Here is a drawing of an FSA that is used to parse strings consisting of x's and y's:

In the above FSA, there are three states: A, B, and C. The initial state is A; the final state is C. The only way to go from state A to B is by seeing the letter x. Once in state B, there are two transition rules: seeing the letter y will cause the FSA to make C the active state, and seeing an x will keep B as the active state. State C is a final state so if the string being parsed is completed and the FSA is in State C, the input string is said to be accepted by the FSA. In State C, seeing any additional letter y will keep the machine in state C. The FSA above will accept strings composed of one or more x’s followed by one or more y’s (e.g., xy, xxy, xxxyy, xyyy, xxyyyy).

A Regular Expression (RE) is an algebraic representation of an FSA. For example, the regular expression corresponding to the first FSA given above is xx*yy*.

The rules for forming a Regular Expression (RE) are as follows:

1. The null string (λ) is a RE.

2. If the string a is in the input alphabet, then it is a RE.

3. if a and b are both REs, then so are the strings built up using the following rules:

a. CONCATENATION. "ab" (a followed by b).

b. UNION. "aUb" or "a|b" (a or b).

c. CLOSURE. "a*" (a repeated zero or more times). This is known as the Kleene Star.

The order of precedence for Regular Expression operators is: Kleene Star, concatenation, and then union. Similar to standard Algebra, parentheses can be used to group sub-expressions. For example, "dca*b" generates strings dcb, dcab, dcaab, and so on, whereas "d(ca)*b" generates strings db, dcab, dcacab, dcacacab, and so on.

If we have a Regular Expression, then we can mechanically build an FSA to accept the strings which are generated by the Regular Expression. Conversely, if we have an FSA, we can mechanically develop a Regular Expression which will describe the strings which can be parsed by the FSA. For a given FSA or Regular Expression, there are many others which are equivalent to it. A "most simplified" Regular Expression or FSA is not always well defined.

Regular Expression Identities

1. (a) = a*
2. aa* = a*a
3. aa* U λ = a*
4. a(b U c) = ab U ac
5. a(ba)* = (ab)*a
6. (a U b)* = (a* U b)
7. (a U b)* = (ab)*
8. (a U b)* = a(ba)*

RegEx in Practice

Programmers use Regular Expressions (usually referred to as regex) extensively for expressing patterns to search for. All modern programming languages have regular expression libraries.

Unfortunately, the specific syntax rules vary depending on the specific implementation, programming language, or library in use. Interactive websites for testing regexes are a useful resource for learning regexes by experimentation. An excellent online tool is https://regex101.com/. A very nice exposition is Pattern Matching with Regular Expressions from the Automate the Boring Stuff book and online course.

Here are the additional syntax rules that we will use. They are pretty universal across all regex packages.

Pattern	Description
\|	As described above, a vertical bar separates alternatives. For example, gray\|grey can match "gray" or "grey".
*	As described above, the asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
?	The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
+	The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
.	The wildcard . matches any character. For example, a.b matches any string that contains an "a", then any other character, and then a "b" such as "a7b", "a&b", or "arb", but not "abbb". Therefore, a.*b matches any string that contains an "a" and a "b" with 0 or more characters in between. This includes "ab", "acb", or "a123456789b".
[ ]	A bracket expression matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].
[^ ]	Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.
( )	As described above, parentheses define a sub-expression. For example, the pattern H(ä\|ae?)ndel matches "Handel", "Händel", and "Haendel".

Sample Problems

Typical problems in the category will include: translate an FSA to a Regular Expression; simplify a Regular Expression; determine which Regular Expressions or FSAs are equivalent; and determine which strings are accepted by either an FSA or a Regular Expression.

Problem 1

Find a simplified Regular Expression for the following FSA:

Solution:

The expression 01*01 is read directly from the FSA. It is in its most simplified form.

Problem 2

Which of the following strings are accepted by the following Regular Expression "00*1*1U11*0*0" ?

A. 0000001111111

B. 1010101010

C. 1111111

D. 0110

E. 10

Solution:

This Regular Expression parses strings described by the union of 00*1*1 and 11*0*0. The RE 00*1*1 matches strings starting with one or more 0s followed by one or more 1s: 01, 001, 0001111, and so on. The RE 11*0*0 matches strings with one or more 1s followed by one or more 0s: 10, 1110, 1111100, and so on. In other words, strings of the form: 0s followed by some 1s; or 1s followed by some 0s. Choice A and E following this pattern.

Problem 3

Which of the following strings match the regular expression pattern "[A-D]*[a-d]*[0-9]" ?

1. ABCD8

2. abcd5

3. ABcd9

4. AbCd7

5. X

6. abCD7

7. DCCBBBaaaa5

Solution:

The pattern describes strings the start with zero or more uppercase letters A, B, C, or D (in any order), followed by zero or more lowercase letter a, b, c, or d (in any order), followed by a single digit. The strings that are represented by this pattern are 1, 2, 3, and 7.

Problem 4

Which of the following strings match the regular expression pattern "Hi?g+h+[^a-ceiou]" ?

1. Highb

2. HiiighS

3. HigghhhC

4. Hih

5. Hghe

6. Highd

7. HgggggghX

Solution:

The ? indicates 0 or 1 "i"s. The + indicates 1 or more "g"s followed by 1 or more "h"s. The ^ indicates that the last character cannot be lower-case a, b, c, e, i, o, or u. The strings that are represented by this pattern are 3, 6, and 7.

Video Resources

Nice two-part video showing the relationship between FSAs and REs.

	1 - Convert Regular Expression to Finite-State Automaton (Barry Brown)
	2 - Convert Regular Expression to Finite-State Automaton (Barry Brown) This video uses the symbol "+" to mean "1 or more matches of the previous term". For example, "ab+" is the same as "abb". In terms of the Kleene Star, zz = z*z = z+.

ACSL Test Prep - Finite State Automaton & Regular Expressions Explained (Mrs. Gupta)

A talked-over presentation discussing the finite state automatons and regular expressions as needed for the American Computer Science League and its tests.

Difference between revisions of "FSAs and Regular Expressions"

Latest revision as of 10:25, 1 September 2020

Contents

Basics

Regular Expression Identities

RegEx in Practice

Sample Problems

Problem 1

Problem 2

Problem 3

Problem 4

Video Resources

Navigation menu

@@ Line 1: / Line 1: @@
-A Finite State Automation (FSA) has four components: an input alphabet (those letters or strings which are legal inputs); a set of transition rules to advance from state to state (given a current state and an element of the input alphabet, what is the next state); a unique start state; and one or more final states.  We can draw the FSA, as shown below, by representing each state as a circular node; the final state as a double circle; the start state as the only node with an incoming arrow; and the transition rules by the strings on the edges connecting the nodes.  When labels are assigned to states, they appear inside the circle representing the state.
+A Finite State Automaton (FSA) is a mathematical model of computation comprising all 4 of the following: 1) a finite number of ''states'', of which exactly one is ''active'' at any given time; 2) ''transition rules'' to change the active state; 3) an ''initial state''; and 4) one or more ''final states''. We can draw an FSA by representing each state as a circle, the final state as a double circle, the start state as the only state with an incoming arrow, and the transition rules as labeled-edges connecting the states. When labels are assigned to states, they appear inside the circle representing the state.
-[[File:fsa_1.png]]
+In this category, FSAs will be limited to parsing strings. That is, determining if a string is valid or not.
-If there is a path from the start state to a final state by which an input string can be parsed, then the input string is said to be “accepted” by the FSA.  The FSA above will accept strings composed of one or more x’s followed by one or more y’s (e.g., xy, xxy, xxxyy, xyyy).
+= Basics =
-= Definitions =
+Here is a drawing of an FSA that is used to parse strings consisting of x's and y's:
-Just like [[Boolean Algebra]] is a convenient algebraic representation of [[Digital Electronics]] Circuits, a regular expression is an algebraic representation of an FSA.  For example, the regular expression corresponding to the first FSA given above is xx*yy*.  The regular expression for the second FSA is extremely complex!  The following simple FSAs and REs illustrate  the correspondence:
+<center>
+[[File:fsa.svg|250px]]
+</center>
-{| class="wikitable" style="text-align: center"
+In the above FSA, there are three states: A, B, and C. The initial state is A; the final state is C. The only way to go from state A to B is by ''seeing'' the letter x. Once in state B, there are two transition rules:  seeing the letter y will cause the FSA to make C the active state, and seeing an x will keep B as the active state. State C is a final state so if the string being parsed is completed and the FSA is in State C, the input string is said to be ''accepted'' by the FSA. In State C, seeing any additional letter y will keep the machine in state C. The FSA above will accept strings composed of one or more x’s followed by one or more y’s (e.g., xy, xxy, xxxyy, xyyy, xxyyyy).
-|[[File:fsas_2-3.png|512px]]
-|-
+A Regular Expression (RE) is an algebraic representation of an FSA. For example, the regular expression corresponding to the first FSA given above is xx*yy*.
-|ab*c ___________________ (aUb)c or acUbc
-|}
-The rules for forming a regular expression (RE) are as follows:
+The rules for forming a Regular Expression (RE) are as follows:
 :1. The null string (λ) is a RE.
 :2. If the string a is in the input alphabet, then it is a RE.
-:3. if the strings a and b are both REs, then so are the strings built up using the following rules:
+:3. if a and b are both REs, then so are the strings built up using the following rules:
-::a. CONCATENATION.  “ab” (a followed by b).
+::a. CONCATENATION.  "ab" (a followed by b).
-::b. UNION. “aUb” (a or b).
+::b. UNION. "aUb" or "a|b" (a or b).
-::c. CLOSURE. “a*” (a repeated zero or more times).
+::c. CLOSURE. "a*" (a repeated zero or more times). This is known as the Kleene Star.
-If we have a regular expression, then we can mechanically build an FSA to accept the strings which are generated by the regular expression.  Conversely, if we have an FSA, we can mechanically develop a regular expression which will describe the strings which can be parsed by the FSA.  For a given FSA or regular expression, there are many others which are equivalent to it.  A “most simplified” regular expression or FSA is not always well defined.
+The order of precedence for Regular Expression operators is: Kleene Star, concatenation, and then union.
+Similar to standard Algebra, parentheses can be used to group sub-expressions.
+For example, "dca*b" generates strings dcb, dcab, dcaab, and so on, whereas
+"d(ca)*b" generates strings db, dcab, dcacab, dcacacab, and so on.
-Identities for regular expressions appear below.  The order of precedence for regular expression operators is: Kleene Star, concatenation, and then union.  The Kleene Star binds from right to left; concatenation and union both bind from left to right.
+If we have a Regular Expression, then we can mechanically build an FSA to accept the strings which are generated by the Regular Expression.  Conversely, if we have an FSA, we can mechanically develop a Regular Expression which will describe the strings which can be parsed by the FSA.  For a given FSA or Regular Expression, there are many others which are equivalent to it. A "most simplified" Regular Expression or FSA is not always well defined.
-Most programming languages today have a package to handle Regular Expressions in order to do pattern matching algorithms more easily.  Visual BASIC has the LIKE operator that is very easy to use.  We will use these “basic” symbols in our category description.
+= Regular Expression Identities =
-{| class="wikitable" style="text-align: left"|
+{| Class="wikitable"
+|-
+|1.  (a*)*  = a*
+|-
+|2.  aa*    = a*a
 |-
-!Pattern
+|3.  aa* U λ  = a*
-!Description
 |-
-|?
+|4.  a(b U c) = ab U ac
-|Any single character
 |-
-|*
+|5.  a(ba)* = (ab)*a
-|Zero or more of the preceding character
 |-
-|#
+|6.  (a U b)* = (a* U b*)*
-|Any single digit
 |-
-|[ char-char,char,…]
+|7.  (a U b)* = (a*b*)*
-|Any inclusive range or list of characters
 |-
-|{![charlist]}
+|8.  (a U b)* = a*(ba*)*
-|Any character not in a given range or list
 |}
-For example, the following strings either match or don’t match the pattern “[A-M][o-z][!a,e,I,o,u]?#.*”:
+= RegEx in Practice =
+Programmers use Regular Expressions (usually referred to as '''regex''') extensively for
+expressing patterns to search for. All modern programming languages have regular expression libraries.
+Unfortunately, the specific syntax rules vary depending on the specific
+implementation, programming language, or library in use.
+Interactive websites for testing regexes are a useful resource for
+learning regexes by experimentation.
+An excellent online tool is [https://regex101.com/ https://regex101.com/].
+A very nice exposition is [https://automatetheboringstuff.com/2e/chapter7/ Pattern Matching with Regular Expressions]
+from the [https://automatetheboringstuff.com/  Automate the Boring Stuff] book and online course.
+Here are the additional syntax rules that we will use. They are pretty universal across all
+regex packages.
 {| class="wikitable" style="text-align: left"|
 |-
-!YES
+!Pattern
-!NO
+!Description
-!Why not?
 |-
-|App$5.java
+! <nowiki>|</nowiki>
-|Help1.
+|As described above, a vertical bar separates alternatives. For example, gray<nowiki>|</nowiki>grey can match "gray" or "grey".
-|	E is not in the range [o-z].
 |-
-|Dog&2.py
+!*
-|IoU$5
+|As described above, the asterisk indicates zero or more occurrences of the preceding element. For example, ab*c matches "ac", "abc", "abbc", "abbbc", and so on.
-|There is no . at the end.
 |-
-|LzyG3.txt
+!?
-|move6.py
+| The question mark indicates zero or one occurrences of the preceding element. For example, colou?r matches both "color" and "colour".
-|The m is not in the range [A-M].
 |-
-|Hot90.
+! +
-|MaX7A.txt
+| The plus sign indicates one or more occurrences of the preceding element. For example, ab+c matches "abc", "abbc", "abbbc", and so on, but not "ac".
-|The A is not a digit from 0 to 9.
-|}
-Typical problems in the category will include: translate an FSA to a regular expression; simplify a regular expression (possibly to minimize the number of states or transitions); determine which regular expressions are equivalent; and determine which strings are accepted by either an FSA or a regular expression.
-= Basic Identities =
-{| Class="wikitable"
 |-
-|1.  (a*)*  = a*
+! .
-|5.  a(ba)* = (ab)*a
+| The wildcard . matches any character. For example, a.b matches any string that contains an "a", then any other character, and then a "b" such as "a7b", "a&b", or "arb", but not "abbb". Therefore, a.*b matches any string that contains an "a" and a "b" with 0 or more characters in between.  This includes "ab", "acb", or "a123456789b".
 |-
-|2.  aa*    = a*a
+! [ ]
-|6.  (a U b)* = (a* U b*)*
+| A bracket expression matches a single character that is contained within the brackets. For example, [abc] matches "a", "b", or "c". [a-z] specifies a range which matches any lowercase letter from "a" to "z". These forms can be mixed: [abcx-z] matches "a", "b", "c", "x", "y", or "z", as does [a-cx-z].
 |-
-|3.  aa* U λ  = a*
+! [^ ]
-|7.  (a U b)* = (a*b*)*
+|Matches a single character that is not contained within the brackets. For example, [^abc] matches any character other than "a", "b", or "c". [^a-z] matches any single character that is not a lowercase letter from "a" to "z". Likewise, literal characters and ranges can be mixed.
 |-
-|4.  a(b U c) = ab U ac
+!( )
-|8.  (a U b)* = a*(ba*)*
+|
+As described above, parentheses define a sub-expression. For example, the pattern H(ä|ae?)ndel  matches "Handel", "Händel", and "Haendel".
 |}
 = Sample Problems =
-== Sample Problem 1 ==
+Typical problems in the category will include:  translate an FSA to a Regular Expression; simplify a Regular Expression; determine which Regular Expressions or FSAs are equivalent; and determine which strings are accepted by either an FSA or a Regular Expression.
+== Problem 1 ==
-Find a simplified regular expression for the following FSA:
+Find a simplified Regular Expression for the following FSA:
 [[File:fsa_s1.png]]
@@ Line 106: / Line 112: @@
 The expression 01*01 is read directly from the FSA.  It is in its most simplified form.
-== Sample Problem 2 ==
+== Problem 2 ==
+Which of the following strings are accepted by the following Regular Expression "00*1*1U11*0*0" ?
+::A. 0000001111111
+::B. 1010101010
+::C. 1111111
+::D. 0110
+::E. 10
+'''Solution:'''
+This Regular Expression parses strings described by the union of 00*1*1 and 11*0*0.  The RE 00*1*1 matches strings starting with one or more 0s followed by one or more 1s:  01, 001, 0001111, and so on. The RE 11*0*0 matches strings with one or more 1s followed by one or more 0s:  10, 1110, 1111100, and so on.  In other words, strings of the form:  0s followed by some 1s; or 1s followed by some 0s.  Choice A and E following this pattern.
+<!--
+== Problem 3 ==
 Which, if any, of the following Regular Expressions are equivalent?
-::A. (aUb)(ab*)(b*Ua)
+::A. (a U b)(ab*)(b* U a)
-::B. (aab*Ubab*)a
+::B. (aab* U bab*)a
-::C. aab*Ubab*UaabaUbab*a
+::C. aab* U bab* U aaba U bab*a
-::D. aab*Ubab*Uaab*aUbab*a
+::D. aab* U bab* U aab*a U bab*a
-::E. a*Ub*
+::E. a* U b*
 '''Solution:'''
-On first inspection, B can be discarded because it is the only RE whose strings must end in an a.  E can also be discarded since it is the only RE that can accept a null string.  C and D are not equal.  After expanding A, we must compare it to C and D.  It is equal to D, but not to C.  A and D.
+Choice B can be discarded because it is the only RE whose strings '''must''' end with an a.  Choice E can be discarded since it is the only RE that can accept a null string. Choices C and D are not equal.  After expanding choice A, we must compare it to choices C and D.  It is equal to choice D, but not to choice C. The only REs that are equivalent are choices A and D.
+-->
+== Problem 3 ==
+Which of the following strings match the regular expression pattern "[A-D]*[a-d]*[0-9]" ?
-== Sample Problem 3 ==
+::1. ABCD8
+::2. abcd5
+::3. ABcd9
+::4. AbCd7
+::5. X
+::6. abCD7
+::7. DCCBBBaaaa5
-Which of the following strings match the regular expression pattern "?[A-D]*[a-d]*#" ?
+'''Solution:'''
-::A. 8AAAabd4
+The pattern describes strings the start with zero or more uppercase letters A, B, C, or D (in any order), followed
-::B. MM5
+by zero or more lowercase letter a, b, c, or d (in any order), followed by a single digit.
-::C. AAAAAAAA7
+The strings that are represented by this pattern are 1, 2, 3, and 7.
-::D. Abcd9
-::E. &dd88
+== Problem 4 ==
-::F. >BAD0
+Which of the following strings match the regular expression pattern "Hi?g+h+[^a-ceiou]" ?
+::1. Highb
+::2. HiiighS
+::3. HigghhhC
+::4. Hih
+::5. Hghe
+::6. Highd
+::7. HgggggghX
 '''Solution:'''
-The answer is A, C, D, and F.  In B the character M is not in the range A-D or a-d.  In E, there cannot be more than 1 digit at the end of the string.  Remember that an asterisk represents 0 or more of that character or rangle of characters.
+The ? indicates 0 or 1 "i"s. The + indicates 1 or more "g"s followed by 1 or more "h"s.
+The ^ indicates that the last character cannot be lower-case a, b, c, e, i, o, or u.
+The strings that are represented by this pattern are 3, 6, and 7.
-== Sample Problem 4 ==
+<!--
+== Problem 5 ==
-Which of the following strings are accepted by the following regular expression "00*1*1U11*0*0" ?
+Which of the following strings match the regular expression pattern "[A-E|a-e]*(00[01])|([10]11)" ?
-::A. 0000001111111
+::1. DAD001
-::B. 1010101010
+::2. bad000
-::C. 1111111
+::3. aCe0011
-::D. 0110
+::4. AbE111
-::E. 10
+::5. AAAbbC
+::6. aBBBe011
+::7. 001011
 '''Solution:'''
-The answer is A and E.  A uses the left hand pattern and E uses the right hand pattern.  Every string must stat and end in the opposite digit so C and D don’t work.  B doesn’t work because all 1’s and all 0’s must be together.
+The [A-E|a-e]* allows for any of those letters in any order 0 or more times. Therefore, all of the
+choices match at the beginning of the string. The end of the string must match "000", "001", "111",
+or "011". That means that 1, 2, 4, and 6 match.
+-->
 = Video Resources =
@@ Line 157: / Line 205: @@
 | [https://youtu.be/GwsU2LPs85U ''1 - Convert Regular Expression to Finite-State Automaton'' ('''Barry Brown''')]
+|-
 | <youtube width="300" height="180">https://youtu.be/shN_kHBFOUE</youtube>
 | [https://youtu.be/shN_kHBFOUE ''2 - Convert Regular Expression to Finite-State Automaton'' ('''Barry Brown''')]
-This video uses the symbol "+" to mean "1 or more matches of the previous term". For example, "ab+" is the same as "abb*".
+This video uses the symbol "+" to mean "1 or more matches of the previous term". For example, "ab+" is the same as "abb*".  In terms of the Kleene Star, zz* = z*z = z+.
+|}
+{|
+|-
+| <youtube width="300" height="180">https://youtu.be/vI_yv0WuAhk</youtube>
+| [https://youtu.be/vI_yv0WuAhk ''ACSL Test Prep - Finite State Automaton & Regular Expressions Explained'' ('''Mrs. Gupta''')]
+A talked-over presentation discussing the finite state automatons and regular expressions as needed for the American Computer Science League and its tests.
 |}

Difference between revisions of "FSAs and Regular Expressions"

Latest revision as of 10:25, 1 September 2020

Basics

Regular Expression Identities

RegEx in Practice

Sample Problems

Problem 1

Problem 2

Problem 3

Problem 4

Video Resources

Navigation menu

Search