NFA -> DFA with Subset Construction

Need to build a simulation of the NFA

Two key functions

move(si, a) is a set of states reachable from si by a
ε-closure(si) is the set of states reachable from si by ε

The algorithm (sketch):

Start state derived from s0 of the NFA
Take its ε-closure S0 = ε-closure(s0)
For each state S, compute move(S, a) for each a ∈ Σ, and take it’s ε-closure
Iterate until no more states are added

Sounds more complex that it is…

Example

Applying the subset construction:

DFA States	NFA States	a	b	c
s0	{q0}	{q1, q2 q3, q9, q4, q6 }	none	none
s1	{q1,q2,q3,q9,q4,q6}	none	{q5, q8, q3, q6, q4, q9}	{q7, q8, q9, q3, q4, q6}
s2	{q5, q8, q9, q3, q4, q6 }	none	s2	s3
s3	{q7, q8, q9, q3, q4, q6 }	none	s2	s3

Note that any NFA state that contains q9 is an accepting state, since that is the final state in the NFA

The result of subset construction is the following DFA

δ	a	b	c
s0	s1	-	-
s1	-	s2	s3
s2	-	s2	s3
s3	-	s2	s3

Ends up smaller than the NFA
All transitions are deterministic

Automatic Scanner Construction

RE -> NFA (Thompson’s construction)
- Build an NFA for each term
- Combine them with ε moves
NFA -> DFA (subset construction)
- Build the simulation
DFA -> Minimal DFA
- Hopcroft’s Algorithm
DFA -> RE (not really part of scanner construction)
- All pairs, all paths problem
- Union together paths from s0 to a final state

DFA Minimization

How do we know whether two states encode the same information?

Here, q1 and q2 are not equivalent. “w” is a witness that they are not equivalent

Intuition: Two states are equivalent if for all sequences of input symbols “w” they both lead to an accepting state, or both end up in a non-accepting state.

Big Picture

Discover sets of equivalent states
Represent each such set with just one state

Two states are equivalent if and only if:

∀ a ∈ Σ, transitions on a lead to equivalent states (DFA)
if a-transitions to different sets => two states must be in different sets, i.e., cannot be equivalent

A partition P of S

Each state s ∈ S is in exactly one set pi ∈ P
The algorithm iteratively partitions the DFA’s states

Details of the algorithm

Group states into maximal size sets, optimistically
Iteratively subdivide those sets, as needed
States that remain grouped together are equivalent

Initial partition, P0, has two sets: {F} & {Q-F}

Splitting a set (“partitioning a set s by a”)

Assume qa & qb ∈ s, and δ(qa, a) = qx & δ(qb, a) = qy
If qx & qy are not in the same set, i.e., are considered equivalent, then s must be split
- qa has transition on a, qb does not => a splits s

Back to our DFA Minimization example

Limits of Regular Languages

Advantages of Regular Expressions

Simple & powerful notation for specifying patterns
Automatic construction of fast recognizers
Many kinds of syntax can be specified with REs

Example - an expression grammar

Term -> [a-zA-Z]([a-zA-Z] | [0-9])*
Op   -> + | - | * | /
Expr -> ( Term Op )* Term

Of course, this would generate a DFA

If REs are so useful… Why not use them for everything?

Not all languages are regular

RL’s ⊂ CFL’s ⊂ CSL’s

You cannot construct DFAs to recognize these languages

L = {p^k q^k}
L = {wcw^r | w ∈ Σ*}

Neither of these is a regular language But, this is a little subtle. You can construct DFA’s for

Strings with alternating 0’s and 1’s
Strings with an even number of 0’s and 1’s
Strings of bit patterns that represent binary numbers which are divisible by 5 (homework)

What can be so hard?

Poor language design can complicate scanning

Reserved words are important
- if then then then - else; else else = then (PL/I)
Insignificant blanks (Fortran & Algol68)
- do 10 i = 1,25
- do 10 i = 1.25
String constants with special characters (C, C++, Java)
- newline, tab, quote, comment delimiters
Limited identifier “length” (Fortran 66 & PL/I)