Parsing is the process of analyzing and interpreting a sequence of symbols according to the rules of a grammar. Parsing is an essential component of compilers, interpreters, natural language processing, and many other applications.
Parsing is done in different phases of compilation. In this article, we will explain the concept of top down
parsing, which is a parsing strategy where one first looks at the highest level
of the parse tree and works down the parse tree by using the rewriting rules of
a formal grammar. We will also discuss the advantages and disadvantages of top
down parsing, the types of top down parsers, problems of top down parsing and
some examples of top down parsing.
Table of content : Top Down Parsing
- What is top down parsing
- Advantages and disadvantages of top down parsing
- Problems of Top-Down
Parsing in Compiler Construction
- Backtracking and its
solution
- Left recursion
- Ambiguity
What is Top Down Parsing?
Top down parsing is a parsing strategy where one first looks
at the highest level of the parse tree and works down the parse tree by using
the rewriting rules of a formal grammar. Top down parsing is also known as
predictive parsing or recursive descent parsing.
The main idea of top down parsing is to start with the start
symbol of the grammar and try to derive the input string by applying the
productions from left to right. At each step, the parser chooses a production
that matches the next input symbol or predicts what the next input symbol
should be. If the prediction is correct, the parser proceeds to the next
symbol. If the prediction is wrong, the parser backtracks and tries another
production.
For example, consider the following grammar that recognizes
arithmetic expressions:
E -> E + T | T
T -> T * F | F
F -> (E) | id
Parse tree for a*b+c
E is a start symbol and the terminals are *, +, (, ), and id (identifier). The
nonterminals are E (expression), T (term), and F (factor).
Examples of Top Down Parsing
Top down parsing is widely used in various applications, such as:
- Compiler and Interpreter
Top down parsing is widely used to interpret and analyze the source code of programming languages according to their syntax and semantics. For example, to parse its source code, Python uses a recursive descent parser.
- Natural Language Processing
Top down parsing is being used in natural language processing. It is used to the structure and meaning of natural languages, such as English, French, etc. For example, Stanford Parser uses a top down parser to parse natural language sentences into syntactic trees.
- XML Processing
Top down parsing is used to validate and process the
structure and content of XML documents according to their schemas and grammars.
For example, SAX (Simple API for XML) uses a top down parser to parse XML
documents into events.
Advantages and Disadvantages of Top Down Parsing
Top down parsing has some advantages and disadvantages
compared to other parsing strategies.
Advantages:
- Top down parsing is simple and intuitive to implement.
- Top down parsing can handle left recursion and left factoring in grammars.
- Top down parsing can produce meaningful error messages by predicting what symbols are expected next.
Disadvantages:
- Top down parsing may require backtracking, which can be inefficient and time-consuming
- Top down parsing may not work for some grammars that are ambiguous or have common.
- Top down parsing may not produce a leftmost derivation or a rightmost derivation of the input string.
Problems of Top-Down Parsing in Compiler Construction
Top-down
parsing is a technique of syntax analysis that builds a parse tree for an input
string from the root to the leaves, following the grammar productions. It can
also be seen as generating a leftmost derivation for an input string, by
expanding the non-terminals from the start symbol to the terminals.
However, top-down parsing also faces some problems that limit its applicability and efficiency. We will discuss some of the common problems of top-down parsing and how to overcome them
i. Backtrackin
ii. Left Recursio
iii. Ambiguity
Let us discuss these top down problems and their solutions
one by one.
- Backtracking
One
of the main problems of top-down parsing is backtracking. Backtracking occurs
when the parser chooses a production for a non-terminal based on the current
input symbol, but later realizes that it made a wrong choice and has to undo
its previous steps and try another production. This can be very inefficient and
may cause exponential time complexity for some grammars.
For
example, consider the following grammar:
S
-> aAb | bBa
A -> c
B -> c
For
the input string “bcb”, a top-down parser would start from S and try to match
it with b. Since there is only one production for S that starts with b, it
would choose S -> bBa and expand it. Then it would try to match B with c.
Since there is only one production for B that matches c, it would choose B
-> c and expand it. Then it would try to match a with b. Since this does not
match, it would have to backtrack to S and try another production. However,
there is no other production for S that starts with b, so it would report an
error.
Solution of backtracking problem
To avoid backtracking, we can use predictive parsing, which
uses a parsing table and lookahead symbols to determine which production to
choose without guessing. A predictive parser is also known as an LL(1) parser,
which means that it scans the input from left to right, constructs a leftmost
derivation, and uses one symbol of lookahead.
To
construct a predictive parser, we need to compute the FIRST and FOLLOW sets for
each non-terminal in the grammar, and use them to fill the parsing table.
The FIRST set of a non-terminal is the set of terminals that
can appear at the beginning of any string derived from that non-terminal.
The FOLLOW set of a
non-terminal is the set of terminals that can appear immediately after any
string derived from that non-terminal.
For
example, for the grammar above, we have:
FIRST(S)
= {a, b} FIRST(A) = {c} FIRST(B) = {c}
FOLLOW(S)
= {$} FOLLOW(A) = {b} FOLLOW(B) = {a}
Using
these sets, we can fill the parsing table as follows:
Using
this table, we can parse the input string bcb without backtracking as follows:
2. Left Recursion
Another
problem of top-down parsing is left recursion. Left recursion occurs when a
non-terminal derives itself directly or indirectly on its leftmost symbol. For
example, consider the following grammar:
S
-> Sa | b
This
grammar is left recursive because S derives itself on its leftmost symbol in S
-> Sa. Left recursion causes infinite loops in top-down parsing because the
parser keeps expanding the same non-terminal without consuming any input
symbol.
For
example, if we try to parse the input string b using this grammar, we would get
stuck in an infinite loop as follows:
S
-> Sa -> Saa -> Saaa -> …
Left Recursion Elimination
To
eliminate left recursion, we can transform the grammar into an equivalent one
that does not have left recursion. The general method is as follows:
·
For each non-terminal A in
the grammar, do:
o
For each production A ->
Aα, where α is any string of symbols, do:
§ Remove A -> Aα from the grammar.
§ Add a new non-terminal A’ to the grammar.
§ Add A -> βA’ for each production A -> β that does not
start with A.
§ Add A’ -> αA’ | ε to the grammar.
Example1:
A -> Aα|β
Left recursion elimination:
A -> βA’
A’ -> αA’ | ε
Example2:
E->E+T |T
Left recursion elimination:
A=E
α=+T
β=T
Then
E->TE’
E->+TE’| ε
3. Ambiguity
A
third problem of top-down parsing is ambiguity. Ambiguity occurs when a grammar
can generate more than one parse tree for the same input string. This means
that the grammar is not clear about the precedence and associativity of the
operators or the structure of the sentences.
For
example, consider the following grammar:
E
-> E + E | E * E | (E) | id
This
grammar is ambiguous because it can generate different parse trees for the same
input string, depending on how we apply the productions. For example, for the
input string id + id * id, id * id + id, we can generate two different parse
trees:
These
two parse trees have different meanings, because they imply different orders of
evaluation for the operators. The first one means (id + id) * id, while the
second one means id + (id * id).
To
resolve ambiguity, we can modify the grammar to make it unambiguous by
introducing different non-terminals for different levels of precedence and
associativity. For example, we can rewrite the grammar above as follows:
E
-> T + E | T T -> F * T | F F -> (E) | id
This
grammar is unambiguous because it gives higher precedence to * than +, and
makes both operators left-associative. We can parse the input string id + id *
id using this grammar and get a unique parse tree.
In this article, we have discussed in detail about types of compiler construction parsers, i.e. top down parser and bottom up parsers. Then we discussed about problems that occurs in top down parsing and their solution along with suitable examples.