Context-Free Languages | Baeldung on Computer Science

1. Introduction

In this tutorial, we’ll talk about context-free languages. In particular, we’ll cover some techniques for proving that a language is (not) context-free.

2. What’s a Context-Free Language?

Before answering that question, let’s first remember what formal languages and grammars are.

2.1. Languages and Grammars

A formal language is a set of words we form using a non-empty alphabet of symbols. For example, $L_1 = \{a^k b^k \mid k = 0,1,2,\ldots\}$ is a language over the alphabet $\{a, b\}$ .

Every language has its grammar. Informally, a language’s grammar describes how to produce the words of that language. Formally, it’s a tuple $(V, \Sigma, S, P)$ where:

$V$ is the set of all symbols the grammar uses
$\Sigma$ is the alphabet; we call its symbols terminals, while the symbols in $V \setminus \Sigma$ are known as non-terminals or variables
$S$ is a unique variable that starts the derivation of all the words
$P$ is the set of production rules

Production rules derive the words of the language at hand. Therefore, we can think of them as the rules guiding the transformation of the word from $S$ over its transitional forms to its final shape.

For example, these rules generate $\{ a^k b^k \mid k\geq 0 \}$ :

(1)

where $\varepsilon$ is the empty word. Consequently, the derivation of $aaaabbbb$ is:

$S \rightarrow aSb \rightarrow aaSbb \rightarrow aaaSbbb \rightarrow aaaaSbbbb \rightarrow aaaabbbb$

Depending on the form of the production rules, we differentiate between several grammar and language types.

2.2. Context-Free Languages

If the production rules $P$ are of the form:

(2)

we say that the grammar is context-free (CFG). So, a language is context-free if there’s a CFG generating it.

The language $L_1$ from the previous section is context-free. Its grammar is:

$G_1 = \left( \{S,a,b\}, \{a,b\}, S, P \right)$

where $P$ is the set of rules (1). Both rules are of the CFG form (2), so $G_1$ is context-free, and so is the language it generates.

3. How to Prove That a Language Is Context-Free

A language can be stated implicitly as a general pattern of its words (such as $a^kb^k$ for $k \geq 1$ ). Alternatively, we can get an explicit corpus containing the whole language.

3.1. Constructing a Grammar

No matter how our language is stated, we can prove it’s context-free by constructing a CFG grammar that generates its words.

That’s sometimes more easily said than done. If the language isn’t complex, we’ll be able to derive the rules by hand (provided it’s context-free). That was the case in the above example.

But, if the language has a complicated structure, we may struggle with designing the corresponding CFG. And it’s not just the rules that will be hard to come by. We’ll also have to decide which non-terminal symbols to use. In such cases, we can apply the techniques for automated grammar learning. They can handle even the partially stated languages, where only a sample of the words is available.

3.2. Proving That the Grammar Is Correct

Devising a CFG isn’t sufficient. We also have to prove two statements: that the grammar generates the language and that all the words of the language can be generated by the grammar. Proving only the latter shows that the grammar’s language contains the target language, but not that they’re equal.

Let’s first prove that by applying the rules of $G_1$ from Section 2.2. $k$ times, we either get a word $a^{k-1}b^{k-1}$ from $L_1$ or a combination $a^k S b^K$ . The latter isn’t a word because it contains a non-terminal symbol. We’ll prove this hypothesis by induction by the number of applied rules.

In the base case ( $k=1$ ), we can apply $S \rightarrow \varepsilon = a^0 b^0$ or $S \rightarrow a S b = a^1 S b^1$ . Since $\varepsilon \in L_1$ and $a^1 b^1$ is of the form $a^k b^k$ , we’ve covered the induction base.

In the step case, we assume that the induction hypothesis holds for some $k\geq 1$ , so we have $a^{k-1} b^{k-1} \in L_1$ and $a^{k} S b^{k}$ . Does the hypothesis hold for $k+1$ ?

Applying $S \rightarrow \varepsilon$ to $a^{k} S b^{k}$ , we get $a^k b^k = a^{(k+1)-1} b^{(k+1)-1} \in L_1$ . Similarly, applying the other rule, we get $a^{k+1} S b^{k+1}$ .

Thus we showed that $G_1$ generated only the words in $L_1$ .

Proving the converse is easier. A word of $L_1$ is of the form $a^k b^k$ . If we apply $S \rightarrow a S b$ $k$ times and $S \rightarrow \varepsilon$ afterwards, we’ll get $a^k b^k$ for any $k$ .

3.3. Pushdown Automata

A pushdown automaton (PDA) is a finite state machine equivalent to a context-free grammar. For any language generated by a CFG, there’s a PDA that recognizes it and vice versa.

Sometimes, it’s easier to construct an automaton than grammar. Informally, a PDA is a finite state machine that transitions between states while reading a word symbol by symbol from its input and writing symbols onto its stack. If it reads the whole word and ends in an accepting state with an empty stack, we say that it’s accepted the word it read. The language a PDA recognizes consists of the words it accepts.

We won’t cover PDAs in this article, but they’re certainly an alternative to consider.

3.4. The Question of Proof

Failing to construct a CFG or a PDA by hand doesn’t mean that the language isn’t context-free. The absence of proof isn’t the same as the proof of absence. A CFG generating our language may exist, but we couldn’t find it.

If we suspect that we failed at constructing an appropriate grammar because the language isn’t context-free, we can try to prove precisely that: that the language isn’t context-free.

4. How to Prove That a Language Isn’t Context-Free

Ogden’s lemma describes a property of all context-free languages. So, if we show that the language at hand doesn’t have that property, we’ll prove that it isn’t context-free.

4.1. Ogden’s Lemma

Let’s state the lemma. Let $L$ be a context-free language. Then, there exists a natural number $n_L$ such that every word $w \in L$ ( $|w| \geq n_L$ ) in which at least $n_L$ positions (indexed symbols) are marked, can be split into five parts:

(3)

so that the following holds:

$uv$ contains at least one marked position.
$uyv$ contains at most $n_L$ marked positions.
$xu^kyv^kz \in L$ for all $k \geq 0$ .

The lemma shows a regularity in generating the words of a context-free language. Pumping the selected sub-words ( $\boldsymbol{u^k}$ and $\boldsymbol{v^k}$ ) produces the new words of the language. If any word with at least $n_L$ symbols breaks this regularity, the language isn’t context-free.

4.2. The Pumping Lemma for Context-Free Languages

Usually, we apply a special case of Ogden’s lemma, where all the positions are marked. This case is known as the pumping lemma for context-free languages:

For every context-free language $L$ , there exists a natural number $n_L$ such that every word $w \in L$ ( $|w| \geq n_L$ ) can be written as

$w = xuyvz$

so that the following holds:

$|uv| \geq 1$
$|uyv| \leq n_L$
$xu^kyv^kz \in L$ for all $k \geq 0$

4.3. Example

Let’s prove that the language $L_2 = \{a^n b^n c^n \mid n \geq 0\}$ isn’t context-free.

So, we first suppose the opposite, that $\boldsymbol{L_2}$ is context-free. Then, by the pumping lemma, there must exist $n_L$ that fulfills the lemma’s conditions about splitting all the words not shorter than $n_L$ in $L_2$ .

Let’s focus on the word $a^{n_L} b^{n_L} c^{n_L}$ . Applying the lemma, we can write it as $w = xuyvz$ so that we have:

$|uv| \geq 1$
$|uyv| \leq n_L$
$xu^kyv^kz \in L$ for all $k \geq 0$

Let’s analyze the last condition. There are two options.

The sub-word $uyv$ contains only $a$ s. So, pumping $u$ and $v$ increases the number of $a$ s in the word, while the numbers of $b$ s and $c$ s stay the same. Therefore, the new pumped words won’t be in $L_2$ .
Alternatively, $uyv$ can contain both $a$ s and $b$ s. By pumping the sub-words $u$ and $v$ , we increase the number of $a$ s, the number of $b$ s, or both. At the same time, the number of $c$ s stays the same. So, the pumped words don’t belong to $L_2$ in this case too.

That’s a contradiction because we assumed that $\boldsymbol{L_2}$ was context-free. Hence, the assumption was wrong, and $\boldsymbol{L_2}$ cannot be a context-free language.

4.4. The Question of Proof, Revisited

Just as was the case with the proofs by constructing the corresponding CFGs and PDAs, failing to prove that a language isn’t context-free doesn’t mean that it is. If we couldn’t find a word that breaks the pumping property, that doesn’t imply that no such word exists.

That opens the question of what to do if both strategies fail. The best course of action would be to use software tools such as automated learners and theorem provers.

5. Conclusion

In this article, we talked about context-free languages. We can prove that a language is context-free if we construct a context-free grammar that generates it. Alternatively, we can create a pushdown automaton that recognizes the language. On the other hand, we use Ogden’s lemma and the pumping lemma for context-free languages to prove that a language isn’t context-free.

Learn Java Collections

Learn Spring

Learn Maven

View All Courses

Core Concepts

Operating Systems

Neural Networks

Graph Theory

Latex

Full Archive

About Baeldung