I’m actually learning some stuff related to algorithms on sequences. The naive search for a pattern in a long string is of course very slow and comes with a lot of unintelligent compares. The Z-Algorithm improves the searching by preprocessing the pattern.

## Naive searching

A simple search algorithm written in java may look like

This code reliably finds any existence of needle in haystack in $$O(m \cdot n)$$, with $$m=$$ length of needle and $$n=$$ length of haystack. That screams for improvements ;)

## Definitions

The first algorithm that I want to present in this series is called Z-Algorithm. First of all we need some definitions.

Definition 1: In the following we denote $$S[i\dots j]$$ as the substring of $$S$$ beginning at position $$i$$ and ending at position $$j$$. We can also leave one of the limits clear, so that $$S[i\dots]$$ is the substring $$S[i\dots |S|]$$ and $$S[\dots j]$$ means $$S[1\dots j]$$.

Definition 2: $$Z_i(S) := \max \{p | S[i \dots i+p-1] = S[1 \dots p]\}$$ So $$Z_i(S)$$ is the length of the longest prefix of the suffix $$S[i\dots]$$ that is also prefix of $$S$$ itself. To abbreviate $$Z_i(S)$$ is further on mentioned as $$Z_i$$.

Definition 3: The set $$[i,i+Z_i-1]$$ for a $$Z_i > 0$$ is called Z-Box at position $$i$$.

Definition 4: $$V_i := \{[a_j, b_j] | [a_j, b_j] \text{ is Z-Box at } a_j \wedge a_j < i\}$$ $$V_i$$ is the set of limits of all Z-Box’es that start at the left-handed side of $$i$$. Consider $$i<j \Rightarrow V_i \subseteq V_j$$.

Definition 5: $$[l_i,r_i] := \begin{cases} \underset{b_j}{\arg\max} \ [a_j,b_j] \in V_i, & \text{if } V_i \ne \varnothing\\ [0,0] & \text{else}\end{cases}$$ If $$l_i>0$$ and $$r_i>0$$, $$[l_i,r_i]$$ defines the rightest Z-Box that starts before respectively at position $$i$$. Consider $$i<j \Rightarrow r_i\le r_j$$.

## Algorithm

In the following $$i$$ will denote the actual position we are looking for, $$l$$ and $$r$$ describe the current respectively last found of a Z-Box. First of all we set the values $$l$$ and $$r$$ to zero because we haven’t found any Z-Box yet. $$Z_2$$ of our text $$S$$ is according to Definition 2 the length of the longest prefix of $$S[2\dots]$$ that is also prefix of $$S$$ itself. If $$Z_2>0$$ we found a first Z-Box and update the limits to $$l=2$$ and $$r=2+Z_2-1$$.

Now we have to run through the word $$S$$, so $$i=3\dots \|S\|$$ with $$\|S\|$$ defines the length of $$S$$.

Case 1: Let’s assume position $$i$$ is outside of the last found Z-Box or we didn’t find any Z-Box yet ($$i>r$$). We find $$Z_i$$ by comparing the prefixes of $$S$$ and $$S[i\dots]$$. If $$Z_i>0$$ we’ve found a new Z-Box and need to update the limits to $$l=i$$ and $$r=i+Z_i-1$$.

Case 2: If the current position $$i$$ is inside of a current Z-Box ($$i\le r$$) we try to find the equivalent position at the beginning of $$S$$. The position we are searching for is $$k=i-l+1$$ steps off the beginning of $$S$$ (we are $$i-l+1$$ steps behind $$l$$ and $$S[l\dots]$$ has the same prefix as $$S$$). Case 2a: If we don’t break out of the current Z-Box by creating another Z-Box with the length of the box at position $$k$$ ($$Z_k<r-i+1$$, so position $$i+Z_k$$ is not behind position $$r$$), we can simply apply this Z-Box to the current position and $$Z_i=Z_k$$. Case 2b: Otherwise, if we would leave the actual Z-Box ($$i + Z_k>r$$) we have to recheck the prefix conditions of $$S[i\dots]$$ and $$S$$. We know that $$S[i\dots r]$$ equals $$S[1\dots r-i+1]$$, so we only have to find the length of the longest prefix $$p$$ of $$S[r-i+2\dots]$$ that equals the prefix of $$S[r+1\dots]$$. Now we can apply the new Z-Box such that $$Z_i=r-i+1+p$$ and of course we update the Z-Box limits to $$l=i$$ and $$r=i+Z_i-1$$.

If we reached the end of $$S$$ all Z-Boxes are found in $$\Theta(\|S\|)$$.

## Example

Let me demonstrate the algorithm with a small example. Let’s take the word $$S=aabaaab$$. First we start with $$l=0$$ and $$r=0$$ at position 2. $$Z_2$$ is the length of the shared prefix of $$S$$ ($$aabaaab$$) and $$S[2\dots]$$ ($$abaaab$$). Easy to see the prefix is $$a$$ with a length of 1. So $$Z_2=1$$, $$l=2$$ and $$r=2$$. At the beginning of our for-loop the program’s status is:

$$T$$ a $$i$$ $$Z_i$$ $$l$$ $$r$$ a b a a a b 1 2 1 2 2

At the first round in the loop $$i=3$$, so $$i>r$$ because $$r=2$$. So we meet case 1 and have to find the length of the prefix of $$S$$ ($$aabaaab$$) and $$S[3\dots]$$ ($$baaab$$). Of course it’s zero, nothing to do.

$$T$$ b $$i$$ $$Z_i$$ $$l$$ $$r$$ a a a a a b 1 2 3 1 0 2 2 2 2

Next round, we’re at position 4 and again $$i>r$$ (case 1). So we have to compare $$aabaaab$$ and $$aaab$$. The longest prefix of both words is $$aa$$ with a length of 2. So we start a new Z-Box at 4 with a size of 2, so $$l=4$$ and $$r=5$$.

$$T$$ a $$i$$ $$Z_i$$ $$l$$ $$r$$ a a b a a b 1 2 3 4 1 0 2 2 2 4 2 2 5

With $$i=5$$ and $$r=5$$ we reach case 2 for the first time. $$k=i-l+1=2$$ so our similar position at the beginning of $$S$$ is position 2. $$Z_2=1$$ and $$r-i+1=1$$ so we are in case 2b and have to find the shared prefix of $$S[2 ..]$$ ($$abaaab$$) and $$S[6 ..]$$ ($$ab$$). It’s $$ab$$, so $$p=2$$ and $$Z_5=r-i+1+p=3$$. $$l=5$$ and $$r=7$$.

$$T$$ a $$i$$ $$Z_i$$ $$l$$ $$r$$ a a b a a b 1 2 3 4 5 1 0 2 3 2 2 4 5 2 2 5 7

Next round brings us $$i=6<r$$, therefor we’re in case 2. Equivalent position is again $$k=i-l+1=2$$, but now $$Z_2=1<r-i+1=2$$ and we’re in case 2a and can just set $$Z_6=1$$.

$$T$$ a $$i$$ $$Z_i$$ $$l$$ $$r$$ a a b a a b 1 2 3 4 5 6 1 0 2 3 1 2 2 4 5 5 2 2 5 7 7

The last round we have to process is $$i=7<r$$, case 2. Equivalent position is $$k=i-l+1=3$$ and $$Z_3=0<r-i+1=1$$, so case 2a and $$Z_7 = 0$$.

 $$T$$ b $$i$$ $$Z_i$$ $$l$$ $$r$$ a a b a a a 1 2 3 4 5 6 7 1 0 2 3 1 0 2 2 4 5 5 5 2 2 5 7 7 7

That’s it. The Z-Box’es we’ve found are visualized in the image.

## Searching

To search for a pattern $$P \in A^*$$ in a text $$T \in A^*$$ just calculate the Z-Boxes of $$P\T$$ with $$\\notin A$$. These calculations are done in $$\Theta(|T|)$$. For any $$i>|P|$$: If $$Z_i=|P|$$ means $$P\T[i\dots i+|P|-1]$$ is prefix of $$P\T$$, so $$P$$ is found at position $$i-(|P|+1)$$ in $$T$$.

## Code

Of course I’m providing an implementation, see attachment.