Algorithm to find the least lexicographic rotation of a circular string

Background:

A string is a sequence of symbols from a set called an alphabet. If the string is circular, one can start reading it from any position/index. For example, consider the following circular string of length 8:

If we read this circular string starting at 12 o’clock position and moving in clockwise direction, we obtain”abaabbab“. Now, let us rotate this circular string in anti-clockwise direction by one position.

The string now derived from reading after one rotation is “baabbaba“. Similarly, rotating again will give us “aabbabab“; continuing in the similar fashion we get “abbababa“, “bbababaa“, “bababaab“, “ababaabb“, “babaabba“.

In other words, all these eight strings so obtained, represent the rotations of the same circular string. The question now arises: how do we know if two given strings represent the rotations of the same circular string or different ones? Alternatively, how do we compare two circular strings?

The answer lies in the fact that a given circular string will have a unique least lexicographic rotation. Simply, if we arrange the strings representing the rotations in lexicographic order (the order in which they will appear in a dictionary), the one appearing first among all will correspond to the least lexicographic rotation. For instance, here the lexicographic or dictionary order is: “aabbabab“, “abaabbab“, “ababaabb“, “abbababa“, “baabbaba“, “babaabba“, “bababaab“, “bbababaa“. Clearly, “aabbabab” represents the least lexicographic rotation.

Thus, if two different strings result in the same least lexicographic rotation, they essentially represent the rotations of the same circular string.

Motivation:

The motivation to study the problem of computing the least lexicographic rotation (LLR, henceforth) of a string comes from many different application domains.

Genetics: Several bacterial genomic sequences are circular. In many situations, such as indexing a new genome in databases, we need to know whether or not two genomic sequences belong to the same bacteria.
Finger-print identification: Fingerprints can be seen as circular strings at primary level. For verification or searching a given fingerprint against the stored ones, comparison of circular strings is required.
Pattern-matching in geometric data: A polygon can be represented as a circular string of coordinates in the sense that there is no starting or end point in a polygon. Pattern searching in such a database of polygons necessitates comparison of circular strings.

Problem:

A string is given that represents a rotation of a circular string (which can be obtained by joining the right end of the given string with its left end). We are required to find the LLR of the underlying circular string. For example, we are given “baabbaba” (which is a rotation of the circular string in the running example). Let us index the positions of the given string from 1.

We get a rotation if we start from an index, say i, go up to the end, start from 1, and stop at i-1. Note that all the rotations of the underlying circular string can be obtained from the given string (by starting from each index). Here, “aabbabab” starting at index 2 is the LLR.

Further note that LLR of a given string (and thus of the underlying circular string) is unique even though it may start from more than one indices. Take the string “baabaa” for example – the LLR is “aabaab” but it starts from index 2 and index 5. In general, if a string is periodic (i.e. of the form xⁿ where n ≥ 2) then it will have several (decided by n) indices giving the same rotation and thus the same LLR. As a consequence, a non-periodic (that can not be expressed as a power of a shorter string) will have the unique index for its LLR.

Naive Algorithm:

Naively, we can sort the successive rotations to find the LLR. For a string of length n, if we use bucket-sort algorithm, the naive approach will require O(n²) time in the worst case.

Duval’s Algorithm:

In 1983, an efficient algorithm was proposed by Duval that runs in O(n) time and requires constant memory space. Working of the algorithm can be understood easily, by following the description of the steps given below:

Start by comparing (seen as a duel) symbols at the consecutive indices: 1 with 2, 3 with 4, 5 with 6, and so on. While comparing indices, say i and i+1, the winner will be the index having smaller symbol of the two. If both are equal, choose the left one as the winner (i.e. i).
Now compare consecutive indices of the winners chosen in the previous step. New winners will be chosen as follows: Assume we are comparing index i with index j. Compare the substring starting at index i up to index j-1, with the substring of the same length (i.e. j-i) but starting from index j. Note that the substring is taken considering string to be circular (i.e., if we hit the end, we go back to 1). Whichever of the two substrings is smaller, its starting index will be the winner. If both are the same, choose the left index (i.e. i).
Repeat Step 2 until we retreive the ultimate winner; the rotation starting from this winner-index is the LLR.

Example

Iteration 1: From b and a at indices 1 and 2, respectively, a is smaller. Therefore, 2 is the winner. Similarly, 3, 6, and 8 are the winners.

Iteration 2: Now substrings of length 1 (3-2 =1) starting at 2 and 3 are compared. Since a=a, left index i.e. 2 is chosen as the winner.

In comparison (duel) between 6 and 8, substrings of length 2 (8-6=2) are compared. Substring of length 2 starting at position 6 is ab and the one starting at position 8 (considering circular string) is also ab. Again, left index i.e. 6 is chosen as the winner because ab=ab.

Iteration 3: The substrings of length 4 (6-2 =4) starting at positions 2 (aabb) and 6 (abab) are duelled in this step. As aabb < abab, 2 is the clear winner. We stop here as there are no indices left to duel and the final winner has been chosen.

Thus, the LLR is the rotation starting at index 2 i.e. “aabbabab“.

Correctness of the algorithm

Consider that we are comparing two indices, say i and j.

Let the substring from i to j-1 is X and the substring of same length (i.e. j-i) starting from j is Y which ends at index, say l. Let the substring from l and going up to i is Z. The rotations starting from i, j, and l are XYZ, YZX, and ZXY, respectively.

Now there are only the following possibilities:

X < Y
In this case, as X and Y are of the same length, XYZ < YZX => Rotation starting from i is smaller. By choosing i as the winner, we chose the smaller rotation.
X > Y
In this case, as X and Y are of the same length, XYZ > YZX => Rotation starting from j is smaller. By choosing j as the winner, we chose the smaller rotation.
X = Y
Here, we chose the left index i.e. i, but we can not conclusively derive the relation between the two rotations. However, there are only three possibilities:
1. XYZ < YZX
  Rotation starting from i is smaller. By choosing i as the winner, we chose the smaller rotation.
2. XYZ = YZX
  Both the rotations are same. Therefore, choosing i or j effectively will result in the same rotation. Thus, choosing i will not give a wrong result.
3. XYZ > YZX
  This, at least on the face value, appears as if we chose the wrong rotation by choosing i over j, even though the rotation starting at j was smaller than the one starting at i. Let us examine carefully.
  Since X=Y, let us replace Y with X i.e.
  XXZ > XZX => XZ > ZX => XZX > ZXX
  Note that when X = Y, XZX is the rotation starting at j and ZXX is the one starting at l. The above derived relation implies that the rotation starting at l is further smaller than the one beginning at j. Therefore, j was not the index of smallest rotation. Even if we had chosen j over i, it would have been beaten by l at least. In other words, we did not make a wrong decision in dropping j as it was not the right answer because there exists a rotation (the one starting at l) which is smaller than that at j.
  As an example, consider comparison between indices 6 and 8 in iteration 2. Here: i=6, j=8, X=ab, Y=ab, l=2, Z=aabb, the rotation starting at 6 is “ababaabb”, and the rotation starting at 8 is “abaabbab”. even though the rotation at 8 is smaller than the one at 6, we chose 6 as winner. But this did not make any difference because the rotation starting at 2 (“aabbabab”) is further smaller than the one at 8. [Please note that here the rotation at 2 also happen to be the LLR.]

This completes the proof of the correctness of the algorithm.

References:

Following is the paper in which the algorithm was proposed:

Jean Pierre Duval (1983). “Factorizing words over an ordered alphabet”. Journal of Algorithms. Elsevier. 8 (8): 363–381. doi:10.1016/0196-6774(83)90017-2. ISSN 0196-6774.

Epilogue:

Please pardon the verbosity and oversimplification of the algorithm. This post is more pertinent to graduate students. It is not intended for a connoisseur of string-algorithms. 🙂

Cheers!