Posts tagged string

Recently, I came across a programming interview question that I didn't know how to answer. It turns out the solution was to use a suffix array. I had actually been aware of the existence of the data structure for some time, but never bothered to learn about them because they always made my head hurt.

They're actually pretty simple to describe. Consider a string $S$ of length $N.$ Then, $S$ will have $N$ suffixes, $S_0,S_1,\ldots,S_{N-1}$, where if we use $0$-based indexing, $S_i$ is just the suffix of $S$ starting at the $i$th character. Thus, we can just represent each suffix its index which, is just an integer. Sort the suffixes alphabetically. A suffix array is sorted array of suffixes of $S$. Strings can be very long, for example, a DNA sequence. Thus, we represent our suffix array as the array of integers that corresponds to each suffix.

The concept is very simple, but I always got lost on the construction. Of course, we could do a comparison sort, and so, given $N$ suffixes, we have about $O(N\log N)$ comparisons. The problem is that for each comparison, we have to compare $O(N)$ characters, so this naive problem is actually $O(N^2\log N)$ which is unacceptable for large $N.$

Suffix Array Construction

There exists $O(N)$ algorithms of constructing suffix arrays, but they are too unwieldly to code during a contest. The algorithm that I describe is $O\left(N(\log N)^2\right)$. The idea reminds me of a radix sort, where would sort a string character-by-character.

Here, we can do better than character-by-character, though, because a suffix of a suffix is just a further suffix. So, we sort by the first character. Now, say we have sorted all the suffixes by the first $k$ characters. We consider the suffix $i$ as the pair of suffixes $(i,i+k).$ The first $k$ characters of both of those suffixes are sorted, so actually we can easily sort by $2k$ characters by comparing pairs of integers. Thus, we at each step we have double the number of characters being compared, so we run a comparison sort $\lceil \log_2 N \rceil + 1$ times, which gives us a running time of complexity $O\left(N(\log N)^2\right)$.

In the title image, you can see this process in action. I thought it was pretty neat idea, so I created this data visualization, where you can construct your own suffix array from a string. Here's the an implementation in Java, where I use a custom comparator that keeps track of the rank of each suffix.

static class SuffixArray {        
    private class SuffixComparator implements Comparator<Integer> {
        private String S;
        private int N;
        private int[] rankA;
        private int[] rankB; // dummy array to avoid constantly reallocating memory
        private boolean useRankB = false;
        private int sortedCharacters = 0;

        public SuffixComparator(String S) {
            this.S = S;
            this.N = S.length();
            this.rankA = new int[N];
            this.rankB = new int[N];                
        }

        public int compare(Integer a, Integer b) {
            int[] rank = useRankB ? rankB : rankA;
            if (rank[a] != rank[b]) return rank[a] - rank[b];                
            if (sortedCharacters == 0) { // just look at first letter
                return S.charAt(a) - S.charAt(b);                    
            } else {
                // first sortedCharacters are equal
                // substring after sortedCharacters is 
                // S.substring(a + sortedCharacters) and S.substring(b + sortedCharacters)
                // these are just further suffixes
                if (a + sortedCharacters == N) {
                    return -1; // suffix a is only sortedCharacters long
                } else if (b + sortedCharacters == N) {
                    return 1; // suffix b is only sortedCharacters long
                } else { // look at sub-suffixes
                    return rank[a + sortedCharacters] - rank[b + sortedCharacters];
                }                    
            }
        }

        public int getRank(int suffixIndex) {
            return useRankB ? rankB[suffixIndex] : rankA[suffixIndex]; // return position in suffixArray
        }

        public void updateRank(Integer[] suffixArray) {
            int[] newRank = useRankB ? rankA : rankB;
            newRank[suffixArray[0]] = 0;
            for (int i = 1; i < N; ++i) { // loop over position in suffix array
                if (compare(suffixArray[i], suffixArray[i - 1]) > 0) {
                    newRank[suffixArray[i]] = newRank[suffixArray[i - 1]] + 1;
                } else {
                    newRank[suffixArray[i]] = newRank[suffixArray[i - 1]];
                }
            }
            toggleRank();                
        }

        public int incrementSortedCharacters() {
            if (sortedCharacters >= N) return sortedCharacters;
            if (sortedCharacters == 0) {
                ++sortedCharacters;
            } else {
                sortedCharacters *= 2;
            }
            return sortedCharacters;
        }

        private boolean toggleRank() {
            useRankB = !useRankB;
            return useRankB;
        }
    }

    private String S;
    private Integer[] suffixArray; // suffixArray[i] is the ith suffix lexographically
    private int N;
    private int[] rank; // rank[suffixIndex] is the position in the suffixArray

    public SuffixArray(String S) {
        this.S = S;
        this.N = S.length();
        this.suffixArray = new Integer[N];
        for (int i = 0; i < N; ++i) suffixArray[i] = i;            
        SuffixComparator suffixComparator = new SuffixComparator(S);
        do { // each iteration sorts 2^i characters
            Arrays.sort(suffixArray, suffixComparator);
            suffixComparator.updateRank(suffixArray);
        } while (suffixComparator.incrementSortedCharacters() < N);
        this.rank = new int[N];
        for (int i = 0; i < N; ++i) rank[i] = suffixComparator.getRank(i);
    }

    public int getSuffixIndex(int i) {
        return suffixArray[i];
    }

    public String getSuffix(int i) {
        return S.substring(suffixArray[i]);
    }

    public int getRank(int suffixIndex) {
        return rank[suffixIndex];
    }

    public int size() {
        return N;
    }

    public String getString() {
        return this.S;
    }

    @Override
    public String toString() {
        StringBuilder stringBuilder = new StringBuilder();
        if (N == 0) return "empty suffix array";
        stringBuilder.append(getSuffix(0));
        for (int i = 1; i < N; ++i) {
            stringBuilder.append('\n');
            stringBuilder.append(getSuffix(i));                
        }
        return stringBuilder.toString();
    }
}

LCP Arrays

For many applications of a suffix array, an auxiliary data structure called an LCP array is needed. LCP stands for longest common prefix. Let $SA_i$ be the $i$th suffix of the suffix array, so it is the suffix of rank $i$ when the suffixes are sorted alphabetically. The LCP array will also have length $N,$ with entries $LCP_i.$ $LCP_0$ is undefined, while for $i > 0,$ $LCP_i$ is the length of longest common prefix of $SA_i$ and $SA_{i-1}.$

Again, a naive computation would be $O(N^2),$ but we can use the fact that if we remove the first character from a suffix, we have a prefix of another suffix. So, start with the longest suffix, that is, $S$, itself. Compute the longest common prefix with the suffix that occurs before it in the suffix array. That is, suppose we know the longest common prefix of $SA_i$ and $SA_{i-1}$ has length $k > 0.$ Now, remove the first character from $SA_i$ to get $SA_j.$ The prefix of $SA_{j-1}$ will share at least $k - 1$ characters with $SA_j$ because in the very worst case $SA_{j-1}$ will be be $SA_{i-1}$ with the first character removed. We handle the case where $i = 0$ or $k = 0,$ specially. Here is the Java code.

static class LCPArray { // LCP stands for longest common prefix
    private int N;
    private String S;
    private SuffixArray suffixArray;
    private int[] lCP; 
    // lCP[i] is undefined for i = 0, i is position in suffixArray
    // for i > 0, it contains the length of the longest common prefix
    // of S.substring(suffixArray[i]) and S.substring(suffixArray[i-1])

    public LCPArray(String S) {
        this(new SuffixArray(S));            
    }

    public LCPArray(SuffixArray suffixArray) {
        this.suffixArray = suffixArray;
        this.S = suffixArray.getString();
        this.N = suffixArray.size();
        this.lCP = new int[N];
        if (N == 0) return;
        lCP[0] = -1; // undefined for 0
        if (N == 1) return;
        int currentLCP = 0;                            
        for (int suffixIndex = 0; suffixIndex < N; ++suffixIndex) {
            int rank = suffixArray.getRank(suffixIndex);
            if (rank == 0) continue;
            int previousSuffixIndex = suffixArray.getSuffixIndex(suffixArray.getRank(suffixIndex) - 1);
            // loop invariant is that the current suffix and previous suffix have longest common prefix
            // of at least currentLCP
            while (Math.max(suffixIndex, previousSuffixIndex) + currentLCP < N 
                   && S.charAt(suffixIndex + currentLCP) == S.charAt(previousSuffixIndex + currentLCP)) {
                ++currentLCP;             
            } 
            lCP[rank] = currentLCP;
            // we only remove one character off the front each time, so suffix length decreases by at most 1
            if (currentLCP > 0) --currentLCP;
        }
    }        

    public int size() {
        return this.N;
    }

    public int get(int i) {
        return this.lCP[i];
    }        

    public SuffixArray getSuffixArray() {
        return this.suffixArray;
    }
}

Case Study

The problem that required me to learn these two data structures was String Similarity. Here's the problem statement.

For two strings $A$ and $B$, we define the similarity of the strings to be the length of the longest prefix common to both strings. For example, the similarity of strings "abc" and "abd" is 2, while the similarity of strings "aaa" and "aaab" is 3.

Calculate the sum of similarities of a string $S$ with each of it's suffixes.

So, we find $S$ in our suffix array. Say $S = SA_i.$ The two suffixes with longest common prefixes of $S$ will be $SA_{i-1}$ and $SA_{i+1},$ which we've already calculated in $LCP_{i}$ and $LCP_{i+1},$ respectively.

The next two suffixes with longest common prefixes will be $LCP_{i-2}$ and $LCP_{i+2}.$ Now, we have to be careful because we could run into a situation like this. Consider $S = aabbabab.$ Here are the first 3 entries of the suffix array.

aabbabab
ab
abab

The longest common prefix of $S_0$ with $S_2$ is not $LCP_{2},$ but $\min\left(LCP_1,LCP_2\right),$ in general the longest common prefix of $SA_{i}$ with $SA_{j},$ where $i < j$ is $\min\left(LCP_{i+1},LCP_{i+2},\ldots,LCP_j\right).$ Thus, with our two data structures, the solution is quite short.

import java.io.*;
import java.util.*;

public class Solution {

    static class SuffixArray {
        ...
    }

    static class LCPArray { // LCP stands for longest common prefix
        ...
    }

    static long sumSuffixSimilarities(String S) {
        int N = S.length();
        LCPArray lCPArray = new LCPArray(S);
        SuffixArray suffixArray = lCPArray.getSuffixArray();
        long similaritySum = N; // initialize with string itself
        int pivot = suffixArray.getRank(0); // the longest matching prefixes will be adjacent
        int minMatchLength = Integer.MAX_VALUE; // afterwards, the length will be decreasing
        for (int i = pivot; i >= 0 && lCPArray.get(i) > 0; --i) {
            minMatchLength = Math.min(lCPArray.get(i), minMatchLength);
            similaritySum += minMatchLength;
        }
        minMatchLength = Integer.MAX_VALUE;
        for (int i = pivot + 1; i < N && lCPArray.get(i) > 0; ++i) {
            minMatchLength = Math.min(lCPArray.get(i), minMatchLength);
            similaritySum += minMatchLength;
        }
        return similaritySum;
    }    

    public static void main(String[] args) throws IOException {
        BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
        PrintWriter out = new PrintWriter(new BufferedWriter(new OutputStreamWriter(System.out)));
        int T = Integer.parseInt(in.readLine());
        for (int t = 0; t < T; ++t) {
            String S = in.readLine();
            out.println(sumSuffixSimilarities(S));            
        }
        in.close();
        out.close();
    }
}

Tags: algorithm, hackerrank, string

About Me

Posts tagged string

Suffix Arrays and LCP Arrays

Suffix Array Construction

LCP Arrays

Case Study