Infection

URL: https://github.com/ppham27/infection

Infection was an assignment when I was interviewing for a data scientist position at the Khan Academy. The idea behind it is that their product is used in the classroom, so instead of A/B testing users, they are A/B testing groups of users. Thus, we can see students and coaches as directed graph, where we draw an edge from each student to a coach. We call the groups of users that are testing the new feature infected. Thus, we want to optimally assign users to the infection group, hence the name.

For extra credit, I chose to make a visualization. This particular dataset clearly shows why a greedy algorithm that picks the largest group of users would fail. We avoid the largest component and get closer to the limit of 26 by infecting many smaller groups.

I saw this as a knapsack problem and solved it with dynamic programming. Our knapsack is the number of users that we want to infect. Our items are connected components of users and have both size and value equal to the number of users. If we want to infect as many users as possible without going over a limit, this is analogous to filling our knapsack to limit the leftover space in our knapsack. The core of deciding which components is done by selectComponents:

/**
 * Fill a knapsack of size limit with components.
 * @param componentSize a map of components and their sizes
 * @param limit the size of our knapsack
 * @return a set of the components that we select
 */
public static Set<String> selectComponents(Map<String, Integer> componentSize, int limit) {
    class State {
        int score;
        Set<String> componentsUsed;
        public State() {
            this.score = 0; this.componentsUsed = new TreeSet<String>();
        }
        public State(int score, Set<String> componentsUsed) {
            this.score = score;
            this.componentsUsed = new TreeSet<String>();
            this.componentsUsed.addAll(componentsUsed);
        } 
    }        
    // solve knapsack problem with dynamic programming O(limit*numComponents)
    State[] bestState = new State[limit + 1];
    for (int i = 0; i <= limit; ++i) { bestState[i] = new State(); }
    for (Map.Entry<String, Integer> e : componentSize.entrySet()) {
        String id = e.getKey();
        int size = e.getValue();
        for (int s = limit; s >= size; --s) {
            if (bestState[s].score < bestState[s-size].score + size || 
                    (bestState[s].score == bestState[s-size].score + size && 
                    bestState[s].componentsUsed.size() > bestState[s-size].componentsUsed.size() + 1)) {
                // break ties by choosing the set with less components
                bestState[s] = new State(bestState[s-size].score + size, bestState[s-size].componentsUsed);
                bestState[s].componentsUsed.add(id);
            }
        }
    }
    return bestState[limit].componentsUsed;
}

This is the classic dynamic programming solution to the knapsack problem which has runtime $O(N \cdot M)$, where $N$ is the size of our knapsack and $M$ is the number of items we have. It is an exact solution. Typically, the number of users in an A/B test is very large, so this solution probably wouldn't scale. The knapsack problem is NP-hard, so in real life, we would probably settle for a heuristic solution.

The idea behind this algorithm is that given component $L$ has size $s$, we can fit it into any knapsack that has size $S \geq s$. Our invariant is that we know the best way to fill the knapack with the previous $L - 1$ components. We try to do better with the $L$th component. bestState[i] refers to a knapsack of size $i$, so it will have a score less or equal to $i$ and contain the components used to obtain that score. The score is simply the total size of all those items. And so, items in bestState[i] along with item $L$ will fit into a knapsack of size $i + s$. Once the components are selected, we can infect all the users in each selected component with a simple breadth-first search (BFS).

I enjoyed this project mainly because I like algorithms and it shows how algorithms can solve a very concrete and real problem. This was one of the first times that I used Maven, too, for builds and running tests. Overall, it was very pleasant experience, but I don't quite understand the plugin system fully, yet. If anyone is thinking of using this, no I didn't get the job, much to my disappointment. I did make it to the fourth and final round, however.

About Me

Infection