PhillyPham

Continuing with the cooking for a 7-year-old series, I made mozzarella sticks tonight. Learning to cook for my brother, I've developed the skill set to be the dad of a picky eater (hint, hint).

I made my own tomato sauce and triple breaded these mozzarella sticks: flour, egg, flour, egg, and bread crumbs. Next, I deep fried them in a wok. I thought that they came out pretty good. They are definitely better than North Penn High School max sticks in any case.

In other news, school hasn't been too busy so far, which has given me some time to get back in the gym. I've also been sleeping a lot, perhaps, too much. Sometimes I wonder if I'm running away from life by sleeping. I have actually been somewhat social, though. My cousin had an engagement part last weekend, and I've been spending a lot of time with roommmates.

Tags: cooking, life

For me, cooking is a hobby that serves many purposes. It's a way to relax, and it can be pure fun to make something new. I also use it show my love for others, for I've never been good with words. Lately, I've been cooking as a way to challenge myself: I like to attempt to make people's favorite foods. Perhaps, there's a people-pleaser part of me, too, that seeks approval.

In any case, since Philadelphia still remains a relatively new city for me, I don't have many people to cook for except my brother, Christopher Pham. For those of you that know him, he eats simple foods that you would expect a 7-year-old child to enjoy. That leaves me cooking decidedly non-Paleo fare. A couple of weeks ago, I made orange sherbet for instance.

One of his favorite foods is macaroni and cheese. This particular version is extremely creamy featuring lots of butter, evaporated milk, and a mix of chedder and gruyère cheese. To add some crispyness, I topped it with toasted bread crumbs. Thankfully, he thought it was pretty good, perhaps, slightly better than his usual Kraft variety.

After some more iterations, I've settled on this recipe, The Best Macaroni and Cheese. It's not quite the best when followed literally, though. First, I use a blend of 1/2 cheddar and 1/2 Monterey Jack cheese. Also, the baking time needs to be modified. Proceed with the first 5 minutes according to the recipe. After the second 5 minutes, add the remaining milk and cheese. It's done after another 5 minutes. All in all, the baking time is cut in half.

Tags: recipe, life, cooking

I was recently doing a problem that involved depth-first search to find components of a graph. For me, the most natural way to write this is recursion. It's pretty simple:

void floodFill(int vertex, int currentComponent, const vector<set<int>> &edgeList,
               vector<int> &component, vector<vector<int>> &componentSets) {
  component[vertex] = currentComponent;
  componentSets.back().push_back(vertex);
  for (int i : edgeList[vertex]) {
    if (component[i] == -1) {
      floodFill(i, currentComponent, edgeList, component, componentSets);         
    }
  }
}

vector<vector<int>> findComponents(const vector<set<int>> &edgeList) {
  int N = edgeList.size();
  vector<int> component(N, -1);  
  vector<vector<int>> componentSets;
  int currentComponent = 0;
  for (int i = 0; i < N; ++i) {
    if (component[i] == -1) {
      componentSets.emplace_back();
      floodFill(i, currentComponent++, edgeList, component, componentSets);
    }
  }  
  return componentSets;  
}

component and componentSets are shared state variables, so we pass them by reference. It seems simple, enough, right? Unfortunately, the stack size is very limited on OS X. Trying to increase the stack size, I get

$ ulimit -s 65536
-bash: ulimit: stack size: cannot modify limit: Operation not permitted

The most that I can increase the stack size to with ulimit is a mere 8192 KB, which only works on a graph with less than 15,000 nodes.

Now the easiest solution is to increase the stack size with gcc like this

$ g++ -Wl,-stack_size,0x4000000 -stdlib=libc++ -std=c++0x D.cpp

We specify the stack size in bytes, so 0x4000000 is $4 \times 16^6 = 67108864$, which is 64 MB.

I want to cover a better solution that works everywhere, though. We can use an explicit stack for recursion:

vector<vector<int>> findComponents(const vector<set<int>> &edgeList) {
  int N = edgeList.size();
  vector<int> component(N, -1);  
  vector<vector<int>> componentSets;
  int currentComponent = 0;
  for (int i = 0; i < N; ++i) {
    if (component[i] == -1) {
      componentSets.emplace_back();
      stack<int> s; 
      s.push(i);
      component[i] = currentComponent;
      componentSets.back().push_back(i);
      while (!s.empty()) {
        int vertex = s.top(); s.pop();
        for (int j : edgeList[vertex]) {
          if (component[j] == -1) {
            component[j] = currentComponent;
            componentSets.back().push_back(j);
            s.push(j);
          }
        }
      }
      ++currentComponent;
    }
  }  
  return componentSets;  
}

This solution is actually more memory efficient since the implicit stack in the recursion solution contains the function state, which includes 5 arguments and other local variables. According to the online judge, the explicit stack-based solution is only using 2/3 of the memory as the recursive solution.

However, the astute reader may note that we aren't quite doing the same thing. For example, in the explicit stack verion, we assign the component before we push the vertex on the stack, whereas in the recursive version, we assign the component after we retrieve the vertex from the stack. This is necessary to avoid adding a vertex to a stack twice. Consider the undirected graph with the edges $\{(0,1), (0,2), (1,2)\}$.

First, we push $0$ onto the stack.
We pop $0$ and check its edges, and push $1$ and $2$ onto the stack.
Stacks are last in, first out (LIFO), so we pop $2$ off the stack. Now, we check the edges of $2$. That includes an edge to $1$. If we added $1$ to the stack without assigning it to a component, yet, we would push $1$ onto the stack again.

We're also visiting visiting the nodes in a different order. In the recursive solution, we would visit an unassigned vertex as soon as we see it, so we would have $0 \rightarrow 1 \rightarrow 2$, but in this explict stack version, we get $0 \rightarrow 2 \rightarrow 1$.

An explicit stack version that better simulates what the recursive solution does is

vector<vector<int>> findComponents(const vector<set<int>> &edgeList) {
  int N = edgeList.size();
  vector<int> component(N, -1);  
  vector<vector<int>> componentSets;
  int currentComponent = 0;
  int iterations = 0;
  for (int i = 0; i < N; ++i) {
    if (component[i] == -1) {
      componentSets.emplace_back();
      stack<tuple<int, set<int>::iterator>> s; 
      s.emplace(i, edgeList[i].begin());
      while (!s.empty()) {
        // important to assign by reference
        tuple<int, set<int>::iterator> &state = s.top(); 
        int vertex = get<0>(state);
        set<int>::iterator &edgeIt = get<1>(state);        
        if (component[vertex] == -1) { // first time visiting
          component[vertex] = currentComponent;
          componentSets.back().push_back(vertex);
        }
        // if there is nothing else, pop
        bool shouldPop = true;
        while (edgeIt != edgeList[vertex].end()) {
          int j = *edgeIt; ++edgeIt;
          if (component[j] == -1) {
            s.emplace(j, edgeList[j].begin()); 
            shouldPop = false;  // something new is on stack, do not pop it
            break; // immediately put it on the stack, break, and visit next vertex
          } 
        }      
        if (shouldPop) s.pop();
      }
      ++currentComponent;
    }
  }  
  return componentSets;  
}

Now, we assign the component after getting the vertex off the stack, and when we see an unassigned vertex, we immediately put it on top of the stack instead of adding all unassigned vertices first. This uses slightly more memory because now the stack has to keep track of which edges were visited, so the state includes set<int>::iterator. Despite this explicit solution being more similar to recursion, I'd be hesitant to recommend this code because of how convoluted it is. You need to be very careful with references. Also, in more complicated cases, your state tuple will contain more than two variables, requiring you to keep track of more things and making it more likely that you will make a mistake.

Tags: algorithm, graph theory, computer science

Recently, I ran into a problem that forced me to recall a lot of old knowledge and use it in new ways, President and Roads. The problem is this:

We have $V$ vertices and $E$ edges. $2 \leq V \leq 10^5$, and $1 \leq E \leq 10^5$. Index these vertices $0, 1, \ldots, V - 1$. Edges are directed and weighted, and thus, can be written as $e_i = (a_i,b_i,l_i)$, where an edge goes from vertex $a_i$ to vertex $b_i$ with length $l_i$, where $1 \leq l_i \leq 10^6$. We're given a source $s$ and a target $t$. For each edge $e_i$, we ask:

If we take the shortest path from $s$ to $t$, will we always travel along $e_i$? If not, can we decrease $l_i$ to a positive integer so that we will always travel along $e_i$? If so, how much do we need to decrease $l_i$?

Shortest Path

Obviously, we need to find the shortest path. My first thought was to use Floyd-Warshall, since I quickly realized that just the distances from the $s$ wouldn't be enough. Computing the shortest path for all pairs is overkill, however. Let $d(a,b)$ be the distance between two vertices. An edge $e_i = (a_i,b_i,l_i)$ will belong to a shortest path if $$d(s,a_i) + l_i + d(b_i,t)$$ equals to the length of the shortest path. Thus, we only need the distances from the $s$ and to $t$. We can find the distance from $s$ and to $t$ by running Dijkstra's algorithm twice. To find the distances to $t$, we'll need to reverse the edges and pretend $t$ is our source. There's still the problem that in the most simple implementation of Djiktra's, you need to find the vertex of minimum distance each time by linear search, which leads to an $O(V^2)$ algorithm. With a little bit of work, we can bring it down to $O\left((E + V)\log V\right)$ with a priority queue. Unfortunately, the built-in C++ and Java implementations do not support the decrease key operation. That leaves us with two options:

The easiest is to reinsert a vertex into the priority queue, and keep track of already visited vertices. If we see the vertex a second time, throw it away.
Implement our own priority queue.

Of course, I chose to implement my own priority queue based on a binary heap with an underlying hash map to support the decrease key operation. If I was really ambitious, I'd implement a Fibonnaci heap, to achieve $O\left(E + V\log V\right)$ running time. See my C++ implementation below.

I use it in Djikstra's algorithm here:

vector<pair<long long, unordered_set<int>>> computeShortestPath(int s, const vector<unordered_map<int, pair<int, int>>> &edgeList) {
  int N = edgeList.size();
  vector<pair<long long, unordered_set<int>>> distance(N, make_pair(LLONG_MAX/3, unordered_set<int>()));
  phillypham::priority_queue pq(N);
  distance[s].first = 0;
  pq.push(s, 0);
  while (!pq.empty()) {  
    int currentVertex = pq.top(); pq.pop();
    long long currentDistance = distance[currentVertex].first;
    // relax step
    for (pair<int, pair<int, int>> entry : edgeList[currentVertex]) {
      int nextVertex = entry.first; int length = entry.second.first;
      long long newDistance = currentDistance + length;
      if (distance[nextVertex].first > newDistance) {
        distance[nextVertex].first = newDistance;
        distance[nextVertex].second.clear();
        distance[nextVertex].second.insert(currentVertex);
        if (pq.count(nextVertex)) {
          pq.decrease_key(nextVertex, newDistance);
        } else {
          pq.push(nextVertex, newDistance);
        }
      } else if (distance[nextVertex].first == newDistance) {        
        distance[nextVertex].second.insert(currentVertex);
      }
    }
  }
  return distance;
}

computeShortestPath returns a vector of pairs, where each pair consists of the distance from $s$ and the possible previous vertices in a shortest path. Thus, I've basically recreated the graph only including the edges that are involved in some shortest path.

Counting Paths

The next task is to figure out which edges are necessary. There could be multiple shortest paths, so if the shortest path has length $L$, given edge $e_i = (a_i, b_i, l_i)$, the condition $$d(s,a_i) + l_i + d(b_i,t) = L$$ is necessary but not sufficient. First, we need to make sure $e_i$ is distinct as there are no restrictions on duplicate edges. Secondly, every path must pass through $a_i$ and $b_i$. And finally, we need to make sure that $\left|P(b_i)\right| = 1$, where $P(b_i)$ is the set of vertices that come directly before vertex $b_i$ on a shortest path because you could imagine something like this scenario:

The shortest path is of length 4, and we have two options $s \rightarrow a \rightarrow c \rightarrow b \rightarrow t$ and $s \rightarrow a \rightarrow b \rightarrow t$. Every shortest path goes through $a$ and $b$, but the actual edge between $a$ and $b$ is not necessary to achieve the shortest path. Since we already computed the parents of $b_i$ when running computeShortestPath, it's simple to check that $b_i$ only has one parent.

Now, the only thing left to do is count the number of shortest paths that go through each vertex. For any vertex, $v$, let $w$ be the number of ways to reach $v$ from $s$ and $x$ be the number ways to reach $t$ from $v$. Then, the number of paths that pass through $v$ is $wx$. Now one could use something like depth-first search to compute $w_j$ for every vertex $j$. We'd start at $s$, take every path, and increment $w_j$ every time that we encouter $j$. This is too slow, though, as we'll visit every node multiple times. If there are $10^9$ paths to $t$, we'll take all $10^9$.

We can solve this problem by using a priority queue again using a similar idea to Dijkstra's. The idea is that for a paricular vertex $v$, $$w_v = \sum_{u \in P(v)} w_u,$$ where you may recall that $P(v)$ consists of all the parents of $v$. So, we just need to calculate all the number of paths from $s$ to the parents first. Since every edge has positive length, all the parents will have shorter distance from $s$, thus, we can use a priority queue to get the vertices in increasing order of distance from $s$. Every time we visit a node, we'll update all its children, and we'll never have to visit that node again, so we can compute the number of paths in $O\left((E + V)\log V\right)$ time. See the code:

const int MOD_A = 1000000207;
const int MOD_B = 1000000007;
const int MOD_C = 999999733;

tuple<int, int, int> addPathTuples(tuple<int, int, int> a, tuple<int, int, int> b) {
  return make_tuple((get<0>(a) + get<0>(b)) % MOD_A, (get<1>(a) + get<1>(b)) % MOD_B, (get<2>(a) + get<2>(b)) % MOD_C);
}

tuple<int, int, int> multiplyPathTuples(tuple<int, int, int> a, tuple<int, int, int> b) {
  long long a0 = get<0>(a), a1 = get<1>(a), a2 = get<2>(a);
  return make_tuple((a0*get<0>(b)) % MOD_A, (a1*get<1>(b)) % MOD_B, (a2*get<2>(b)) % MOD_C);
}

vector<tuple<int, int, int>> countPaths(int s, 
                                        const vector<pair<long long, unordered_set<int>>> &distances,
                                        const vector<pair<long long, unordered_set<int>>> &children) {
  // assume only edges that make shortest paths are included
  int N = children.size();
  vector<tuple<int, int, int>> pathCounts(N, make_tuple(0, 0, 0)); // store as tuple, basically modular hash function
  pathCounts[s] = make_tuple(1, 1, 1);
  phillypham::priority_queue pq(N);
  pq.push(s, 0);
  while (!pq.empty()) {
    int currentVertex = pq.top(); pq.pop();  
    for (int nextVertex : children[currentVertex].second) {
      pathCounts[nextVertex] = addPathTuples(pathCounts[currentVertex], pathCounts[nextVertex]);
      if (!pq.count(nextVertex)) {
        pq.push(nextVertex, distances[nextVertex].first);
      }
    }    
  }
  return pathCounts;
}

$x$, the number of paths to $t$ from $v$ can be computed in much the same way by reversing edges and starting at $t$.

Now, you might notice that instead of returning the number of paths, I return a vector of tuple<int, int, int>. This is because the number of paths is huge, exceeding what can fit in unsigned long long, which is $2^{64} - 1 = 18446744073709551615$. We don't actually care about the number of paths, though, just that the number of paths is equal to the total number of paths. Thus, we can hash the number of paths. I chose to hash the paths by modding it by three large prime numbers close to $10^9$. See the complete solution below.

Hashing and Chinese Remainder Theorem

At one level, doing this seems somewhat wrong. If there is a collision, I'll get the wrong answer, so I decided to do some analysis on the probability of a collision. It occured to me that hashing is actually just an equivalence relation, where two keys will belong to the same bucket if they are equivalent, so buckets are equivalence classes.

Now, I choose 3 large primes $p_1 = 1000000207$, $p_2 = 1000000007$, and $p_3 = 999999733$, each around $10^9$. We'll run into an error if we get two numbers $x$ and $y$ such that \begin{align*} x &\equiv y \bmod p_1 \\ x &\equiv y \bmod p_2 \\ x &\equiv y \bmod p_3 \end{align*} since $x$ and $y$ will have the same hash. Fix $y$. Let $N = p_1p_2p_3$, $y \equiv a_1 \bmod p_1$, $y \equiv a_2 \bmod p_2$, and $y \equiv a_3 \bmod p_3$. Then, by the Chinese Remainder Theorem, for $x$ to have the same hash, we must have $$ x \equiv a_1\left(\frac{N}{p_1}\right)^{-1}_{p_1}\frac{N}{p_1} + a_2\left(\frac{N}{p_2}\right)^{-1}_{p_2}\frac{N}{p_2} + a_3\left(\frac{N}{p_3}\right)^{-1}_{p_3}\frac{N}{p_3} \bmod N,$$ where $$\left(\frac{N}{p_i}\right)^{-1}_{p_i}$$ is the modular inverse with respect to $p_i$, which we can find by the extended Euclidean algorithm as mentioned here.

We'll have that \begin{align*} \left(\frac{N}{p_1}\right)^{-1}_{p_1} &\equiv 812837721 \bmod p_1 \\ \left(\frac{N}{p_2}\right)^{-1}_{p_2} &\equiv 57354015 \bmod p_2 \\ \left(\frac{N}{p_3}\right)^{-1}_{p_3} &\equiv 129808398 \bmod p_3, \end{align*} so if $a_1 = 10$, $a_2 = 20$, and $a_3 = 30$, then the smallest such $x$ that would be equivalent to $y$ is $x = 169708790167673569857984349$. In general, \begin{align*} x &= 169708790167673569857984349 + Nk \\ &= 169708790167673569857984349 + 999999946999944310999613117k, \end{align*} for some $k \in \mathbb{Z}$, which leaves us with only an approximately $10^{-27}$ chance of being wrong. Thus, for all intents and purposes, our algorithm will never be wrong.

Binary Heap Priority Queue

namespace phillypham {
  class priority_queue {
  private:
    int keysSize;
    vector<int> keys;
    vector<long long> values;
    unordered_map<int, int> keyToIdx;
    int parent(int idx) {
      return (idx + 1)/2 - 1;
    }
    int left(int idx) {
      return 2*(idx+1) - 1;
    }
    int right(int idx) {
      return 2*(idx+1);
    }
    void heap_swap(int i, int j) {
      if (i != j) {
        keyToIdx[keys[j]] = i;
        keyToIdx[keys[i]] = j;
        swap(values[j], values[i]);
        swap(keys[j], keys[i]);
      }
    }
    void max_heapify(int idx) {
      int lIdx = left(idx);
      int rIdx = right(idx);
      int smallestIdx = idx;
      if (lIdx < keysSize && values[lIdx] < values[smallestIdx]) {
        smallestIdx = lIdx;
      }
      if (rIdx < keysSize && values[rIdx] < values[smallestIdx]) {
        smallestIdx = rIdx;
      }
      if (smallestIdx != idx) {
        heap_swap(smallestIdx, idx);
        max_heapify(smallestIdx);
      } 
    }

    void min_heapify(int idx)  {
      while (idx > 0 && values[parent(idx)] > values[idx]) {
        heap_swap(parent(idx), idx);
        idx = parent(idx);
      }
    }

  public:
    priority_queue(int N) {
      keysSize = 0;
      keys.clear(); keys.reserve(N);
      values.clear(); values.reserve(N);
      keyToIdx.clear(); keyToIdx.reserve(N);
    }

    void push(int key, long long value) {
      // if (keyToIdx.count(key)) throw logic_error("key " + ::to_string(key) + " already exists");
      int idx = keysSize; ++keysSize;
      if (keysSize > keys.size()) {
        keys.push_back(key);
        values.push_back(value);
      } else {
        keys[idx] = key;
        values[idx] = value;
      }   
      keyToIdx[key] = idx;
      min_heapify(idx);      
    }

    void increase_key(int key, long long value) {
      // if (!keyToIdx.count(key)) throw logic_error("key " + ::to_string(key) + " does not exist");
      // if (values[keyToIdx[key]] > value) throw logic_error("value " + ::to_string(value) + " is not an increase");
      values[keyToIdx[key]] = value;
      max_heapify(keyToIdx[key]);
    }

    void decrease_key(int key, long long value) {
      // if (!keyToIdx.count(key)) throw logic_error("key " + ::to_string(key) + " does not exist");
      // if (values[keyToIdx[key]] < value) throw logic_error("value " + ::to_string(value) + " is not a decrease");
      values[keyToIdx[key]] = value;
      min_heapify(keyToIdx[key]);       
    }

    void pop() {
      if (keysSize > 0) {
        heap_swap(0, --keysSize);
        keyToIdx.erase(keys[keysSize]);       
        if (keysSize > 0) max_heapify(0);
      } else {
        throw logic_error("priority queue is empty");
      }
    }

    int top() {
      if (keysSize > 0) {
        return keys.front();
      } else {
        throw logic_error("priority queue is empty");
      }
    }    

    int size() {
      return keysSize;
    }

    bool empty() {
      return keysSize == 0;
    }

    int at(int key) {
      return values[keyToIdx.at(key)];
    }

    int count(int key) {
      return keyToIdx.count(key);
    }

    string to_string() {
      ostringstream out;
      copy(keys.begin(), keys.begin() + keysSize, ostream_iterator<int>(out, " "));
      out << '\n';
      copy(values.begin(), values.begin() + keysSize, ostream_iterator<int>(out, " "));
      return out.str();
    }    
  };
}

Solution

int main(int argc, char *argv[]) {
  ios::sync_with_stdio(false); cin.tie(NULL);
  int N, M, s, t; // number of nodes, edges, source, and target
  cin >> N >> M >> s >> t;
  --s; --t;                     // 0 indexing
  vector<tuple<int, int, int>> edges;
  vector<unordered_map<int, pair<int, int>>> edgeList(N);
  vector<unordered_map<int, pair<int, int>>> reverseEdgeList(N);
  for (int m = 0; m < M; ++m) { // read in edges
    int a, b, l;
    cin >> a >> b >> l;
    --a; --b;
    edges.emplace_back(a, b, l);
    if (!edgeList[a].count(b) || edgeList[a][b].first > l) {
      edgeList[a][b] = make_pair(l, 1);
      reverseEdgeList[b][a] = make_pair(l, 1);
    } else if (edgeList[a][b].first == l) {
      ++edgeList[a][b].second;
      ++reverseEdgeList[b][a].second;
    }
  } 
  vector<pair<long long, unordered_set<int>>> distanceFromSource = computeShortestPath(s, edgeList);
  vector<pair<long long, unordered_set<int>>> distanceFromTarget = computeShortestPath(t, reverseEdgeList);
  vector<tuple<int, int, int>> pathCounts = countPaths(s, distanceFromSource, distanceFromTarget);
  vector<tuple<int, int, int>> backPathCounts = countPaths(t, distanceFromTarget, distanceFromSource);
  for (int i = 0; i < N; ++i) pathCounts[i] = multiplyPathTuples(pathCounts[i], backPathCounts[i]);

  long long shortestDistance = distanceFromSource[t].first;
  tuple<int, int, int> shortestPaths = pathCounts[s];
  for (tuple<int, int, int> edge : edges) {
    long long pathDistance = distanceFromSource[get<0>(edge)].first + get<2>(edge) + distanceFromTarget[get<1>(edge)].first;
    if (pathDistance == shortestDistance && // path is shortest
        pathCounts[get<0>(edge)] == shortestPaths && // every path goes through from node
        pathCounts[get<1>(edge)] == shortestPaths && // every path goes through to node
        distanceFromSource[get<1>(edge)].second.size() == 1 && // only paths come from the from node
        edgeList[get<0>(edge)][get<1>(edge)].second == 1) {
      cout << "YES" << '\n';
    } else if (pathDistance - get<2>(edge) + 1 < shortestDistance) {
      cout << "CAN" << ' ' << (pathDistance - shortestDistance) + 1 << '\n';
    } else {
      // which will happen if the city is unconnected since the edge won't be big enough to overcome the big distance
      cout << "NO" << '\n';
    }
  }
  cout << flush;
  return 0;
}

Tags: number theory, graph theory, computer science, math, algorithm, competitive programming

One of my greatest achievements at Duke was getting an A+ in COMPSCI 100E (I think it is 201, nowadays?). It's a bit sad how little I have to be proud of, but my only other A+ came from an intro physics course unless you count study abroad. I'd also like to make a shoutout to my fellow math major Robert Won, who took COMPSCI 100E with me.

I did all the extra credit and every single Algorithmic Problem-solving Testing (APT) for full credit. I remember the hardest one being APT CountPaths. I remember neglecting all my other coursework and spending several days on this problem, which apparently isn't suppose to be that hard. I recently encountered a variation of this problem again, here, so I figured that I would spend some time writing up a solution. My original solution that I wrote back at Duke is a whopping 257 lines of Java. My current solution is less than 100 lines in C++, which includes comments.

Beginner Version

In the most basic version of the problem, we're trying to get from one corner to the other corner.

Going from point A to B, only down and right moves are permitted. If the grid has $R$ rows and $C$ columns. We need to make $R-1$ down moves and $C-1$ right moves. The order doesn't matter, and any down move is indistinguishable from another down move. Thus, the number of paths is $$\boxed{\frac{(R + C - 2)!}{(R-1)!(C-1)!} = {{R + C - 2}\choose{R-1}}}.$$

Intermediate Version

In another version, there are some squares that we are not allowed to visit, and the grid is not too big, say less than a billion squares.

Here, we are not allowed to visit black squares, $x_1$, $x_2$, $x_3$, and $x_4$. This version admits a dynamic programming solution that is $O(RC)$ in time complexity and $O\left(\max(R,C)\right)$ in memory since we only need to keep track of two rows or columns at a time. Let $P(r, c)$ be the number of paths to square $(r,c)$. We have that $$ P(r,c) = \begin{cases} 1, &\text{if $r = 0$ and $c=0$} \\ 0, &\text{if $(r,c)$ is a black square} \\ P(r-1,c) + P(r,c-1), &\text{if $0 \leq r < R$ and $0 \leq c < C$ and $(r,c) \neq (0,0)$} \\ 0, &\text{otherwise}. \end{cases} $$

We simply loop through rows and columns in order with our loop invariant being that at square $(r,c)$, we know the number of paths to squares $(r^\prime, c^\prime)$, where either $r^\prime < r$ or $r^\prime = r$ and $c^\prime < c$. See the code in C++ below with some variations since the problem asks for number of paths modulo 1,000,000,007.

#include <algorithm>
#include <iostream>
#include <vector>

using namespace std;

const int MOD = 1000000007;

int main(int argc, char *argv[]) {
  int h, w, n;
  cin >> h >> w >> n;  
  vector<vector<int>> grid(2, vector<int>(w, 0));
  vector<pair<int, int>> blackSquares;
  for (int i = 0; i < n; ++i) {
    int r, c;
    cin >> r >> c;
    blackSquares.emplace_back(r - 1, c - 1);
  }
  sort(blackSquares.begin(), blackSquares.end());
  grid.front()[0] = 1;               // guaranteed that 0,0 is white
  auto blackIt = blackSquares.begin();
  for (int c = 1; c < w; ++c) {
    if (blackIt != blackSquares.end() && blackIt -> first == 0 && blackIt -> second == c) {
      grid.front()[c] = -1;
      ++blackIt;
    } else if (grid.front()[c-1] != -1) {      
      grid.front()[c] = 1;
    }
  }
  for (int r = 1; r < h; ++r) {
    if (blackIt != blackSquares.end() && blackIt -> first == r && blackIt -> second == 0) {
      grid.back()[0] = -1;
      ++blackIt;
    } else if (grid.front()[0] != -1) {
      grid.back()[0] = 1;
    } else {
      grid.back()[0] = 0;
    }
    for (int c = 1; c < w; ++c) {
      if (blackIt != blackSquares.end() && blackIt -> first == r && blackIt -> second == c) {
        grid.back()[c] = -1;
        ++blackIt;
      } else {
        grid.back()[c] = 0;
        if (grid.front()[c] != -1) grid.back()[c] += grid.front()[c];
        if (grid.back()[c-1] != -1) grid.back()[c] += grid.back()[c-1];
        grid.back()[c] %= MOD;
      }
    }
    grid.front().swap(grid.back());
  }
  cout << grid.front()[w-1] << endl;
  return 0;
}

Advanced Version

Now, if we have a very large grid, the above algorithm will take too long. If $1 \leq R,C \leq 10^5$, we could have as many as 10 billion squares. Assume that number of black squares $N \ll RC$ ($N$ much less than $RC$). Then, we can write an algorithm that runs in $O\left(N^2\log\left(\max(R,C)\right)\right)$.

First, sort the black squares. Given $x = (r_1, c_1)$ and $y = (r_2, c_2)$, let $x < y$ if either $r_1 < r_2$ or $r_1 = r_2$ and $c_1 < c_2$. Let $x_1,x_2,\ldots,x_N$ be the black squares, where $x_i = (r_i,c_i)$. For each $x_i$, associate the set $$S_i = \left\{x_j = (r_j, y_j) : r_j \leq r_i~\text{and}~c_j \leq c_i~\text{and}~x_j \neq x_i\right\}.$$

Recalling our previous diagram,

$S_1 = \emptyset$, $S_2 = S_3 = \{ x_1 \}$, $S_4 = \{x_1,x_3\}$, and $S_B = \{x_1, x_2, x_3, x_4\}$. Let let us loop through these points in order, and calculate the number of paths to $x_i$ avoiding all the squares in $S_i$. It's clear that for all $x_j \in S_i$, $j < i$, so we have our loop invariant: at $i$, for all $j < i$, we know the number of paths to $x_j$ that do not pass through any square in $S_j$. Assuming this, let us calculate the number of paths to $x_i$.

If $S_i$ is empty, the problem reduces to the beginner version and $$P(x_i) = P(r_i,c_i) = {{r_i + c_i - 2}\choose{r_i-1}}.$$ If $S_i$ is not empty, we need to subtract any paths that go through any $x_j \in S_i$, so if $i = 4$, we want to subtract paths like these red ones:

Thus, there are 3 types of paths we need to subtract. Paths that go through $x_1$ and not $x_3$, paths that go through $x_3$ and not $x_1$, and paths that go through both $x_1$ and $x_3$.

It might be helpful to think about this as a Venn diagram:

We see that the set of these paths is the same as any path that goes through $x_1$ and paths that go through $x_3$ but not $x_1$. The number of paths that go through $x_1$ is $$P(x_1)\cdot{(r_4 - r_1) + (c_4 - c_1)\choose{r_4 - r_1}},$$ and since $x_1 \in S_3$, the number of paths that go through $x_3$ but not $x_1$ is $$P(x_3)\cdot{(r_4 - r_3) + (c_4 - c_3)\choose{r_4 - r_3}}.$$

Generalizing this, we have that $$P(x_i) = {{r_i + c_i - 2}\choose{r_i-1}} - \sum_{x_j \in S_i}P(x_j)\cdot{{(r_i-r_j) + (c_i - c_j)}\choose{r_i-r_j}},$$ which leads to the $N^2$ factor in the time complexity. Treating B as a black square, we can calculate the number of paths to B that avoid $x_1, x_2,\ldots,x_N$.

Now, the problem is further complicated by the fact that we need to return the number of paths modulo 1,000,000,007. Computing the binomial coefficient involves division, so we need to be able to find the modular multiplicative inverse. We can do this with the Euclidean algorithm, which gives us the $\log\left(\max(R, C)\right)$ factor in the time complexity.

Since $m = 10^9 + 7$ is prime, given any $n < m$, $\gcd(n,m) = 1$, so there exists $w,x \in \mathbb{Z}$ such that $wn + xm = 1$. We can find $w$ and $x$ by setting up a linear system, $$ \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}\begin{pmatrix} n \\ m \end{pmatrix} = \begin{pmatrix} n \\ m \end{pmatrix}, $$ which can be rewritten as an augmented matrix, $$ \left(\begin{array}{cc|c} 1 & 0 & n\\ 0 & 1 & m \end{array}\right). $$ You can perform the Euclidean algorithm by doing elementary row operations until a $1$ appears in the rightmost column. The first and second column of that row is $w$ and $x$, respectively. Since the system remains consistent with $(n,m)^\intercal$ as a solution, we have $$wn + xm = 1 \Rightarrow wn = 1 - xm \equiv 1 \bmod m,$$ so the modular multiplicative inverse of $n$ is $w$.

If you memoize this step and precalculate all the modular multiplicative inverses up to $\max(R,C)$, you can actually achieve a $O(N^2)$ algorithm, but this is unnecessary for this problem. Here's the code below.

#include <algorithm>
#include <iostream>
#include <vector>

using namespace std;

const long long MOD = 1000000007LL;
const int FACTORIALS_LENGTH = 199999;
long long factorials[FACTORIALS_LENGTH];

long long invertMod(int n, int m) {
  /*
   * find the modular inverse with
   * extended euclidean algorithm
   * [w x][a] = [a] 
   * [y z][b] = [b]
   */
  n %= m;
  long w = 1;
  long x = 0;
  long y = 0;
  long z = 1;
  long a = n;
  long b = m;
  while (a != 0 && b != 0)  {
    if (a <= b) {
      int q = b / a;
      int r = b % a;
      b = r;
      z -= q*x; z %= m;
      y -= q*w; y %= m;
    } else if (a > b) {
      int q = a / b;
      int r = a % b;
      a = r;
      x -= q*z; x %= m;
      w -= q*y; w %= m;
    }
  }
  int inverseMod = a != 0 ? w : y;
  if (inverseMod < 0) inverseMod += m;
  return inverseMod;
}

long long choose(int n, int r) {
  long long res = (factorials[n]*invertMod(factorials[r], MOD)) % MOD;
  res *= invertMod(factorials[n-r], MOD);
  return res % MOD;
}

int main(int argc, char *argv[]) {
  factorials[0] = 1;
  for (int n = 1; n < FACTORIALS_LENGTH; ++n) factorials[n] = (factorials[n-1]*n) % MOD;
  int h, w, n;
  cin >> h >> w >> n;  
  vector<pair<int, int>> blackSquares;
  vector<long long> blackSquarePaths;
  for (int i = 0; i < n; ++i) {
    int r, c;
    cin >> r >> c;
    blackSquares.emplace_back(r - 1, c - 1);
    blackSquarePaths.push_back(0);
  }
  // just treat the end point as a black square
  blackSquares.emplace_back(h - 1, w - 1);
  blackSquarePaths.push_back(0);
  sort(blackSquares.begin(), blackSquares.end());
  for (int i = 0; i < blackSquares.size(); ++i) {
    // start by considering all paths
    blackSquarePaths[i] = choose(blackSquares[i].first + blackSquares[i].second, blackSquares[i].first);
    for (int j = 0; j < i; ++j) {
      // subtract unallowed paths, we already know row is smaller because we sorted already
      if (blackSquares[j].second <= blackSquares[i].second) {
        /*
         *  A
         *   B
         *    C
         * if we're trying to get to C, we need to subtract all the paths that go through, A, B, or both
         * we hit A first and subtract paths that go through A and both A and B
         * we hit B and subtract all paths that go through B but not A
         *
         * loop invariant is that we know how many paths to get to a 
         * black square by avoiding other black squares
         */
        int rDiff = blackSquares[i].first - blackSquares[j].first;
        int cDiff = blackSquares[i].second - blackSquares[j].second;
        blackSquarePaths[i] -= (blackSquarePaths[j]*choose(rDiff + cDiff, rDiff)) % MOD;
        blackSquarePaths[i] %= MOD;
        if (blackSquarePaths[i] < 0) blackSquarePaths[i] += MOD;
      }
    }
  }
  cout << blackSquarePaths.back() << endl;
  return 0;
}

Duke APT CountPath Version

Now the version that I did at Duke has a few key differences. First, you start in the lower left and end up in the upper right. More importantly, you need to calculate the paths that go through 1 black square, 2 black squares, 3 black squares, etc. We need to do this up to paths that go through all $N$ black squares. Also, the grid is much smaller, and the black squares must be visited in the order given.

Since the grid is much smaller, a dynamic programming solution works where the state is four variables $(r,c,k,l)$, where $k$ is the index of last black square visited and $l$ is the number of black squares on that path. In general, if $T$ is the set of indicies of black squares that are to the lower-left of $k$ and appear earlier in the order given, we have that $$ P(r,c,k,l) = \begin{cases} P(r-1,c,k,l) + P(r,c-1,k,l), &\text{if $(r,c)$ is not a black square} \\ \sum_{k^\prime \in T} P(r-1,c,k^\prime,l-1) + P(r,c-1,k^\prime,l-1), &\text{if $(r,c)$ is a black square.} \end{cases} $$ This is $O(RCN^2)$ in both time and space. Here is the Java code.

import java.util.Arrays;

public class CountPaths {

    private static final int MOD = 1000007;

    public int[] difPaths(int r, int c, int[] fieldrow, int[] fieldcol) { 
        // row, col, last special field, num of previous special fields,
        // need to add 1 to third index, in case there are 0 special fields
        int[][][][] memo = new int[r+1][c+1][fieldrow.length+1][fieldrow.length+1]; 

        int[][] isSpecial = new int[r+1][c+1];
        for (int i = 0; i <= r; ++i) { Arrays.fill(isSpecial[i], -1); };
        for (int i = 0; i < fieldrow.length; ++i) { isSpecial[fieldrow[i]][fieldcol[i]] = i; }
        // 00 is special, marks no special fields yet
        memo[1][0][0][0] = 1;   // so that memo[1][1] gets 1 path
        for (int i = 1; i <= r; ++i) {
            for (int j = 1; j <= c; ++j) {                
                int kLimit; int lStart;
                // handle 00 case where no special fields have been reached yet
                if (isSpecial[i][j] == -1) {
                    // l = 0 case, no special field came before it
                    memo[i][j][0][0] += memo[i-1][j][0][0] + memo[i][j-1][0][0];
                    memo[i][j][0][0] %= MOD;
                    kLimit = fieldrow.length; lStart = 1;
                } else {
                    // since this field is special min length is 1
                    memo[i][j][isSpecial[i][j]][1] += memo[i-1][j][0][0] + memo[i][j-1][0][0];
                    memo[i][j][isSpecial[i][j]][1] %= MOD;
                    kLimit = isSpecial[i][j]; lStart = 2;
                }                
                for (int l = lStart; l <= fieldrow.length; ++l) {
                    // only check special fields that come before
                    for (int k = 0; k < kLimit; ++k) {
                        // this point is only reachable from special field if it is up and to the right
                        if (i >= fieldrow[k] && j >= fieldcol[k]) {                        
                            if (isSpecial[i][j] == -1) {
                                // when the field is not special, length is the same
                                memo[i][j][k][l] += memo[i-1][j][k][l] + memo[i][j-1][k][l];
                                memo[i][j][k][l] %= MOD;
                            } else if (isSpecial[i][j] > -1) {
                                // when the field is special add 1 to the length
                                memo[i][j][kLimit][l] += memo[i-1][j][k][l-1] + memo[i][j-1][k][l-1];
                                memo[i][j][kLimit][l] %= MOD;
                            }
                        }
                    }
                }
            }            
        }                
        int[] paths = new int[fieldrow.length+1];
        for (int k = 0; k < fieldrow.length; ++k) {            
            for (int l = 1; l < fieldrow.length + 1; ++l) {
                paths[l] += memo[r][c][k][l];
                paths[l] %= MOD;
            }
        }
        paths[0] += memo[r][c][0][0]; paths[0] %= MOD;
        return paths;
    }
}

Using ideas from the advanced version, we can significantly reduce the memory needed. Basically, we represent each black square as a node and construct a directed graph, represented by an adjacency matrix, where the edges are weighted by the number of paths from one black square to another that avoid all other black squares. This takes $O(N^4)$ time and $O(N^2)$ space. Now, let the nodes be numbered $1,2,\ldots,N$. Now, we can calculate the paths that go through any number of black squares with some dynamic programming. Let $G(j,k)$ be the number of paths from black square $j$ and black square $k$ that don't go through any other black squares. Let $Q(k, l)$ be the number paths of length $l$ where $k$ is the last black square. We have that $$Q(k,l) = \sum_{j < k} Q(j, l-1)\cdot G(j,k), $$ keeping in mind that the nodes are sorted, and $G(j,k) = 0$ if the black squares are in the wrong order. Here's the C++ code below.

#include <algorithm>
#include <cassert>
#include <iostream>
#include <sstream>
#include <vector>
using namespace std;

class CountPaths { 
private:
  const int MOD = 1000007;
  vector<vector<int>> choose;
  vector<vector<int>> initializeChoose(int N) {
    choose = vector<vector<int>>();
    choose.push_back(vector<int>{1}); // choose[0]
    for (int n = 1; n < N; ++n) {
      choose.push_back(vector<int>(n+1));
      choose[n][n] = choose[n][0] = 1;
      for (int r = 1; r < n; ++r) {
        // out of the previous n - 1 things choose r things
        // and out of the previous n - 1 things choose r - 1 things and choose the nth thing
        choose[n][r] = choose[n-1][r] + choose[n-1][r-1];
        choose[n][r] %= MOD;
      }
    }
    return choose;
  }

  int computePaths(int R, int C, vector<pair<int, int>> specialFields) {
    // specialFields should already be sorted in this case
    specialFields.emplace_back(R, C);
    int N = specialFields.size();
    vector<int> paths = vector<int>(N);
    for (int i = 0; i < N; ++i) {
      paths[i] = choose[specialFields[i].first + specialFields[i].second][specialFields[i].first];
      for (int j = 0; j < i; ++j) {
        int rDiff = specialFields[i].first - specialFields[j].first; 
        int cDiff = specialFields[i].second - specialFields[j].second; 
        assert(rDiff >= 0);
        if (cDiff >= 0) {
          paths[i] -= (((long long) paths[j])*choose[rDiff + cDiff][rDiff]) % MOD;
          paths[i] %= MOD;
          if (paths[i] < 0) paths[i] += MOD;
        }
      }
    }
    return paths.back();
  }

  vector<tuple<int, int, int>> makeSpecialFields(int r, int c, vector<int> fieldrow, vector<int> fieldcol) {
    vector<tuple<int, int, int>> specialFields;
    bool originIsSpecial = false;
    bool endIsSpecial = false;
    int N = fieldrow.size();
    for (int i = 0; i < N; ++i) {
      if (fieldrow[i] == 1 && fieldcol[i] == 1) originIsSpecial = true;
      if (fieldrow[i] == r && fieldcol[i] == c) endIsSpecial = true;
      specialFields.emplace_back(fieldrow[i] - 1, fieldcol[i] - 1, i);
    }
    if (!originIsSpecial) specialFields.emplace_back(0, 0, -1);
    if (!endIsSpecial) specialFields.emplace_back(r - 1, c - 1, N);
    sort(specialFields.begin(), specialFields.end(),
         [](tuple<int, int, int> a, tuple<int, int, int> b) { 
           int ra = get<0>(a);
           int rb = get<0>(b);
           int ca = get<1>(a);
           int cb = get<1>(b);
           return ra < rb || (ra == rb && ca < cb) || (ra == rb && ca == cb && get<2>(a) < get<2>(b));
         });
    return specialFields;
  }

public: 
  vector<int> difPaths(int r, int c, vector<int> fieldrow, vector<int> fieldcol) { 
    initializeChoose(r + c - 1);    
    // make a directed graph of paths between without going through any other fields
    // tuple is (row, col, idx)
    vector<tuple<int, int, int>> specialFields = makeSpecialFields(r, c, fieldrow, fieldcol);
    int N = specialFields.size();
    // graph[i][j] is the number of paths going from i to j without going through any other special fields
    vector<vector<int>> graph(N, vector<int>(N, 0));
    for (int i = 0; i < N - 1; ++i) {
      for (int j = i + 1; j < N; ++j) {
        int rDiff = get<0>(specialFields[j]) - get<0>(specialFields[i]);
        int cDiff = get<1>(specialFields[j]) - get<1>(specialFields[i]);
        vector<pair<int, int>> intermediateFields;
        assert(rDiff >= 0);     // since fields are sorted
        if (cDiff >= 0 && get<2>(specialFields[i]) < get<2>(specialFields[j])) {
          for (int k = i + 1; k < j; ++k) {
            assert(get<0>(specialFields[i]) <= get<0>(specialFields[k]) && get<0>(specialFields[k]) <= get<0>(specialFields[j])); // since fields are sorted
            if (get<1>(specialFields[i]) <= get<1>(specialFields[k]) && get<1>(specialFields[k]) <= get<1>(specialFields[j])) {
              intermediateFields.emplace_back(get<0>(specialFields[k]) - get<0>(specialFields[i]), 
                                              get<1>(specialFields[k]) - get<1>(specialFields[i]));
            }
          }
          graph[i][j] = computePaths(rDiff, cDiff, intermediateFields);
        }
      }
    }  
    // first index refers to last special index (index in specialFields)
    // second index refers to length of path
    vector<vector<int>> pathsMemo(N, vector<int>(N + 1, 0));
    pathsMemo[0][1] = 1; // counting the origin as a special field, we have 1 path going through the origin that goes through 1 special field (the origin)
    for (int l = 2; l <= N; ++l) {
      for (int k = 0; k < N; ++k) {
        for (int j = 0; j < k; ++j) {
          pathsMemo[k][l] += ((long long) pathsMemo[j][l-1]*graph[j][k]) % MOD;
          pathsMemo[k][l] %= MOD;
        }
      }      
    }
    return vector<int>(pathsMemo[N-1].end() - fieldrow.size() - 1, pathsMemo[N-1].end());
  } 
};

Tags: competitive programming, duke, computer science, algorithm, c++

Back when I lived in Boston, I was an avid CrossFitter. For a variety of reasons, mainly financial, I no longer go to a CrossFit box, but I'm still interested in the sport. Being a bit of a data nerd, I've always been curious about what makes an elite CrossFitter and how much height and weight play a role. To satisfy my curiosity, I scraped data on the top 2,000 athletes from CrossFit Games, and created a pivot chart in D3.js, where you can compare statistics on workouts and lifts by different groups of athletes.

Play around with the data yourself at 2015 CrossFit Open Pivot Chart. Be careful. The data may not be that reliable. If there are a lot of outliers, it may be better to use a robust statistic like median instead of mean (in particular, Sprint 400m and Run 5k workouts seem to have this problem). If you don't choose your groups wisely, you may fall into Simpson's Paradox by excluding important data. For example, from the chart below, one may conclude that back squat strength decreases with age.

Mean Back Squat by Age Group

But now, when we consider gender, we have an entirely different story:

Mean Back Squat by Gender and Age Group

Unsurprisingly, women back squat less than men do. Back squat strength remains stable with age for women, and if anything, back squat strength actually increases slightly with age for men. Whoa, what's going on here? Check this out:

Gender and Age of Top CrossFit Athletes

Notice that the 3 rightmost female bars (red) are taller than the 3 rightmost male bars (blue). On the other hand, the 3 leftmost female bars are shorter than the 3 leftmost male bars. Thus, it seems that women age better than men in the CrossFit world. This has the implication that there are more women than men in older age groups, so the average back squat of that group appears to be lower, when in reality, there simply are a greater relative number of women in that group. You can reach the same conclusion from the title picture, where the bars are stacked.

Height and Weight

There definitely seems to be a prototypical build for an elite CrossFit athlete. For men it's about 5'10" and 200 lb, with a lot of athletes just over 6 feet, too.

Male Height and Weight

For women, most athletes seem to be about 5'6" and 145 lb, which happen to be my dimensions. Smaller female athletes that barely break 5 feet are pretty well-represented, too. I was somewhat surprised at the lack of taller women.

Female Height and Weight

Some open workouts like 1A, which was a one-rep max clean and jerk favored larger athletes:

Open Workout 1A by Height (One-rep Max Clean & Jerk)

Other open workouts like 4, which was an ascending rep ladder of cleans and handstand push-ups, favored smaller athletes:

Open Workout 4 by Height (Cleans and Handstand Push-ups)

You can check the other workouts yourself, but overall, this year's open seemed fair with regard to athlete size.

Anyway, feel free to play with the data yourself here. Let me know if you find anything cool and if you have any suggestions for improving usability.

Tags: stat, d3.js, data viz, crossfit

One popular model for classification is nearest neighbors. It's broadly applicable and unbiased as it makes no assumptions about the generating distribution of the data. Suppose we have $N$ observations. Each observation can be written as $(\mathbf{x}_i,y_i)$, where $0 \leq i \leq N$ and $\mathbf{x}_i = (x_{i1},x_{i2},\ldots,x_{ip})$, so we have $p$ features, and $y_i$ is the class to which $\mathbf{x}_i$ belongs. Let $y_i \in C$, where $C$ is a set of possible classes. If we were given a new observation $\mathbf{x}$. We would find the $k$ closest $(\mathbf{x}_i, y_i)$, say $(\mathbf{x}_{j_1}, y_{j_1}), (\mathbf{x}_{j_2}, y_{j_2}),\ldots, (\mathbf{x}_{j_k}, y_{j_k})$. To classify $\mathbf{x}$, we would simply take a majority vote among these $k$ closest points. While simple and intuitive, as we will see, nearest neighbors runs into problems when $p$ is large.

Consider this problem:

Consider $N$ data points uniformly distributed in a $p$-dimensional unit ball centered at the origin. Find the median distance from the origin of the closest data point among the $N$ points.

Let the median distance be $d(p, N)$. First to keep things simple consider a single data point, so $N = 1$. The volume of a $p$-dimensional ball of radius $r$ is proportional to $r^p$, so $V(r) = Kr^p$. Let $d$ be the distance of the point, so $P(d \leq d(p,1)) = 0.5$. Viewing probability as volume, we imagine a smaller ball inside the larger ball, so \begin{align*} \frac{1}{2} &= P(d \leq d(p, 1)) \\ &= \frac{V(d(p,1))}{V(1)} \\ &= d(p,1)^p \\ &\Rightarrow d(p,1) = \left(\frac{1}{2}\right)^{1/p}, \end{align*} and in general, $P(d \leq t) = t^p$, where $0 \leq t \leq 1$. For example when $p = 1$, we have

For $p=2$,

Now, consider the case when we have $N$ data points, $x_1, x_2, \ldots, x_N$. The distance of the closest point is $$d = \min\left(\Vert x_1 \Vert, \Vert x_2 \Vert, \ldots, \Vert x_N \Vert\right).$$ Thus, we'll have \begin{align*} \frac{1}{2} &= P(d \leq d(p,N)) \\ &= P(d > d(p,N)),~\text{since $P(d \leq d(p,N)) + P(d > d(p,N)) = 1$} \\ &= P\left(\Vert x_1\Vert > d(p,N)\right)P\left(\Vert x_2\Vert > d(p,N)\right) \cdots P\left(\Vert x_N\Vert > d(p,N)\right) \\ &= \prod_{i=1}^N \left(1 - P\left(\Vert x_i \Vert \leq d(p,N)\right)\right) \\ &= \left(1 - d(p,N)^p\right)^N,~\text{since $x_i$ are i.i.d and $P(\Vert x_i\Vert \leq t) = t^p$}. \end{align*} And so, \begin{align*} \frac{1}{2} &= \left(1 - d(p,N)^p\right)^N \\ \Rightarrow 1 - d(p,N)^p &= \left(\frac{1}{2}\right)^{1/N} \\ \Rightarrow d(p,N)^p &= 1 - \left(\frac{1}{2}\right)^{1/N}. \end{align*} Finally, we obtain $$\boxed{d(p,N) = \left(1 - \left(\frac{1}{2}\right)^{1/N}\right)^{1/p}}.$$

So, what does this equation tell us? As the dimension $p$ increases, the distance goes to 1, so all points become far away from the origin. But as $N$ increases the distance goes to 0, so if we collect enough data, there will be a point closest to the origin. But note that as $p$ increases, we need an exponential increase in $N$ to maintain the same distance.

Let's relate this to the nearest neighbor method. To make a good prediction on $\mathbf{x}$, perhaps, we need a training set point that is within distance 0.1 from $\mathbf{x}$. We would need 7 data points for there to be a greater than 50% chance of such a point existing if $p = 1$. See how $N$ increases as $p$ increases.

p	N
1	7
2	69
3	693
4	6932
5	69315
10	6,931,471,232
15	$\scriptsize{6.937016 \times 10^{14}}$

The increase in data needed for there to be a high probability that there is a point close to $\mathbf{x}$ is exponential. We have just illustrated the curse of dimensionality. So, in high dimensions, other methods must be used.

Tags: machine learning, math, stat, probability

One of my seemingly incongruous hobbies is reading classics, particularly, Victorian era novels. Although, I'd like to think that if you really take the time to get to know me, it's not that strange as I can be a bit of a hopeless romantic. In this lonelier chapter of my life, I have found much solace in books.

I've just finished Far From the Madding Crowd by Thomas Hardy. It was a rather difficult read with all the pastoral vocabulary and allusions from the Bible and Greek mythology. My perseverance rewarded me with Hardy's poetical English and an enchanting story.

In the story, Bathsheba Everdene is courted by three different suitors in three different ways. First, there is Gabriel Oak, who shows a steadfast love. Second, there is Mr. Boldwood, who, being a well-established farmer, himself, offers the most practical marriage, both in terms of money and social status. Thirdly and finally, there is Sergeant Troy, who displays a most passionate love. He flirts so well that I found myself being charmed by his wit and eloquence.

I won't ruin the surprise on whom she picks, but I do find it interesting how little modern love has changed. Today, given online dating, women even moreso than back then hold the power of choice. The so-called friend zone is alive and well being present in Vanity Fair, with Captain Dobbin and Amelia, and A Tale of Two Cities, with Sydney Carton and Lucie, too. Men are still losing their minds over women. One very forward-thinking quote I found was

The good-fellowship — camaraderie, usually occuring through similarity of pursuits, is unfortunately seldom super-added to love between the sexes, because they associate not in ther labours, but in their pleasures merely.

It does seem that this ideal of love is more a reality in today's world now that women are part of our workforce. This love is claimed to be "strong as death." Perhaps, I'll start looking out for this "similarity of pursuits."

Tags: literature, life

A classic programming problem is to find all the primes up to a certain number. This problem admits a classic solution, the sieve of Eratosthenes. Here it is in Java.

/**
 * @param upper bound exclusive
 * @return a list of primes strictly less than upper
 */
public static Deque<Integer> findPrimes(int upper) {
    Deque<Integer> primes = new ArrayDeque<Integer>();
    if (upper <= 2) return primes;
    primes.add(2);
    boolean[] isPrime = new boolean[(upper-2)/2]; // index 0 is 3
    Arrays.fill(isPrime, true);
    for (int p = 3; p < upper; p += 2) {
        if (isPrime[p/2 - 1]) {
            primes.add(p);
            // only need to start from p^2 since we already checked p*m, where m < p
            for (long q = ((long) p)*((long) p); q < upper; q += 2*p) {
                isPrime[((int) q)/2 - 1] = false;
            }
        }            
    }
    return primes;
}

The problem is with the isPrime array. This solution is $O(n)$ in space and computation. We may only be interested in finding large primes from say 999,900,000 to 1,000,000,000 as in this problem PRIME1. It doesn't make sense to check numbers less than 999,900,000 or allocate space for them.

Hence, we use a segmented sieve. The underlying idea is that to check if a number $P$ is prime by trial division, we only need to check that it is not divisible by any prime numbers $q \leq \sqrt{P}$. Thus, if we want to find all the primes between $m$ and $n$, we first generate all the primes that are less than or equal to $\sqrt{n}$ with the traditional sieve. Let $S$ be the set of those primes.

Then, let $L$ be some constant number. We work in segments $[m, m + L)$, $[m + L, m + 2L)$, $\ldots$, $[m + kL, n + 1)$. In each of these segments, we identify of all the multiples of the primes found in $S$ and mark them as not prime. Now, we only need $O(\max(|S|, L))$ space, and computation is $$O\left(|S| \cdot \frac{n-m}{L} + (n-m)\right),$$ and we can set $L$ to be as large or small as we want.

By the prime number theorem, $|S|$ is not typically very large. Asympototically, $$|S| = \pi(\sqrt{n}) \sim \frac{\sqrt{n}}{\log \sqrt{n}}.$$ For $L$, we have a tradeoff. If we have large $L$, we may need a lot of space. If we have $L$ too small, our sieve is very small and may not contain many multiples of the primes in $S$, which results in wasted computation. Here is the code with some tweaks to avoid even numbers.

/**
 * Find primes in range
 * @param lower bound, inclusive
 * @param upper bound exclusive
 * @param sieveSize space to use
 * @return list of primes in range
 */
public static Deque<Integer> findPrimes(int lower, int upper, int sieveSize) {
    if (lower >= upper) throw new IllegalArgumentException("lower must be less than upper");
    int sievingPrimesUpper = (int) Math.sqrt(upper);
    if (lower <= sievingPrimesUpper || sievingPrimesUpper <= 2) {
        Deque<Integer> primes = findPrimes(upper);
        if (!primes.isEmpty()) while (primes.peekFirst() < lower) primes.removeFirst();
        return primes;
    }                
    if (sieveSize < 5) sieveSize = 10;
    Deque<Integer> primes = new ArrayDeque<Integer>();        
    if (lower <= 2) primes.add(2);
    Deque<Integer> sievingPrimes = findPrimes(sievingPrimesUpper + 1);
    sievingPrimes.removeFirst(); // get rid of 2
    while (!sievingPrimes.isEmpty() &&
           sievingPrimes.getLast()*sievingPrimes.getLast() >= upper) sievingPrimes.removeLast();
    if (lower % 2 == 0) lower += 1; // make lower odd
    boolean[] isPrime = new boolean[sieveSize]; // isPrime[i] refers to lower + 2*i
    /** 
     * Find first odd multiple for each sieving prime. lower + 2*nextMultipleOffset[i]
     * will be the first odd multiple of sievingPrimes[i] that is greater than or
     * equal to lower.
     */
    int[] nextMultipleOffset = new int[sievingPrimes.size()];
    int idx = 0;
    for (int p : sievingPrimes) {
        int nextMultiple = lower - (lower % p); // make it a multiple of p
        if (nextMultiple < lower)  nextMultiple += p; // make it at least lower
        if (nextMultiple % 2 == 0) nextMultiple += p; // make sure it's odd
        nextMultipleOffset[idx++] = (nextMultiple - lower)/2;
    }
    while (lower < upper) {
        Arrays.fill(isPrime, true);
        idx = 0;
        for (int p : sievingPrimes) {
            int offset = nextMultipleOffset[idx];
            for (int j = offset; j < sieveSize; j += p) isPrime[j] = false;
            /** 
             * We want (lower + 2*sieveSize + 2*(nextMultipleOffset[idx] + k)) % p == 0
             * and (lower + 2*sieveSize + 2*(nextMultipleOffset[idx] + k)) % 2 == 1,
             * where k is the correction term. Second equation is always true. 
             * First reduces to 2*(sieveSize + k) % p == 0 ==> (sieveSize + k) % p == 0
             * since 2 must be invertible in the field F_p. Thus, we have that
             * k % p = (-sieveSize) % p. Then, we make sure that the offset is nonnegative.
             */
            nextMultipleOffset[idx] = (nextMultipleOffset[idx] - sieveSize) % p;
            if (nextMultipleOffset[idx] < 0) nextMultipleOffset[idx] += p;
            ++idx;
        }   
        for (int i = 0; i < sieveSize; ++i) {
            int newPrime = lower + i*2;
            if (newPrime >= upper) break;
            if (isPrime[i]) primes.add(newPrime);
        }
        lower += 2*sieveSize;
    }
    return primes;
}

Tags: competitive programing, computer science, java, prime number, spoj, algorithm

Perhaps the funnest part about this blog was writing the commenting system. One may want to comment on a comment. I found a way to support this behaviour with recursion.

In the layout, using Jade mixins, we have something like

mixin commentView(comment)
    - var children = comments.filter(function(child) { return comment.id === child.commentId; })
    .wmd-preview!= comment.bodyHtml
    .children
        .inner-children
            each child in children
                +commentView(child)

We need the inner-children div to create the smooth transitions when you expand the sub-comments. Apparently, CSS transitions only work when the height is specified, so you can calculate the new height of children with inner-children.

We can also expand comments recursively with JavaScript:

function expandParent(node) {
    if (node.classList.contains('comments')) return; // we've reached the top level
    if (node.classList.contains('children')) {
        node.parentNode.querySelector('.reply-expander').classList.add('expanded');
        node.classList.add('expanded');
    }
    expandParent(node.parentNode);
}

I've left some sample comments below to show off the features. Apparently, transitions are only smooth in Chrome and Internet Explorer. Leave a comment below and try it out!

Tags: node.js, computer science, javascript, meta, css

<<< Previous page

Next page >>>

About Me

Mozzarella Sticks

Macaroni and Cheese

Stacks and Recursion

Dijkstra, Paths, Hashing, and the Chinese Remainder Theorem

Shortest Path

Counting Paths

Hashing and Chinese Remainder Theorem

Binary Heap Priority Queue

Solution

Counting Paths

Beginner Version

Intermediate Version

Advanced Version

Duke APT CountPath Version

CrossFit and Pivot Charts

Height and Weight

Nearest Neighbors and the Curse of Dimensionality

Far From the Madding Crowd

Segmented Sieve of Eratosthenes

Recursive Comments