One of the features of Snapstream Searcher is matrix search. One can search for a whole list of terms over a date range and receive a matrix of co-occurrences, where a co-occurrence is defined as two terms being mentioned in the same program within a specified number of characters.
One way to visualize such data is as a graph. Each country is a node. Between each pair of nodes, we place an edge which is weighted according to the strength of their relationship. We'd suspect that countries that frequently co-occur will form clusters.
Spring Embedding
To accomplish this clustering, I used spring embedding. Suppose we have $N$ nodes, labeled from $1$ to $N$. Between nodes $i$ and $j$, we place a string of length $l_{ij}$ with spring constant $k_{ij}$. Recall that Hooke's law states that the force needed to stretch or compress the spring to a length $d$ is $F(d) = k_{ij}(d - l_{ij})$, which implies that spring has potential energy $$ E(d) = \frac{1}{2}k_{ij}(d-l_{ij})^2. $$ Suppose each node $i$ has position $(x_i,y_i)$ and node $j$ has position $(x_j,y_j)$. Define the distance between two nodes $$ d(i,j) = \sqrt{(x_j-x_i)^2 + (y_j-y_i)^2}. $$ The total energy of the system is then \begin{equation} E = \frac{1}{2}\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}k_{ij}\left(d(i,j) - l_{ij}\right)^2 \end{equation} If we call $l_{ij}$ the ideal length of the spring, deviations from the ideal length lead to higher energy. We want to choose $(x_i, y_i)$ for all $i = 1,2,\ldots,N$ such that the total energy is minimized. We can do this with a steepest gradient descent method. Much to my surprise, I actually used something from my Numerical Analysis class.
Implementation Issues
To actually implement this algorithm a couple of issues must be addressed:
- computing the $l_{ij}$ as a function of occurrences and co-occurrences, and
- normalizing $l_{ij}$ so the nodes fit on the canvas.
For the first issue, if there are a lot of co-occurrences, we want the nodes to be more closely linked. But nodes that are mentioned frequently like USA would have the most co-occurrences with every other node by chance since it appears most frequently in general. To address this issue some normalization is done. Let $S$ be the total number of occurrences of all search terms. Let $R_i$ be the number of occurrences of term $i$ and $C_j$ the number of occurrences of term $j$. Finally, let $M_{ij}$ be the number of co-occurrences of terms $i$ and $j$. We define \begin{equation} A_{ij} = \frac{\left(M_{ij}/S\right)}{\left(R_i/S\right)\left(C_j/S\right)} = \frac{SM_{ij}}{R_iC_j}, \end{equation} which you can intuitively think of as the proportion of co-occurrences we actaully observed over the number of co-occurrences that we would expect if the occurrences were independent of each other.
Now, since more co-occurrences than expected should lead to a smaller ideal distance, we define \begin{equation} l_{ij} = \frac{1}{c + A_{ij}}, \end{equation} where we chose $c = 0.01$.
Note that $l_{ij} \in (0,1/c)$ which is between $0$ and $10$ in the case that $c = 0.01.$ To plot this we need to translate this distance into pixels. We don't want the minimum distance to be too small because than the nodes will be on top of each other. Nor do we want the max distance to be too big since the nodes will fall off the canvas. We apply an affine transformation from $l_{ij}$ to $[L, U].$ Let $l^* = \max_{i,j}l_{ij}$ and $l_* = \min_{i,j}l_{ij},$ and define \begin{equation} l_{ij}^\prime = L + \frac{l_{ij} - l_*}{l^* - l_*}(U - L). \end{equation} I simply chose to fix $L = 80.$ To choose $U$, I used a technique that I learned from programming contests. We want to vary $U$ until all the nodes fit in the canvas. Running the spring embedding algorithm is very expensive however, so we can't try all possible values of $U$. Thus, we do a binary search and find the largest $U$ such that all the nodes fit. This solves the second issue.
You can this algorithm implemented at Country Relationships. Play around with the graph, and let me know what you think!
Analysis
If you look at Country Relationships, right away you see a cluster of Middle Eastern countries with Russia, France, Belgium, and Israel as bridge to the United States. China also places a central role and is closely related to Vietnam and Japan.
If you change the time period to December 2015 and click Spring Embed
again to re-cluster, inexplicably the Philippines and Colombia are strongly related. Recall that Steve Harvey mixed up the winners of Miss Universe 2015.
In January 2016, North Korea appears, for it claimed to test a hydrogen bomb. Brazil grows larger with as the fear of the Zika virus takes grip.
In February 2016, Cuba becomes larger due to President Obama announcing that he'll visit. Brazil also gets even larger as Zika fears intensify.
In March 2016, Belgium becomes very large due to the terrorist attack. Of course, Ireland makes its debut thanks to St. Patrick's Day.
In April 2016, Ecuador appears due to the earthquake. It so happens that Japan had earthquakes, too. News programs often group earthquake reporting together, so Ecuador and Japan appear to be closely related.
Try making your own graph by doing a matrix search here!
New Comment
Comments
No comments have been posted yet. You can be the first!