Posts tagged data viz

Photo URL is broken

One of the features of Snapstream Searcher is matrix search. One can search for a whole list of terms over a date range and receive a matrix of co-occurrences, where a co-occurrence is defined as two terms being mentioned in the same program within a specified number of characters.

One way to visualize such data is as a graph. Each country is a node. Between each pair of nodes, we place an edge which is weighted according to the strength of their relationship. We'd suspect that countries that frequently co-occur will form clusters.

Spring Embedding

To accomplish this clustering, I used spring embedding. Suppose we have $N$ nodes, labeled from $1$ to $N$. Between nodes $i$ and $j$, we place a string of length $l_{ij}$ with spring constant $k_{ij}$. Recall that Hooke's law states that the force needed to stretch or compress the spring to a length $d$ is $F(d) = k_{ij}(d - l_{ij})$, which implies that spring has potential energy $$ E(d) = \frac{1}{2}k_{ij}(d-l_{ij})^2. $$ Suppose each node $i$ has position $(x_i,y_i)$ and node $j$ has position $(x_j,y_j)$. Define the distance between two nodes $$ d(i,j) = \sqrt{(x_j-x_i)^2 + (y_j-y_i)^2}. $$ The total energy of the system is then \begin{equation} E = \frac{1}{2}\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}k_{ij}\left(d(i,j) - l_{ij}\right)^2 \end{equation} If we call $l_{ij}$ the ideal length of the spring, deviations from the ideal length lead to higher energy. We want to choose $(x_i, y_i)$ for all $i = 1,2,\ldots,N$ such that the total energy is minimized. We can do this with a steepest gradient descent method. Much to my surprise, I actually used something from my Numerical Analysis class.

Implementation Issues

To actually implement this algorithm a couple of issues must be addressed:

  1. computing the $l_{ij}$ as a function of occurrences and co-occurrences, and
  2. normalizing $l_{ij}$ so the nodes fit on the canvas.

For the first issue, if there are a lot of co-occurrences, we want the nodes to be more closely linked. But nodes that are mentioned frequently like USA would have the most co-occurrences with every other node by chance since it appears most frequently in general. To address this issue some normalization is done. Let $S$ be the total number of occurrences of all search terms. Let $R_i$ be the number of occurrences of term $i$ and $C_j$ the number of occurrences of term $j$. Finally, let $M_{ij}$ be the number of co-occurrences of terms $i$ and $j$. We define \begin{equation} A_{ij} = \frac{\left(M_{ij}/S\right)}{\left(R_i/S\right)\left(C_j/S\right)} = \frac{SM_{ij}}{R_iC_j}, \end{equation} which you can intuitively think of as the proportion of co-occurrences we actaully observed over the number of co-occurrences that we would expect if the occurrences were independent of each other.

Now, since more co-occurrences than expected should lead to a smaller ideal distance, we define \begin{equation} l_{ij} = \frac{1}{c + A_{ij}}, \end{equation} where we chose $c = 0.01$.

Note that $l_{ij} \in (0,1/c)$ which is between $0$ and $10$ in the case that $c = 0.01.$ To plot this we need to translate this distance into pixels. We don't want the minimum distance to be too small because than the nodes will be on top of each other. Nor do we want the max distance to be too big since the nodes will fall off the canvas. We apply an affine transformation from $l_{ij}$ to $[L, U].$ Let $l^* = \max_{i,j}l_{ij}$ and $l_* = \min_{i,j}l_{ij},$ and define \begin{equation} l_{ij}^\prime = L + \frac{l_{ij} - l_*}{l^* - l_*}(U - L). \end{equation} I simply chose to fix $L = 80.$ To choose $U$, I used a technique that I learned from programming contests. We want to vary $U$ until all the nodes fit in the canvas. Running the spring embedding algorithm is very expensive however, so we can't try all possible values of $U$. Thus, we do a binary search and find the largest $U$ such that all the nodes fit. This solves the second issue.

You can this algorithm implemented at Country Relationships. Play around with the graph, and let me know what you think!

Analysis

If you look at Country Relationships, right away you see a cluster of Middle Eastern countries with Russia, France, Belgium, and Israel as bridge to the United States. China also places a central role and is closely related to Vietnam and Japan.

If you change the time period to December 2015 and click Spring Embed again to re-cluster, inexplicably the Philippines and Colombia are strongly related. Recall that Steve Harvey mixed up the winners of Miss Universe 2015.

In January 2016, North Korea appears, for it claimed to test a hydrogen bomb. Brazil grows larger with as the fear of the Zika virus takes grip.

In February 2016, Cuba becomes larger due to President Obama announcing that he'll visit. Brazil also gets even larger as Zika fears intensify.

In March 2016, Belgium becomes very large due to the terrorist attack. Of course, Ireland makes its debut thanks to St. Patrick's Day.

In April 2016, Ecuador appears due to the earthquake. It so happens that Japan had earthquakes, too. News programs often group earthquake reporting together, so Ecuador and Japan appear to be closely related.

Try making your own graph by doing a matrix search here!


Photo URL is broken

Happy New Years, everyone! As a way to start off the year, I thought that it would be interesting to write about something that has evolved a lot over the past year: the Republican field of presidential candidates.

At Penn, I've been working on Snapstream Searcher, which searches through closed captioning television scripts. I've decided to see how often a candidate is mentioned on TV has changed over the year. Check out the chart and do your own analysis here.

As you can see in the title picture, Donald Trump has surged in popularity since he announced his candidicy in June. In general every candidate, experiences a surge in mentions upon announcing his or her candidacy. Usually the surge is not sustained, though.

Many candidates lost popularity over the course of 2015. Jeb Bush lost quite a bit of ground, but perhaps no one has suffered as much as Chris Christie.

Other candidates like Ben Carson are passing fads with a bump from Octorber to November before fading away:

Some cool features I added are the ability to zoom. The D3 brush calculates the coordinates, and then, I update the scales and axes. The overflow is hidden with SVG clipping. To illustrate the usefulness of this feature, we can focus in on the September debate. Here, we see Carly Fiorina's bump in popularity due to her strong debate performance.

Another cool feature that I added was the ability to see actual data points. If one holds the Control key and hovers over the point, we can see a tooltip.

Play around with it here, and let me know what you think!


Photo URL is broken

Back when I lived in Boston, I was an avid CrossFitter. For a variety of reasons, mainly financial, I no longer go to a CrossFit box, but I'm still interested in the sport. Being a bit of a data nerd, I've always been curious about what makes an elite CrossFitter and how much height and weight play a role. To satisfy my curiosity, I scraped data on the top 2,000 athletes from CrossFit Games, and created a pivot chart in D3.js, where you can compare statistics on workouts and lifts by different groups of athletes.

Play around with the data yourself at 2015 CrossFit Open Pivot Chart. Be careful. The data may not be that reliable. If there are a lot of outliers, it may be better to use a robust statistic like median instead of mean (in particular, Sprint 400m and Run 5k workouts seem to have this problem). If you don't choose your groups wisely, you may fall into Simpson's Paradox by excluding important data. For example, from the chart below, one may conclude that back squat strength decreases with age.

Mean Back Squat by Age Group

But now, when we consider gender, we have an entirely different story:

Mean Back Squat by Gender and Age Group

Unsurprisingly, women back squat less than men do. Back squat strength remains stable with age for women, and if anything, back squat strength actually increases slightly with age for men. Whoa, what's going on here? Check this out:

Gender and Age of Top CrossFit Athletes

Notice that the 3 rightmost female bars (red) are taller than the 3 rightmost male bars (blue). On the other hand, the 3 leftmost female bars are shorter than the 3 leftmost male bars. Thus, it seems that women age better than men in the CrossFit world. This has the implication that there are more women than men in older age groups, so the average back squat of that group appears to be lower, when in reality, there simply are a greater relative number of women in that group. You can reach the same conclusion from the title picture, where the bars are stacked.

Height and Weight

There definitely seems to be a prototypical build for an elite CrossFit athlete. For men it's about 5'10" and 200 lb, with a lot of athletes just over 6 feet, too.

Male Height and Weight

For women, most athletes seem to be about 5'6" and 145 lb, which happen to be my dimensions. Smaller female athletes that barely break 5 feet are pretty well-represented, too. I was somewhat surprised at the lack of taller women.

Female Height and Weight

Some open workouts like 1A, which was a one-rep max clean and jerk favored larger athletes:

Open Workout 1A by Height (One-rep Max Clean & Jerk)

Other open workouts like 4, which was an ascending rep ladder of cleans and handstand push-ups, favored smaller athletes:

Open Workout 4 by Height (Cleans and Handstand Push-ups)

You can check the other workouts yourself, but overall, this year's open seemed fair with regard to athlete size.

Anyway, feel free to play with the data yourself here. Let me know if you find anything cool and if you have any suggestions for improving usability.