Interactive Data Visualization with D3

URL: https://gist.github.com/ppham27

By Philip Pham

Nim: The Game

I wrote this game primarily to supplement my blog post Sequential Game of Perfect Information: Nim and More. However, it was a good way to learn some of the nuances of D3 v4. In particular, some of the nuances with the brush were tricky to work out. I thought that it ended up being a great example of how one can use the enter and exit methods in D3 to reflect changes in the UI, too.

Country Relationships

This data visualization is part of my Snapstream Searcher project, which searches through closed-captioning television scripts. Countries are considered related if they frequently co-occur, that is, they are mentioned within 300 characters of each other. Some normalization is done to account for the fact that certain countries more frequently than others, so those countries would appear related to other countries by chance.

To cluster the countries, I used a spring embedding algorithm, which places a spring between each pair of countries. If two countries are strongly related, the spring will be shorter. If the two countries are not related at all, the spring will be very long. The layout is calculated by running a steepest gradient descent algorithm to minimize the potential energy of these springs. It's possible that the algorithm gets stuck in a local min, so you may want to reset or randomize the layout a rerun the spring embedding algorithm by clicking on the buttons.

You can see how these relationships change over time by selecting a time period. Each time corresponds to one month of data. Countries that newly appear in the top 25 will be colored in red. The size of each node corresponds to the absolute number of occurrences of that country.

A more in-depth discussion can be found on my blog, Country Relationships and Spring Embedding.

2015 CrossFit Open Pivot Chart

This is a pivot chart in D3.js to visualize data from the top 1,000 male and female athletes in the 2015 CrossFit Open. You can view statistics on workouts and lifts by groups of athletes based on attributes like height, weight, age, region, or 2015 CrossFit Open rank. Using the controls on the right side, you can modify what data is displayed. The control panel is scrollable. Scroll down to access the Metric and Filters. Other controls are hidden in dropdown menus like Zoom options and Order of the groups.

Groups refer to how many bars there are. There will be a bar for each possible combination from the groups selected. Value and Metric determine the height of the bars. If Sum is seleted as a metric, bars can be stacked. Filters exclude data if you find there are too many groups or you want to focus on a particular subset of athletes. For instance, you may only want to look at female athletes or athletes ranked in the top 100. The group Order can be changed by dragging and dropping the Order list. If you exchange Gender and Age Group, the bars will be ordered by Age Group with female and male side-by-side. Mouseover the bars to see a tooltip for more detailed information about the group like the number of athletes in the group.

For zooming, you may find you want only zoom on a particular axis. Expand the zoom menu by clicking on zoom and check or uncheck the appropriate boxes. You can reset the zoom by clicking the Reset Zoom button.

Here's a blog post showing some examples on how this chart can be used: http://www.phillypham.com/CrossFit%20and%20Pivot%20Charts. Data was obtained from http://games.crossfit.com/. I have no idea how reliable it is. In particular, a lot of the Sprint 400m amd Run 5k data seem inaccurate.

Infection Visualization

This is the visualization that accompanies Infection. You can find the code at GitHub. This dataset consists of 6 components. 5 of the components have 5 nodes, and the big component has 23 nodes. The number of infections has been limited to 26. Repeatedly, press the infect button to see how to optimally infect the nodes. We want all nodes in a component to share the same state. As you can you see, it is possible to have infect 25 nodes with this restriction. This solution was found through dynamic programming. Note that a greedy algorithm would have failed here since it would have selected the largest component, which has 23 nodes. After infecting this component, we would not be able to infect any other components because that would exceed 26 infections.

Some interesting features of this visualization that can be reused:

  • the use of SVG markers for arrowheads
  • nodes are draggable with one node's position being fixed ondragstart
  • the infection count is interpolated, so it counts up smoothly

Suffix Array

This visualization shows how a suffix array can be constructed in O(N log(N) log(N)) time. Everytime, we sort using the first 2^i characters, starting at i = 0. More concretely, assume that first 2^i characters have been sorted. Then, remove the first 2^i characters. The remaining string is a suffix, whose first 2^i characters are already sorted. So, we can sort the first 2^(i+1) characters by comparing a pair or ranks, which is just a pair of integers. In this way, we avoid the O(N) cost of doing a comparison. Since we need to sort log(N) times, the total running time is O(N log(N) log(N)).

The comparison sort is an implementation of quicksort, where partitions are selected with the median-of-three rule.

2016 Republican Candidates on TV in 2015

This is a D3.js visualization of the number of mentions 2016 Republican candidates received on television in 2015.

It's part of larger effort to build a tool to search and visualize closed captioning television scripts, Snapstream Reader.

The variable definitions are

  • Total Matches: The total number of times the candidate's name appeared on a given day.
  • Programs: The number of television programs in which the candidate's name appeared on a given day.
  • Contexts: The number of television programs excluding commercials and reruns in which the candidate's name appeared on a given day.

Moving average takes the average of a moving window of 7 days to smooth out the lines.

Percentage normalizes each day so all the data points sum up to 100. This allows for comparisons of relative popularity.

Some neat interactive features:

  • I used the D3 brush and SVG clipping to allow for select and zoom.
  • If you hold the control key and hover over a data point, a tooltip will appear.
  • Hovering over a name in the legend will highlight the candidate's line

See a more detailed exposition on my blog, PhillyPham.