Perils of Balance Testing and $p$-Hacking in Experimental Design

URL: https://github.com/ppham27/cheating-linear-models-simulations

By Philip Pham

Experiments to test the hypothesis that is extremely easy to select a subset of covariates to obtain a desired level of significance. This work was part of my master's thesis and a publication in the The American Statistician, The Perils of Balance Testing in Experimental Design: Messy Analyses of Clean Data.

My primary contribution is a dynamic programming algorithm to approximate the optimal $p$-value obtainable by selecting a subset of $k$ covariates. In this setting, random covariates are generated with no correlation with the response. If there are $T$ covariates, it's infeasible to test all $2^T$ possible subsets, so I used a greedy dynamic programming approximation. Let $S_{t,k}$ be the optimal subset of size $k$ at step $t$. Upon generating the $t + 1$th covariate, $x_{t + 1}$, we let

\begin{equation} S_{t + 1, k} = \textrm{arg min}_{S \in \{S_{t, k}, S_{t, k - 1} \cup {x_{t + 1}}\}} \textrm{test}\left(S\right), \end{equation} where the output of $\textrm{test}$ is the $p$-value.

It was quite interesting to learn how to do linear algebra with Armadillo and statistics in Boost, which are C++ libraries. My previous experience with writing mathematical and statistical computing had all been in R and Python.