Archive of Stanford MS Statistics Projects

This repo holds the projects I worked on for my M.S. in Statistics at Stanford.

Class: CS 224N - Natural Language Processing with Deep Learning

Code: https://github.com/qqlabs/cs224n-project

Question Answering (QA), or the task of asking a model to answer a question correctly given a passage, is one of the most promising areas in NLP. However, state-of-the-art QA models tend to overfit to training data and do not generalize well to new domains, requiring additional training on domain-specific datasets to adapt. In this project, we aim to design a QA system that is robust to domain shifts and can perform well on out-of-domain (OOD) fewshot data.

We implement a variety of techniques that boost the robustness of a QA model trained with domain adversarial learning and evaluated on out-of-domain data, yielding a 16% increase in F1 score in development and 10% increase in test. We find that the following innovations boost model performance: 1) finetuning the model on augmented out-of-domain augmented data, 2) aggregating Wikipedia type datasets during adversarial training to simplify the domain discriminator’s task, and 3) supplementing the training data with synthetic QA pairs generated with roundtrip consistency. We also ensemble the best-performing models on each dataset and find that ensembling yields further performance increases.

Enhancing Short Term Air Quality Predictions with Location & Meteorological Data

Link to Report

Class: STATS 207 - Time Series Analysis

Code (in Python): Google Colab

We sought to build a 3 day ahead prediction for the air quality index of Santa Clara County. We approached the problem with increasingly complex models (ARMA, VARIMA, LSTM) and evaluated the performance increase with a sliding window cross-validation strategy. We specifically included AQI and meteorology data features from surrounding counties and improved performance when using relevant features.

Our presentation slides can be found here.

Impact of COVID Misinformation on Vaccination - A Reanalysis Identifying and Addressing Covariate Imbalances

Link to Report

Class: STATS 209 - Causal Inference

Code (in R): Google Colab

We reanalyzed a randomized controlled trial (Original Paper, Original Github) that exposed participants to COVID misinformation and measured its impact on vaccination intent. We evaluated the study’s randomization and show that it is significantly imbalanced (p-value < 0.0001) using a Monte Carlo simulation of the Mahalanobis distance between Treatment and Control. We then reduced the bias of the estimates by applying matching estimators and performing regression adjustment using Lin’s Estimator. We also explored heterogeneous treatment effects and provide some intuitive insights.

Experiment - Can I Pay2Win for First Person Shooting Games?

Link to Report

Class: STATS 263 - Design of Experiments

Code (in R): Google Colab

We were interested in understanding what factors can help improve an average person’s skills in shooting games and whether investing in better equipment (gaming mouse, high refresh rate monitor) really does make a significant impact on shooting skills. In addition, we wanted to see if a stimulant like coffee will be able to further enhance a player’s performance.

We structured our experiment as a combination of a strip-plot and stepped-wedge design. There were many logistic details that we considered during the design, ranging from how to parallelize the runs to whether we should serve hot or cold coffee.

This project focused on the design of the experiment and the data collection process. A proper experiment design helped drastically simplify what we needed to analyze.

Stock Price Predictions

Link to Report

Class: STATS 202 - Data Mining and Analysis

Code (in R): Google Colab

We were tasked with making a 9 day forecast at the 5-second granularity for 9 anonymized stock tickers. After exploring baseline and ARIMA models, we ultimately structured the problem as a direct forecasting problem and used the forecast-ml package to train and make predictions. Note that we did not actually learn how to do time series analysis for this class, so we had to convert the problem into a structure we were familiar with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Archive of Stanford MS Statistics Projects

Generating Robustness: Exploring Various Ways to Adapt Question Answering to New Domains

Enhancing Short Term Air Quality Predictions with Location & Meteorological Data

Impact of COVID Misinformation on Vaccination - A Reanalysis Identifying and Addressing Covariate Imbalances

Experiment - Can I Pay2Win for First Person Shooting Games?

Stock Price Predictions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Archive of Stanford MS Statistics Projects

Generating Robustness: Exploring Various Ways to Adapt Question Answering to New Domains

Enhancing Short Term Air Quality Predictions with Location & Meteorological Data

Impact of COVID Misinformation on Vaccination - A Reanalysis Identifying and Addressing Covariate Imbalances

Experiment - Can I Pay2Win for First Person Shooting Games?

Stock Price Predictions