Principal Component Analysis (PCA):
Now we seek to understand the degree of correlation between the 90 input features (80 search features and 10 dynamics features). For this, we compute a 90 by 90 covariance matrix with each feature computed against each other. The covariance between two variables, classically represented by the letter Q, is defined and computed according to the following equation. Next, we calculate the eigenvalues of the covariance matrix to compute the Principal Components (PCs) of the matrix. PCs are linear combinations of inputs that are made to be relatively independent. As we have 90 inputs features, we will have 90 Principal components. However, our model will not take as input all 90 principle components as many of them have low predictive power and thus can lead our models to overfit. We choose the minimum amount of features that explain at least 95% of the variance. After PCA, it was found that ~ 95% of the variance was explained by 10 features as shown in the graph.
As is the case with most behavioral data, there exists a non-trivial amount of noise inherent in the data. To combat this problem, we use the exponential smoothing recursive algorithm, represented by the equation below. Specifically, we use the algorithm to transform each trends data column into one with less statistical noise. In the equation below the transformed data column is given by Ft, α is an experimentally determined smoothing constant and is yt is the pre transformed value.
Imputation and Data Shifting:
To overcome the problem of missing data, a process known as multiple imputation was used to fill in the missing data. Specifically, the missing value is computed by running least-square regression on the missing datapoint a fixed amount of times and averaging the result, which was what was used as the missing value. To align the dataset for our objective of predicting outbreak 4 days before we shift the date of cases by 4 days.