Outbreak Detection

Outbreak Detection from Search Data

Development of an algorithm for the early detection of food-borne illnesses using internet search data: a Generalized Additive Model (GAM) and an Artificial Neural Network (ANN) model to predict outbreaks.


Food Borne illnesses are a major problem in the world with 1 in 6 people suffering from foodborne illness in the US alone, causing 420,00 deaths worldwide and 3000 deaths in the US. Further, food-borne illnesses cost the US economy close to 78 billion dollars.


Food Borne illnesses are a major problem in the world with 1 in 6  people suffering from foodborne illness in the US alone, causing 420,00 deaths worldwide and 3000 deaths in the US. Further, food-borne illnesses cost the US economy close to  78 billion dollars. 

To help prevent the spread of these outbreaks, local, state, national and multinational organizations have created protocols that can dramatically slow down an outbreak. However, as you can see in the epi-curve, the effectiveness of the protocols are directly dependent on how fast they can determine the existence of an outbreak, which is oftentimes late, leading to massive losses. Thus, if the existence of the outbreak is determined early then it can greatly decrease losses and help curb the spread of the outbreak dramatically.


In this research, we propose a model for the early warning of food-borne outbreaks by analyzing Internet, specifically Google, search data. We hypothesize that in a time of an outbreak, people affected by the outbreak will search terms related to the disease which will cause a spike in those terms. Since it is likely that individuals who search these disease related terms will do so before seeking medical attention, a model that can identify these spikes can identify outbreaks much quicker than existing systems. Thus, this research focuses on developing mathematical and computational models to recognize the spikes that correspond to foodborne outbreaks.


In this research, we use 5 food-borne symptoms which are not only common but also pertain mostly to food-borne illness using data obtained from the CDC. After this analysis we came to the conclusion to use the following 5 symptoms as symptom search terms in our model: nausea, vomiting, stomach cramps, stomach flu, and diarrhea. We also incorporate the 5 most common food-borne illnesses (as per the CDC[7]) itself as search terms in our model. Currently, these are Norovirus, Salmonella, Clostridium perfringens, Campylobacter and Staphylococcus aureus. The Google search data is obtained from Google Trends [8] and Twitter data is obtained from the Twitter Streaming API. The outbreak data were obtained from CDC’s National Outbreak Reporting System (NORS) [10]. As we have Google Trends and Twitter data from 2013 to 2018, only those years are taken into consideration in developing our model . Before we start developing models we will have to preprocess the data. In this research, we will use feature selection, feature optimization using PCA, normalization, imputation, data shifting and smoothing. We will first talk about feature selection.

Feature Selection:

We are interested in the difference between the search values, so instead of taking in the “absolute” value of the search terms as input, we take in the difference between search terms in consecutive days as inputs to the model. We take into account search data for the past 5 days for the use of the model. This gives us 4 differences as you can see here. Thus, for each search term we will have 4 additional features. Because we have 10 search terms as mentioned earlier, 5 for symptoms and 5 for the illness itself, we will have 10 * 4 = 40 features. But because we incorporate data from both Google and Twitter, we have 40 * 2 = 80 overall features. Our models also employ population dynamics and historical food-borne illness features as the particular locale can greatly influence the amount of cases. Specifically, our models take into account the average # of foodborne illnesses per kilometer^2 relative to population (ie. cases/day/population density) and average cases per day in the last 5 days.

Data Preprocessing:

However, this approach will lead to over 80 features from search alone and even more from population dynamics, which will cause the model to pick up on noise as opposed to the actual patterns of the dataset and will make it computationally expensive to train. To solve these problems we use Principal Components Analysis (PCA) to determine the optimal amount of features to successfully predict # of food-borne illness cases 4 days from when the model is used


In the first step of PCA, we perform z-score standardization to transform the input dataset to a mean value of 0 and standard deviation of 1. This is done because (1) we do not want to penalize a particular feature solely due to its magnitude and (2) because PCA is highly sensitive to variance in input variables. Specifically, the equation below is used for this standardization, where zi represents the ith standardized value, xi represents the ith actual data point of the column, mu represents the average of the column and sigma represents the standard deviation of the column.

Principal Component Analysis (PCA):

Now we seek to understand the degree of correlation between the 90 input features (80 search features and 10 dynamics features). For this, we compute a 90 by 90 covariance matrix with each feature computed against each other. The covariance between two variables, classically represented by the letter Q, is defined and computed according to the following equation. Next, we calculate the eigenvalues of the covariance matrix to compute the Principal Components (PCs) of the matrix. PCs are linear combinations of inputs that are made to be relatively independent. As we have 90 inputs features, we will have 90 Principal components. However, our model will not take as input all 90 principle components as many of them have low predictive power and thus can lead our models to overfit. We choose the minimum amount of features that explain at least 95% of the variance. After PCA, it was found that ~ 95% of the variance was explained by 10 features as shown in the graph.


As is the case with most behavioral data, there exists a non-trivial amount of noise inherent in the data. To combat this problem, we use the exponential smoothing recursive algorithm, represented by the equation below. Specifically, we use the algorithm to transform each trends data column into one with less statistical noise. In the equation below the transformed data column is given by Ft, α is an experimentally determined smoothing constant and is yt is the pre transformed value.

Imputation and Data Shifting:

To overcome the problem of missing data, a process known as multiple imputation was used to fill in the missing data. Specifically, the missing value is computed by running least-square regression on the missing datapoint a fixed amount of times and averaging the result, which was what was used as the missing value. To align the dataset for our objective of predicting outbreak 4 days before we shift the date of cases by 4 days.

Mathematical Methodology:

We develop two models to predict the number of cases of food-borne illnesses 4 days into the future. One using an Artificial Neural Network and another using an Generalized Additive Model.

Artificial Neural Network (ANN):

We first develop an Artificial Neural Network (ANN) to predict the expected number of food-borne illnesses cases depending on Google and Twitter search data along with population dynamic data and frequency of food-borne illness in that area. An Artificial Neural Network is an algorithm that takes in some input in the first layer, performs computations on the weighted inputs, applies activation functions in the hidden layer, and outputs a desired result in the output layer. The connections between nodes and layers generate pattern recognition capabilities similar to those generated by neurons and synapses in human brains.

Hyperparameter Computation:

Hyperparameters, which is a term that refers to the parameters that affect the neural network itself such as the number of nodes in each layer, the number of layers etc., were computed using a process known as random search. which consists of performing experiments where hyperparameter values are sampled using a normal distribution and the choice that maximized accuracy on the cross validation test was chosen. The following neural network architecture consisting of 3 hidden layers with 4 nodes each was found to be optimal after doing random search. An ANN is trained by computing weights that minimize a given cost function. In this research, we use the log loss cost function and the BFGS optimization algorithm.

To train and test the model, we randomly sort the dataset consisting of the 10 principle components and the food-borne case data and use the first 60 % of each for training, the next 20 % for cross validation (used to compute hyperparameter values) and the rest of 20% is used testing the model itself. The Artificial Neural Network is developed using the Python programming language and the TensorFlow Machine Learning library. After running it through 25 iterations, a r^2 value of 0.97 was found.

Generalized Additive Models (GAM):

While a Neural Network is often very powerful, one significant disadvantage with this approach is that a neural network is often very hard to interpret. That is, if a stakeholder asks exactly what the neural network is doing to determine the number of cases 4 days in the future, then it would be very difficult to do so. To that end, we also develop a Generative Additive Model (GAM), a model known for its interpretability.A GAM model is an extension of classic statistical techniques (ex. Polynomial regression) in that it is represented by linear combination of terms. However, this model differs from traditional techniques because it does not restrict itself to one family of functions (ex. polynomial). Specifically, a GAM model is represented by the following equation. where bk(x) is a term from some family of functions


How the models reduce risk?

Our model could predict the risk 4 days before an outbreak happens giving the public health agency the most valuable time in an outbreak to detect the cause and eliminate it. The most of the number of cases happen in the peak and this model successfully flatten the curve. With a simulation, this chart shows the number of cases with and without the model , red representing without the model and blue representing cases with the model and it’s a reduction of as much as 65% of total number of cases because of those precios 4 days. As the active contact tracing has significant cost, the resultant cost saving is as much as 27%.

Public Policy Recommendation:

Public health agencies go through a 5 step process for any foodborne outbreak. They are

This model is recommended to be used at the first step of the public health surveillance system, that is at the “Detect” step. These two timelines are a depiction of an outbreak timeline with and without the model. With the model, the intervention or elimination of the cause can be achieved almost 8 days before.


The model brings down the cost of an outbreak by 27%. With an estimated $77.7 billion annual cost of foodborne illness in the US, the model will save an estimated $20.84 billion and potentially save more than 1900 American lives. We strongly recommend to use the model in the surveillance of foodborne illness.

Partner: Shreyas Kar