The reports provided by AIDO are summaries of reports from other organizations. Because of this, our data can only be as accurate as the reports they are based on. Therefore, an assumption of AIDO is that the reports provided by health organizations are relatively complete and accurate. Every outbreak listed in AIDO provides references to the original outbreak reports.
The AIDO similarity function provides a weighted mean comparing user input to a library of historic disease outbreaks, based on properties identified by literature and analysts as important to disease progression. The point estimate placed by the similarity algorithm in AIDO is a tool to contextualize user data and should not be treated as a forecast.
Our similarity algorithm relies on the user’s ability to complete the form associated with their outbreak. When possible we provide data sources from which additional information can be collected. However, if a user cannot provide property information, the associated similarity score will necessarily be lower (see the "How are empty values handled when computing the similarity score?" FAQ entry for more information). The more information a user can provide, the more accurate the algorithm will be.
Because the AIDO similarity function compares user data to a library of historic outbreaks, similarity scores rely heavily on the types and quality of the data in the libraries. We strive to provide libraries that are richly diverse and from reports with good quality data. However, the number of outbreaks per library varies substantially between diseases based on data availability.
When a user submits the form to match an outbreak to their situation, we compute a score for each outbreak in our library. This score represents how similar, on a scale from 0 to 100, the outbreak is to the user's situation. This allows the user to understand their situation in the context of historical outbreaks.
Scores are generated using a simple weighted sum,
,
such that and , which ensure that .
Here, s is the outbreak's similarity score, K is the number of properties considered, wi is the weight of property i, and mi is the outbreak's match score of property i (i.e., how well the outbreak's value for property i matches the user's value, provided in the query form).
Note that while the equation above returns a score, s, between 0 and 1, we display scores as percentages (i.e., we display s · 100).
Selecting disease properties and tuning the property weights is done in several steps:
During the development of the matching algorithm, the first classification of outbreaks in our analysis occurs by size and duration. Other properties are selected based on their ability to sort by case count and/or duration. As a result, the "Case count" and "Time" properties will always receive the highest weights.
For more information on the weights various properties take on, including the maximum possible score for an outbreak, please view the "How was this outbreak scored?" table below each outbreak chart.
When a user sorts results by similarity score, a point estimate is shown on each resulting graph. This point estimate is drawn based on the user's input relative to the outbreak start date in each disease curve. The point estimate appears as a circle on top of the graph:
Suppose the user inputs 50 cases of dengue between 2015-05-05 and 2015-05-19. In other words, there were 50 cumulative cases over a two-week period of time.
In the above example, the outbreak began September 25, 2012. We compute the date two weeks after this initial date; this is October 9. This date is used for the point estimate's X coordinate. The Y coordinate is drawn relative to the right Y axis representing the cumulative case count and is simply the number of cases the user provides; in this case, the Y coordinate is 50.
The short-term forecast is generated using a simplified variation on an algorithm called the method of analogs. The method of analogs is a simple forecasting method that relies on a large library of historical information. It has applications in fields such as meteorology, climatology, and epidemiology.
When a user sorts results by similarity score, a short-term forecast graph is displayed at the top of the sorted outbreak results. This graph presents a simple custom forecast of cumulative disease incidence based on user input and our library of historical outbreak curves. An example forecast is shown below.
There are two types of empty values that may be possible:
In both situations, empty values are handled in AIDO's weighting algorithm by reducing the outbreak's maximum possible score by the weight of the property that has the empty value. That is, the weight of the property—wi—is still factored into the weighted sum equation, but the match score for the property—m(pi)—will automatically be assigned 0.
The reason we decrease the maximum possible score is so that the user recognizes the fact that missing data will decrease AIDO's ability to match outbreaks.
If the user answers all questions, the maximum possible score will be 100%. Suppose, however, that the user leaves the location field blank, and suppose that the location property's weight is 0.25. This means that the maximum possible score that any outbreak can have is 75%.
Because there are two types of possible empty values, outbreaks may have differing max scores during a search. Using the above example, if the user leaves the location score blank, the max score any outbreak may have is 75%. However, suppose that one outbreak's strain is unknown; furthermore, suppose that the strain property's weight is 0.12. Because the user left the location score blank and the outbreak's strain is unknown, the maximum possible score for that particular outbreak will be 63%; the rest of the outbreaks that do have the strain property will have a maximum possible score of 75%.
For more information on the weights various properties take on, including the maximum possible score for an outbreak, please view the "How was this outbreak scored?" table below each outbreak chart.
This tool was originally named SWAP (Surveillance Window Application) until mid-2016. We decided to change the name to AIDO (Analytics for Investigation of Disease Outbreaks) for several reasons:
We pronounce AIDO as "I do" or "I dough".
We conducted two tests to evaluate the algorithm for each of the diseases:
A document showing user input information for test outbreaks for all of the AIDO diseases along with the evaluation results is provided here.
AIDO's analytical capabilities require a certain amount of data to be able to function reliably. The similarity score feature must be disabled for diseases that do not meet the data requirements.
If the "sort by similarity score" option is not available, it is probably due to an insufficient number of outbreaks. If you know of an outbreak that isn't listed in our library, please feel free to contact us to let us know.
Given the breadth and depth of AIDO's outbreak library, a user might wonder if their outbreak is anomalous compared to historical outbreaks. The "Anomaly detection" section is aimed at allowing the user to answer this question.
When viewing search results, outbreak properties (e.g., average daily cases, vaccination percentage, HDI) are broken up into two groups: 1) discrete and 2) continuous. Discrete properties are those for which there are a finite number of choices; for example, a "yes or no" question would be a discrete property. A continuous property is one for which a raw numerical value is provided; for example, the outbreak's total case count is a continuous property.
Charts showing the distribution of property values in AIDO are shown at the top of the search results page under the "Anomaly detection" tab. As discussed below, continuous property values are presented as box plots, and discrete property values are presented as pie charts. In both cases, a drop-down menu below the chart allows the user to select the property they wish to visualize.
To visualize the distribution of values for a continuous property, a box plot is shown. If the user is sorting results by "Similarity score", the user's value, if provided, will be overlaid on the distribution of values in AIDO's library. The following screenshot shows a sample continuous property, total cases:
Here, the user's value is displayed in orange, and the values for all of the outbreaks in AIDO are displayed in blue, along with a box plot showing the median, 1st and 3rd quartiles, and lower and upper fence values. In this instance, the user's total case count value is clearly an outlier in the context of the historical outbreaks present in AIDO's library, which may be a cause for concern.
To visualize the distribution of values for a discrete property, a pie chart is shown. If the user is sorting results by "Similarity score", the user's value, if provided, will be highlighted by darkening the border on the relevant pie slice. The following screenshot shows a sample discrete property, anthrax type:
Here, the user's value was "inhalational", so that pie slice is highlighted. In this instance, inhalational anthrax only accounts for 7.14% of the anthrax outbreaks present in AIDO's library, so this may be a cause for concern.