- What is the scope of AIDO?
- How can I filter out outbreaks I know are irrelevant?
- How are the data in AIDO collected?
- What are the assumptions, caveats, and limiting factors?
- Can I get access to the raw outbreak time series in AIDO?
- How are outbreak similarity scores computed?
- How is the point estimate computed?
- How is the method of analogs display computed?
- How are empty values handled when computing the similarity score?
- A data source link is broken. How can I access the original data source?
- Why was the name of this tool changed from SWAP to AIDO?
- How were AIDO's algorithms evaluated?
- What are the BRD links used for?

The goal of AIDO is to enable the analysis of unfolding outbreaks in the context of historical outbreaks for known infectious diseases. The power of AIDO is its large library of historical outbreaks. As a result, AIDO is *not* equipped for handling emerging infectious diseases.

The search interface allows uses to filter out irrelevant outbreaks (e.g., it is possible to *only* consider person-to-person norovirus outbreaks that occurred in the United States). To do this, expand the "Restrict search" section at the bottom of each disease's search form.

A dedicated team of biologists and epidemiologists collects data for AIDO. They predominantly use official reports and peer-reviewed literature to find and collect disease outbreaks. When outbreaks are only available as epidemic curves, they are digitized using WebPlotDigitizer. If you would like alert us about a new outbreak or provide additional information about an outbreak we currently recognize, please use our feedback form.

We assume that data provided in outbreak reports are relatively complete and accurate.

The AIDO similarity function provides a weighted mean comparing user input to a library of historic disease outbreaks, based on properties identified by literature and analysts as important to disease progression. The point estimate placed by the similarity algorithm in AIDO is a tool to contextualize user data and should not be treated as a forecast.

Our similarity algorithm relies on the user’s ability to complete the form associated with their outbreak. When possible we provide data sources from which additional information can be collected. However, if a user cannot provide property information, the associated similarity score will necessarily be lower (see the "How are empty values handled when computing the similarity score?" FAQ entry for more information). The more information a user can provide, the more accurate the algorithm will be.

Because the AIDO similarity function compares user data to a library of historic outbreaks, similarity scores rely heavily on the types and quality of the data in the libraries. We strive to provide libraries that are richly diverse and from reports with good quality data. However, the number of outbreaks per library varies substantially between diseases based on data availability.

We offer a publicly available read-only RESTful API for accessing our raw data. This API gives access to diseases, locations, and outbreak time series. Visit this link for more information.

When a user submits the form to match an outbreak to their situation, we compute a score for each outbreak in our library. This score represents how similar, on a scale from 0 to 100, the outbreak is to the user's situation. This allows the user to understand their situation in the context of historical outbreaks.

Scores are generated using a simple weighted sum,

,

such that and . These constraints ensure that .

Here, *s* is the outbreak's similarity score, *K* is the number of properties considered, *p _{i}* is property

Note that while the equation above returns a score, *s*, between 0 and 1, we display scores as percentages (i.e., we display *s* · 100).

Selecting disease properties and tuning the property weights is done in several steps:

**Analysts build an outbreak library for the disease.**The library contains time series, location data, source references, and any notable factors that influenced the outbreak's progression. The size of the library will depend on a variety of factors (e.g., how pervasive the disease is, the quality and availability of the reported data, whether the disease requires mandatory reporting).**Properties that influence outbreak progression are determined by the analyst.**Some examples of these properties include: location, total population at risk, strain, socioeconomic factors, and contamination source.**Property importance is ranked based on a sensitivity analysis.**The ranking orders properties by how influential they are when distinguishing outbreaks.**Property weights are assigned from the rankings using the rank-sum method.**For example, a weight of 0.25 assigned to the location property means that 25% of each outbreak's total score is determined by the proximity of the location in the user's query to the outbreak's location.

For more information on the weights various properties take on, including the maximum possible score for an outbreak, please view the "How was this outbreak scored?" table below each outbreak chart.

When a user sorts results by similarity score, a point estimate is shown on each resulting graph. This point estimate is drawn based on the user's input relative to the outbreak start date in each disease curve. The point estimate appears as a circle on top of the graph:

Suppose the user inputs 50 cases of dengue between 2015-05-05 and 2015-05-19. In other words, there were 50 cumulative cases over a two-week period of time.

In the above example, the outbreak began September 26, 2012. We compute the date two weeks after this initial date; this is October 10. This date is used for the point estimate's X coordinate. The Y coordinate is drawn relative to the right Y axis representing the cumulative case count and is simply the number of cases the user provides; in this case, the Y coordinate is 50.

The method of analogs is a simple forecasting method that relies on a large library of historical information. It has applications in fields such as meteorology, climatology, and epidemiology.

When a user sorts results by similarity score, a method of analogs graph is displayed at the top of the sorted outbreak results. This graph presents a simple custom forecast of cumulative disease incidence based on user input and our library of historical outbreak curves. An example forecast is shown below.

It should be noted that, even though the graph will frequently show several years' worth of data, the method of analogs algorithm is only meant for short term forecasting.

- Group outbreaks by disease.
- Line
*cumulative*case count curves up in time and group by time unit: - For each time unit, we have a list of case counts. Compute the mean and standard deviation to fit a normal distribution. For example, if the case counts at a certain time point are [14, 23, 56, 19, 12], then μ = 24.8 and σ = 16.07.
- Using this normal distribution, we can compute the median, 50% prediction interval, and 90% prediction interval for
*each time unit*. Case counts below zero don't make sense, so we institute a lower bound of zero. The normal distribution for the example case count values above produces median = 24.8, 50% PI = (13.96, 35.64), and 90% PI = (-1.63, 51.23) → (0, 51.23).

- To customize the forecast to the user, each case count is weighted in proportion to its outbreak similarity score; thus, case counts in outbreaks that are scored higher weigh more than case counts in outbreaks that are scored lower. To achieve this, we compute weighted mean and standard deviation values, which are then used as the normal distribution's parameters.
- We currently require at least 10 data points at each time point. Once we have fewer than 10 data points, the forecast stops.
- If necessary, we will use B-spline interpolation to handle time series interval granularity issues. For example, if we have monthly and weekly data, B-spline interpolation will be used to fill in the gaps in the monthly data so that it can be used alongside the weekly data. We interpolate at the finest resolution present in each outbreak library.
- Despite the fact that cumulative case count curves are used, there actually may be a visible drop in some forecasts. This is because time series have different lengths, and once the end of a time series is reached, the cumulative case counts may drop significantly.

- C. Viboud, P.-Y. Boëlle, F. Carrat, A.-J. Valleron, and A. Flahault, "Prediction of the Spread of Influenza Epidemics by the Method of Analogues," American Journal of Epidemiology, vol. 158, no. 10, pp. 996–1006, 2003. Link.
- A. S. Mandel', "Method of Analogs in Prediction of Short Time Series: An Expert-statistical Approach," Automation and Remote Control, vol. 65, no. 4, pp. 634–641, 2004. Link.
- E. F. Vasechkina and V. D. Yarin, "Prediction of time series by the method of analogs," Physical Oceanography, vol. 17, no. 4, pp. 242–251, 2007. Link.

There are two types of empty values that may be possible:

**The user may choose to leave certain questions blank.**With the exception of the "Case count", "First case report", and "Last case report" fields (denoted mandatory by *), all questions on all outbreak match forms are optional to allow users the ability to leave questions blank that they may not have the answer to. For example, the measles match form asks users to provide the percentage of population vaccinated in the country*and*in the affected region. It is straightforward to determine the country's vaccine percentage using the provided data source link, but it may not be possible for the region. AIDO allows users to ignore such questions by leaving the answer blank (in this case, leaving the dropdown box selection on "---------").**We may not have enough information to answer that question for a particular outbreak.**For example, perhaps the strain of a specific small novel influenza outbreak wasn't determined or available in the available literature.

In both situations, empty values are handled in AIDO's weighting algorithm by reducing the outbreak's maximum possible score by the weight of the property that has the empty value. That is, the weight of the property—*w _{i}*—is still factored into the weighted sum equation, but the match score for the property—

The reason we decrease the maximum possible score is so that the user recognizes the fact that missing data will decrease AIDO's ability to match outbreaks.

If the user answers all questions, the maximum possible score will be 100%. Suppose, however, that the user leaves the location field blank, and suppose that the location property's weight is 0.25. This means that the *maximum possible score* that *any* outbreak can have is 75%.

Because there are two types of possible empty values, outbreaks may have differing max scores during a search. Using the above example, if the user leaves the location score blank, the max score *any* outbreak may have is 75%. However, suppose that one outbreak's strain is unknown; furthermore, suppose that the strain property's weight is 0.12. Because the user left the location score blank and the outbreak's strain is unknown, the maximum possible score for *that particular outbreak* will be 63%; the rest of the outbreaks that *do* have the strain property will have a maximum possible score of 75%.

For more information on the weights various properties take on, including the maximum possible score for an outbreak, please view the "How was this outbreak scored?" table below each outbreak chart.

Unfortunately, links may die, making it difficult to access the original data source using our link. Perhaps the website is down for maintenance, or perhaps the link has changed or was deleted for some reason. Internet Archive's Wayback Machine attempts to archive snapshots of every website on the internet and can be useful for accessing content from broken links. To use it, simply visit the Wayback Machine and copy/paste the URL of the broken link. Additionally, we invite you to use our feedback form to tell us about encountered broken links so that we can fix them in our database.

This tool was originally named SWAP (Surveillance Window Application) until mid-2016. We decided to change the name to AIDO (Analytics for Investigation of Disease Outbreaks) for several reasons:

- The scope of the tool has expanded beyond the "surveillance window" concept and offers more functionality/features, such as outbreak matching, investigation for a causative agent, and short-term forecasting.
- Having the word "app" or "application" in the named was confusing for some.
- AIDO better describes the goal and use of this tool.

We pronounce AIDO as "I do" or "I dough".

We conducted two tests to evaluate the algorithm for each of the diseases:

**Test 1:**Properties of an outbreak from the library were used as input, and AIDO was expected to return results with match scores of 95% or higher. This allowed us to correct possible errors in property values, time steps, and/or weight calculations.**Test 2:**Data from new outbreaks that were not included in the library were used as input. The top matches from AIDO results were then evaluated for their ability to match the expected case count and duration of the test outbreak. For this test, we collected outbreaks with complete epidemiological information as well as ongoing outbreaks with limited information. The matching scores were also evaluated for a reasonable cut off recommendation (i.e., similarity score thresholds below which matches may not be relevant). Our analyses showed that a lower limit of 65-70% provides a relevant match between an unfolding situation and a historical outbreak.

A document showing user input information for test outbreaks for all of the AIDO diseases along with the evaluation results is provided here.

The Biosurveillance Resource Directory (BRD) is a tool to facilitate obtaining detailed disease surveillance information. It contains information on worldwide disease surveillance systems, epidemiological models, and a thorough disease ontology. Many outbreaks in AIDO include links to BRD resources that can provide surveillance system context, describe epidemiological models used during the outbreak, or point the user to epidemiological data. The BRD is available at https://brd.bsvgateway.org/.