Real-time outbreak analysis: Ebola as a case study - part 1
Introduction
This practical (in three parts) simulates the early assessment and reconstruction of an Ebola Virus Disease (EVD) outbreak. It introduces various aspects of analysis of the early stage of an outbreak, including case fatality ratio (CFR), epicurves (part 1), growth rate estimation, contact tracing data, delays, and estimates of transmissibility (part 2), as well as transmission chain reconstruction using outbreaker2 (part 3).
Note: This practical is derived from earlier practicals called Ebola simulation part 1: early outbreak assessment and Ebola simulation part 2: outbreak reconstruction
Learning outcomes
By the end of this practical, you should be able to:
Load and clean outbreak data in R (part 1)
Estimate the case fatality ratio (CFR) (part 1)
Compute and plot incidence from linelist (part 1)
Estimate & interpret the growth rate & doubling time of the epidemic (part 2)
Estimate the serial interval from data on pairs infector / infected individuals (part 2)
Estimate & interpret the reproduction number of the epidemic (part 2)
Forecast short-term future incidence (part 2)
Reconstruct who infected whom using epidemiological and genetic data (part 3)
A novel EVD outbreak in a fictional country in West Africa
A new EVD outbreak has been notified in a fictional country in West Africa. The Ministry of Health is in charge of coordinating the outbreak response, and have contracted you as a consultant in epidemic analysis to inform the response in real time.
Required packages
The following packages, available on CRAN or github, are needed for this analysis. Install necessary packages as follows:
# install.packages("remotes")
# install.packages("readxl")
# install.packages("outbreaks")
# install.packages("incidence")
# remotes::install_github("reconhub/epicontacts@ttree")
# install.packages("distcrete")
# install.packages("epitrix")
# remotes::install_github("annecori/EpiEstim")
# remotes::install_github("reconhub/projections")
# install.packages("ggplot2")
# install.packages("magrittr")
# install.packages("binom")
# install.packages("ape")
# install.packages("outbreaker2")
# install.packages("here")
Once the packages are installed, you may need to open a new R session. Then load the libraries as follows:
library(readxl)
library(outbreaks)
library(incidence)
library(epicontacts)
library(distcrete)
library(epitrix)
library(EpiEstim)
library(projections)
library(ggplot2)
library(magrittr)
library(binom)
library(ape)
library(outbreaker2)
library(here)
Early data (reading data into R)
You have been given the following linelist and contact data:
linelist_20140701.xlsx: a linelist containing case information up to the 1st July 2014; and
contact_20140701.xlsx: a list of contacts reported between cases up to the 1st July 2014. “infector” indicates a potential source of infection, and “case_id” the recipient of the contact.
To read into R, download these files and use the function read_xlsx()
from the readxl
package to import the data. Each import will create a
data table stored as a tibble
object.
- Call the first one
linelist
, and - the second one
contacts
.
For instance, you first command line could look like:
linelist <- read_excel(here("data/linelist_20140701.xlsx"), na = c("", "NA"))
Take some time to look at the data and structure here.
- Are the data and format similar to linelists that you have seen in the past?
- If you were part of the outbreak investigation team, what other information might you want to collect?
## [1] 169 11
## # A tibble: 6 x 11
## case_id generation date_of_infecti… date_of_onset date_of_hospita…
## <chr> <dbl> <chr> <chr> <chr>
## 1 d1fafd 0 <NA> 2014-04-07 2014-04-17
## 2 53371b 1 2014-04-09 2014-04-15 2014-04-20
## 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25
## 4 6c286a 2 <NA> 2014-04-27 2014-04-27
## 5 0f58c4 2 2014-04-22 2014-04-26 2014-04-29
## 6 49731d 0 2014-03-19 2014-04-25 2014-05-02
## # … with 6 more variables: date_of_outcome <chr>, outcome <chr>,
## # gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
Note that for further analyses, you will need to make sure that all
dates as stored correctly as Date
objects. You can do this by using
the function as.Date
, for
example:
linelist$date_of_onset <- as.Date(linelist$date_of_onset, format = "%Y-%m-%d")
The formatted data should then look like:
## # A tibble: 6 x 11
## case_id generation date_of_infecti… date_of_onset date_of_hospita…
## <chr> <dbl> <date> <date> <date>
## 1 d1fafd 0 NA 2014-04-07 2014-04-17
## 2 53371b 1 2014-04-09 2014-04-15 2014-04-20
## 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25
## 4 6c286a 2 NA 2014-04-27 2014-04-27
## 5 0f58c4 2 2014-04-22 2014-04-26 2014-04-29
## 6 49731d 0 2014-03-19 2014-04-25 2014-05-02
## # … with 6 more variables: date_of_outcome <date>, outcome <chr>,
## # gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
## # A tibble: 6 x 3
## infector case_id source
## <chr> <chr> <chr>
## 1 d1fafd 53371b other
## 2 f5c3d8 0f58c4 other
## 3 0f58c4 881bd4 other
## 4 f5c3d8 d58402 other
## 5 20b688 d8a13d funeral
## 6 2ae019 a3c8b8 other
Data cleaning and descriptive analysis
Look more closely at the data contained in this linelist
.
- What do you notice?
## # A tibble: 6 x 11
## case_id generation date_of_infecti… date_of_onset date_of_hospita…
## <chr> <dbl> <date> <date> <date>
## 1 d1fafd 0 NA 2014-04-07 2014-04-17
## 2 53371b 1 2014-04-09 2014-04-15 2014-04-20
## 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25
## 4 6c286a 2 NA 2014-04-27 2014-04-27
## 5 0f58c4 2 2014-04-22 2014-04-26 2014-04-29
## 6 49731d 0 2014-03-19 2014-04-25 2014-05-02
## # … with 6 more variables: date_of_outcome <date>, outcome <chr>,
## # gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
## [1] "case_id" "generation"
## [3] "date_of_infection" "date_of_onset"
## [5] "date_of_hospitalisation" "date_of_outcome"
## [7] "outcome" "gender"
## [9] "hospital" "lon"
## [11] "lat"
You may notice that there are missing entries.
An important step in analysis is to identify any mistakes in data entry.
Although it can be difficult to assess errors in hospital names, we
would expect the date of infection to always be before the date of
symptom onset.
Clean this dataset to remove any entries with negative or 0 day incubation periods.
## identify mistakes in data entry (negative incubation period)
mistakes <-
mistakes
linelist[mistakes, ]
## [1] 46 63 110
## # A tibble: 3 x 11
## case_id generation date_of_infecti… date_of_onset date_of_hospita…
## <chr> <dbl> <date> <date> <date>
## 1 3f1aaf 4 2014-05-18 2014-05-18 2014-05-25
## 2 ce9c02 5 2014-05-27 2014-05-27 2014-05-29
## 3 7.0000… 6 2014-06-10 2014-06-10 2014-06-16
## # … with 6 more variables: date_of_outcome <date>, outcome <chr>,
## # gender <chr>, hospital <chr>, lon <dbl>, lat <dbl>
Save your “cleaned” linelist as a new object: linelist_clean
linelist_clean <- linelist[-mistakes, ]
What other negative dates or mistakes might you want to check if you had the full dataset?
Calculating the case fatality ratio (CFR)
Here are the number of cases by outcome status. How would you calculate the CFR from this?
table(linelist_clean$outcome, useNA = "ifany")
##
## Death Recover <NA>
## 60 43 63
Think about what to do with cases whose outcome is NA?
n_dead <- sum(linelist_clean$outcome %in% "Death")
n_known_outcome <- sum(linelist_clean$outcome %in% c("Death", "Recover"))
n_all <- nrow(linelist_clean)
cfr <- n_dead / n_known_outcome
cfr_wrong <- n_dead / n_all
cfr_with_CI <- binom.confint(n_dead, n_known_outcome, method = "exact")
cfr_wrong_with_CI <- binom.confint(n_dead, n_all, method = "exact")
Looking at incidence curves
The first question we want to know is simply: how bad is it?. The first step of the analysis is descriptive - we want to draw an epidemic curve or epicurve. This visualises the incidence over time by date of symptom onset.
Using the package incidence
, compute the daily incidence from the
linelist_clean
based on the dates of symptom onset. Store the result
in an object called i_daily; the result should look like:
i_daily
## <incidence object>
## [166 cases from days 2014-04-07 to 2014-06-29]
##
## $counts: matrix with 84 rows and 1 columns
## $n: 166 cases in total
## $dates: 84 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 84 days
## $cumulative: FALSE
plot(i_daily, border = "black")
You might notice that the incidence dates i_daily$dates
stops on the
last date where we have data on date of symptom onset (29th June 2014).
However close inspection of the linelist shows that the last date in the
linelist (of any entry) is in fact a bit later (1st July 2014). You can
use the argument last_date
in the incidence
function to change this.
## <incidence object>
## [166 cases from days 2014-04-07 to 2014-07-01]
##
## $counts: matrix with 86 rows and 1 columns
## $n: 166 cases in total
## $dates: 86 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 86 days
## $cumulative: FALSE
Another issue is that it may be hard to interpret trends when looking at
daily incidence, so also compute and plot the weekly incidence
i_weekly
, as follows:
i_weekly <- incidence(linelist_clean$date_of_onset, interval = 7,
last_date = as.Date(max(linelist_clean$date_of_hospitalisation, na.rm = TRUE)))
i_weekly
## <incidence object>
## [166 cases from days 2014-04-07 to 2014-06-30]
## [166 cases from ISO weeks 2014-W15 to 2014-W27]
##
## $counts: matrix with 13 rows and 1 columns
## $n: 166 cases in total
## $dates: 13 dates marking the left-side of bins
## $interval: 7 days
## $timespan: 85 days
## $cumulative: FALSE
plot(i_weekly, border = "black")
Save data and outputs
This is the end of part 1 of the practical. Before going on to part 2, you’ll need to save the following objects:
dir.create(here("data/clean")) # create clean data directory if it doesn't exist
saveRDS(i_daily, here("data/clean/i_daily.rds"))
saveRDS(i_weekly, here("data/clean/i_weekly.rds"))
saveRDS(linelist, here("data/clean/linelist.rds"))
saveRDS(linelist_clean, here("data/clean/linelist_clean.rds"))
saveRDS(contacts, here("data/clean/contacts.rds"))