library(HDSinRdata)
library(tidyverse)
library(patchwork)
library(gt)
library(gtsummary)
library(usmap)
9 Case Study: Exploring Early COVID-19 Data
In this chapter, we demonstrate a short exploratory analysis as a case study. This case study focuses on COVID-19 cases and deaths during 2020 using the covidcases
and mobility
datasets from the HDSinRdata package. A new package that is used in this case study is usmap package (Di Lorenzo 2024), which allows us to easily create spatial plots of the United States.
9.1 Pre-processing
We start by cleaning and merging our data. The covidcases
data contains weekly confirmed COVID-19 cases and deaths at the state and county level in 2020. As the data description notes, some of these values may be negative due to data discrepancies in the cumulative counts data. The mobility
data contains daily mobility statistics by state.
# Read in data
data(covidcases)
data(mobility)
First, we look at the columns in our data. We convert the date columns in the mobility data to be recognized as a date using the as.Date()
function. The covidcases
data has the week number of 2020. We create a similar column for the mobility data.
# Convert to date format and find week
$date <- as.Date(mobility$date, formula = "%Y-%M-%D")
mobility$week <- week(mobility$date) mobility
This allows us to summarize the mobility for a state across each week.
# Find average mobility for week
<- mobility %>%
mobility_week group_by(state, week) %>%
summarize(m50 = mean(m50, na.rm=TRUE), .groups = "drop")
head(mobility_week)
#> # A tibble: 6 × 3
#> state week m50
#> <chr> <dbl> <dbl>
#> 1 Alabama 9 13.2
#> 2 Alabama 10 14.6
#> 3 Alabama 11 13.4
#> 4 Alabama 12 8.98
#> 5 Alabama 13 7.81
#> 6 Alabama 14 6.73
For both of our datasets, we want to check whether each state was observed across all dates and how the state’s name is represented. For the mobility data, our data is at the state level so we can use the table()
function.
# Find number of dates recorded for each state
table(mobility_week$state)
#>
#> Alabama Alaska Arizona Arkansas
#> 27 27 27 27
#> California Colorado Connecticut Delaware
#> 27 27 27 27
#> Florida Georgia Hawaii Idaho
#> 27 27 27 27
#> Illinois Indiana Iowa Kansas
#> 27 27 27 27
#> Kentucky Louisiana Maine Maryland
#> 27 27 27 27
#> Massachusetts Michigan Minnesota Mississippi
#> 27 27 27 27
#> Missouri Montana Nebraska Nevada
#> 27 27 27 27
#> New Hampshire New Jersey New Mexico New York
#> 27 27 27 27
#> North Carolina North Dakota Ohio Oklahoma
#> 27 27 27 27
#> Oregon Pennsylvania Rhode Island South Carolina
#> 27 27 27 27
#> South Dakota Tennessee Texas Utah
#> 27 27 27 27
#> Vermont Virginia Washington Washington, D.C.
#> 27 27 27 27
#> West Virginia Wisconsin Wyoming
#> 27 27 27
For the covidcases
data, our data is at the county level. We need to summarize the data instead. In this case, some states were observed for fewer weeks than others.
# Find state names and number of weeks recorded for each state
unique(covidcases$state)
#> [1] "California" "Florida"
#> [3] "New Hampshire" "Washington"
#> [5] "Massachusetts" "Arizona"
#> [7] "Texas" "Georgia"
#> [9] "New York" "Wisconsin"
#> [11] "Oregon" "North Carolina"
#> [13] "Nebraska" "Illinois"
#> [15] "Utah" "Indiana"
#> [17] "Tennessee" "Pennsylvania"
#> [19] "Michigan" "Oklahoma"
#> [21] "Kentucky" "Connecticut"
#> [23] "Colorado" "Virginia"
#> [25] "Nevada" "South Dakota"
#> [27] "Minnesota" "Ohio"
#> [29] "Vermont" "New Jersey"
#> [31] "Maryland" "Iowa"
#> [33] "Missouri" "South Carolina"
#> [35] "Hawaii" "District of Columbia"
#> [37] "Louisiana" "Kansas"
#> [39] "Maine" "Arkansas"
#> [41] "Idaho" "Alabama"
#> [43] "Montana" "Mississippi"
#> [45] "North Dakota" "New Mexico"
#> [47] "Alaska" "Wyoming"
#> [49] "Delaware" "West Virginia"
#> [51] "Rhode Island"
<- covidcases %>%
num_wks group_by(state) %>%
summarize(num_weeks = n_distinct(week), .groups = "drop")
summary(num_wks)
#> state num_weeks
#> Length:51 Min. :23.0
#> Class :character 1st Qu.:25.5
#> Mode :character Median :26.0
#> Mean :26.0
#> 3rd Qu.:27.0
#> Max. :27.0
Note that D.C. is written differently for each data source. We update this name in the mobility data.
$state[mobility_week$state == "Washington, D.C."] <-
mobility_week"District of Columbia"
After checking the formatting of the state
and week
columns, we can now merge our data together. In this case, we want to add the mobility data to the case data and use a left_join()
.
# Join cases and mobility data
<- left_join(covidcases, mobility_week, by = c("state", "week")) covid
Next, we want to get some simple information about the continuous variables in our data. We observe two key points. First, we can see the negative values the data description warned us about, and second, there is no missing data.
summary(covid[, c("weekly_cases", "weekly_deaths", "m50")])
#> weekly_cases weekly_deaths m50
#> Min. : -188 Min. :-511 Min. : 0.0
#> 1st Qu.: 2 1st Qu.: 0 1st Qu.: 5.0
#> Median : 9 Median : 0 Median : 7.7
#> Mean : 87 Mean : 3 Mean : 7.7
#> 3rd Qu.: 39 3rd Qu.: 1 3rd Qu.: 9.9
#> Max. :35134 Max. :5226 Max. :49.4
These negative numbers are clear data discrepancies. When showing the distribution of cases in our exploratory analysis we may choose to either code these as 0 or NA. We decide to recode these negative values as NA.
# Set negative counts to NA
$weekly_cases <- replace(covid$weekly_cases,
covidwhich(covid$weekly_cases < 0),
NA)
$weekly_deaths <- replace(covid$weekly_deaths,
covidwhich(covid$weekly_deaths < 0),
NA)
As the last step in our pre-processing, we add in the state abbreviation and region for each state using the state.name
and state.region
vectors available in R. We code D.C. to be in the same region as Maryland and Virginia.
# Add region and abbreviation and remove county
<- data.frame(state = c(state.name,
region_key "District of Columbia"),
state_abb = c(state.abb, "DC"),
region = c(as.character(state.region),
"South"))
<- covid %>%
covid left_join(region_key, by = c("state"))
head(covid)
#> # A tibble: 6 × 8
#> state county week weekly_cases weekly_deaths m50 state_abb region
#> <chr> <chr> <dbl> <int> <int> <dbl> <chr> <chr>
#> 1 Califo… Marin 9 1 0 7.50 CA West
#> 2 Califo… Orange 9 3 0 7.50 CA West
#> 3 Florida Manat… 9 1 0 9.68 FL South
#> 4 Califo… Napa 9 1 0 7.50 CA West
#> 5 New Ha… Graft… 9 2 0 7.67 NH North…
#> 6 Washin… Spoka… 9 4 0 4.01 WA West
9.2 Mobility and Cases Over Time
Now that our data are merged and cleaned, we start exploring mobility and cases by region. The following summary table shows that these measures did differ by region overall.
%>%
covid select(c("region", "m50", "weekly_cases", "weekly_deaths")) %>%
tbl_summary(by = "region", missing = "no") %>%
as_gt()
Characteristic | North Central, N = 22,6531 | Northeast, N = 5,1651 | South, N = 32,2761 | West, N = 9,4361 |
---|---|---|---|---|
m50 | 7.4 (5.4, 8.8) | 4.2 (1.6, 5.2) | 9.7 (7.1, 11.3) | 4.3 (2.9, 7.0) |
weekly_cases | 5 (1, 22) | 15 (3, 91) | 13 (3, 50) | 5 (1, 38) |
weekly_deaths | 0 (0, 0) | 0 (0, 4) | 0 (0, 1) | 0 (0, 1) |
1 Median (IQR) |
We then plot mobility over time both for the whole country and by region. Across the country we see a similar pattern in how mobility fluctuated but that certain regions had overall higher mobility than others.
# Average mobility in the US over time - overall
<- covid %>%
pmob1 select(c(region, state, week, m50)) %>%
distinct() %>%
group_by(week) %>%
summarize(avg_m50 = mean(m50, na.rm=TRUE), .groups="drop") %>%
ggplot() +
geom_line(aes(x = week, y = avg_m50)) +
labs(x = "Week in 2020", y = "Average Mobility",
title = "Average Mobility in the US") +
theme_bw()
# Average mobility in the US over time - by region
<- covid %>%
pmob2 select(c(region, state, week, m50)) %>%
distinct() %>%
group_by(region, week) %>%
summarize(avg_m50 = mean(m50, na.rm=TRUE), .groups="drop") %>%
ggplot() +
geom_line(aes(x = week, y = avg_m50, color = region)) +
labs(x = "Week in 2020", y = "Average Mobility",
title = "Average Mobility by Region in the US",
color = "Region") +
theme_bw() +
theme(legend.position = "bottom")
+pmob2 pmob1
We then look at cases and deaths by region. A limitation of these data are that we do not have population counts which would allow us to standardize these numbers. However, we do use a secondary y-axis using the sec_axis()
function within scale_y_continuous()
to allow us to plot deaths and cases together. In this case, the secondary axis is scaled by 1/10th of the primary axis.
# Change in number cases over time, per region
%>%
covid filter(!is.na(region)) %>%
group_by(region, week) %>%
summarize(weekly_cases = sum(weekly_cases, na.rm = TRUE),
weekly_deaths = sum(weekly_deaths, na.rm = TRUE),
.groups = "drop") %>%
ggplot() +
geom_line(aes(x = week, y = weekly_cases, color = region,
linetype = "Cases")) +
geom_line(aes(x = week, y = weekly_deaths*10, color = region,
linetype = "Deaths")) +
scale_y_continuous(name = "Average Weekly Cases",
sec.axis = sec_axis(~./10,
name = "Average Weekly Deaths"))+
scale_linetype(name = "Measure") +
labs(x = "Week in 2020", color = "Region",
title = "Weekly Cases Over Time by Region") +
theme_bw()
To look at how mobility and cases are related, we look at a scatter plot of mobility and cases in California.
<- covid %>% filter(state == "California")
covid_ca ggplot(covid_ca)+
geom_point(aes(x = weekly_cases, y = m50), na.rm = TRUE) +
labs(x = "Weekly Cases", y = "Average Mobility")
This motivates us to look at the correlation between these two columns by state. We plot this using the plot_usmap()
function from the usmap package. Interestingly, we observe different relationships throughout the country, but none of the correlations are particularly strong.
# Calculate and plot correlation between cases and mobility, y state
<- covid %>%
covid_cor group_by(state) %>%
summarize(correlation = cor(weekly_cases, m50,
use = "complete.obs"))
plot_usmap(regions = "states", data = covid_cor,
values = "correlation") +
scale_fill_gradient2(low = "darkblue", high = "darkred",
mid="white", name = "Correlation") +
labs(title = "Correlation Between Cases and Mobility") +
theme(legend.position = "right")
Last, we look at how the total cases and deaths are related to each other. This shows that the Northeast suffered more deaths per case overall, which may be related to the lower mobility and negative correlation between mobility and cases observed earlier.
# Relationship between cases and deaths summarized
%>%
covid group_by(region, state_abb) %>%
summarize(total_cases = sum(weekly_cases, na.rm = TRUE),
total_deaths = sum(weekly_deaths, na.rm = TRUE),
.groups = "drop") %>%
ggplot() +
geom_label(aes(x = total_cases, y = total_deaths, color = region,
label = state_abb), size = 1.5) +
labs(x = "Total Cases", y = "Total Deaths", color = "Region") +
theme_bw()