library(HDSinRdata)
library(tidyverse)
library(gt)
library(gtsummary)
6 Case Study: Cleaning Tuberculosis Screening Data
In this chapter, we put some of our R skills together in a case study. This case study focuses on data cleaning and pre-processing. We use the tb_diagnosis_raw
data from the HDSinRdata package. This data contains information on 1,634 patients in rural South Africa who presented at a health clinic with tuberculosis-related symptoms and were tested for tuberculosis (TB) using Xpert MTB/RIF. Our goal is to clean this data to reflect the pre-processing described in Baik et al. (2020). This paper uses this data to derive a simple risk score model for screening patients for treatment while awaiting Xpert results. We use the tidyverse packages as well as the summary tables from gtsummary.
To begin, read in the data and review the description of the original columns. Some things to note in the data documentation are the ways unknown, missing, or refused values are coded as well as how some of the columns are related to each other.
# Read in data
data("tb_diagnosis_raw")
# Inspect variable descriptions
# ?tb_diagnosis_raw
To start, we select variables needed for our analysis. In particular, we drop columns related to the participation in the survey and about seeking care. Since some of these variables contain long or vague names, we also rename most of the variables.
# Select variables and rename
<- tb_diagnosis_raw %>%
tb_df select(c(xpert_status_fac, age_group, sex, hiv_status_fac,
other_conditions_fac___1, other_conditions_fac___3,
other_conditions_fac___88, other_conditions_fac___99,
symp_fac___1, symp_fac___2, symp_fac___3, symp_fac___4,
symp_fac___99, length_symp_unit_fac, length_symp_days_fac,
length_symp_wk_fac, length_symp_mnt_fac, length_symp_yr_fac,%>%
smk_fac, dx_tb_past_fac, educ_fac)) rename(tb = xpert_status_fac, hiv_pos = hiv_status_fac,
cough = symp_fac___1, fever = symp_fac___2,
weight_loss = symp_fac___3, night_sweats = symp_fac___4,
symptoms_missing = symp_fac___99,
ever_smoke = smk_fac,
past_tb = dx_tb_past_fac, education = educ_fac)
We then use a summary table to understand the initial distributions of the variables observed. This also highlights where we have missing or unknown data.
tbl_summary(tb_df) %>%
as_gt()
Characteristic | N = 1,6341 |
---|---|
tb | |
1 | 765 (47%) |
2 | 869 (53%) |
age_group | |
[15,25) | 240 (15%) |
[25,35) | 333 (20%) |
[35,45) | 385 (24%) |
[45,55) | 343 (21%) |
[55,99) | 333 (20%) |
sex | |
1 | 830 (51%) |
2 | 804 (49%) |
hiv_pos | |
1 | 632 (39%) |
2 | 815 (50%) |
77 | 139 (8.5%) |
88 | 48 (2.9%) |
other_conditions_fac___1 | 895 (55%) |
other_conditions_fac___3 | 52 (3.2%) |
other_conditions_fac___88 | 11 (0.7%) |
other_conditions_fac___99 | 30 (1.8%) |
cough | 1,279 (78%) |
fever | 479 (29%) |
weight_loss | 534 (33%) |
night_sweats | 579 (35%) |
symptoms_missing | 22 (1.3%) |
length_symp_unit_fac | |
1 | 207 (14%) |
2 | 603 (39%) |
3 | 538 (35%) |
4 | 83 (5.4%) |
77 | 98 (6.4%) |
Unknown | 105 |
length_symp_days_fac | 3 (3, 4) |
Unknown | 1,427 |
length_symp_wk_fac | |
1 | 183 (30%) |
2 | 237 (39%) |
3 | 147 (24%) |
4 | 15 (2.5%) |
5 | 5 (0.8%) |
6 | 13 (2.2%) |
7 | 3 (0.5%) |
Unknown | 1,031 |
length_symp_mnt_fac | 2 (1, 3) |
Unknown | 1,096 |
length_symp_yr_fac | 2 (1, 4) |
Unknown | 1,551 |
ever_smoke | |
1 | 294 (18%) |
2 | 252 (15%) |
3 | 1,072 (66%) |
99 | 16 (1.0%) |
past_tb | |
1 | 255 (16%) |
2 | 1,354 (83%) |
77 | 25 (1.5%) |
education | 10 (7, 12) |
1 n (%); Median (IQR) |
One observation from the table is that the coding of variables is inconsistent, with some using 0/1 and others using 1/2. We want to standardize how these variables are represented. To start, we update our tb
column. Additionally, we create a column male
from the previous column sex
to make the reference level clear. We can then drop the sex
column.
# Re-code binary variables to 0/1 instead of 1/2
$tb <- case_when(tb_df$tb == 1 ~ "TB Positive",
tb_df$tb == 2 ~ "TB Negative")
tb_df
$male <- case_when(tb_df$sex == 1 ~ 1,
tb_df$sex == 2 ~ 0)
tb_df<- tb_df %>% select(-c(sex)) tb_df
Diabetes is another variable that should be coded this way. In the raw data, several columns correspond to this question about other medical conditions. Therefore, we need to use the columns other_conditions_fac___88
and other_conditions_fac___99
to check whether the participant did not answer the question when interpreting the 0/1 value for diabetes.
# Re-code diabetes to check if missing
$diabetes <- case_when(tb_df$other_conditions_fac___3 == 1 ~ 1,
tb_df$other_conditions_fac___1 == 1 ~ 0,
tb_df$other_conditions_fac___88 == 1 ~ NA,
tb_df$other_conditions_fac___99 == 1 ~ NA,
tb_dfTRUE ~ 0)
<- tb_df %>% select(-c(other_conditions_fac___1,
tb_df
other_conditions_fac___3,
other_conditions_fac___88,
other_conditions_fac___99))table(tb_df$diabetes)
#>
#> 0 1
#> 1541 52
Next, we similarly code our variables about HIV status, smoking, and whether the patient has ever been diagnosed with tuberculosis before. For these variables, if the patient answered that they did not know if their HIV status or had tested positive for TB, we code these as 0 to be consistent with the paper.
# Re-code variables with missing or refused values
$hiv_pos <- case_when((tb_df$hiv_pos == 1) ~ 1,
tb_df$hiv_pos %in% c(2,77) ~ 0,
tb_df$hiv_pos == 88 ~ NA)
tb_df
$ever_smoke <- case_when(tb_df$ever_smoke %in% c(1,2) ~ 1,
tb_df$ever_smoke == 3 ~ 0,
tb_df$ever_smoke == 99 ~ NA)
tb_df
$past_tb <- case_when(tb_df$past_tb == 1 ~ 1,
tb_df$past_tb %in% c(2,77) ~ 0) tb_df
The next variable we clean is education
. First, we need to code NA values correctly. We can then observe the distribution of years of education.
# Code NA values and look at education distribution
$education[tb_df$education == 99] <- NA
tb_dfhist(tb_df$education, xlab = "Years of Education",
main = "Histogram of Education Years")
For our purposes, we want to represent education as whether a person has a high school education or less.
# Categorize education to HS and above
$hs_less <- case_when(tb_df$education <= 12 ~ 1,
tb_df$education > 12 ~ 0,
tb_dfTRUE ~ NA)
<- tb_df %>% select(-c(education)) tb_df
There are several variables in the data related to how long a person has experienced symptoms. In the following code, we can see that the unit of the symptoms, recorded in length_symp_unit_fac
, determines which other column is entered. For example, if length_symp_unit_fac == 1
, then the only column without an NA value is length_symp_days_fc
.
%>%
tb_df group_by(length_symp_unit_fac) %>%
summarize(missing_days = sum(is.na(length_symp_days_fac))/n(),
missing_wks = sum(is.na(length_symp_wk_fac))/n(),
missing_mnt = sum(is.na(length_symp_mnt_fac))/n(),
missing_yr = sum(is.na(length_symp_yr_fac))/n())
#> # A tibble: 6 × 5
#> length_symp_unit_fac missing_days missing_wks missing_mnt missing_yr
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 1 1 1
#> 2 2 1 0 1 1
#> 3 3 1 1 0 1
#> 4 4 1 1 1 0
#> 5 77 1 1 1 1
#> 6 NA 1 1 1 1
Additionally, these measurements are positive integer values.
min(tb_df$length_symp_days_fac, na.rm = TRUE)
#> [1] 1
is.integer(tb_df$length_symp_days_fac)
#> [1] TRUE
This allows us to create a new variable that represents whether or not someone has had symptoms for more than two weeks. In our case_when()
function call, we first check whether the duration is missing before checking for the cases when symptoms would be less than two weeks.
# Categorize number of weeks experiencing symptoms
<- tb_df %>%
tb_df mutate(two_weeks = case_when((length_symp_unit_fac == 77 |
is.na(length_symp_unit_fac)) ~ NA,
== 1 &
(length_symp_unit_fac <= 14) ~ 0,
length_symp_days_fac == 2 &
(length_symp_unit_fac <= 2) ~ 0,
length_symp_wk_fac TRUE ~ 1))
<- tb_df %>%
tb_df select(-c(length_symp_wk_fac, length_symp_days_fac,
length_symp_mnt_fac, length_symp_yr_fac, length_symp_unit_fac))
Last, we update our symptom variables to have a summary column num_symptoms
that represents the total number of classic TB symptoms rather than keeping track of individual symptoms. We also exclude anyone who does not have any TB symptoms.
# Count total number of symptoms
$num_symptoms <- tb_df$fever + tb_df$weight_loss + tb_df$cough +
tb_df$night_sweats
tb_df$num_symptoms[tb_df$symptoms_missing == 1] <- NA
tb_df<- tb_df %>% select(-c(night_sweats, weight_loss, cough, fever,
tb_df
symptoms_missing))
# Exclude observations with no TB symptoms
<- tb_df %>%
tb_df filter(num_symptoms != 0)
table(tb_df$num_symptoms)
#>
#> 1 2 3 4
#> 600 344 265 196
Last, we convert all variables to factors.
# Convert all variables to factors
<- lapply(tb_df, function(x){return(as.factor(x))}) tb_df[]
Our final data is summarized in the following table. The add_overall()
function includes the overall summary statistics in addition to our stratified summaries. Our summary table looks similar to the one in the paper. However, it looks like we have a few more observations included. Additionally, our education variable shows a lower percentage of observations with post-high school education and positive HIV status.
tbl_summary(tb_df, by = "tb",
label = list(
tb= "Tuberculosis",
age_group = "Age Group",
hiv_pos = "HIV Positive",
ever_smoke = "Ever Smoked",
past_tb = "Past TB",
male = "Male",
hs_less = "High School or Less Educ",
two_weeks = "Two Weeks Symptoms",
diabetes = "Diabetes",
num_symptoms = "Number of Symptoms"
%>%
)) add_overall() %>%
as_gt()
Characteristic | Overall, N = 1,4051 | TB Negative, N = 7041 | TB Positive, N = 7011 |
---|---|---|---|
Age Group | |||
[15,25) | 205 (15%) | 121 (17%) | 84 (12%) |
[25,35) | 286 (20%) | 120 (17%) | 166 (24%) |
[35,45) | 338 (24%) | 136 (19%) | 202 (29%) |
[45,55) | 305 (22%) | 158 (22%) | 147 (21%) |
[55,99) | 271 (19%) | 169 (24%) | 102 (15%) |
HIV Positive | |||
0 | 808 (59%) | 503 (73%) | 305 (45%) |
1 | 556 (41%) | 186 (27%) | 370 (55%) |
Unknown | 41 | 15 | 26 |
Ever Smoked | |||
0 | 899 (64%) | 483 (69%) | 416 (60%) |
1 | 496 (36%) | 213 (31%) | 283 (40%) |
Unknown | 10 | 8 | 2 |
Past TB | |||
0 | 1,186 (84%) | 613 (87%) | 573 (82%) |
1 | 219 (16%) | 91 (13%) | 128 (18%) |
Male | |||
0 | 669 (48%) | 394 (56%) | 275 (39%) |
1 | 736 (52%) | 310 (44%) | 426 (61%) |
Diabetes | |||
0 | 1,326 (97%) | 658 (97%) | 668 (96%) |
1 | 47 (3.4%) | 22 (3.2%) | 25 (3.6%) |
Unknown | 32 | 24 | 8 |
High School or Less Educ | |||
0 | 119 (8.5%) | 73 (10%) | 46 (6.6%) |
1 | 1,276 (91%) | 625 (90%) | 651 (93%) |
Unknown | 10 | 6 | 4 |
Two Weeks Symptoms | |||
0 | 592 (44%) | 386 (57%) | 206 (31%) |
1 | 760 (56%) | 294 (43%) | 466 (69%) |
Unknown | 53 | 24 | 29 |
Number of Symptoms | |||
1 | 600 (43%) | 426 (61%) | 174 (25%) |
2 | 344 (24%) | 181 (26%) | 163 (23%) |
3 | 265 (19%) | 67 (9.5%) | 198 (28%) |
4 | 196 (14%) | 30 (4.3%) | 166 (24%) |
1 n (%) |