6 Case Study: Cleaning Tuberculosis Screening Data

In this chapter, we put some of our R skills together in a case study. This case study focuses on data cleaning and pre-processing. We use the tb_diagnosis_raw data from the HDSinRdata package. This data contains information on 1,634 patients in rural South Africa who presented at a health clinic with tuberculosis-related symptoms and were tested for tuberculosis (TB) using Xpert MTB/RIF. Our goal is to clean this data to reflect the pre-processing described in Baik et al. (2020). This paper uses this data to derive a simple risk score model for screening patients for treatment while awaiting Xpert results. We use the tidyverse packages as well as the summary tables from gtsummary.

library(HDSinRdata)
library(tidyverse)
library(gt)
library(gtsummary)

To begin, read in the data and review the description of the original columns. Some things to note in the data documentation are the ways unknown, missing, or refused values are coded as well as how some of the columns are related to each other.

# Read in data
data("tb_diagnosis_raw")

# Inspect variable descriptions
# ?tb_diagnosis_raw

To start, we select variables needed for our analysis. In particular, we drop columns related to the participation in the survey and about seeking care. Since some of these variables contain long or vague names, we also rename most of the variables.

# Select variables and rename
tb_df <- tb_diagnosis_raw %>% 
  select(c(xpert_status_fac, age_group, sex, hiv_status_fac,
           other_conditions_fac___1, other_conditions_fac___3,
           other_conditions_fac___88, other_conditions_fac___99,
           symp_fac___1, symp_fac___2, symp_fac___3, symp_fac___4, 
           symp_fac___99, length_symp_unit_fac, length_symp_days_fac,
           length_symp_wk_fac, length_symp_mnt_fac, length_symp_yr_fac,
           smk_fac, dx_tb_past_fac, educ_fac)) %>%
    rename(tb = xpert_status_fac, hiv_pos = hiv_status_fac,
           cough = symp_fac___1, fever = symp_fac___2, 
           weight_loss = symp_fac___3, night_sweats = symp_fac___4, 
           symptoms_missing = symp_fac___99,
           ever_smoke = smk_fac, 
           past_tb = dx_tb_past_fac, education = educ_fac)

We then use a summary table to understand the initial distributions of the variables observed. This also highlights where we have missing or unknown data.

tbl_summary(tb_df) %>%
  as_gt()

Characteristic	N = 1,634¹
tb
1	765 (47%)
2	869 (53%)
age_group
[15,25)	240 (15%)
[25,35)	333 (20%)
[35,45)	385 (24%)
[45,55)	343 (21%)
[55,99)	333 (20%)
sex
1	830 (51%)
2	804 (49%)
hiv_pos
1	632 (39%)
2	815 (50%)
77	139 (8.5%)
88	48 (2.9%)
other_conditions_fac___1	895 (55%)
other_conditions_fac___3	52 (3.2%)
other_conditions_fac___88	11 (0.7%)
other_conditions_fac___99	30 (1.8%)
cough	1,279 (78%)
fever	479 (29%)
weight_loss	534 (33%)
night_sweats	579 (35%)
symptoms_missing	22 (1.3%)
length_symp_unit_fac
1	207 (14%)
2	603 (39%)
3	538 (35%)
4	83 (5.4%)
77	98 (6.4%)
Unknown	105
length_symp_days_fac	3 (3, 4)
Unknown	1,427
length_symp_wk_fac
1	183 (30%)
2	237 (39%)
3	147 (24%)
4	15 (2.5%)
5	5 (0.8%)
6	13 (2.2%)
7	3 (0.5%)
Unknown	1,031
length_symp_mnt_fac	2 (1, 3)
Unknown	1,096
length_symp_yr_fac	2 (1, 4)
Unknown	1,551
ever_smoke
1	294 (18%)
2	252 (15%)
3	1,072 (66%)
99	16 (1.0%)
past_tb
1	255 (16%)
2	1,354 (83%)
77	25 (1.5%)
education	10 (7, 12)
¹ n (%); Median (IQR)

One observation from the table is that the coding of variables is inconsistent, with some using 0/1 and others using 1/2. We want to standardize how these variables are represented. To start, we update our tb column. Additionally, we create a column male from the previous column sex to make the reference level clear. We can then drop the sex column.

# Re-code binary variables to 0/1 instead of 1/2
tb_df$tb <- case_when(tb_df$tb == 1 ~ "TB Positive", 
                      tb_df$tb == 2 ~ "TB Negative")

tb_df$male <- case_when(tb_df$sex == 1 ~ 1, 
                        tb_df$sex == 2 ~ 0)
tb_df <- tb_df %>% select(-c(sex))

Diabetes is another variable that should be coded this way. In the raw data, several columns correspond to this question about other medical conditions. Therefore, we need to use the columns other_conditions_fac___88 and other_conditions_fac___99 to check whether the participant did not answer the question when interpreting the 0/1 value for diabetes.

# Re-code diabetes to check if missing
tb_df$diabetes <- case_when(tb_df$other_conditions_fac___3 == 1 ~ 1,
                            tb_df$other_conditions_fac___1 == 1 ~ 0,
                            tb_df$other_conditions_fac___88 == 1 ~ NA,
                            tb_df$other_conditions_fac___99 == 1 ~ NA,
                            TRUE ~ 0)
tb_df <- tb_df %>% select(-c(other_conditions_fac___1,
                             other_conditions_fac___3,
                             other_conditions_fac___88,
                             other_conditions_fac___99))
table(tb_df$diabetes)
#> 
#>    0    1 
#> 1541   52

Next, we similarly code our variables about HIV status, smoking, and whether the patient has ever been diagnosed with tuberculosis before. For these variables, if the patient answered that they did not know their HIV status or if they had tested positive for TB, we code these as 0 to be consistent with the paper.

# Re-code variables with missing or refused values
tb_df$hiv_pos <- case_when((tb_df$hiv_pos == 1) ~ 1, 
                           tb_df$hiv_pos %in% c(2,77) ~ 0,
                           tb_df$hiv_pos == 88 ~ NA)

tb_df$ever_smoke <- case_when(tb_df$ever_smoke %in% c(1,2) ~ 1,
                              tb_df$ever_smoke == 3 ~ 0,
                              tb_df$ever_smoke == 99 ~ NA)

tb_df$past_tb <- case_when(tb_df$past_tb == 1 ~ 1, 
                           tb_df$past_tb %in% c(2,77) ~ 0)

The next variable we clean is education. First, we need to code NA values correctly. We can then observe the distribution of years of education.

# Code NA values and look at education distribution
tb_df$education[tb_df$education == 99] <- NA
hist(tb_df$education, xlab = "Years of Education", 
     main = "Histogram of Education Years")

For our purposes, we want to represent education as whether a person has a high school education or less.

# Categorize education to HS and above
tb_df$hs_less <- case_when(tb_df$education <= 12 ~ 1,
                           tb_df$education > 12 ~ 0,
                           TRUE ~ NA)
tb_df <- tb_df %>% select(-c(education))

There are several variables in the data related to how long a person has experienced symptoms. In the following code, we can see that the unit of the symptoms, recorded in length_symp_unit_fac, determines which other column is entered. For example, if length_symp_unit_fac == 1, then the only column without an NA value is length_symp_days_fc.

tb_df %>%
  group_by(length_symp_unit_fac) %>%
  summarize(missing_days = sum(is.na(length_symp_days_fac))/n(),
            missing_wks = sum(is.na(length_symp_wk_fac))/n(),
            missing_mnt = sum(is.na(length_symp_mnt_fac))/n(),
            missing_yr = sum(is.na(length_symp_yr_fac))/n())
#> # A tibble: 6 × 5
#>   length_symp_unit_fac missing_days missing_wks missing_mnt missing_yr
#>                  <int>        <dbl>       <dbl>       <dbl>      <dbl>
#> 1                    1            0           1           1          1
#> 2                    2            1           0           1          1
#> 3                    3            1           1           0          1
#> 4                    4            1           1           1          0
#> 5                   77            1           1           1          1
#> 6                   NA            1           1           1          1

Additionally, these measurements are positive integer values.

min(tb_df$length_symp_days_fac, na.rm = TRUE)
#> [1] 1
is.integer(tb_df$length_symp_days_fac)
#> [1] TRUE

This allows us to create a new variable that represents whether or not someone has had symptoms for more than two weeks. In our case_when() function call, we first check whether the duration is missing before checking for the cases when symptoms would be less than two weeks.

# Categorize number of weeks experiencing symptoms
tb_df <- tb_df %>%
  mutate(two_weeks = case_when((length_symp_unit_fac == 77 |
                      is.na(length_symp_unit_fac)) ~ NA,
                   (length_symp_unit_fac == 1 & 
                      length_symp_days_fac <= 14) ~ 0,
                   (length_symp_unit_fac == 2 &
                      length_symp_wk_fac <= 2) ~ 0,
                   TRUE ~ 1))

tb_df <- tb_df %>% 
  select(-c(length_symp_wk_fac, length_symp_days_fac, 
            length_symp_mnt_fac, length_symp_yr_fac, 
            length_symp_unit_fac))

Last, we update our symptom variables to have a summary column num_symptoms that represents the total number of classic TB symptoms rather than keeping track of individual symptoms. We also exclude anyone who does not have any TB symptoms.

# Count total number of symptoms
tb_df$num_symptoms <- tb_df$fever + tb_df$weight_loss + tb_df$cough + 
  tb_df$night_sweats
tb_df$num_symptoms[tb_df$symptoms_missing == 1] <- NA
tb_df <- tb_df %>% select(-c(night_sweats, weight_loss, cough, fever,
                             symptoms_missing))

# Exclude observations with no TB symptoms 
tb_df <- tb_df %>%
  filter(num_symptoms != 0)

table(tb_df$num_symptoms)
#> 
#>   1   2   3   4 
#> 600 344 265 196

Last, we convert all variables to factors.

# Convert all variables to factors 
tb_df[] <- lapply(tb_df, function(x){return(as.factor(x))})

Our final data is summarized in the following table. The add_overall() function includes the overall summary statistics in addition to our stratified summaries. Our summary table looks similar to the one in the paper. However, it looks like we have a few more observations included. Additionally, our education variable shows a lower percentage of observations with post-high school education and positive HIV status.

tbl_summary(tb_df, by = "tb",
            label = list(
              tb= "Tuberculosis",
              age_group = "Age Group",
              hiv_pos = "HIV Positive",
              ever_smoke = "Ever Smoked",
              past_tb = "Past TB",
              male = "Male",
              hs_less = "High School or Less Educ.",
              two_weeks = "Two Weeks Symptoms",
              diabetes = "Diabetes",
              num_symptoms = "Number of Symptoms"
            )) %>%
  add_overall() %>%
  as_gt() %>%
  cols_width(everything() ~ "55pt")

Characteristic	Overall, N = 1,405¹	TB Negative, N = 704¹	TB Positive, N = 701¹
Age Group
[15,25)	205 (15%)	121 (17%)	84 (12%)
[25,35)	286 (20%)	120 (17%)	166 (24%)
[35,45)	338 (24%)	136 (19%)	202 (29%)
[45,55)	305 (22%)	158 (22%)	147 (21%)
[55,99)	271 (19%)	169 (24%)	102 (15%)
HIV Positive
0	808 (59%)	503 (73%)	305 (45%)
1	556 (41%)	186 (27%)	370 (55%)
Unknown	41	15	26
Ever Smoked
0	899 (64%)	483 (69%)	416 (60%)
1	496 (36%)	213 (31%)	283 (40%)
Unknown	10	8	2
Past TB
0	1,186 (84%)	613 (87%)	573 (82%)
1	219 (16%)	91 (13%)	128 (18%)
Male
0	669 (48%)	394 (56%)	275 (39%)
1	736 (52%)	310 (44%)	426 (61%)
Diabetes
0	1,326 (97%)	658 (97%)	668 (96%)
1	47 (3.4%)	22 (3.2%)	25 (3.6%)
Unknown	32	24	8
High School or Less Educ.
0	119 (8.5%)	73 (10%)	46 (6.6%)
1	1,276 (91%)	625 (90%)	651 (93%)
Unknown	10	6	4
Two Weeks Symptoms
0	592 (44%)	386 (57%)	206 (31%)
1	760 (56%)	294 (43%)	466 (69%)
Unknown	53	24	29
Number of Symptoms
1	600 (43%)	426 (61%)	174 (25%)
2	344 (24%)	181 (26%)	163 (23%)
3	265 (19%)	67 (9.5%)	198 (28%)
4	196 (14%)	30 (4.3%)	166 (24%)
¹ n (%)