library(tidyverse)
library(HDSinRdata)
21 Expanding your R Skills
Throughout this book, we have covered some popular packages as well as many of the specific functions from these packages. However, it would be impossible to cover all of the packages, functions, and options that are available in R. In this chapter, we talk about how to use new packages, interpret error messages, debugging, and overall good programming practices to help you take your R programming to the next level.
21.1 Reading Documentation for New Packages
As you start to apply the tools from this book to your own work or in new settings, you may need to install and use new packages or encounter some unexpected errors. Practicing reading package documentation and responding to error messages helps you expand your R skills beyond the topics covered here. We demonstrate these skills using the stringr package (Wickham 2022), which is a package that is part of the tidyverse and has several functions for dealing with text data.
Every published package has a CRAN website. This website contains a reference manual that contains the documentation for the functions and data available in the package. Most often, the website also includes useful vignettes that give examples of how to use the functions in the package. The site also tells you what the requirements for using the package are, who the authors of the package are, and when the package was last updated. For example, take a look at the CRAN site for stringr and read the vignette “Introduction to String R”.
We use the stringr package to demonstrate cleaning up text related to a PubMed search query for a systematic review. An example search query is given in the following code chunk and is taken from Gue et al. (2021). Our first goal is to extract the actual search query from the text along with all the terms used in the query. We can assume that the search query is either fully contained in parentheses or is a sequence of parenthetical phrases connected with AND or OR. Our goal is to extract the search query as well as all the individual search terms used in the query, but we have to get there in a series of steps.
<- " A systematic search will be performed in PubMed,
sample_str Embase, and the Cochrane Library, using the following search query:
('out-of-hospital cardiac arrest' OR 'OHCA') AND ('MIRACLE 2' OR
'OHCA' OR 'CAHP' OR 'C-GRAPH' OR 'SOFA' OR 'APACHE' OR 'SAPS’ OR
’SWAP’ OR ’TTM’)."
The first thing we want to do with the text is clean up the white space by removing any trailing, leading, or repeated spaces. In our example, the string starts with a trailing space and there are also multiple spaces right before the search query. Searching for “white space” in the stringr reference manual, we find the str_trim()
and str_squish()
functions. Read the documentation for these two functions. You should find that str_squish()
is the function we are looking for and that it takes a single argument.
<- str_squish(sample_str)
sample_str
sample_str#> [1] "A systematic search will be performed in PubMed, Embase, and the Cochrane Library, using the following search query: ('out-of-hospital cardiac arrest' OR 'OHCA') AND ('MIRACLE 2' OR 'OHCA' OR 'CAHP' OR 'C-GRAPH' OR 'SOFA' OR 'APACHE' OR 'SAPS’ OR ’SWAP’ OR ’TTM’)."
21.2 Trying Simple Examples
The premise of testing a function on a single string is a good example of starting with a simple case. Rather than applying your function to your full dataset right away, you want to first make sure that you understand how it works on a simple example on which you can anticipate what the outcome should look like. Our next task is to split the text into words and store this as a character vector. Read the documentation to determine why we use the str_split_1()
function. We then double check that the returned result is indeed a vector and print the result.
<- str_split_1(sample_str, " ")
sample_str_words class(sample_str_words)
#> [1] "character"
sample_str_words#> [1] "A" "systematic" "search"
#> [4] "will" "be" "performed"
#> [7] "in" "PubMed," "Embase,"
#> [10] "and" "the" "Cochrane"
#> [13] "Library," "using" "the"
#> [16] "following" "search" "query:"
#> [19] "('out-of-hospital" "cardiac" "arrest'"
#> [22] "OR" "'OHCA')" "AND"
#> [25] "('MIRACLE" "2'" "OR"
#> [28] "'OHCA'" "OR" "'CAHP'"
#> [31] "OR" "'C-GRAPH'" "OR"
#> [34] "'SOFA'" "OR" "'APACHE'"
#> [37] "OR" "'SAPS’" "OR"
#> [40] "’SWAP’" "OR" "’TTM’)."
We now want to identify words in this vector that have starting and/or end parentheses. The function grepl()
takes in a character vector x
and a pattern to search for. It returns a logical vector for whether or not each element of x
has a match for that pattern.
grepl(sample_str_words, ")")
#> Warning in grepl(sample_str_words, ")"): argument 'pattern' has length
#> > 1 and only the first element will be used
#> [1] FALSE
That didn’t match what we expected. We expected to have multiple TRUE/FALSE values outputted - one for each word. Let’s read the documentation again.
21.3 Deciphering Error Messages and Warnings
The previous warning message gives us a good clue for what went wrong. It says that the inputted pattern has length > 1. However, the pattern we gave it is a single character. In fact, we specified the arguments in the wrong order. Let’s try again. This time we specify x
and pattern
.
grepl(x = sample_str_words, pattern = ")")
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
That fixed it. However, it won’t work if we change that to an opening parenthesis. Try it out for yourself to see this. The error message says that it is looking for an end parentheses. In this case, the documentation does not help us. Let’s try searching “stringr find start parentheses” using an online search engine. Our search results indicate that we may need to use backslashes to tell R to read the parentheses literally rather than as a special character used in a regular expression (a technique often referred to as “escaping” a character). Investigating the reason for an error, including using online material, is an important skill for a programmer to have.
grepl(x = sample_str_words, pattern = "\\(")
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
#> [23] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
When a function doesn’t return what we expect it to, it is a good idea to first test whether the arguments we gave it match what we expect, then re-read the documentation, and then look for other resources for help. For example, we could check that sample_str_words
is indeed a character vector, then re-read the stringr documentation, and then search our problem.
21.3.1 Debugging Code
The following code is supposed to extract the search query from the text as well as find the individual search terms used in the query. However, the code is incorrect. You can try out two test strings given to see why the code output is wrong. Practice reading through the code to understand what it is trying to do. The comments are there to help explain the steps, but you may also want to print the output to figure out what it is doing.
<- " A systematic search will be performed in PubMed,
sample_strA Embase, and the Cochrane Library, using the following search query:
('out-of-hospital cardiac arrest' OR 'OHCA') AND ('MIRACLE 2' OR
'OHCA' OR 'CAHP' OR 'C-GRAPH' OR 'SOFA' OR 'APACHE' OR 'SAPS’ OR
’SWAP’ OR ’TTM’)."
<- "Searches will be conducted in MEDLINE via PubMed, Web
sample_strB of Science, Scopus and Embase. The following search strategy will be
used:(child OR infant OR preschool child OR preschool children OR
preschooler OR pre-school child OR pre-school children OR pre school
child OR pre school children OR pre-schooler OR pre schooler OR
children OR adolescent OR adolescents)AND(attention deficit disorder
with hyperactivity OR ADHD OR attention deficit disorder OR ADD OR
hyperkinetic disorder OR minimal brain disorder) Submitted "
<- sample_strB
sample_str
# separate parentheses, remove extra white space, and split into words
<- str_replace(sample_str, "\\)", " \\) ")
sample_str <- str_replace(sample_str, "\\(", " \\( ")
sample_str <- str_squish(sample_str)
sample_str <- str_split_1(sample_str, " ")
sample_str_words
# find indices with parentheses
<- grepl(x = sample_str_words, pattern = "\\)")
end_ps <- grepl(x = sample_str_words, pattern = "\\(")
start_ps
# find words between first and last parentheses
<- sample_str_words[which(end_ps)[1]:which(start_ps)[1]]
search_query <- paste(search_query, collapse=" ")
search_query
search_query#> [1] ") adolescents OR adolescent OR children OR schooler pre OR pre-schooler OR children school pre OR child school pre OR children pre-school OR child pre-school OR preschooler OR children preschool OR child preschool OR infant OR child ("
# find search terms
<- str_replace_all(search_query, "\\)", "")
search_terms <- str_replace_all(search_query, "\\(", "")
search_terms <- str_squish(search_query)
sample_terms <- str_split_1(search_terms, " AND | OR ")
search_terms
search_terms#> [1] ") adolescents" "adolescent" "children"
#> [4] "schooler pre" "pre-schooler" "children school pre"
#> [7] "child school pre" "children pre-school" "child pre-school"
#> [10] "preschooler" "children preschool" "child preschool"
#> [13] "infant" "child "
21.3.2 Video Solution
We shows how we can test and fix the previous code by using some simple debugging principles.
21.4 General Programming Tips
As you write more complex code and functions, we want to focus on practicing good programming principles. This helps when you need to share or update your code or when you inevitably run into errors or unexpected behavior. Following are some general programming tips and how they relate to communication and debugging.
Consistent naming. Use consistent and informative names for your objects and functions. For example, you can see that within this text we have only used lowercase letters, underscores (
_
), and occasionally numbers in our names. These names should also be informative and unique. This makes it easier to check for typos or duplicate names when debugging. When debugging, check that you haven’t used the same name for different objects or different names for the same object. You can do this by using thels()
function to find all current objects or by checking your environment pane.# Recommended <- c(65, 33, 27, 88) ages <- mean(ages) age_mean # Not recommended <- c(65, 33, 27, 88) x <- mean(ages) x1
Make Your Code Readable. Readable code requires several elements of communication. As with writing an essay, we need to break our code into digestible and structured pieces. First, code should be broken into blocks, using white space to separate steps, and should use correct levels of indentation (one extra level of indentation for each new loop, if/else statement, or function). This means that closing curly braces should be on their own line indicating the end of the block. This makes it easy to check that all parentheses (), brackets [], and curly braces {} match. Additionally, you should use line breaks to avoid going over 80 characters on a single line of code. RStudio has an option to reformat or re-indent your code under the Code tab.
Besides the structure of your code, writing helpful comments and function documentation are key for making your code readable. A good rule of thumb is to write comments for yourself a year from now - you might remember the project goal but you won’t remember what
x
represents. You likely do not need a comment for every line of code but you might need comments to explain the overall goal of a code block or to clarify lines that aren’t self-explanatory.In the following code, we have not used indentation. This makes it hard to see the structure of the code such as what lines of code are in the loop or if statement. The function does not have any roxygen documentation. However, we have added too many comments. The comments here are repetitive with the code. Last, we named the function
unique()
, which is already a function in R.# Not recommended <- function(x){ unique <- c() # results y if (length(x) == 0){ # check length 0 return(NULL)} # return NULL for(i in length(x)){ # loop through x if(!(x[i] %in% y)){ # check if x[i] in y <- c(y, x[i]) }} # add x[i] to y y return(y) # return y }
We rewrite the function addressing these comments. The end result is much easier to read.
#' Find unique elements of a vector #' #' @param x vector #' @return new vector with duplicates removed <- function(x){ own_unique # check for empty vector if (length(x) == 0){ return(NULL) } # otherwise y will be unique values <- c() y for(i in 1:length(x)){ # if value of x is not in y, we add it if(!(x[i] %in% y)){ <- c(y, x[i]) y } } return(y) }
Don’t Repeat Yourself. Repeating code increases the likelihood of errors. Additionally, it makes it hard to update our code later on. When we find ourselves repeating code, we should use a function. If you find yourself repeating constants you should define these values as an object. The subsequent code uses a single line of code to convert categorical variables to factors and stores a vector of which columns are categorical.
<- data.frame(age = ages, dry_df tb = c(1, 0, 0, 0), heart_rate = c(60, 82, 76, 72), gender = c("Female", "Male", "Nonbinary", "Female")) # convert factor variables <- c("tb", "gender") cat_vars <- lapply(dry_df[cat_vars], factor) dry_df[cat_vars]
Practice Reading Documentation. Whenever we are using a new function, you should read the documentation first. When debugging, you should check that the input arguments to a function match what is expected and check the examples. Reading these examples also helps when writing your own documentation so you can better understand how to communicate to your audience.
Start Simple, Build Up. If we write a large amount of code at once and then it fails to work, it’s hard to understand what went wrong. Instead, we should build up our code or functions in small steps and check it after each step. When it comes to testing code, a good mantra is test early and test often. So, try to avoid writing too much code before running and checking that the results match what you expect. If you do end up writing a big chunk of code, you can use localize your error by checking the values of objects at different points.
Get Comfortable Asking for Help. In software engineering, it’s a known tip to have a rubber ducky (or other adorable object) at your desk to talk through your code. Having to verbalize and explain your approach can be really helpful for debugging. R’s error messages can sometimes hint at what the error might stem from, but they are not always direct. Searching for error messages you don’t understand might give you a better understanding of the problem.
21.5 Exercises
These exercises focus on reading function documentation and debugging.
Suppose we want to replace the words “Thank you” in the following string with the word “Thanks”. Why does the following code fail? How can we correct it?
<- "Congratulations on finishing the book! string Thank you for reading it." str_sub(string, c(35, 42)) <- "Thanks" string#> [1] "Congratulations on finishing the bThanks" #> [2] "Congratulations on finishing the book! \nTThanks"
The subsequent code uses the
NHANESsample
data from the HDSinRdata package. The goal of the code is to plot the worst diastolic blood pressure reading against the worst systolic blood pressure reading for each patient, colored by hypertension status. However, the code currently generates an error message. What is wrong with the code? There are four errors for you to identify and fix.data(NHANESsample) <- NHANESsample %>% nhanes_df mutate(worst_DBP = max(DBP1, DBP2, DBP3, DBP4), worst_SBP = max(SBP1, SBP2, SBP3, SBP4)) ggplot() %>% geom_point(data = nhanes_df, aes(x = worst_SBP, y = worst_DBP), color = HYP)
The following code uses the
breastcancer
data from the HDSinRdata package. The goal is to create a logistic regression model for whether or not the diagnosis is benign or malignant and then to create a calibration plot for the model, following the code from Chapter 14. Debug and fix the code. Hint: there are three separate errors.data(breastcancer) <- glm(diagnosis ~ smoothness_worst + symmetry_mean + model + radius_mean, texture_se data = breastcancer, family = binomial) <- predict(model) pred_probs <- 10 num_cuts <- data.frame(prob = pred_probs, calib_data bin = cut(pred_probs, breaks = num_cuts), class = mod_start$y) <- calib_data %>% calib_data group_by(bin) %>% summarize(observed = sum(class)/n(), expected = sum(prob)/n(), se = sqrt(observed * (1 - observed) / n())) calib_data ggplot(calib_data) + geom_abline(intercept = 0, slope = 1, color = "red") + geom_errorbar(aes(x = expected, ymin = observed - 1.96 * se, ymax = observed + 1.96 * se), colour = "black", width = .01)+ geom_point(aes(x = expected, y = observed)) + labs(x="Expected Proportion", y="Observed Proportion") + theme_minimal()