02-basics.Rmd

# R basics {#basics}

Learning a programming language is like learning any language. 
If you’re learning French, for example, you could just dive in and start gesticulating to people in central Paris. 
However, it’s worth taking the time to understand the structure and key words of the language first. 
The same applies to data science: it will help if you first understand a little about the syntax and grammar of the language that we will use to in relation to the ‘data’ (the statistical programming language R) before diving into using it for road safety research. 
This section, parts of which were originally developed for a [2 day course](https://github.com/ITSLeeds/TDS/releases/download/0.2/2day.pdf) on data science, may seem tedious for people who just want to crack on and load-in data.^[
Thanks to Malcolm Morgan who contributed content that has been included in this section.
]
However, working through the examples below is recommended for most people unless you’re already an experienced R user, although even experienced R users may learn something about the language’s syntax and fundamental features.

The first step is to **start RStudio**, e.g. if you are on Windows, this can be achieved by tapping **Start** and searching for RStudio. You should see an **R console** pane like that which is displayed in Figure \@ref(fig:console).

```{r console, echo=FALSE, fig.cap='The R console pane in RStudio that appears when you first open RStudio. After typing install.packages(stats), you should see the package name auto-complete. Pressing `Enter` at this point will trigger the autocompletion of the command `install.packages("stats19").'}
knitr::include_graphics("figures/console.png")
```

If you saw something like that which is shown in Figure \@ref(fig:console) congratulations!
You are ready to start running R code by entering commands into the console.

## Creating and removing R objects

R can be understood as a giant calculator.
If you feed the console arithmetic tasks, it will solve them precisely and instantly.
Try typing the following examples (note that `pi` is an inbuilt object) into the R console in RStudio and pressing `Enter` to make R run the code. (`Note:` The output of the code, when shown in this manual, is preceeded by '##'.)

```{r}
2 + 19
pi^(19 + 2) / exp(2 + 19) 
```

Use the same approach to find the square route of 361 (answer not shown):

```{r, eval=FALSE}
sqrt(361)
```

This is all well and good, providing a chance to see how R works with numbers and to get some practice with typing commands into the console. 
However, the code chunks above do not make use of a key benefit of R: it is *object oriented*, meaning it stores values and complex objects, such as data frames representing road casualties, and processes them *in memory* (meaning that R is both fast and memory hungry when working with large datasets). If you are more familiar with Excel, a data frame may be thought of as fulfilling the purpose of a single worksheet containing a set of data.

The two most common ways of creating objects are using `<-` 'arrow' or `=` 'equals' assignment.
These symbols are *assignment operators* because they assign contents, such as numbers, to named objects.^[
We use equals assignment for speed of typing and to avoid ambiguity in commands such as `x<-1` vs `x< -1`, although `<-` has historically been (and still is) more commonly used in many fields.
]
Let's reproduce the calculations above using objects.
This makes the final command more concise:

```{r}
x = 2
y = 19
z = x + y
pi^z / exp(z)
```

The previous code chunk created and stored three objects called `x`, `y` and `z` and showed how these objects can themselves be used to create additional objects.
Why `x`, `y` and `z`?
Lack of imagination!

You can call R objects anything you like, provided they do not start with numbers or contain reserved symbols such as `+` and `-`.
You can use various stylistic conventions when naming your R objects, including `camelCase` and `dot.case`
[@baath_state_2012].
We advocate using `snake_case`, a style that avoids upper case characters to ease typing and uses the underscore symbol (`_`) to clearly demarcate spaces between words.

The objects created in the previous code chunk have now served their purpose, which is to demonstrate basic object creation in R. 
So, based on the wise saying that tidying up is the most important part of a job, we will now remove these objects:

```{r}
rm(x, y, z)
```

What just happened?
We removed the objects using the R function `rm()`, which stands for 'remove'.
A function is an instruction or set of instructions for R to do something with what we give to that function.
What we give to the function are known as arguments.
Each function has set of arguments we can potentially give to it.  

Technically speaking, we *passed the objects to arguments in the `rm()` function call*.
In plain English, things that go inside the curved brackets that follow a function name are the arguments.
The `rm()` function removes the objects that it is passed (most functions modify objects).
A 'nuclear option' for cleaning your workspace is to run the following command, the meaning of which you will learn in the next section. (Can you guess?)

```{r}
rm(list = ls())
```

Next exercise: create objects that more closely approximate road casualty data by typing and running the following lines of code in the console:

```{r}
casualty_type = c("pedestrian", "cyclist", "cat")
casualty_age = seq(from = 20, to = 60, by = 20)
```

## Object types: vectors and data frames

The final stage in the previous section involved creating two objects with sensible names in our R session.^[
The R session can be restarted by closing RStudio and then reopening it, or by pressing `Ctl+Shift+F10`.
Restarting R in this way will make R 'forget' the objects, unless you ask it to save the objects.
In most cases we advocate *not* saving R objects in your workspace but saving the specific objects you need, something that is covered in Section \@ref(saving-r-objects).
]
After running the previous code chunk the `casualty_*` objects are in the workspace (technically, the 'global environment').
You should be able to see the object names in the Environment tab in the top right of RStudio.
Objects can also be listed with the `ls()` command as follows:

```{r}
ls()
```

The previous command executed the function `ls()` with no arguments.
This helps explain the command `rm(list = ls())`, which removed all objects in the global environment in the previous section.
<!-- The command basically means 'remove everything', **used with caution**. -->
This also makes the wider point that functions can accept arguments (in this case the `list` argument of the `rm()` function) that are themselves function calls.

Two key functions for getting the measure of R objects are `class()` and `length()`.

```{r}
class(casualty_type)
class(casualty_age)
```

The class of the `casualty_type` and `casualty_type` objects are `character` (meaning text) and `numeric` (meaning numbers), respectively.
The objects are *vectors*, a sequence of values of the same type.
Next challenge: guess their length and check your guess was correct by running the following commands (results not shown):

```{r, eval=FALSE}
length(casualty_type)
length(casualty_age)
```

To convert a series of vectors into a data frame with rows and columns (similar to an Excel worksheet), we will use the `data.frame()` function.
Create a data frame containing the two casualty vectors as follows:

```{r}
crashes = data.frame(casualty_type, casualty_age)
```

We can see the contents of the new `crashes` object by entering the following line of code.
This prints its contents (results not shown, you need to run the command on your own computer to see the result):

```{r, eval=FALSE}
crashes
```

We can get a handle of data frame objects such as `crashes` as follows:

```{r}
class(crashes)
nrow(crashes)
ncol(crashes)
```

The results of the previous commands tell us that the dataset has 3 rows and 2 columns.
We will use larger datasets (with thousands of rows and tens of columns) in later sections, but for now it's good to 'start small' to understand the basics of R.

## Subsetting by index or column name

As we saw above, the most basic type of R object is a *vector*, which is a sequence of values of the same type such as the numbers in the object `casualty_age`. 
In the earlier examples, `x`, `y` and `z` were all *numeric vectors* with a length of 1; `casualty_type` is a *character vector* (because it contains letters that cannot be added) of length 3; and `casualty_age` is a *numeric vector* of length 3.

Subsetting means 'extracting' only part of a vector or other object, so that only the parts of most interest are returned to us.
Subsets of vectors can be returned by providing numbers representing the positions (index) of the elements within the vector (e.g. '2' representing selection of the 2^nd^ element) or with logical (`TRUE` or `FALSE`) values associated with the element.
These methods are demonstrated below, to return the 2nd element of the `casualty_age` object is returned:

```{r}
casualty_age
casualty_age[2]
casualty_age[c(FALSE, TRUE, FALSE)]
```

Two dimensional objects such as matrices and data frames can be subset by rows and columns.
Subsetting in base R is achieved by using square brackets `[]` after the name of an object.
**To practice subsetting, run the following commands to index and column name and verify that you get the same results to those that are shown below.**

```{r, eval=FALSE}
casualty_age[2:3] # second and third casualty_age
crashes[c(1, 2), ] # first and second row of crashes
crashes[c(1, 2), 1] # first and second row of crashes, first column
crashes$casualty_type # returns just one column
```

The final command used the dollar symbol (`$`) to subset a column.
We can use the same symbol to create a new column as follows:

```{r}
vehicle_type = c("car", "bus", "tank")
crashes$vehicle_type = vehicle_type
ncol(crashes)
```

Notice that the dataset now has three columns after we added one to the right of the previous one.
Note also that this would involve copying and pasting cells in Excel, but in R it is instant and happens as fast as you can type the command.
To confirm that what we think has happened has indeed happened, print out the object again to see its contents:

```{r}
crashes
```

In Section \@ref(data) we will use `filter()` and `select()` functions to subset rows and columns.
Before we get there, it is worth practicing subsetting using the square brackets to consolidate your understanding of how base R works with vector objects such as `vehicle_type` and data frames such as `crashes`.
If you can answer the following questions, congratulations,  you are ready to move on.
If not, it's worth doing some extra reading and practice on the topic of subsetting in base R.


```{r, eval=FALSE, echo=FALSE}
crashes[, c(1, 3)] # first and third column of crashes by positional numbers
crashes[c(2), c(3)]
crashes[c(2), c(2, 3)]
class(crashes[, c(1, 3)])
class(crashes[c(2), c(3)])
```

<!-- ## Exercises {-} -->
<!-- Todo: should these be sections? -->

**Exercises**

1. Use the `$` operator to print the `vehicle_type` column of `crashes`.
1. Subset the crashes with the `[,]` syntax so that only the first and third columns of `crashes` are returned.
1. Return the 2^nd^ row and the 3^rd^ column of the `crashes` dataset. 
1. Return the 2^nd^ row and the columns 2:3 of the `crashes` dataset. 
1. **Bonus**: what is the `class()` of the objects created by each of the previous exercises? 

## Subsetting by values

It is also possible to subset objects by the values of their elements.
This works because the `[` operator accepts logical vectors returned by queries such as 'Is it less than 3?' (`x < 3` in R) and 'Was it light?' (`crashes$dark == FALSE`), as demonstrated below:

```{r, eval=FALSE}
x[c(TRUE, FALSE, TRUE, FALSE, TRUE)] # 1st, 3rd, and 5th element in x
x[x == 5] # only when x == 5 (notice the use of double equals)
x[x < 3] # less than 3
x[x < 3] = 0 # assign specific elements
casualty_age[casualty_age %% 6 == 0] # just the ages that are a multiple of 6
crashes[crashes$dark == FALSE, ] # just crashes that occured when it wasnt dark
```

**Exercises**

1. Subset the `casualty_age` object using the inequality (`<`) so that only elements less than 50 are returned.
1. Subset the `crashes` data frame so that only tanks are returned using the `==` operator.
1. **Bonus**: assign the age of all tanks to 61.

```{r, eval=FALSE, echo=FALSE}
casualty_age[casualty_age < 50] # the  casualty_age less than 50
crashes[crashes$vehicle_type == "tank", ] # rows where the name is tank
crashes$casualty_age[crashes$vehicle_type == "tank"] = 61
```

## Dealing with NAs and recoding

R objects can have a value of NA. NA is how R represents missing data.

```{r, eval=FALSE}
z = c(4, 5, NA, 7)
```

NA values are common in real-world data but can cause trouble. For example:

```{r, eval=FALSE}
sum(z) # result is NA
```

Some functions can be told to ignore NA values.

```{r, eval=FALSE}
sum(z, na.rm = TRUE) # result is equal to 4 + 5 + 7
```

You can find NAs using the `is.na()` function, and then remove them:

```{r, eval=FALSE}
is.na(z)
z_no_na = z[!is.na(z)] # note the use of the not operator !
sum(z)
```

If you remove records with NAs, be warned: the average of a value excluding NAs may not be representative.

## Changing class

Sometimes you may want to change the class of an object.
This is called class coercion, and can be done with functions such as `as.logical()`, `as.numeric()` and `as.matrix()`.

**Exercises**

1. Coerce the `vehicle_type` column of `crashes` to the class `character`.
1. Coerce the `crashes` object into a matrix. What happened to the values?
1. **Bonus:** What is the difference between the output of `summary()` on `character` and `factor` variables?

```{r, echo=FALSE, eval=FALSE}
crashes$vehicle_type = as.character(crashes$vehicle_type)
as.matrix(crashes)
```

## Recoding values

Often it is useful to 'recode' values.
In the raw STATS19 files, for example, -1 means NA.
There are many ways to recode values in R, the simplest and most mature of which is the use of 'factors', which are whole numbers representing characters.  Factors are commonly used to manage categorical variables such as sex, ethnicity or, in road traffic research, vehicle type or casualty injury severity.   

```{r}
z = c(1, 2, -1, 1, 3)
l = c(NA, "a", "b", "c") # labels in ascending order
z_factor = factor(z, labels = l) # factor z using labels l
z_character = as.character(z_factor) # convert factors to characters
z_character
```

**Exercises**

1. Recode `z` to Slight, Serious and Fatal for 1:3 respectively.
1. **Bonus:** read the help file at `?dplyr::case_when` and try to recode the values using this function.

## Saving R objects

You can also save individual R objects as `.Rds` files.
The `.Rds` format is the data format for R, meaning that any R object can be saved as an `Rds` file,  equivalent to saving an Excel spreadsheet as a `.xlsx` file.
The following command saves the `crashes` dataset into a compressed file called `crashes.Rds`:

```{r}
saveRDS(crashes, "crashes.Rds")
```

Try reading in the data just saved, and checking that the new object is the same as `crashes`, as follows:

```{r}
crashes2 = readRDS("crashes.Rds")
identical(crashes, crashes2)
```

R also supports many other formats, including CSV files, which can be created and imported with the functions `readr::read_csv()` and `readr::write_csv()` (see also the [readr](https://readr.tidyverse.org/) package).

```{r readr-write, eval=FALSE}
readr::write_csv(crashes, "crashes.csv") # uses the write_csv function from the readr package
crashes3 = readr::read_csv("crashes.csv")
identical(crashes3, crashes) 
```

Notice that `crashes3` and `crashes` are not identical. What has changed? Hint: read the help page associated with `?readr::write_csv`.

## Now you are ready to use RStudio

**Bonus: reproduce the following plot by typing the following code into the console.**

<!-- Todo: find a better place for this comment: -->
<!-- Note: Any characters on a line of R script following the '#' symbol are not read as part of the computer code. It is used for writing plain English comments on the code to help naviage and understand what you have written. It is also used to 'comment-out' lines of code if you do not want them to run. See section 3.6 for more details. -->

```{r}
eyes = c(2.3, 4, 3.7, 4)
eyes = matrix(eyes, ncol = 2, byrow = T)
mouth = c(2, 2, 2.5, 1.3, 3, 1, 3.5, 1.3, 4, 2)
mouth = matrix(mouth, ncol = 2, byrow = T)
plot(eyes, type = "p", main = "RRR!", cex = 2, xlim = c(1, 5), ylim = c(0, 5))
lines(mouth, type = "l", col = "red")
```


```{r smile, out.width="30%", fig.align="center", echo=FALSE, eval=FALSE}
# pdf("figures/smile.pdf")
# dev.off()
knitr::include_graphics("figures/smile.png")
```