07-space.Rmd

# Spatial data {#space}

From displaying simple point data to examining collision density along routes or between areas, the geographic property of STATS19 data is one of its most useful attributes.
Mapping is a hugely useful and powerful aspect of R and has many applications in road safety, both in understanding geographic trends and presenting insight to colleagues.
This aspect of R is covered in detail in the book [Geocomputation With R](https://geocompr.robinlovelace.net/index.html) [@lovelace_geocomputation_2019].
By mapping collision data in R, you can add layers containing other geographic datasets to further understand the reasons for certain trends. 
This can lead to new opportunities for intervention and collaboration with other parties who may have mutually compatible solutions for reaching their goals.
We will use the following packages in this section:

```{r, message=FALSE}
library(sf)        # load the sf package for working with spatial data
library(tidyverse) # load the tidyverse as before
```

## sf objects

All road crashes happen somewhere and, in the UK at least, all collisions recorded by the police are given geographic coordinates. These can help in prioritising interventions to save lives by focusing on ‘crash hotspots.’
R has strong geographic data capabilities with the `sf` package, providing a generic class for spatial vector data. Points, lines and polygons are represented in `sf` as objects in a special 'geometry column', typically called 'geom' or 'geometry', extending the data frame class we've already seen in `crashes`, created in Section \@ref(tibbles) (repeated here to consolidate data frame creation):

```{r}
crashes = tibble(
  casualty_type = c("pedestrian", "cyclist", "cat"),
  casualty_age = seq(from = 20, to = 60, by = 20),
  vehicle_type = c("car", "bus", "tank"),
  dark = c(TRUE, FALSE, TRUE)
)
```

Create an `sf` data frame called `crashes_sf` that expands the `crashes` data frame to include a geometry column based on the `crashes` longitude and latitude data as follows:

```{r crashes-sf, fig.height=2, fig.width=3, message=FALSE, warning=FALSE}
crashes_sf = crashes # create copy of crashes dataset
crashes_sf$longitude = c(-1.3, -1.2, -1.1)
crashes_sf$latitude = c(50.7, 50.7, 50.68)
crashes_sf = st_as_sf(crashes_sf, coords = c("longitude", "latitude"), crs = 4326) # st_as_sf converts longitude and latitude coordinates into spatial objects using a specified Coordinate Reference System (see 7.6)
# plot(crashes_sf[1:4]) # basic plot
# mapview::mapview(crashes_sf) # for interactive map
```

**Exercises:**

1. Plot only the geometry column of `crashes_sf` (**Hint:** the solution may contain `$geometry`). If the result is like that in Figure \@ref(fig:crashes-sf-ex), congratulations, it worked!
1. Plot `crashes_sf`, only showing the age variable.
1. Plot the 2^nd^ and 3^rd^ crashes, showing which happened in the dark.
1. **Bonus:** How far apart are the points? (**Hint:** `sf` functions begin with `st_`)
1. **Bonus:** Near which settlement did the tank runover the cat?

```{r crashes-sf-ex, echo=FALSE, out.width="30%", fig.show='hold', fig.cap="The `crashes_sf` dataset shown in map form with the function `plot()`."}
plot(crashes_sf$geometry)
plot(crashes_sf["casualty_age"])
plot(crashes_sf[2:3, "dark"])
# st_distance(crashes_sf)
# Bembridge

# # updload geographic crash data
# write_sf(crashes_sf, "crashes_sf.geojson")
# piggyback::pb_upload("crashes_sf.geojson")
```

## Reading and writing spatial data

You can read and write spatial data with `read_sf()` and `write_sf()`, as shown below (see `?read_sf`):

```{r, eval=FALSE}
write_sf(zones, "zones.geojson") # saves the spatial dataset called zones as geojson file type
write_sf(zones, "zmapinfo", driver = "MapInfo file") # saves the dataset as a MapInfo file
read_sf("zmapinfo") # reads-in mapinfo file
```

See [Chapter 6](https://geocompr.robinlovelace.net/read-write.html) of *Geocomputation with R* for further information [@lovelace_geocomputation_2019].

## sf polygons

`sf` objects can also represent administrative zones.
This is illustrated below with reference to `zones`, a spatial object representing the Isle of Wight, that we will download using the `pct` package (**Note:** the `[1:9]` appended to the function selects only the first 9 columns).

```{r}
zones = pct::get_pct_zones("isle-of-wight")[1:9]
```
**Exercises:**

1. What is the class of the `zones` object?
2. What are its column names?
3. Print its first 2 rows and columns 6:8 (the result is below).

```{r, echo=FALSE}
# class(zones)
# names(zones)
zones[1:2, c(1, 5, 6, 7, 8)]
```

## Spatial subsetting and sf plotting

Like index and value subsetting, spatial subsetting can be done with the `[]` notation.
We can identify the crashes (`crashes_sf`) that occur in the Isle of Wight (`zones`) by subsetting as follows (i.e. subset `zones` by whether it contains data in `crashes_sf`):

```{r, message=FALSE}
zones_containing_crashes = zones[crashes_sf, ]
```

To plot a new layer on top of an existing `sf` plot, use the `add = TRUE` argument, e.g. as follows:

```{r, eval=FALSE}
plot(zones$geometry) # plot just the geometry of one layer
plot(zones_containing_crashes$geometry, col = "grey", add = TRUE)
```

Remember to plot only the `geometry` column of objects to avoid multiple maps.
Colours can be set with the `col` argument.

**Exercises:**

1. Plot the geometry of the zones, with the zones containing crashes overlaid on top in red (see Figure \@ref(fig:sp-ex)).
2. Plot the zone containing the 2^nd^ crash in blue (see Figure \@ref(fig:sp-ex)).
3. **Bonus:** Plot all zones that intersect with a zone containing crashes, with the actual crash points plotted in black (see Figure \@ref(fig:sp-ex)).

```{r sp-ex, echo=FALSE, out.width="33%", fig.show='hold', message=FALSE, warning=FALSE, fig.cap="Illustration of the results of spatial subsetting."}
plot(zones$geometry)
plot(zones_containing_crashes$geometry, col = "red", add = TRUE)
plot(zones$geometry)
plot(zones[crashes_sf[2, ], ], col = "blue", add = TRUE)
plot(zones$geometry)
plot(zones[zones_containing_crashes, ], col = "yellow", add = TRUE)
plot(crashes_sf$geometry, pch = 20, add = TRUE)
```

## Geographic joins

Geographic joins involve assigning values from one object to a new column in another, based on the geographic relationship between them.
With `sf` objects, the data from the `crashes_sf` dataset is joined onto the 'target' `zones` dataset, to create a new object called `zones_joined`:

```{r, message=FALSE}
zones_joined = st_join(zones[1], crashes_sf)
```

The above code takes the `geo_code` column data from `zones`, matches it to the `geometry` column in `crashes_sf` and then joins it to the crashes that have occurred in those geo_codes.  The matched, joined `geo_code` is a new column in the `zone_joined` dataset. We now know the administrative geo_code in which each crash occured. 

**Exercises:**

1. Plot the `casualty_age` variable of the new `zones_joined` object (see Figure \@ref(fig:joinf), to verify the result).
2. How many zones are returned in the previous command? 
3. Select only the `geo_code` column from the `zones` and the `dark` column from `crashes_sf` and use the `left = FALSE` argument to return only zones in which crashes occurred. Plot the result. (**Hint:** it should look like the map shown in Figure \@ref(fig:joinf))

See [Chapter 4](https://geocompr.robinlovelace.net/spatial-operations.html#spatial-joining) of *Geocomputation with R* [@lovelace_geocomputation_2019] for further information on geographic joins.

```{r joinf, echo=FALSE, out.width="40%", fig.show='hold', message=FALSE, fig.cap="Illustration of geographic joins."}
plot(zones_joined["casualty_age"])
zjd = st_join(zones[1], crashes_sf["dark"], left = FALSE)
plot(zjd)
```

## Coordinate Reference Systems

A Coordinate Reference Systems (CRS) is used for plotting data on maps.
There are many systems in use but they can generally be classified into two groups; 'projected' and 'geographic'.
A projected system, such as Eastings/Northings, plots locations onto a flat 2D projection of the Earth's surface.
A geographic system, such as Longitude/Latitude, refers to locations on the 3D surface of the globe.
Distance and direction calculations work differently between geographic and projected CRSs, so it is often necessary to convert from one to another.
Fortunately, R makes this very easy, and every CRS has its own unique reference number. For example, 27700 for the British National Grid system. 

CRSs define how two-dimensional points (such as longitude and latitude) are actually represented in the real world.  A CRS value is needed to interpret and give true meaning to coordinates. You can get and set CRSs with the command `st_crs()`.
Transform CRSs with the command `st_transform()`, as demonstrated in the code chunk below, which converts the 'lon/lat' geographic CRS of `crashes_sf` into the projected CRS of the British National Grid:

```{r crs1}
crashes_osgb = st_transform(crashes_sf, 27700)
```

<!-- From Will -->
<!-- When you define the geometry in a data frame, it creates a multipoint vector. -->
<!-- This is a single column containing coordinate pairs for point data, or multiple coordinate pairs for line vectors or polygons. Sometimes it may be necessary to convert from one CRS to another without creating a multipoint vector. -->
<!-- The following code is one way to convert Eastings/Northings (CRS 27700) to Latitude/Longitude (CRS 4326) without creating a multipoint vector. -->

```{r, eval=FALSE, echo=FALSE}
# data frame name is Crashes, containing location data in separate columns for Eastings and Northings
latlongcoords = (cbind(Crashes$Easting , Crashes$Northing))
latlong = SpatialPointsDataFrame(latlongcoords,data = Crashes ,proj4string = CRS("+init=epsg:27700"))
latlong_SP = spTransform(latlong,CRS("+init=epsg:4326"))
Crashes_latlong = as.data.table(latlong_SP)
```

**Exercises:**

1. Try to subset the zones with the `crashes_osgb`. What does the error message say?
2. Create `zones_osgb` by transforming the `zones` object.
3. **Bonus:** Use `st_crs()` to find out the units measurement of the British National Grid.

For more information on CRSs see [Chapter 6](https://geocompr.robinlovelace.net/reproj-geo-data.html) of *Geocompuation with R* [@lovelace_geocomputation_2019].

## Buffers

Buffers are polygons surrounding geometries, usually with fixed distance.
For example, in road safety research a 30m buffer can be created around crash locations to identify crashes that happened in close proximity to a particular junction or road segment.  
<!-- Currently buffer operations in R only work on objects with projected CRSs. -->

**Exercises:**

1. Find out and read the help page of `sf`'s buffer function.
2. Create an object called `crashes_1km_buffer` representing the area within 1 km of the `crashes_osgb` object and plot the result using the command: `plot(crashes_1km_buffer)`. As a fun bonus, try: `mapview::mapview(crashes_1km_buffer)`.
3. **Bonus:** Try creating buffers on the geographic version of the `crashes_sf` object. What happens?

```{r, eval=FALSE, echo=FALSE}
crashes_1km_buffer = sf::st_buffer(crashes_osgb, 1000)
plot(crashes_1km_buffer)
mapview::mapview(crashes_1km_buffer)
```

## Attribute operations on sf objects

We can do non-spatial operations on `sf` objects because they are `data.frame`s. 
Try the following attribute operations on the `zones` data:

```{r, eval=TRUE}
# load example dataset if it doesn't already exist
zones = pct::get_pct_zones("isle-of-wight")
sel = zones$all > 3000  # create a subsetting object
zones_large = zones[sel, ] # subset areas with a population over 3,000
zones_2 = zones[zones$geo_name == "Isle of Wight 002",] # subset based on 'equality' query
zones_first_and_third_column = zones[c(1, 3)]
zones_just_all = zones["all"]
```

**Exercises:**

1. Practice the subsetting techniques you have learned on the `sf data.frame` object `zones`:
     * Create an object called `zones_small`, which contains only regions with less than 3000 people in the `all` column.
     * Create a selection object called `sel_high_car` which is `TRUE` for regions with above median numbers of people who travel by car and `FALSE` otherwise.
     * Create an object called `zones_foot` which contains only the foot attribute from `zones`.
     * **Bonus 1:** plot `zones_foot` using the function `plot` to show where walking is a popular mode of travel to work.
     * **Bonus 2:** building on your answers to previous questions, use `filter()` from the `dplyr` package to subset small regions where car use is high. 
1. **Bonus:** What is the population density of each region (**Hint:** you may need to use the functions `st_area()`, `as.numeric()` and use the 'all' column)?
1. **Bonus:** Which zone has the highest percentage of people who cycle?

```{r, echo=FALSE, eval=FALSE}
# 1. Practice subsetting techniques you have learned on the `sf data.frame` object `zones`:
#      1. Create an object called `zones_small` which contains only regions with less than 3000 people in the `all` column
# in base R
zones_small = zones[zones$all < 3000, ]
# with dplyr
zones_small = zones %>% 
  filter(all < 3000)
#      1. Create a selection object called `sel_high_car` which is `TRUE` for regions with above median numbers of people who travel by car and `FALSE` otherwise
median_car = median(zones$car_driver)
sel_high_car = zones$car_driver > median_car 
#      1. How many regions have the number '1' in the column 'geo_name'? What percentage of the regions in the Isle of Wight is this?
sel_region_name_contains_1 = grepl("1", x = zones$geo_name)
sum(sel_region_name_contains_1) / nrow(zones)
#      1. Create an object called `zones_foot` which contains only the foot attribute from `zones`
# using base R
zones_foot = zones["foot"]
# dplyr
zones_foot = zones %>% 
  select(foot)
#      1. Bonus: plot the result to show where walking is a popular mode of travel to work
plot(zones_foot)
#      1. Bonus: bulding on your answers to previous questions, use `filter()` from the `dplyr` package to subset small regions where high car use is high
zones_small_car_high = zones %>% 
  filter(all < 3000, car_driver > median_car)
# 1. Bonus: What is the population density of each region (hint: you may need to use the functions `st_area()`, `as.numeric()` and use the 'all' column)?
zones$area_km2 = as.numeric(st_area(zones)) /1000000
zones$population_density = zones$all / zones$area_km2
plot(zones["population_density"])
# in dplyr
zones_density = zones %>% 
  mutate(area_km2 = as.numeric(st_area(geometry)) / 1000000) %>% 
  mutate(population_density = all / area_km2)
plot(zones_density %>% select(population_density))
# 1. Bonus: Which zone has the highest percentage who cycle?
zones %>% 
  mutate(pcycle = bicycle / all) %>% 
  top_n(n = 1, wt = pcycle)
# 1. Bonus: Find the proportion of people who drive to work (`car_driver`) in areas in which more than 500 people walk to work
zones %>% 
  group_by(foot > 500) %>% 
  summarise(mean_car = sum(car_driver) / sum(all) )
```

<!-- ## Matching roads to crashes -->

<!-- I think you forgot something here. For example we could introduce `st_nearest_feature`? Or counting using `st_within` and `st_buffer`.  -->

## Mapping road crash data

So far we have used the `plot()` function to make maps.
That's fine for basic visualisation, but for publication-quality maps we recommend using `tmap`. See Chapter 8 of *Geocomputation with R* [@lovelace_geocomputation_2019] for further explanation and alternatives.
After installation, load the package as follows:

```{r}
library(tmap)
tmap_mode("plot") # this sets the tmap mode to plotting as opposed to interactive
```

**Exercises:**

1. Create the plots of the `zones` object using `plot()` and `tm_shape() + tm_polygons()` functions (see Figure \@ref(fig:plot3)).
2. Create an interactive version of the `tmap` plot by setting `tmap_mode("view")` and re-running the plotting commands.
3. Add an additional layer to the interactive map showing the location of crashes, using marker and dot symbols.
4. **Bonus:** Change the default basemap (**Hint:** you may need to search in the package documentation or online for the solution).

```{r plot3, fig.show='hold', out.width="50%", echo=FALSE, fig.cap="Illustration of the plot and tmap approaches for creating maps."}
plot(zones[c("all", "bicycle")])
tm_shape(zones) + 
  tm_polygons(c("all", "bicycle"))
# tmap_mode("view")
# m = tm_shape(zones_joined) + 
#   tm_polygons(c("casualty_type")) +
#   tm_scale_bar()
# m
# knitr::include_graphics("tmap-zones-interactive.png")
```

## Analysing point data

Based on the saying, "Don't run before you can walk," we've learned the vital foundations of R before tackling a real dataset.
Temporal and spatial attributes are key to road crash data, hence the emphasis on `lubridate` and `sf`.
Visualisation is central to understanding data and influencing policy, which is where `tmap` comes in.
With these solid foundations, plus knowledge of how to ask for help (we recommend reading R's internal help, asking colleagues, searching the internet and creating new questions or comments on online forums such as StackOverflow or GitHub and we suggest you follow this order of resources to get help), you are ready to test the methods on some real data.

Before doing so, take a read of the `stats19` vignette, which can be launched as follows:

```{r, eval=FALSE}
vignette(package = "stats19") # view all vignettes available on stats19
vignette("stats19") # view the introductory vignette
```

This should now be sufficient to tackle the following exercises:

1. Download and plot all crashes reported in Great Britain in 2018. (**Hint:** see [the stats19 vignette](https://docs.ropensci.org/stats19/articles/stats19.html))
2. Find the function in the `stats19` package that converts a `data.frame` object into an `sf` data frame. Use this function to convert the road crashes into an `sf` object, called `crashes_sf`, for example.
3. Filter crashes that happened in the Isle of Wight based on attribute data. (**Hint:** the relevant column contains the word `local`)
4. Filter crashes happened in the Isle of Wight using geographic subsetting. (**Hint:** remember `st_crs()`?)
5. **Bonus:** Which type of subsetting yielded more results and why? 
6. **Bonus:** How many crashes happened in each zone?
7. Create a new column called `month` in the crash data using the function `lubridate::month()` and the `date` column.
8. Create an object called `a_zones_may` representing all the crashes that happened in the Isle of Wight in the month of May.
9. **Bonus:** Calculate the average (`mean`) speed limit associated with each crash that happened in May across the zones of the Isle of Wight (the result is shown in the map).


```{r, echo=FALSE, results='hide', message=FALSE, eval=FALSE}
library(stats19)
library(dplyr)
library(sf)
a = get_stats19(2018, "ac")
asf = format_sf(a)
a_zones = asf %>% 
  filter(local_authority_district == "Isle of Wight")
nrow(a_zones)
zones = pct::get_pct_zones(region = "isle-of-wight")
zones_osbg = st_transform(zones, 27700)
a_zones_sf = a_zones[zones_osbg, ]
nrow(a_zones_sf)
# mapview::mapview(zones) +
#   mapview::mapview(a_zones)
class(a$date)
class(a$time)
a_zones$month = lubridate::month(a_zones$date)
a_zones_may = a_zones %>% 
  filter(month == 5)
a_agg = aggregate(a_zones_may["speed_limit"], zones_osbg, mean)
plot(a_agg)
class(a$date)
```

Other mapping functions that can be powerful presentation tools are also explained in [Geocomputation With R](https://geocompr.robinlovelace.net/index.html) [@lovelace_geocomputation_2019].
These include leaflet maps, which have interactive properties and can be shared as .html files, and shiny apps that give even more interactivity as well as the possibility to embed on websites. 

## Analysing crash data on road networks

Road network data can be accessed from a range of sources, including OpenStreetMap (OSM) and Ordnance Survey.
We will use some OSM data from the Isle of Wight, which can be loaded as follows:

```{r}
u = "https://github.com/ropensci/stats19/releases/download/1.1.0/roads_key.Rds" 
roads_wgs = readRDS(url(u))
roads = roads_wgs %>% st_transform(crs = 27700)
```

You should already have road crashes for the Isle of Wight from the previous stage.
If not, load crash data as follows:

```{r}
u = "https://github.com/ropensci/stats19/releases/download/1.1.0/car_accidents_2017_iow.Rds"
crashes_iow = readRDS(url(u))
```

1. Plot the roads with the crashes overlaid.
2. Create a buffer around the roads with a distance of 200m.
3. How many crashes fall outside the buffered roads?
3. **Bonus:** Use the `aggregate()` function to identify how many crashes happened per segment and plot the result with `tmap` and plot the crashes that happened outside the road buffers on top, as shown in Figure \@ref(fig:iowbuff) (hint: see `?aggregate.sf` and take a read of Section [4.2.5](https://geocompr.robinlovelace.net/spatial-operations.html#spatial-aggr) of *Geocomputation with R* [@lovelace_geocomputation_2019]). Try undertaking the same steps for a different region of your choice. An example for Essex is shown in the code chunk, which results in Figure \@ref(fig:ex), (thanks to Will Cubbin).^[
The code chunk downloads and filters road data to show only main roads, and saves as an .Rds file for later use in your R project.
The geometry of the `roads.Rds` file from this example is a one dimensional line representing the centre of the road.
Some functions require the road to be treated as 2D polygon with a width property.
You can use the `st_buffer()` function to turn this vector into a polygon.
]

```{r iowbuff, message=FALSE, echo=FALSE, out.width="49%", fig.show='hold', message=FALSE, fig.cap="Maps of the Isle of Wight."}
plot(roads$geometry)
plot(crashes_iow["accident_severity"], add = TRUE)
roads_buffer = st_buffer(roads, 200, endCapStyle = "FLAT")
crashes_outside_roads = crashes_iow[roads_buffer, , op = sf::st_disjoint]
roads_agg = aggregate(crashes_iow[1], by = roads_buffer, FUN = length)
# plot(roads_agg, border = NA, main = "")
names(roads_agg)[1] = "N. Crashes"
tmap_mode("plot")
tm_shape(roads_agg) + tm_fill("N. Crashes") +
  tm_shape(crashes_outside_roads) + tm_dots(col = "blue") +
  tm_layout(frame = FALSE)
```


```{r, eval=FALSE}
remotes::install_github("itsleeds/osmextract") # install github package for osm data
library(osmextract)
region_name = "essex" # "essex" can be changed to another area name as required
osm_data = oe_get(region_name, extra_tags = c("maxspeed", "ref"))
table(osm_data$highway)
#filter osm_data to show only major roads
roads = osm_data %>%
  filter(str_detect(highway, pattern = "moto|prim|seco|tert|trunk"))
# transform geometry and save
roads = st_transform(roads, 27700) #converts to projected BNG system for later use
# plot(roads$geometry) # basic plot of roads
tm_shape(roads) + tm_lines("maxspeed", showNA = T, lwd = 2)
saveRDS(roads, file = "roads.Rds") # Saves road dataset for future use
```

```{r, echo=FALSE, eval=FALSE}
piggyback::pb_upload("roads.Rds")
piggyback::pb_download_url("roads.Rds")
# "https://github.com/ITSLeeds/rrsrr/releases/download/v0/roads.Rds"
```

```{r ex, echo=FALSE, fig.cap="Roads in Essex downloaded with the code shown above."}
ur = "https://github.com/ITSLeeds/rrsrr/releases/download/v0/roads.Rds"
roads = readRDS(url(ur))
tm_shape(roads) + tm_lines("maxspeed", showNA = T, lwd = 2)
```

## Bonus exercises {-}

Identify a region and zonal units of interest from http://geoportal.statistics.gov.uk/ or from the object `police_boundaries` in the `stats19` package.

1. Read them into R as an `sf` object.
2. Create a map showing the number of crashes in each zone.
3. Identify the average speed limit associated with crashes in each zone.
4. Identify an interesting question you can ask to the data and use exploratory data analysis to find answers.
5. Check another [related project](https://github.com/agila5/leeds_seminar) for further information on smoothing techniques of counts on a linear network. 
<!-- 1. Take a look at the code in [the file iow_example.R in the inst directory in the stats19 repo](https://github.com/ropensci/stats19/blob/master/inst/iow_example.R). Run it to create smoothed estimates of crash frequency on the road network (see [code in the  GitHub repo](https://github.com/agila5/leeds_seminar/blob/master/examples.R) for further information on these preliminary methods). -->


```{r final-plot, echo=FALSE, out.width="100%"}
# knitr::include_graphics("final-figure.png")
```