-
Notifications
You must be signed in to change notification settings - Fork 18
/
Copy pathREADME.Rmd
342 lines (247 loc) · 15.7 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
---
output: github_document
bibliography: vignettes/references.bib
---
<!-- badges: start -->
<!-- [![Travis build status](https://travis-ci.org/ropensci/stats19.svg?branch=master)](https://travis-ci.org/ropensci/stats19) -->
[![](http://www.r-pkg.org/badges/version/stats19)](https://www.r-pkg.org/pkg/stats19)
[![R-CMD-check](https://github.com/ropensci/stats19/workflows/R-CMD-check/badge.svg)](https://github.com/ropensci/stats19/actions)
[![CRAN RStudio mirror downloads](https://cranlogs.r-pkg.org/badges/grand-total/stats19)](https://www.r-pkg.org/pkg/stats19)
[![Life cycle](https://img.shields.io/badge/lifecycle-stable-brightgreen.svg)](https://lifecycle.r-lib.org/articles/stages.html)
[![](https://badges.ropensci.org/266_status.svg)](https://github.com/ropensci/software-review/issues/266)
[![DOI](https://joss.theoj.org/papers/10.21105/joss.01181/status.svg)](https://doi.org/10.21105/joss.01181)
![codecov](https://codecov.io/gh/ropensci/stats19/branch/master/graph/badge.svg)
<!-- badges: end -->
<!-- [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2540781.svg)](https://doi.org/10.5281/zenodo.2540781) -->
<!-- [![Gitter chat](https://badges.gitter.im/ITSLeeds/stats19.png)](https://gitter.im/stats19/Lobby?source=orgpage) -->
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# stats19 <a href='https://docs.ropensci.org/stats19/'><img src='https://raw.githubusercontent.com/ropensci/stats19/master/man/figures/logo.png' align="right" height=215/></a>
**stats19** provides functions for downloading and formatting road crash data.
Specifically, it enables access to the UK's official road traffic casualty database, [STATS19](https://www.data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-accidents-safety-data). (The name comes from the form used by the police to record car crashes and other incidents resulting in casualties on the roads.)
The raw data is provided as a series of `.csv` files that contain integers and which are stored in dozens of `.zip` files.
Finding, reading-in and formatting the data for research can be a time consuming process subject to human error.
**stats19** speeds up these vital but boring and error-prone stages of the research process with a single function: `get_stats19()`.
By allowing public access to properly labelled road crash data, **stats19** aims to make road safety research more reproducible and accessible.
For transparency and modularity, each stage can be undertaken separately, as documented in the [stats19 vignette](https://itsleeds.github.io/stats19/articles/stats19.html).
The package has now been peer reviewed and is stable, and has been published in the Journal of Open Source Software [@lovelace_stats19_2019].
Please tell people about the package, link to it and cite it if you use it in your work.
## Installation
Install and load the latest version with:
```{r eval=FALSE}
remotes::install_github("ropensci/stats19")
```
```{r attach}
library(stats19)
```
You can install the released version of stats19 from [CRAN](https://cran.r-project.org/package=stats19) with:
```r
install.packages("stats19")
```
## get_stats19()
`get_stats19()` requires `year` and `type` parameters, mirroring the provision of STATS19 data files, which are categorised by year (from 1979 onward) and type (with separate tables for crashes, casualties and vehicles, as outlined below).
The following command, for example, gets crash data from 2022 (**note**: we follow the "crash not accident" campaign of [RoadPeace](https://www.roadpeace.org/working-for-change/crash-not-accident/) in naming crashes, although the DfT refers to the relevant tables as 'accidents' data):
```{r}
crashes = get_stats19(year = 2022, type = "collision")
```
What just happened?
For the `year` 2022 we read-in crash-level (`type = "collision"`) data on all road crashes recorded by the police across Great Britain.
The dataset contains `r ncol(crashes)` columns (variables) for `r format(nrow(crashes), big.mark = ",")` crashes.
We were not asked to download the file (by default you are asked to confirm the file that will be downloaded).
The contents of this dataset, and other datasets provided by **stats19**, are outlined below and described in more detail in the [stats19 vignette](https://itsleeds.github.io/stats19/articles/stats19.html).
We will see below how the function also works to get the corresponding casualty and vehicle datasets for 2022.
The package also allows STATS19 files to be downloaded and read-in separately, allowing more control over what you download, and subsequently read-in, with `read_collisions()`, `read_casualties()` and `read_vehicles()`, as described in the vignette.
## Data download
Data files can be downloaded without reading them in using the function `dl_stats19()`.
If there are multiple matches, you will be asked to choose from a range of options.
Providing just the year, for example, will result in the following options:
```{r dl2022-all, eval=FALSE}
dl_stats19(year = 2022, data_dir = tempdir())
```
```
Multiple matches. Which do you want to download?
1: dft-road-casualty-statistics-casualty-2022.csv
2: dft-road-casualty-statistics-vehicle-2022.csv
3: dft-road-casualty-statistics-collision-2022.csv
Selection:
Enter an item from the menu, or 0 to exit
```
## Using the data
STATS19 data consists of 3 main tables:
- Collisions, the main table which contains information on the crash time, location and other variables (`r ncol(accidents_sample)` columns in total)
- Casualties, containing data on people hurt or killed in each crash (`r ncol(casualties_sample)` columns in total)
- Vehicles, containing data on vehicles involved in or causing each crash (`r ncol(vehicles_sample)` columns in total)
The contents of each is outlined below.
### Crash data
Crash data was downloaded and read-in using the function `get_stats19()`, as described above.
```{r read2022-raw-format}
nrow(crashes)
ncol(crashes)
```
Some of the key variables in this dataset include:
```{r crashes2022-columns}
key_column_names = grepl(pattern = "severity|speed|pedestrian|light_conditions", x = names(crashes))
crashes[key_column_names]
```
For the full list of columns, run `names(crashes)` or see the [vignette](https://github.com/ropensci/stats19/blob/master/vignettes/stats19.Rmd).
<!-- This means `crashes` is much more usable than `crashes_raw`, as shown below, which shows three records and some key variables in the messy and clean datasets: -->
### Casualties data
As with `crashes`, casualty data for 2022 can be downloaded, read-in and formatted as follows:
```{r 2022-cas}
casualties = get_stats19(year = 2022, type = "casualty", ask = FALSE, format = TRUE)
nrow(casualties)
ncol(casualties)
```
The results show that there were `r format(nrow(casualties), big.mark=",")` casualties reported by the police in the STATS19 dataset in 2022, and `r ncol(casualties)` columns (variables).
Values for a sample of these columns are shown below:
```{r 2022-cas-columns}
casualties[c(4, 5, 6, 14)]
```
The full list of column names in the `casualties` dataset is:
```{r 2022-cas-columns-all}
names(casualties)
```
### Vehicles data
Data for vehicles involved in crashes in 2022 can be downloaded, read-in and formatted as follows:
```{r dl2022-vehicles}
vehicles = get_stats19(year = 2022, type = "vehicle", ask = FALSE, format = TRUE)
nrow(vehicles)
ncol(vehicles)
```
The results show that there were `r format(nrow(vehicles), big.mark=",")` vehicles involved in crashes reported by the police in the STATS19 dataset in 2022, with `r ncol(vehicles)` columns (variables).
Values for a sample of these columns are shown below:
```{r 2022-veh-columns}
vehicles[c(3, 14:16)]
```
The full list of column names in the `vehicles` dataset is:
```{r 2022-veh-columns-all}
names(vehicles)
```
## Creating geographic crash data
An important feature of STATS19 data is that the collision table contains geographic coordinates.
These are provided at ~10m resolution in the UK's official coordinate reference system (the Ordnance Survey National Grid, EPSG code 27700).
**stats19** converts the non-geographic tables created by `format_collisions()` into the geographic data form of the [`sf` package](https://cran.r-project.org/package=sf) with the function `format_sf()` as follows:
```{r format-crashes-sf}
crashes_sf = format_sf(crashes)
```
The note arises because `NA` values are not permitted in `sf` coordinates, and so rows containing no coordinates are automatically removed.
Having the data in a standard geographic form allows various geographic operations to be performed on it.
The following code chunk, for example, returns all crashes within the boundary of West Yorkshire (which is contained in the object [`police_boundaries`](https://itsleeds.github.io/stats19/reference/police_boundaries.html), an `sf` data frame containing all police jurisdictions in England and Wales).
```{r crashes-leeds}
library(sf)
library(dplyr)
wy = filter(police_boundaries, pfa16nm == "West Yorkshire")
crashes_wy = crashes_sf[wy, ]
nrow(crashes_sf)
nrow(crashes_wy)
```
This subsetting has selected the `r format(nrow(crashes_wy), big.mark = ",")`
crashes which occurred within West Yorkshire in 2022.
## Joining tables
The three main tables we have just read-in can be joined by shared key variables.
This is demonstrated in the code chunk below, which subsets all casualties that took place in Leeds, and counts the number of casualties by severity for each crash:
```{r table-join, message = FALSE}
sel = casualties$accident_index %in% crashes_wy$accident_index
casualties_wy = casualties[sel, ]
names(casualties_wy)
cas_types = casualties_wy %>%
select(accident_index, casualty_type) %>%
mutate(n = 1) %>%
group_by(accident_index, casualty_type) %>%
summarise(n = sum(n)) %>%
tidyr::spread(casualty_type, n, fill = 0)
cas_types$Total = rowSums(cas_types[-1])
cj = left_join(crashes_wy, cas_types, by = "accident_index")
```
What just happened? We found the subset of casualties that took place in West Yorkshire with reference to the `accident_index` variable.
Then we used functions from the **tidyverse** package **dplyr** (and `spread()` from **tidyr**) to create a dataset with a column for each casualty type.
We then joined the updated casualty data onto the `crashes_wy` dataset.
The result is a spatial (`sf`) data frame of crashes in Leeds, with columns counting how many road users of different types were hurt.
The original and joined data look like this:
```{r table-join-examples}
crashes_wy %>%
select(accident_index, accident_severity) %>%
st_drop_geometry()
cas_types[1:2, c("accident_index", "Cyclist")]
cj[1:2, c(1, 5, 34)] %>% st_drop_geometry()
```
## Mapping crashes
The join operation added a geometry column to the casualty data, enabling it to be mapped (for more advanced maps, see the [vignette](https://itsleeds.github.io/stats19/articles/stats19.html)):
```{r}
cex = cj$Total / 3
plot(cj["speed_limit"], cex = cex)
```
The spatial distribution of crashes in West Yorkshire clearly relates to the region's geography.
Crashes tend to happen on busy Motorway roads (with a high speed limit, of 70 miles per hour, as shown in the map above) and city centres, of Leeds and Bradford in particular.
The severity and number of people hurt (proportional to circle width in the map above) in crashes is related to the speed limit.
STATS19 data can be used as the basis of road safety research.
The map below, for example, shows the results of an academic paper on the social, spatial and temporal distribution of bike crashes in West Yorkshire, which estimated the number of crashes per billion km cycled based on commuter cycling as a proxy for cycling levels overall (more sophisticated measures of cycling levels are now possible thanks to new data sources) [@lovelace_who_2016]:
```{r, echo=FALSE}
knitr::include_graphics("https://ars.els-cdn.com/content/image/1-s2.0-S136984781500039X-gr9.jpg")
```
## Time series analysis
We can also explore seasonal trends in crashes by aggregating crashes by day of the year:
```{r crash-date-plot}
library(ggplot2)
head(cj$date)
class(cj$date)
crashes_dates = cj %>%
st_set_geometry(NULL) %>%
group_by(date) %>%
summarise(
walking = sum(Pedestrian),
cycling = sum(Cyclist),
passenger = sum(`Car occupant`)
) %>%
tidyr::gather(mode, casualties, -date)
ggplot(crashes_dates, aes(date, casualties)) +
geom_smooth(aes(colour = mode), method = "loess") +
ylab("Casualties per day")
```
Different types of crashes also tend to happen at different times of day.
This is illustrated in the plot below, which shows the times of day when people who were travelling by different modes were most commonly injured.
```{r crash-time-plot}
library(stringr)
crash_times = cj %>%
st_set_geometry(NULL) %>%
group_by(hour = as.numeric(str_sub(time, 1, 2))) %>%
summarise(
walking = sum(Pedestrian),
cycling = sum(Cyclist),
passenger = sum(`Car occupant`)
) %>%
tidyr::gather(mode, casualties, -hour)
ggplot(crash_times, aes(hour, casualties)) +
geom_line(aes(colour = mode))
```
Note that cycling manifests distinct morning and afternoon peaks [see @lovelace_who_2016 for more on this].
## Usage in research and policy contexts
Examples of how the package can been used for policy making include:
- Use of the package in a web app created by the library service of the UK Parliament. See commonslibrary.parliament.uk^[
Got to the following URL: commonslibrary.parliament.uk/constituency-data-traffic-accidents online
] screenshots of which from December 2019 are shown below, for details.
![](https://user-images.githubusercontent.com/1825120/70164249-bf730080-16b8-11ea-96d8-ec92c0b5cc69.png)
- Use of methods taught in the [stats19-training](https://docs.ropensci.org/stats19/articles/stats19-training.html) vignette by road safety analysts at Essex Highways and the Safer Essex Roads Partnership ([SERP](https://saferessexroads.org/)) to inform the deployment of proactive front-line police enforcement in the region (credit: Will Cubbin).
- Mention of road crash data analysis based on the package in an [article](https://www.theguardian.com/cities/2019/oct/07/a-deadly-problem-should-we-ban-suvs-from-our-cities) on urban SUVs.
The question of how vehicle size and type relates to road safety is an important area of future research.
A starting point for researching this topic can be found in the [`stats19-vehicles`](https://docs.ropensci.org/stats19/articles/stats19-vehicles.html) vignette, representing a possible next step in terms of how the data can be used.
## Next steps
There is much important research that needs to be done to help make the transport systems in many cities safer.
Even if you're not working with UK data, we hope that the data provided by **stats19** data can help safety researchers develop new methods to better understand the reasons why people are needlessly hurt and killed on the roads.
The next step is to gain a deeper understanding of **stats19** and the data it provides.
Then it's time to pose interesting research questions, some of which could provide an evidence-base in support policies that save lives.
For more on these next steps, see the package's introductory [vignette](https://itsleeds.github.io/stats19/articles/stats19.html).
## Further information
The **stats19** package builds on previous work, including:
- code in the [bikeR](https://github.com/Robinlovelace/bikeR) repo underlying an academic paper on collisions involving cyclists
- functions in [**stplanr**](https://docs.ropensci.org/stplanr/) for downloading Stats19 data
- updated functions related to the [CyIPT](https://github.com/cyipt/stats19) project
[![ropensci_footer](https://ropensci.org/public_images/ropensci_footer.png)](https://ropensci.org)
## References