-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathspatially_weighted_avg.Rmd
444 lines (330 loc) · 29.6 KB
/
spatially_weighted_avg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
---
title: "Spatially weighted averages in R with sf"
author: "Markus Konrad"
date: "7/01/2021"
output:
# pdf_document: default
html_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
dev = 'png',
fig.path='figures/',
fig.width = 8,
fig.height = 6
)
```
# Introduction
Spatial joins allow to augment one spatial dataset with information from another spatial dataset by linking overlapping features. In this post I will provide an example showing how to augment a dataset containing school locations with socioeconomic data of their surrounding statistical region using R and the package [*sf*](https://cran.r-project.org/web/packages/sf/index.html) (Pebesma 2018). This approach has the drawback that the surrounding statistical region doesn't reflect the actual catchment area of the school. I will present an alternative approach where the overlaps of the schools' catchment areas with the statistical regions allow to calculate the weighted average of the socioeconomic statistics. If we have no data about the actual catchment areas of the schools, we may resort to approximating these areas as circular regions or as Voronoi regions around schools.
# Data
For this example, I'd like to compare the percentage of children whose parents obtain social welfare in the neighborhood regions around public and private primary schools in Berlin. This blog post concentrates on how to join the point samples (the schools) with the surrounding statistical regions and calculate a spatially weighted average the welfare rate, so I will present only a few descriptive results in the end.
We will work with several datasets: The first spatial dataset contains the shape of the statistical regions in Berlin, the second dataset contains the socioeconomic data for these regions, the third and fourth datasets contain the locations and other attributes of public and private primary schools in Berlin, respectively.
All data and the code are available in the [GitHub repository](https://github.com/WZBSocialScienceCenter/spatially_weighted_avg). We will use the [*sf* package](https://cran.r-project.org/web/packages/sf/index.html) for working with spatial data in R, [*dplyr*](https://dplyr.tidyverse.org/) for data management and [*ggplot2*](https://ggplot2.tidyverse.org/) for a few more advanced visualizations, i.e. when base `plot()` is not sufficient.
```{r, warning=FALSE, message=FALSE}
library(sf)
library(dplyr)
library(ggplot2)
```
## Socioeconomic data for statistical regions
We will at first load a dataset with the most granular official statistical regions for Berlin, called [*Planungsräume* (planning areas)](https://www.stadtentwicklung.berlin.de/planen/basisdaten_stadtentwicklung/lor/). We select the area ID and name as spatial attributes. The result is a spatial dataframe (a *simple feature (sf)* collection).
```{r}
bln_plan <- read_sf('data/berlin_plr.shp') %>%
mutate(areaid = as.integer(SCHLUESSEL)) %>% # transform character SCHLUESSEL to numeric area ID
select(areaid, name = PLR_NAME)
head(bln_plan)
```
When printing this dataframe, the header reveals another important information: The coordinate reference system (CRS) of this dataset is [ETRS89 / UTM zone 33N](http://epsg.io/25833). We will later need to make sure that the coordinates of the school locations and the coordinates of the planning areas use the same coordinate system.
This data can be joined with socioeconomic information provided from official sources. Luckily, [Helbig/Salomo 2021](https://shiny2.wzb.eu/konrad/salomo_helbig_dashboard/) compiled these information for some cities in Germany (available for [download](https://shiny2.wzb.eu/konrad/salomo_helbig_dashboard/download/Helbig,%20Salomo%20-%20Sozialraeumliche%20Ungleichheiten%20-%20Daten%20Stand%202021-01-26.xlsx)) among which is data for Berlin from 2020. I've created an excerpt with percentages of residents receiving social welfare (`welfare`) and percentage of children under 15 years whose parents receive social welfare (`welfare_chld`):
```{r}
bln_welfare <- read.csv('data/berlin_welfare.csv', stringsAsFactors = FALSE)
head(bln_welfare)
```
We can use the area ID for augmenting the planning areas with the welfare statistics. We're joining a spatial with an ordinary dataframe, so we can use dplyr's `inner_join`. Before that we can check that for each planning area we have welfare statistics information and vice versa: ^[Note that when joining spatial and ordinary dataframes, the order of arguments in the join function matters. If you have a spatial dataframe on the "left side" (`x` argument), the result will be a spatial dataframe. If you have an ordinary dataframe on the left side, the result will be an ordinary dataframe, i.e. the merged dataset loses its "spatial nature" and spatial operations won't work with it any more (unless you convert it back to a spatial dataframe again with [`st_as_sf`](https://r-spatial.github.io/sf/reference/st_as_sf.html)).]
```{r}
setequal(bln_plan$areaid, bln_welfare$areaid)
```
```{r}
bln <- inner_join(bln_plan, bln_welfare, by = 'areaid') %>%
select(-name)
head(bln)
```
A quick plot confirms that it is similar to the one from the [dashboard of the Helbig/Salomo study](https://shiny2.wzb.eu/konrad/salomo_helbig_dashboard/?_state_id_=49379e5c63e0e14c).^[I prefer using the base `plot` function for quick exploration of spatial data and usually only turn to ggplot2 for more advanced or "publication ready" plots. The help page for `plot.sf` provides some information about the arguments of this plotting function used for sf objects.]
```{r 01cmap}
plot(bln['welfare_chld'])
```
The median percentage of children whose parents receive social welfare is ~20% with an interquartile range of about 29%. The following shows the distribution of this welfare rate:
```{r 02welfarehist}
hist(bln$welfare_chld,
main = 'Histogram of percentage of children under 15 years\nwhose parents receive social welfare',
xlab = '')
```
## Public and private primary schools
The [Berlin geodata catalog "FIS Broker"](https://stadtentwicklung.berlin.de/geoinformation/fis-broker/) provides the [locations of public schools in Berlin](https://fbinter.stadt-berlin.de/fb/index.jsp?loginkey=zoomStart&mapId=schulen@senstadt&bbox=362719,5798847,423243,5838687).^[The catalog is a bit clumsy to use, but actually works quite well: You search for the data, get an URL to the [WFS](https://en.wikipedia.org/wiki/Web_Feature_Service) endpoint from the data's metainformation panel and use that URL to obtain the data e.g. via a WFS layer in [QGIS](https://qgis.org/).] I obtained the data and converted it to GeoJSON, which we can now load. We'll only retain primary schools and add a variable denoting that these are public schools. We also see that the CRS of the school locations matches the CRS of the Berlin statistical regions data.
```{r}
pubschools <- read_sf('data/berlin_pubschools.geojson') %>%
filter(SCHULART == 'Grundschule') %>%
select(name = NAME) %>%
mutate(ownership = 'pub', .before = 1)
head(pubschools)
```
Now to the private schools' locations. [Marcel Helbig](https://wzb.eu/de/personen/marcel-helbig), [Rita Nikolai](https://www.erziehungswissenschaften.hu-berlin.de/de/institut/abteilungen/didaktik/As%20Kol/nikolai) and me collected data on school locations in East Germany from 1992 to 2015 in order to analyze the [development of the network of schools in East Germany and which role private schools play in it](https://bibliothek.wzb.eu/wzbrief-bildung/WZBriefBildung382018_helbig_konrad_nikolai.pdf) (Helbig/Konrad/Nikolai 2018). Besides creating an [interactive map](https://schulenkarte.wzb.eu/), we also [published the data](https://schulenkarte.wzb.eu/#daten) and are planning an update with newer data (until 2020) from which will we now use an excerpt. This dataset provides school locations from 2019 as [longitude/latitude WGS84 coordinates](https://gisgeography.com/wgs84-world-geodetic-system/) which we can load and convert into a spatial dataset using `st_as_sf`. We also transform these locations to the ETRS89 CRS used in all prior spatial datasets.
```{r}
privschools <- read.csv('data/grundschulen_berlin_2019.csv', stringsAsFactors = FALSE) %>%
filter(traeger == 'priv') %>%
select(ownership = traeger, name, lng, lat) %>%
st_as_sf(coords = c('lng', 'lat'), crs = 4326) %>% # EPSG 4326 is WGS84 lat/long coord.
st_transform(crs = st_crs(pubschools)) # transform to same CRS as publ. schools
head(privschools)
```
The variable `ownership` encodes whether a given facility is a public ("pub") or private ("priv") primary school. We can now append the public and private primary schools datasets to form a single `schools` dataset. The public school data comes from 2020 and the private school data from 2019, but this shouldn't be an issue because the number of public and private schools has been quite stable in recent years.
```{r}
schools <- bind_rows(pubschools, privschools) %>%
mutate(schoolid = 1:nrow(.), .before = 1)
head(schools)
```
In our dataset we now have 361 public and 71 private primary schools in Berlin.
# Public / private primary schools and poverty by statistical region
Both datasets use the same coordinate system now, so we can plot the school locations on top of the planning areas. I will use ggplot2 this time to make a choropleth map of the `welfare_chld` variable and overlay that with the public and private primary school locations.
```{r 03cmap_schools}
ggplot() +
geom_sf(aes(fill = welfare_chld), color = 'black', data = bln) +
geom_sf(aes(color = ownership), size = 1, alpha = 0.75, data = schools) +
scale_fill_binned(type = 'viridis', guide = guide_bins(title = '% Welfare')) +
scale_color_manual(values = c('pub' = '#c767cb', 'priv' = '#cdc566'),
labels = c('public school', 'private school'),
guide = guide_legend(title = '')) +
coord_sf(datum = NA) + # disable graticule
labs(title = "Public / private primary schools and poverty",
subtitle = "Choropleth map of percentage of children whose parents obtain social welfare.\nDots represent primary schools.") +
theme_minimal()
```
From the figure alone, it's probably hard to assess whether there's a pattern in the distribution of private and public schools regarding areas with higher welfare rate in the city. In order to compare the social welfare statistics of regions around private schools with those around public schools, we can join the schools' data with the socioeconomic information of the planning areas they're located in. This can be done with a [spatial join](https://geocompr.robinlovelace.net/spatial-operations.html#spatial-joining) using [`st_join`](https://r-spatial.github.io/sf/reference/st_join.html). By default, this function joins the spatial features of the first argument with features of the second argument **when they intersect** -- in our case this means a school is linked with the planning area it's located in. Note that the order of arguments matters here and that the spatial geometry of the first argument is retained in the resulting dataset.
```{r}
schools_plan <- st_join(schools, bln)
head(schools_plan)
```
We can see that the schools' data was linked with the data from the planning areas. We should also check whether there's a school that was not located in any planning area (this may for example happen when a school is very close to the Berlin-Brandenburg border):
```{r}
sum(is.na(schools_plan$areaid))
```
All schools were linked with their planning area, so we can now compare the percentage of children whose parents obtain social welfare between public and private primary schools:
```{r 04violin1}
ggplot(schools_plan) +
geom_violin(aes(x = ownership, y = welfare_chld), draw_quantiles = c(0.5)) +
geom_jitter(aes(x = ownership, y = welfare_chld), alpha = 0.25) +
scale_x_discrete(labels = c('pub' = 'public primary schools', 'priv' = 'private primary schools')) +
labs(title = 'Percentage of children whose parents obtain social welfare', x = '', y = '% welfare')
```
Our descriptive results indicate that the median percentage of children whose parents obtain social welfare is around six percent higher in the statistical regions around public schools than around private schools: ^[I'm using `st_drop_geometry` here, because otherwise a [spatial aggregation](https://geocompr.robinlovelace.net/spatial-operations.html#spatial-aggr) would be performed which takes much longer to compute and is not necessary here.]
```{r}
st_drop_geometry(schools_plan) %>%
group_by(ownership) %>%
summarise(median_welfare_chld = median(welfare_chld))
```
This is an interesting descriptive result and we may continue with our spatial analysis from here. However, our current approach doesn't consider the catchment area of a school correctly: Children from nearby planning areas will most likely visit a school, but at the moment we only consider the one planning area in which a school is located. As an example, let's zoom to school #388 "Evangelische Schule Berlin Buch" in the north of Berlin. As you can see, only considering the planning area in which this school is located omits the higher welfare rates in nearby areas:
```{r 05catchment, fig.width=9, fig.height=9}
ggplot(bln) +
geom_sf(aes(fill = welfare_chld), color = 'black') +
geom_sf(data = filter(schools, schoolid == 388), size = 3, color = 'red') +
scale_fill_binned(type = 'viridis', guide = guide_bins(title = '% Welfare')) +
coord_sf(datum = st_crs(bln), xlim = c(395e3, 401e3), ylim = c(583e4, 5837e3)) +
labs(title = 'School #388 "Evangelische Schule Berlin Buch" and\nsurrounding planning areas') +
theme_minimal()
```
# Spatial weighting with official school catchment areas
In Berlin, parents can send their children to a primary school that is within the [official school catchment area of their home address *("Grundschuleinzugsbereiche")*](https://www.berlin.de/sen/bildung/schule/bildungswege/grundschule/anmeldung/).^[There's much debate about this and parents can try to register their children in a primary school outside their home catchment area but this comes with juristic obstacles, so we can assume for now that most will stay within their area.] Luckily, there's spatial data for these catchment areas again available in the [Berlin geodata catalog](https://fbinter.stadt-berlin.de/fb/index.jsp?loginkey=zoomStart&mapId=schulen@senstadt&bbox=362719,5798847,423243,5838687). I again converted the obtained data from the Berlin geodata catalog to GeoJSON, which we can load now. I also generate a catchment area ID `catchid` which however has nothing to do with the school ID from `schools` dataset.
```{r}
schoolareas <- read_sf('data/berlin_ezb.geojson') %>%
select(-BSN, -BEREICH) %>%
mutate(catchid = 1:nrow(.), .before = 1)
head(schoolareas)
```
Let's generate a plot that overlays the planning areas, catchment areas and school locations. We can see that the catchment areas differ from the planning areas:
```{r 06catchment_overlay}
plot(bln$geometry)
plot(schoolareas$geometry, col = '#80000055', border = 'white', add = TRUE)
plot(schools$geometry, col = '#0000AA77', cex = 0.5, pch = 19, add = TRUE)
```
The goal is now to calculate the weighted average of the welfare rate for a given school by taking into account all planning areas that the school's catchment area intersects with. The weights will be determined by the intersection area between the catchment area and the planning areas. I will first do this with a single school only to illustrate how it works. This school will be #269 "Müggelheimer Schule" located in the south east of Berlin:
```{r}
(exampleschool <- schools[schools$schoolid == 269,])
```
```{r 07catchmentexample}
plot(bln$geometry)
plot(schoolareas$geometry, col = '#80000055', border = 'white', add = TRUE)
plot(exampleschool$geometry, col = 'red', pch = 19, add = TRUE)
```
First, we need the catchment area of that school. We can again apply `st_join` for this in order to get the catchment area that intersects with the school. Note that the catchment areas should be the first argument in the `st_join` function since we want to retain the catchment areas' geometries in the resulting dataset. We also use an inner join instead of a left join by setting `left = FALSE` so that the result set only contains the single catchment area that intersects with the school.
```{r}
(example_catchment_area <- st_join(schoolareas, exampleschool, left = FALSE))
```
The next step is to get the intersections between the planning areas and catchment areas, i.e. to clip the planning areas according to the school's catchment area. We do this with the help of [`st_intersection`](https://r-spatial.github.io/sf/reference/geos_binary_ops.html), which calculates the intersection between spatial objects. The result is a spatial dataframe of seven planning regions that overlap with the school's catchment area:
```{r, warning=FALSE}
(example_plr <- st_intersection(bln, example_catchment_area))
```
We can put that a little bit into perspective again and display it on the Berlin planning areas map overlayed with the schools' catchment areas. Here we can see that our example school's catchment area mainly intersects with only two planning areas. The other five intersections listed above are only tiny overlaps from surrounding planning areas, as we can also confirm next by computing their surface areas.
```{r 08catchmentexample2}
ggplot() +
geom_sf(color = 'black', data = bln) +
geom_sf(fill = NA, color = 'red', linetype = 'dotted', data = schoolareas) +
geom_sf(aes(fill = welfare_chld), color = 'black', data = example_plr) +
geom_sf(fill = NA, color = 'red', data = example_catchment_area) +
geom_sf(color = 'red', size = 3, data = exampleschool) +
scale_fill_continuous(type = 'viridis', guide = guide_colorbar(title = '% Welfare')) +
coord_sf(datum = NA) +
labs(title = "Berlin statistical regions and school catchment areas",
subtitle = "Highlighted school #270 with surrounding catchment area and planning areas intersection.") +
theme_minimal()
```
All that is left now for our example school is to take the weighted average of the welfare rate. The weights are the area of the planning area intersections so that planning areas with larger overlap in the catchment area have a higher influence on the overall average. The following shows the planning area intersections along with their area as calculated via [`st_area`](https://r-spatial.github.io/sf/reference/geos_measures.html). We can see that the welfare rate of ~5% in Müggelheim will have the largest weight, followed by the ~17% rate in Kietzer Feld/Nachtheide:
```{r}
cbind(example_plr[c('areaname', 'welfare_chld')], area = st_area(example_plr)) %>%
mutate(weight = as.numeric(area / sum(area))) %>%
arrange(desc(weight))
```
We pass these area measurements to `weighted.mean` (stripping the m² unit via `as.numeric` since `weighted.mean` can't handle it) and obtain a weighted average welfare rate of ~7% which is quite a bit higher than the ~4.5% we get when using the former approach (linking the school with its planning area "Müggelheim"):
```{r}
weighted.mean(example_plr$welfare_chld, as.numeric(st_area(example_plr)))
```
```{r}
# former approach: linking the school with its planning area
schools_plan[schools_plan$schoolid == 269, ]$welfare_chld
```
We'll next perform these calculations for all schools. First, we link each school with its catchment area using `st_join` as before:
```{r}
schools_catch <- st_join(schoolareas, schools, left = FALSE)
head(schools_catch)
```
We confirm that there can be several schools in the same catchment area:
```{r 09catchment_nschools}
st_drop_geometry(schools_catch) %>%
count(catchid) %>%
pull(n) %>%
hist(main = 'Number of schools per catchment area', breaks = 1:6 - 0.5, xlab = '')
```
Next we calculate the planning area intersections, their areas and weighted average of the welfare rate for each school's catchment area using `sapply`. We define a function `spat_weighted_mean` for this, which we can later reuse. This computation takes some seconds to complete and in the end adds the weighted average of the welfare rate as `welfare_chld` variable to the schools' catchment area dataset:
```{r, warning=FALSE}
spat_weighted_mean <- function(catch) {
# the catchment area polygon "catch" loses the CRS during sapply -> set it here again
catch <- st_sfc(catch, crs = st_crs(bln))
areas <- st_intersection(bln, catch)
weighted.mean(areas$welfare_chld, as.numeric(st_area(areas)))
}
schools_catch$welfare_chld <- sapply(schools_catch$geometry, spat_weighted_mean)
select(schools_catch, catchid, schoolid, ownership, name, welfare_chld) %>% head()
```
```{r 10catchment_welfare}
plot(distinct(schools_catch['welfare_chld']),
main = 'Weighted average of percentage of children\nwhose parents receive social welfare per school catchment area',
cex.main = 0.75)
```
Note that blank areas in the above figure represent catchment areas in which no primary school was located -- this may be a flaw in the official data (12 such areas in total).
We again compare public and private schools, this time with our revised calculations:
```{r 11violin2}
ggplot(schools_catch) +
geom_violin(aes(x = ownership, y = welfare_chld), draw_quantiles = c(0.5)) +
geom_jitter(aes(x = ownership, y = welfare_chld), alpha = 0.25) +
scale_x_discrete(labels = c('pub' = 'public primary schools', 'priv' = 'private primary schools')) +
labs(title = 'Percentage of children whose parents obtain social welfare', x = '', y = '% welfare')
```
The median percentage of children whose parents obtain social welfare is still higher for public schools, but the difference is now five instead of six percent.
```{r}
st_drop_geometry(schools_catch) %>%
group_by(ownership) %>%
summarise(median_welfare_chld = median(welfare_chld))
```
Our updated approach led to a difference that is only a bit smaller. The difference is not so large because of the very small catchment areas for the many schools in the inner city that result in a weighted average of the welfare rate that is very close to the rate of the schools' planning areas. For other data, where catchment areas are bigger than the statistical regions (like in the example school in the south east of Berlin), you can expect a larger difference between the two approaches.
# Approximating catchment areas as circular regions or Voronoi regions around schools
So far, we've assumed that private primary schools have the same catchment area as their nearby public schools, since there are no official catchment areas for private primary schools and parents can choose more freely which school they send their children to when they prefer a private school. So if we have no spatial data about the catchment areas of private primary schools, what can we do?
One possibility would be to construct a circle around each private school which represents the catchment area for a certain radius. This can be done via [`st_buffer`](https://r-spatial.github.io/sf/reference/geos_unary.html). However, it's hard to justify a certain value for that radius and the radius for such a catchment area should probably vary depending on where the school is located (smaller catchment areas in inner city schools than for schools in the outskirts).
Another approach relies on Voronoi regions. They partition the space between given points so that the Voronoi region around each point covers an area of minimal distance to that origin point. In other words: the Voronoi region around a school is the area in which all households are located that are closest to that school. It is reasonable to assume that most parents choose among the closest private schools to their home. This means approximating the catchment area of private primary schools as Voronoi regions may be a good option, while still using the official public primary school catchment areas only for the public schools.
Voronoi regions can be generated with [`st_voronoi`](https://r-spatial.github.io/sf/reference/geos_unary.html), which accepts the points as `MULTIPOINT` geometry object. The second argument is an envelope polygon for which we'll use the the Berlin borders. The resulting object is a `GEOMETRYCOLLECTION` geometry object which we pass on to [`st_collection_extract`](https://r-spatial.github.io/sf/reference/st_collection_extract.html) and [`st_sfc`](https://r-spatial.github.io/sf/reference/sfc.html) in order to transform this to a *geometry set* object that has the same CRS as our other spatial data (ETRS89).
Let's generate the Voronoi regions around all private primary schools:
```{r}
bln_outline <- st_union(bln$geometry) # Berlin borders
privschools <- filter(schools, ownership == 'priv')
pubschools <- filter(schools, ownership == 'pub')
(priv_voro <- st_coordinates(privschools$geometry) %>%
st_multipoint() %>%
st_voronoi(bln_outline) %>%
st_collection_extract() %>%
st_sfc(crs = st_crs(bln)))
```
We can now plot the generated regions along with the school locations:
```{r 12voro1}
plot(bln_outline)
plot(privschools$geometry, col = 'blue', pch = 19, add = TRUE)
plot(priv_voro, border = 'red', col = NA, add = TRUE)
```
We can see that the Voronoi regions extend beyond the borders of Berlin so we should take the intersection between the Voronoi regions and the Berlin border in order to clip these regions:
```{r 13voro2}
priv_voro <- st_intersection(priv_voro, bln_outline)
plot(bln_outline)
plot(privschools$geometry, col = 'blue', pch = 19, add = TRUE)
plot(priv_voro, border = 'red', col = NA, add = TRUE)
```
Let's overlay the planning areas with the private schools' Voronoi regions to see how they differ.
```{r 13voro3}
plot(bln$geometry)
plot(priv_voro, col = '#80000077', border = 'white', add = TRUE)
plot(privschools$geometry, col = 'blue', pch = 19, cex = 0.5, add = TRUE)
```
Once we have circular regions or Voronoi regions around the private schools, the rest of the calculations are similar to those with the official catchment areas. We link the private schools with their approximated catchment areas and apply the `spat_weighted_mean` function that we've defined before.
Linking the private schools with their Voronoi regions:
```{r}
privschools_voro <- st_as_sf(priv_voro) %>%
mutate(voroid = 1:nrow(.)) %>%
st_join(privschools, left = FALSE) %>%
rename(geometry = x)
head(privschools_voro)
```
Checking that there's really only one private school per Voronoi region:
```{r}
st_drop_geometry(privschools_voro) %>%
count(voroid) %>%
pull(n) %>%
all(. == 1) %>%
stopifnot()
```
Generating the weighted averages and plotting the corresponding choropleth map for the private schools:
```{r 14voro_choro, warning=FALSE}
privschools_voro$welfare_chld <- sapply(privschools_voro$geometry, spat_weighted_mean)
plot(privschools_voro['welfare_chld'],
main = 'Weighted average of percentage of children\nwhose parents receive social welfare per private school catchment area approx. as Voronoi regions',
cex.main = 0.75)
```
Linking the public schools with their catchment areas:
```{r}
pubschools_catch <- st_join(schoolareas, pubschools, left = FALSE)
head(pubschools_catch)
```
Generating the weighted averages and plotting the corresponding choropleth map for the public schools:
```{r 15catchment_pub, warning=FALSE}
pubschools_catch$welfare_chld <- sapply(pubschools_catch$geometry, spat_weighted_mean)
plot(distinct(pubschools_catch['welfare_chld']),
main = 'Weighted average of percentage of children\nwhose parents receive social welfare per public school catchment area',
cex.main = 0.75)
```
Combining the results from the public and private schools:
```{r 16violin3}
pubpriv <- bind_rows(select(pubschools_catch, ownership, welfare_chld),
select(privschools_voro, ownership, welfare_chld)) %>%
st_drop_geometry()
ggplot(pubpriv) +
geom_violin(aes(x = ownership, y = welfare_chld), draw_quantiles = c(0.5)) +
geom_jitter(aes(x = ownership, y = welfare_chld), alpha = 0.25) +
scale_x_discrete(labels = c('pub' = 'public primary schools', 'priv' = 'private primary schools')) +
labs(title = 'Percentage of children whose parents obtain social welfare', x = '', y = '% welfare')
```
```{r}
group_by(pubpriv, ownership) %>%
summarise(median_welfare_chld = median(welfare_chld))
```
# Conclusion
The descriptive results suggest that private schools in Berlin may tend to be located in areas with lower rates of children whose parents obtain social welfare as compared to public schools. Further [spatial analysis](https://mgimond.github.io/Spatial/hypothesis-testing.html) could be done to test this hypothesis.
We have seen how we can calculate a weighted average for some variable of interest for a catchment area around sample points, when this variable of interest was measured for regions that overlap with that catchment area. In the best case scenario, you know the geometry of the catchment areas. Otherwise you may need to approximate them, for example as circular regions around the points or as Voronoi regions. Additionally, you may consider [nearest-feature-joins or travel time isochrones](https://nhsrcommunity.com/blog/using-sf-to-calculate-catchment-areas/). Which option is more appropriate depends on your use-case.
# References
- *Helbig/Konrad/Nikolai 2018*: [Helbig, M., Konrad, M., & Nikolai, R. (2018). Die Schulinfrastruktur in Ostdeutschland: Ein multimedialer Zugang zur Analyse der Veränderungen von Schulstandorten.](https://www.ssoar.info/ssoar/bitstream/handle/document/66713/ssoar-2018-helbig_et_al-Die_Schulinfrastruktur_in_Ostdeutschland_ein.pdf?sequence=1&isAllowed=y)
- *Helbig/Salomo 2021:* [Helbig, M., & Salomon, K. (2021). Eine Stadt-getrennte Welten? Sozialräumliche Ungleichheiten für Kinder in sieben deutschen Großstädten (No. 25). Schriften zu Wirtschaft und Soziales.](https://www.econstor.eu/handle/10419/234476)
- *Pebesma 2018:* [Pebesma, E. J. (2018). Simple features for R: standardized support for spatial vector data. R J., 10(1), 439.](http://pebesma.staff.ifgi.de/RJwrapper.pdf)