Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some police forces used alternative grid references for eastings and northings ~1979-1981 and 1986 #101

cmcaine opened this issue Aug 7, 2019 · 22 comments
help wanted Extra attention is needed


Copy link

cmcaine commented Aug 7, 2019

E.g. Hounslow isn't really in the sea to the west of Glasgow.


The London Boroughs and some other geographical areas did this. If we can find out what system they were using we could fix this.

1986, looks like a CRS issue too, but I don't really know what's going on.

We can also observe a lot of other errors with the early geocoding in this sequence of images:


Source code:

# smaller_s19 is stats19 1979-2004, serious and fatal only, drop NAs on coordinates
year_maps = smaller_s19 %>%
  mutate(year = lubridate::year(date)) %>% 
  st_as_sf(coords=c("location_easting_osgr", "location_northing_osgr"), crs = BNG) %>%
  qtm() + tm_facets(along="year");

# And copy the image files out of /tmp before this finishes ;)
tmap_animation(year_maps, "stats19-osgr-locations-over-time.mpg")
Copy link

Well found @cmcaine. At the request of @mem48 I recall we add a warning saying that locations may not be accurate before 2005. It would be amazing if we could rectify the issues in the code. I think that analysing the crashes with clearly errant points could lead to a solution. One question: do the errors also affect the longitue and latitude columns? Rarely use data before 2005 but clearly it's very important so I (and I imagine other users of the data) am very grateful to you for raising this issue.

Copy link
Contributor Author

cmcaine commented Aug 9, 2019

There's no longitude or latitude data at all until 1999, by which time the eastings and northings look accurate anyway.

Some number of these will be transcription errors, but others (London) certainly look like systematic use of an alternative (or truncated) grid reference system, so I think there is a chance of fixing those.

Another challenge is that the data for all accidents has fewer obviously missing areas:


It seems unlikely that these places really had no serious or fatal collisions in a year. Perhaps police forces used a different system for recording more serious crashes in those areas?

Copy link

mem48 commented Aug 9, 2019

We had this problem with the cyipt project, some of the older data uses less precise grid references and lots of data ended up in the sea.

The british national grid has not change since 1936, and I'm not aware of any regional grids in the UK. So I suspect there is some add hoc truncation, where ploice have left out inital few didgits they would always be the same in their area of intrest.

Copy link

mem48 commented Aug 9, 2019

@cmcaine I've had a deeper dive and this does not seem to be a simple scaling problem. I can't figure out how the coordinates are supposed to map to the BNG. I suggest making an equiery with the DFT they may have some historical context that we are missing.

Copy link
Contributor Author

cmcaine commented Aug 9, 2019

Thanks for looking at it, Malcolm.

Sent to DfT:


As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I also note that some areas, including much of Scotland, Wales and the North West appear to have reported "Slight" accidents/collisions to STATS19 but not "Serious" or "Fatal" collisions. This is also illustrated in the linked issue.

I would appreciate your help in identifying: 1) if this omission is documented; 2) if data on serious and fatal collisions in these locations is available elsewhere.

Colin Caine
PhD Student, School of Geography, University of Leeds

Copy link

Just picking up the thread on this after getting back from holiday yesterday. I suspect there are some systematic errors that can be fixed, and likely some random errors that cannot. I think asking the DfT is a good plan (have you heard anything @cmcaine? can follow up if not) and, if they don't know either, would suggest a collaborative project aimed at doing an even deeper dive than @mem48 did to identify dodgy coordinates (please share analysis code you used for this if you have it).

Quantifying (e.g. range, standard deviation) and plotting differences between expected (based on recent data) and recorded Easting and Northing distributions for each force/year combination in which dodgy coordinates are found should help at least identify the region/years in which there is a pattern to the error.

Copy link
Contributor Author

cmcaine commented Aug 17, 2019 via email

Copy link

mem48 commented Aug 19, 2019

I couldn't find any pattern in the London Data. It wasnt even rougly in the shape of london, so I think we are going to need expert help

Copy link
Contributor Author

cmcaine commented Aug 19, 2019

I think it should be discoverable, though. 1981 was only 39 years ago. I'm sure the met police or mopac could reach out to some retired officers for us if they felt like it.

I've sent a similar email to the ONS as well and attached a sample CSV of the eastings and northings.

I attach here a zip of a CSV of all of the eastings and northings for 1979-1981 in case anyone wants to get the data without using R (mostly for the convenience of our external friends). Each observation includes the easting and northing, local authority district name, road class and road number.

The exact text of the email sent to the ONS is:

To: [email protected]
Subject: Help interpreting unusual grid references in DfT STATS19 data


As described and illustrated in this github issue1, I am trying to work out how some of the eastings and northings values provided for collisions in the 1979-2004 STATS19 dataset should be interpreted.

In particular, many locations given for collisions between 1979 and 1982 and in 1986 do not appear to be valid British National Grid references.

From the maps I have generated, it looks like the metropolitan police was using a different or truncated reference system from 1979 to 1981.

I would appreciate your help in identifying what scheme was used and any other detailed information about how to interpret the eastings and northings for these early years of the dataset.

I attach a zipped CSV sample of 50k (out of ~750k) observations from the stats19 dataset 1979-1981 for your convenience. A zipped csv containing all the rows is linked from the github issue. Each observation includes the easting and northing, local authority district name, road class and road number.

Colin Caine
PhD Student, School of Geography, University of Leeds

Copy link
Contributor Author

cmcaine commented Aug 19, 2019

Perhaps the coordinates just assume that they're on the OS map for their particular area of London.

If there were enough different maps for different areas of london then the shape of London would not be scaled and recognisable.

Copy link

mem48 commented Aug 20, 2019

That is possible, there may also have been a conversion error that scrambled the data. For example BNG coordinates can be stored like this TQ1234 if these had been convered to numbers incorrectly the may have become garbled.

Copy link

mem48 commented Aug 20, 2019

I plotted all the crashed on the A4 in 1979


There is some random point but clearly a road across the map. The intresting bit is the bottom right which suggests more than one coordinate system is in use.

@Robinlovelace Robinlovelace added the help wanted Extra attention is needed label Dec 5, 2019
Copy link

layik commented Jul 1, 2020

Just catching up with this, I think one technical issue is also format_sf fails on output of get_stats19 for those years. I can take this to a new ticket if I am right:

Reprex to show all london accidents return empty using `format_sf`:
dd = "~/code/saferactive/ignored/"
acc7904 = stats19::get_stats19(1979, data_dir = dd)
#> No files of that type found for that year.
#> �[31mThis will download 240 MB+ (1.8 GB unzipped).�[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See and
#> Files identified:
#> Data already exists in data_dir, not downloading
#> Data saved at ~/code/saferactive/ignored//Stats19-Data1979-2004/Vehicles7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/code/saferactive/ignored//Stats19-Data1979-2004/Casualty7904.csv~/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv
#> No files of that type found for that year.
#> �[31mThis will download 240 MB+ (1.8 GB unzipped).�[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See and
#> Reading in:
#> /home/layik/code/saferactive/ignored//Stats19-Data1979-2004/Accidents7904.csv
#> date and time columns present, creating formatted datetime column
# acc7904 = stats19::format_sf(acc7904, lonlat = TRUE)
l = acc7904[acc7904$local_authority_district == "London", ]
#> [1] 638746
l = stats19::format_sf(l, lonlat = TRUE)
#> 638746 rows removed with no coordinates
#> Warning in min(cc[[1]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in min(cc[[2]], na.rm = TRUE): no non-missing arguments to min;
#> returning Inf
#> Warning in max(cc[[1]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
#> Warning in max(cc[[2]], na.rm = TRUE): no non-missing arguments to max;
#> returning -Inf
nrow(l) == 0 # TRUE
#> [1] TRUE

Created on 2020-07-01 by the reprex package (v0.3.0)

In terms of a systematic error or help with this ticket, I will also send an email to our tech contacts in DfT and cc @Robinlovelace and invite them to contribute here if possible.

Copy link

layik commented Jul 1, 2020

I sent an email to our tech contacts in DfT too @cmcaine and @mem48. Will update this ticket if I dig anything out. Great data analysis/insights.

Copy link

wengraf commented Nov 2, 2020

Just come to this - just terrible geo-coding, and no error checking at the time, not an alternative CRS. Unless you can match to main road name and work from that, just learn to live with it. I'd be wary of the idea of "correction" too....

Copy link

I think different conventions were used in different forces. Confident there are ways to improve on the assumption of 'bog standard' 27700 (e.g. by dividing coords by 10) for some places, but not a priority!

Copy link

wengraf commented Nov 3, 2020

A lot would have been found on an old A-Z, then roughly guessed on a paper Landranger, with only the vaguest idea about eastings and northings. Stats19 has duff fields, at points in time, that’s something people have to just come to terms with.

Copy link

wengraf commented Nov 3, 2020

Some will have been filled with meaningless numbers, like 0,0, just so it passed the check for a filled in field.

Copy link

Good point Ivo.

Copy link

layik commented Dec 3, 2020

Can we also close this? As we cannot offer any useful solutions to the issue. Use of road names etc are all outside the main issue. I say we close it.

Copy link

Robinlovelace commented Dec 3, 2020

I think we can close this. We've raised the issue and even give the user a message telling them to watch out. Good suggestion, thanks @layik.

stats19::get_stats19(year = 1979)
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See and
#> Files identified:
#> Data already exists in data_dir, not downloading
#> Data saved at ~/stats19-data/Stats19-Data1979-2004/Vehicles7904.csv~/stats19-data/Stats19-Data1979-2004/Road-Accident-Safety-Data-Guide-1979-2004.xls~/stats19-data/Stats19-Data1979-2004/Casualty7904.csv~/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> No files of that type found for that year.
#> [31mThis will download 240 MB+ (1.8 GB unzipped).[39m
#> Coordinates and other variables may be unreliable in these datasets.
#> See and
#> Reading in:
#> /home/robin/stats19-data/Stats19-Data1979-2004/Accidents7904.csv
#> date and time columns present, creating formatted datetime column
#> # A tibble: 6,224,198 x 33
#>    accident_index location_eastin… location_northi… longitude latitude
#>    <chr>                     <int>            <int>     <dbl>    <dbl>
#>  1 197901A11AD14                NA               NA        NA       NA
#>  2 197901A1BAW34            198460           894000        NA       NA
#>  3 197901A1BFD77            406380           307000        NA       NA
#>  4 197901A1BGC20            281680           440000        NA       NA
#>  5 197901A1BGF95            153960           795000        NA       NA
#>  6 197901A1CBC96            300370           146000        NA       NA
#>  7 197901A1DAK71            143370           951000        NA       NA
#>  8 197901A1DAP95            471960           845000        NA       NA
#>  9 197901A1EAC32            323880           632000        NA       NA
#> 10 197901A1FBK75            136380           245000        NA       NA
#> # … with 6,224,188 more rows, and 28 more variables: police_force <chr>,
#> #   accident_severity <chr>, number_of_vehicles <int>,
#> #   number_of_casualties <int>, date <date>, day_of_week <chr>, time <chr>,
#> #   local_authority_district <chr>, local_authority_highway <chr>,
#> #   first_road_class <chr>, first_road_number <int>, road_type <chr>,
#> #   speed_limit <int>, junction_detail <chr>, junction_control <chr>,
#> #   second_road_class <chr>, second_road_number <int>,
#> #   pedestrian_crossing_human_control <chr>,
#> #   pedestrian_crossing_physical_facilities <chr>, light_conditions <chr>,
#> #   weather_conditions <chr>, road_surface_conditions <chr>,
#> #   special_conditions_at_site <chr>, carriageway_hazards <chr>,
#> #   urban_or_rural_area <chr>,
#> #   did_police_officer_attend_scene_of_accident <int>,
#> #   lsoa_of_accident_location <chr>, datetime <dttm>

Created on 2020-12-03 by the reprex package (v0.3.0)

Copy link

wengraf commented Nov 6, 2023

I've looked back at my earlier comments, and perhaps age and fatherhood has mellowed me since there any mileage to be had with reverse geocoding and LA polygon/road/secondary road/junction type etc? Perhaps it is an interesting undergrad or MSc project?

@Robinlovelace Robinlovelace reopened this Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
help wanted Extra attention is needed
None yet

No branches or pull requests

5 participants