Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data upload + massage #4

Closed
joekarl opened this issue Dec 23, 2014 · 11 comments
Closed

Data upload + massage #4

joekarl opened this issue Dec 23, 2014 · 11 comments
Milestone

Comments

@joekarl
Copy link
Contributor

joekarl commented Dec 23, 2014

** more to come **

@mkchandler
Copy link
Member

Any ideas on best language to use for this? I would think we would need one with some good libraries for converting to geojson and parsing kml.

The source files on data.okc.gov are zipped kml files, so I assume the flow will be something like:

kmz --> kml --> geojson

At some point in the process, we will need to have a "cleaner" that goes through and modifies the name of certain polygons and markers (terminology correct?). Some initial transformations that I see based on the data:

  1. Convert names to proper case (right now all uppercase)
  2. Convert "ES" to "Elementary School", "MS" to "Middle School", etc. (this is how they look to be named in the data set, but we will want a more user friendly display)

In the future, more items could be added to this process to pull in other data about schools, school districts, or whatever points of interest we see fit.

cc: @joekarl @DevinClark @jagthedrummer @jvrousseau @makenova

@joekarl
Copy link
Contributor Author

joekarl commented Dec 24, 2014

So kmz == zip file and inside is the kml file so no worries on that.

A couple of options for consideration:

  • Grab the kmz file of the data -> pull out kml -> Use ogr2ogr to convert the kml to geojson -> load up the geojson in memory -> clean it -> dump to final geojson -> upload
  • Grab the CSV file of the data -> run through every line cleaning as we go -> gen geojson -> dump to disk -> upload

Option 1 requires extra moving parts (notably ogr2ogr) while option 2 looks like we can deal with streaming data better (ie larger data).
Both can be put into a single script (ie single file ruby / python / node / whatever).

@joekarl
Copy link
Contributor Author

joekarl commented Dec 24, 2014

I say that and there's probably a good way to do streaming handling of kml as well, just easier to write a CSV parser than a kml parser.

@joekarl
Copy link
Contributor Author

joekarl commented Dec 24, 2014

@adamveld12 you basically have a good chunk of the CSV parsing done right? (see #1)

@jvrousseau
Copy link

Kml to geojson conversions are out there https://github.com/mapbox/togeojson for example, but I know that we would sill have to do a good amount of cleanup on the generated geojson. I think csv parsing would give us a bit more flexibility to massage the data during the conversion rather than after.

@adamveld12
Copy link

@joekarl Yeah, I put the parser code in a gist and you can get it here.

@joekarl
Copy link
Contributor Author

joekarl commented Dec 24, 2014

@adamveld12 nice thanks

@joekarl
Copy link
Contributor Author

joekarl commented Dec 29, 2014

So the message part of the data looks pretty straight forward, just convert the CSV to geojson and (maybe) convert the coordinate system if needed.
Next question is where do we want to store said data?
The options we'd been talking about were putting the data into GH pages for serving or putting the data onto S3.

GH pages - Will allow us free hosting of the data, but will make updating the data a pain (either manually updating the repo, or scripts to update the repo programmatically, neither one fun)

S3 - Have to pay for S3 (though this is super super minuscule cost), but can update data in place programmatically

Both are ultimately accessible via URL so frontend won't care where the data actually lives (just needs a URL)

Just me but I would prefer S3 as I trust their hosting a bit more that GH pages. Also can you do SSL for custom domains with GH pages? //cc @gorsuch

@gorsuch
Copy link

gorsuch commented Dec 30, 2014

Also can you do SSL for custom domains with GH pages?

Naw. Unfortunately SSL on GitHub pages can only be accomplished via *.github.io domains.

If you want to use SSL + custom domains on S3, I believe you'd have to bring cloudfront into the mix w/ an SNI cert. That's pretty straightforward.

@joekarl
Copy link
Contributor Author

joekarl commented Dec 31, 2014

Initial version of data ingest is working (see merged PR #7). It basically takes the CSV input, cleans up the school names, and dumps out geojson for the schools and school districts.

There are new issues for cleaning up the data even more to make it a lot smaller (see #8 and #9) as well as adding in school to school district mapping (#10).

@mkchandler
Copy link
Member

Closing this issue since this part is complete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants