Skip to content

Latest commit

 

History

History
69 lines (46 loc) · 6.23 KB

13d-patterns.asciidoc

File metadata and controls

69 lines (46 loc) · 6.23 KB

To explore data, you need data, in a form you can use, and that means engaging in the necessary evils of data munging: code to turn the data you have into the data you’d like to use. Whether it’s public or commercial data data_commons, data from a legacy database, or managing unruly human-entered data, you will need to effectively transform the data into a format that can be used and understood.

Your goal is to take data from the source domain (the shape, structure and cognitive model of the raw data) to the target domain (the shape, structure and cognitive model you will work with). If you separate

  • Figure out the data model you’d like to target and build a lightweight class to represent it.

  • Extract: the syntactic transformation from raw blobs to passive structured records

  • Transform: the semantic transformation of structured data in the source domain to active models in the target domain

  • Normalize

    • clean up errors and missing fields

    • augment with data from other tables

  • Canonicalization:

    • choose exemplars and mountweazels

    • pull summary statistics

    • sort by the most likely join key

    • pull out your best guess as to a subuniverse (later)

  • Land

    • the data in its long-term home(s)

(later):

  • fix your idea of subuniverse

DL TODO: (fix style (document pig style); align terminology with above wikipedia/pageviews_extract_a.rb

Extract, Translate, Canonicalize

Semistructured Data Workflow
  • Raw data, as it comes from the source. This could be unstructured (lines of text, JOEMAN:HS

  • Source domain models:

  • Target domain models:

  • Transformer:

You’ll be tempted to move code too far to the right — to put transformation code into JOEMAN:HS Resist the urge. At the beginning of the project you’re thinking about the details of extracting the data, and possibly still puzzling out the shape of the target domain models. Make the source domain models match the raw data as closely as reasonable, doing only the minimum effective work to make the data active.

Separate the model’s properties (fundamental intrinsic data), its metadata (details of processing), and its derived variables (derived values).

In an enterprise context, this process is "ETL" — extraction, transformation and loading. In data at scale, rather than centralizing access in a single data store, you’ll more often syndicate data along a documented provenance chain. So we’ll change this to "Extract, Transform and Land".

Solitary, poor, nasty, brutish, and short

Thomas Hobbes wrote that the natural life of man is "solitary, poor, nasty, brutish, and short". and so should your data munging scripts be:

  • Solitary: write discrete, single concern scripts. It’s much easier to validate the data when your transformations are simple to reason about.

  • Poor: spend wisely. It’s especially important to optimize for programmer time, not cluster efficiency: you will probably spend more in opportunity cost to write the code than you ever will running it. Even if it will be a production ETL [1] job, get it running well and then make it run quickly. The hard parts are not where you expect them to be.

  • Nasty: as you’ll see, the real work in data munging are the special cases and domain incompatibilities. Ugly transformations will lead to necessarily ugly code, but even still there are ways to handle it. There’s also a certain type of brittleness that is good — a script that quietly introduces corrupt data JOEMAN:HS

  • Brutish: be effective, not elegant. Fight the ever-strong temptation to automate the in-automatable. Here are some street-fighting techniques we’ll use below:

    • Hand-curating a 200-row join table is the kind of thing you learn computer programing to avoid, but it’s often the smart move.

    • Faced with a few bad records that don’t fit any particular pattern, you can write a script to recognize and repair them; or, if it’s of modern size, just hand-fix them and save a diff. You’ve documented the change concisely and given future-you a chance to just re-apply the patch.

    • To be fully-compliant, most XML libraries introduce a necessary complexity that complexifies up the work to deal with simple XML structures. It’s often more sensible to handle simple as plain text.

  • Short: use a high-level language. You’ll invest much more in using the data than in writing the munging code. Two weeks later, when you discover that 2% of the documents used an alternate text encoding (or whatever card Loki had dealt) you’ll be glad for brief readable scripts and a rich utility ecosystem.

FIXME: the analogy will line up better with the source if I can make the point that 'your data munging scripts are to civilize data from the state of nature'. FIXME: the stuff above I like; need to chop out some of the over-purplish stuff from the rest of the chapter so the technical parts don’t feel too ranty.

Canonical Data

  • sort the data along its most-likely join field (sometimes but not always its primary key). This often enables a (merge_join TODO: ref), with tremendous speedup.

  • choose exemplars or add mountweazels. Choose a few familiar records, and put their full contents close at hand to use for testing and provenance. You may also wish to add a mountweazel, a purposefully-bogus record. Why? First, so you have something that fully exercises the data pipeline — all the fields are present, text holds non-keyboard and escape characters, and so forth. Second, for production smoke testing: it’s a record you can send through the full data pipeline, including writing into the database, without concern that you will clobber an actual value. Lastly, since exemplars are real records, they may change over time; you can hold your mountweazel’s fields constant. Make sure it’s unmissably bogus and unlikely to collide: "John Doe" is a terrible name for a mountweazel, if there’s any way user-contributed data could find its way into your database. The best-chosen bogus names appeal to the worst parts of a 5th-grader’s sense of humor.

  • sample a coherent subuniverse. Ensure this includes, or go back and add, your exemplar records.


1. ETL = Extraction, Transformation and Loading — what you call data munging when it runs often and well enough to deserve an enterprise-y name