First exploration Pt I
Simple Transform: Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you’ll take over the task of translating text to Pig Latin. This is an "embarrasingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.
Chimpanzee and Elephant start a business
Pig Latin translation
Your first job: test at commandline
Run it on cluster
Input Splits
Why Hadoop I: Simple Parallelism
Transform-Pivot Job
Elves pt1
Simple Join
Elves pt2
Partition key + sort key
First Exploration: Regional Flavor pt II
articles → wordbags
wordbag+geolocation join (wukong)
wordbag+geolocation join (pig)
statistics on corpus
wordbag for each geotiles
PMI for each geotile
MI for geotile
The Toolset
toolset overview
pig vs hive vs impala
hbase & elasticsearch (not accumulo or cassandra)
launching jobs
seeing the data
seeing the logs
simple debugging
globbing, and caveat about shell vs. hdfs globs
overview of wukong
installing it (pointer to internet)
classes you inherit from
options, launching
overview of pig
options, launching
Filesystem Mojo
,wu-rm -r
,wu-rm -r --skip_trash
filenames, wu style
s3n, s3hdfs, hdfs, file (note: 'hdfs:///~' should translate to 'hdfs:///.')
) -
(the default time-based one in http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html)
sugared jobs (wu-identity, wu-grep, wu-wc, wu-bzip, wu-gzip, wu-snappify, wu-digest (md5/sha1/etc))
Event Streams
Parsing logs and using regular expressions
Histograms and time series of pageviews
Geolocate visitors based on IP
Sessionizing a log
(Ab)Using Hadoop to stress-test your web server
(DL paste list here)
(see pagerank in section on graphs)
Text Processing: We’ll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:
grep’ing etc for simple matches
wordbags using Lucene
Indexing documents
Pointwise Mutual Information
Minhashing to combat a massive feature space
How to cheat with Bloom filters
K-means Clustering (mini-batch)
(?maybe?) TF-IDF
(?maybe?) Document clustering with SVD
(?maybe?) SVD as Principal Component Analysis
(?maybe?) Topic extraction using (to be determined)
Averages, Percentiles, and Normalization
sum, average, standard deviation, etc (airline_flights)
Percentiles / Median
exact percentiles / median
approximate percentiles / median
fit a curve to the CDF;
construct a histogram (tie back to server logs)
"Average value frequency"
Sampling responsibly: it’s harder and more important than you think
Statistical aggregates and the danger of large numbers
normalize data by mapping to percentile, by mapping to Z-score
consistent sampling
Time Series
Anomaly detection
Wikipedia Pageviews
windowing and rolling statistics
(?maybe?) correlation of joint timeseries
(?even mayber?) similar wikipedia pages based on pageview time series
Spatial join (find all UFO sightings near Airports)
mechanics of handling geo data
Statistics on grid cells
quadkeys and grid coordinate system
— map wikipedia -
k-means clustering to produce readable summaries
partial quad keys for "area" data
voronoi cells to do "nearby"-ness
— use weather station locations to calculate voronoi polygons -
— cells that have a piece of border, or the largest grid cell that has no border on it -
Using polymaps to see results
Pointwise mutual information
total sort
ruby -ne
grep, cut, seq, (reference back to
) -
wc, sha1sum, md5sum, nl
wu-box, head, tail, less, split
uniq, sort, join,
sort| uniq -c
bzip2, gzcat
commandline workflow tips
> /dev/null 2>&1
loops (see if you can get agnostic btwn zsh & bash) -
nohup, disown, bg and
advanced hadoop filesystem (chmod, setrep, fsck)
Data munging (Semi-structured data): The dirty art of data munging. It’s a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We’ll show you street-fighting tactics that lessen the time and pain. Along the way, we’ll prepare the datasets to be used throughout the book.
Wikipedia Articles: Every English-language article (12 million) from Wikipedia.
Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.
US Commercial Airline Flights: every commercial airline flight since 1987
Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.
"Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio’s waxy.org).
Interlude I: Data Models, Data Formats, Data Management:
How to design your data models
How to serialize their contents (orig, scratch, prod)
How to organize your scripts and your data
Graph — some better-motivated subset of:
Adjacency List / Edge List conversion
Undirecting a graph, Min-degree undirected graph
Breadth-First Search
subuniverse extraction
(?maybe?) Pagerank on server logs?
(?maybe?) identify strong links
Minimum Spanning Tree
clustering coefficient
Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents
Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages
Machine Learning without Grad School
weather & flight delays for prediction
Naive Bayes
Logistic Regression ("SGD")
Random Forest
(?maybe?) Collaborative Filtering
(?or maybe?) SVD on documents (eg authorship)
where to go from here
don’t get fancy
better features
unreasonable effectiveness
partition the data, recombine the models
pointers for the person who is going to get fancy anyway
Interlude II: Best Practices and Pedantic Points of style
Pedantic Points of Style
Best Practices
How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they’re equivalent; with some experience under your belt it’s worth learning how to fluidly shift among these different models.
Why Hadoop
robots are cheap, people are important
Hadoop Native Java API
Advanced Pig
Advanced operators:
map-side join, merge join, skew joins
Basic UDF
why algebraic UDFs are awesome and how to be algebraic
Custom Loaders
Wonderdog: a LoadFunc / StoreFunc for elasticsearch
Performance efficiency and tunables
Data Modeling for HBase-style Database
Hadoop Internals
What happens when a job is launched
A shallow dive into the HDFS
Hadoop Tuning
Tuning for the Wise and Lazy
Tuning for the Brave and Foolish
The USE Method for understanding performance and diagnosing problems
Overview of Datasets and Scripts
Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)
Airline Flights
UFO Sightings
Global Hourly Weather
Waxy.org "Star Wars Kid" Weblogs
Regular Expressions
Sizes of the Universe
Hadoop Tuning & Configuration Variables