Skip to content

Latest commit

 

History

History
229 lines (201 loc) · 8.34 KB

00b-topics.asciidoc

File metadata and controls

229 lines (201 loc) · 8.34 KB

Topics

  1. First exploration Pt I

    • motivation

    • walkthrough

    • reflection

  2. Simple Transform: Chimpanzee and Elephant are hired to translate the works of Shakespeare to every language; you’ll take over the task of translating text to Pig Latin. This is an "embarrasingly parallel" problem, so we can learn the mechanics of launching a job and a coarse understanding of the HDFS without having to think too hard.

    • Chimpanzee and Elephant start a business

    • Pig Latin translation

    • Your first job: test at commandline

    • Run it on cluster

    • Input Splits

    • Why Hadoop I: Simple Parallelism

  3. Transform-Pivot Job

    • Locality

    • Elves pt1

    • Simple Join

    • Elves pt2

    • Partition key + sort key

  4. First Exploration: Regional Flavor pt II

    • articles → wordbags

    • wordbag+geolocation join (wukong)

    • wordbag+geolocation join (pig)

    • statistics on corpus

    • wordbag for each geotiles

    • PMI for each geotile

    • MI for geotile

    • (visualize)

  5. The Toolset

    • toolset overview

    • pig vs hive vs impala

    • hbase & elasticsearch (not accumulo or cassandra)

    • launching jobs

    • seeing the data

    • seeing the logs

    • simple debugging

    • wu-ps, wu-kill

    • globbing, and caveat about shell vs. hdfs globs

    • overview of wukong

    • installing it (pointer to internet)

    • classes you inherit from

    • options, launching

    • overview of pig

    • options, launching

    • operations

    • functions

  6. Filesystem Mojo

    • wu-dump

    • wu-lign

    • wu-ls, wu-du

    • wu-cp, wu-mv, wu-put, wu-get, wu-mkdir

    • wu-rm, wu-rm -r, wu-rm -r --skip_trash

    • filenames, wu style

    • s3n, s3hdfs, hdfs, file (note: 'hdfs:///~' should translate to 'hdfs:///.')

    • templating: {user}, {pid}, {uuid}; {date}, {time}, {tod}, {epoch}, {yr}, {mo}, {day}, {hr}, {min}, {sec}; {run_env}, {project})

    • (the default time-based one in http://docs.oracle.com/javase/6/docs/api/java/util/UUID.html)

    • wu-distcp

    • sugared jobs (wu-identity, wu-grep, wu-wc, wu-bzip, wu-gzip, wu-snappify, wu-digest (md5/sha1/etc))

  7. Event Streams

    • Parsing logs and using regular expressions

    • Histograms and time series of pageviews

    • Geolocate visitors based on IP

    • Sessionizing a log

    • (Ab)Using Hadoop to stress-test your web server

    • (DL paste list here)

    • (see pagerank in section on graphs)

  8. Text Processing: We’ll show how to combine powerful existing libraries with hadoop to do effective text handling and Natural Language Processing:

    • grep’ing etc for simple matches

    • wordbags using Lucene

    • Indexing documents

    • Pointwise Mutual Information

    • Minhashing to combat a massive feature space

    • How to cheat with Bloom filters

    • K-means Clustering (mini-batch)

    • (?maybe?) TF-IDF

    • (?maybe?) Document clustering with SVD

    • (?maybe?) SVD as Principal Component Analysis

    • (?maybe?) Topic extraction using (to be determined)

  9. Statistics

    • Averages, Percentiles, and Normalization

    • sum, average, standard deviation, etc (airline_flights)

    • Percentiles / Median

    • exact percentiles / median

    • approximate percentiles / median

    • fit a curve to the CDF;

    • construct a histogram (tie back to server logs)

    • "Average value frequency"

    • Sampling responsibly: it’s harder and more important than you think

    • Statistical aggregates and the danger of large numbers

    • normalize data by mapping to percentile, by mapping to Z-score

    • sampling

    • consistent sampling

    • distributions

  10. Time Series

    • Anomaly detection

    • Wikipedia Pageviews

    • windowing and rolling statistics

    • (?maybe?) correlation of joint timeseries

    • (?even mayber?) similar wikipedia pages based on pageview time series

  11. Geographic

    • Spatial join (find all UFO sightings near Airports)

    • mechanics of handling geo data

    • Statistics on grid cells

    • quadkeys and grid coordinate system

    • d3 — map wikipedia

    • k-means clustering to produce readable summaries

    • partial quad keys for "area" data

    • voronoi cells to do "nearby"-ness

    • Scripts:

    • calculate_voronoi_cells — use weather station locations to calculate voronoi polygons

    • voronoi_grid_assignment — cells that have a piece of border, or the largest grid cell that has no border on it

    • Using polymaps to see results

    • Clustering

    • Pointwise mutual information

  12. cat herding

    • total sort

    • transformations

    • ruby -ne

    • grep, cut, seq, (reference back to wu-lign)

    • wc, sha1sum, md5sum, nl

    • pivots

    • wu-box, head, tail, less, split

    • uniq, sort, join, sort| uniq -c

    • bzip2, gzcat

    • commandline workflow tips

    • > /dev/null 2>&1

    • for loops (see if you can get agnostic btwn zsh & bash)

    • nohup, disown, bg and &

    • time

    • advanced hadoop filesystem (chmod, setrep, fsck)

  13. Data munging (Semi-structured data): The dirty art of data munging. It’s a sad fact, but too often the bulk of time spent on a data exploration is just getting the data ready. We’ll show you street-fighting tactics that lessen the time and pain. Along the way, we’ll prepare the datasets to be used throughout the book.

    • Wikipedia Articles: Every English-language article (12 million) from Wikipedia.

    • Wikipedia Pageviews: Hour-by-hour counts of pageviews for every Wikipedia article since 2007.

    • US Commercial Airline Flights: every commercial airline flight since 1987

    • Hourly Weather Data: a century of weather reports, with hourly global coverage since the 1950s.

    • "Star Wars Kid" weblogs: large collection of apache webserver logs from a popular internet site (Andy Baio’s waxy.org).

  14. Interlude I: Data Models, Data Formats, Data Management:

    • How to design your data models

    • How to serialize their contents (orig, scratch, prod)

    • How to organize your scripts and your data

  15. Graph — some better-motivated subset of:

    • Adjacency List / Edge List conversion

    • Undirecting a graph, Min-degree undirected graph

    • Breadth-First Search

    • subuniverse extraction

    • (?maybe?) Pagerank on server logs?

    • (?maybe?) identify strong links

    • Minimum Spanning Tree

    • clustering coefficient

    • Community Extraction: Use the page-to-page links in Wikipedia to identify similar documents

    • Pagerank (centrality): Reconstruct pageview paths from web logs, and use them to identify important pages

    • (bubble)

  16. Machine Learning without Grad School

    • weather & flight delays for prediction

    • Naive Bayes

    • Logistic Regression ("SGD")

    • Random Forest

    • (?maybe?) Collaborative Filtering

    • (?or maybe?) SVD on documents (eg authorship)

    • where to go from here

    • don’t get fancy

    • better features

    • unreasonable effectiveness

    • partition the data, recombine the models

    • pointers for the person who is going to get fancy anyway

  17. Interlude II: Best Practices and Pedantic Points of style

    • Pedantic Points of Style

    • Best Practices

    • How to Think: there are several design patterns for how to pivot your data, like Message Passing (objects send records to meet together); Set Operations (group, distinct, union, etc); Graph Operations (breadth-first search). Taken as a whole, they’re equivalent; with some experience under your belt it’s worth learning how to fluidly shift among these different models.

    • Why Hadoop

    • robots are cheap, people are important

  18. Hadoop Native Java API

    • don’t

  19. Advanced Pig

    • Advanced operators:

    • map-side join, merge join, skew joins

    • Basic UDF

    • why algebraic UDFs are awesome and how to be algebraic

    • Custom Loaders

    • Wonderdog: a LoadFunc / StoreFunc for elasticsearch

    • Performance efficiency and tunables

  20. Data Modeling for HBase-style Database

  21. Hadoop Internals

    • What happens when a job is launched

    • A shallow dive into the HDFS

  22. Hadoop Tuning

    • Tuning for the Wise and Lazy

    • Tuning for the Brave and Foolish

    • The USE Method for understanding performance and diagnosing problems

  23. Overview of Datasets and Scripts

    • Datasets

    • Wikipedia (corpus, pagelinks, pageviews, dbpedia, geolocations)

    • Airline Flights

    • UFO Sightings

    • Global Hourly Weather

    • Waxy.org "Star Wars Kid" Weblogs

    • Scripts

  24. Cheatsheets:

    • Regular Expressions

    • Sizes of the Universe

    • Hadoop Tuning & Configuration Variables

  25. Appendix