Your cluster

Designed for a "write and debug it on your laptop, run it in the cloud"

You will need a real hadoop cluster running in distributed mode on real data for many of the exercises and to truly grok what is happening.

Goal is that you can do anything with a 5-machine cluster of m1.large machines - cost is ~ $2.00/hr.

(→ Instructions for launching using ironfan)

Programs

Why not Hive? The appealing thing about Hive is that it feels a lot like SQL. The dismal thing about Hive is that it feels a lot like SQL. Similarly, the wonderful thing about Pig is that its operations more closely mirror the underlying map-reduce setup, making it easier to reason about the performance of your tasks; this however means more brain-bendy at the outset for a traditional DBA. Lastly, Hive organizes your data — useful for a multi-analyst setup - but it’s a pain when using a polyglot toolset. Ultimately, Hive is better for an Enterprise Data Warehouse experience, or if you’re already a SQL expert. but all else equal, for exploratory analysis and Data science, you’re better off with Pig.

Ruby & Wukong

xx

parsing:
- HomeRun rubygem — large increase in date/time handling
- Crack rubygem — parse XML simply
- Oj rubygem — parse JSON quickly
algorithms:
- Algorithms rubygem
- Priority Queue
- TSort (in the ruby stdlib) for Topological Sort.
hashing and encoding:
- murmur hash by
- stdlib’s Digest::
- stdlib’s Base64
matrices:
- stdlib’s Matrix class

Pig

xx

Wukong

Narrative Method Structure

Gather input
Perform work
Deliver results
Handle failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05a-tools.asciidoc

05a-tools.asciidoc

Your cluster

Programs

Ruby & Wukong

Pig

Wukong

Files

05a-tools.asciidoc

Latest commit

History

05a-tools.asciidoc

File metadata and controls

Your cluster

Programs

Ruby & Wukong

Pig

Wukong