Designed for a "write and debug it on your laptop, run it in the cloud"
You will need a real hadoop cluster running in distributed mode on real data for many of the exercises and to truly grok what is happening.
Goal is that you can do anything with
a 5-machine cluster of m1.large
machines - cost is ~ $2.00/hr.
(→ Instructions for launching using ironfan)
Why not Hive? The appealing thing about Hive is that it feels a lot like SQL. The dismal thing about Hive is that it feels a lot like SQL. Similarly, the wonderful thing about Pig is that its operations more closely mirror the underlying map-reduce setup, making it easier to reason about the performance of your tasks; this however means more brain-bendy at the outset for a traditional DBA. Lastly, Hive organizes your data — useful for a multi-analyst setup - but it’s a pain when using a polyglot toolset. Ultimately, Hive is better for an Enterprise Data Warehouse experience, or if you’re already a SQL expert. but all else equal, for exploratory analysis and Data science, you’re better off with Pig.
xx
-
parsing:
-
HomeRun rubygem — large increase in date/time handling
-
Crack rubygem — parse XML simply
-
Oj rubygem — parse JSON quickly
-
-
algorithms:
-
TSort (in the ruby stdlib) for Topological Sort.
-
hashing and encoding:
-
murmur hash by
-
stdlib’s
Digest::
-
stdlib’s
Base64
-
-
matrices:
-
stdlib’s
Matrix
class
-