Skip to content
schacon edited this page Sep 12, 2010 · 17 revisions

The Munger::Data component is for dealing with tables of data – specifically, arrays of hashes or collections of hash like objects (ActiveRecords, for example). You feed all the raw data into it and transform and amend and “munge” it using the helpful methods so you have all the data points that you will eventually need for your report, completely disregarding layout or anything else – this tool is about getting the data you need.

Loading Data

The first thing that you want to do is initialize your Munger::Data object and load your data into it. For these examples, I will use data for an online advertising system that displays an ad for blocks of time on their site.

result = AdAirings.find(:all)
data = Munger::Data.load_data(result)

You can also call it thusly:

data = Munger::Data.new(:data => result)

Now you have a Munger::Data object. Let’s see what we can do with it.

Column Manipulation

You can easily add new columns of data or transform existing columns of data. For example, let’s say that this was the raw data:

|airtime | airdate    | clicks | advert | 
------------------------------------------
|15      | 2008-01-01 | 301    | spot 1 | 
|30      | 2008-01-02 | 199    | spot 1 | 
|30      | 2008-01-03 | 234    | spot 1 | 
|15      | 2008-01-04 | 342    | spot 1 | 
|30      | 2008-01-01 | 172    | spot 2 | 
|15      | 2008-01-02 | 217    | spot 2 | 
|90      | 2008-01-03 | 1023   | spot 2 | 
|30      | 2008-01-04 | 321    | spot 2 | 
|60      | 2008-01-01 | 512    | spot 3 | 
|30      | 2008-01-02 | 813    | spot 3 | 
|15      | 2008-01-03 | 333    | spot 3 |

and we wanted to know what the click per second rate was. We can run this to add that column:

data.add_column('rate') { |row| (row.clicks / row.airtime) }
|airdate    | rate | clicks | airtime | advert | 
-------------------------------------------------
|2008-01-01 | 20   | 301    | 15      | spot 1 | 
|2008-01-02 | 6    | 199    | 30      | spot 1 | 
|2008-01-03 | 7    | 234    | 30      | spot 1 | 
|2008-01-04 | 22   | 342    | 15      | spot 1 | 
|2008-01-01 | 5    | 172    | 30      | spot 2 | 
|2008-01-02 | 14   | 217    | 15      | spot 2 | 
|2008-01-03 | 11   | 1023   | 90      | spot 2 | 
|2008-01-04 | 10   | 321    | 30      | spot 2 | 
|2008-01-01 | 8    | 512    | 60      | spot 3 | 
|2008-01-02 | 27   | 813    | 30      | spot 3 | 
|2008-01-03 | 22   | 333    | 15      | spot 3 |

You can also do this to overwrite column data, and you can do it on more than one column at a time. Lets say we want to capitalize the names of the adverts and calculate the rate:

data.add_columns([:advert, :rate]) do |row| 
  rate = (row.clicks / row.airtime)
  [row.advert.capitalize, rate]
end

Now you have this :


|airdate | advert | rate | clicks | airtime |
-———————————————————————
|2008-01-01 | Spot 1 | 20 | 301 | 15 |
|2008-01-02 | Spot 1 | 6 | 199 | 30 |
|2008-01-03 | Spot 1 | 7 | 234 | 30 |
|2008-01-04 | Spot 1 | 22 | 342 | 15 |
|2008-01-01 | Spot 2 | 5 | 172 | 30 |
|2008-01-02 | Spot 2 | 14 | 217 | 15 |
|2008-01-03 | Spot 2 | 11 | 1023 | 90 |
|2008-01-04 | Spot 2 | 10 | 321 | 30 |
|2008-01-01 | Spot 3 | 8 | 512 | 60 |
|2008-01-02 | Spot 3 | 27 | 813 | 30 |
|2008-01-03 | Spot 3 | 22 | 333 | 15 |

Notice that the ‘advert’ field was overwritten with a new value.

Row Filtering

Let’s say we only want to report on data where the click rate is more than 10/second. We can filter our dataset down easily. (Many times this is easier to do in SQL, if that’s where the data is coming from, but sometimes the calculations are complex…)

data.filter_rows { |row| row.rate > 10 }

So now we have this:

|advert | rate | airdate    | clicks | airtime | 
-------------------------------------------------
|Spot 1 | 20   | 2008-01-01 | 301    | 15      | 
|Spot 1 | 22   | 2008-01-04 | 342    | 15      | 
|Spot 2 | 14   | 2008-01-02 | 217    | 15      | 
|Spot 2 | 11   | 2008-01-03 | 1023   | 90      | 
|Spot 3 | 27   | 2008-01-02 | 813    | 30      | 
|Spot 3 | 22   | 2008-01-03 | 333    | 15      | 

Pivoting

Let’s say we want to see a table of spot clicks by airdate in a pivot table. If we take the original dataset and run this:

new_columns = data.pivot('airdate', 'advert', 'clicks')

we get ‘data’ modified to be the following (and new_columns contains the names of the columns that were added):

|advert | 2008-01-01 | 2008-01-02 | 2008-01-03 | 2008-01-04 | 
--------------------------------------------------------------
|Spot 1 | 301        | 199        | 234        | 342        | 
|Spot 2 | 172        | 217        | 1023       | 321        | 
|Spot 3 | 512        | 813        | 333        |            | 

Or, you can see an average or a count in the data field instead:

new_columns = data.pivot('airtime', 'advert', 'rate', :average)
|advert | 60 | 90 | 30 | 15 | 
------------------------------
|Spot 1 |    |    | 6  | 21 | 
|Spot 2 |    | 11 | 7  | 14 | 
|Spot 3 | 8  |    | 27 | 22 | 

So I can see that the average click rate for Spot 3 was highest on the 30 second spot, where for Spot 1 and 2 it was the 15 second one.

Examining

There are also some convenience functions you can use to inspect your Data object at any time. ‘size’ will give you the number of rows you have, and ‘columns’ will give you all the active column names.

See MungerReport for how to turn this munged Data object into an actual report.

Clone this wiki locally