We can use the user logs to assemble a picture of how users navigate the site — 'sessionizing' their pageviews. Marketing and e-commerce sites have a great deal of interest in optimizing their "conversion funnel", the sequence of pages that visitors follow before filling out a contact form, or buying those shoes, or whatever it is the site exists to serve. Visitor sessions are also useful for defining groups of related pages, in a manner far more robust than what simple page-to-page links would define. A recommendation algorithm using those relations would for example help an e-commerce site recommend teflon paste to a visitor buying plumbing fittings, or help a news site recommend an article about Marilyn Monroe to a visitor who has just finished reading an article about John F Kennedy. Many commercial web analytics tools don’t offer a view into user sessions — assembling them is extremely challenging for a traditional datastore. It’s a piece of cake for Hadoop, as you’re about to see.
////This spot could be an effective place to say more about "Locality" and taking the reader deeper into thinking about that concept in context. Amy////
NOTE:[Take a moment and think about the locality: what feature(s) do we need to group on? What additional feature(s) should we sort with?]
spit out [ip, date_hr, visit_time, path]
.
link:code/serverlogs/old/logline-03-breadcrumbs-full.rb[role=include]
You might ask why we don’t partition directly on say both visitor_id
and date (or other time bucket). Partitioning by date would break the locality of any visitor session that crossed midnight: some of the requests would be in one day, the rest would be in the next day.
run it in map mode:
link:code/serverlogs/old/logline-02-histograms-01-mapper.log[role=include]
link:code/serverlogs/old/logline-03-breadcrumbs-02-mapper.log[role=include]
group on user
link:code/serverlogs/old/logline-03-breadcrumbs-03-reducer.log[role=include]
We use the secondary sort so that each visit is in strict order of time within a session.
You might ask why that is necessary — surely each mapper reads the lines in order? Yes, but you have no control over what order the mappers run, or where their input begins and ends.
This script will accumulate multiple visits of a page.
TODO: say more about the secondary sort. ////This may sound wacky, but please try it out: use the JFK/MM exmaple again, here. Tie this all together more, the concepts, using those memorable people. I can explain this live, too. Amy////