What can you do with the sessionized logs? Well, each row lists a visitor-session on the left and a bunch of pages on the right. We’ve been thinking about that as a table, but it’s also a graph — actually, a bunch of graphs! The serverlogs_affinity_graph describes an affinity graph, but we can build a simpler graph that just connects pages to pages by counting the number of times a pair of pages were visited by the same session. Every time a person requests the /archive/2003/04/03/typo_pop.shtml
page and the /archive/2003/04/29/star_war.shtml
page in the same visit, that’s one point towards their similarity. The chapter on [graph_processing] has lots of fun things to do with a graph like this, so for now we’ll just lay the groundwork by computing the page-page similarity graph defined by visitor sessions.
link:code/serverlogs/old/logline-04-page_page_edges-full.rb[role=include]
link:code/serverlogs/old/logline-04-page_page_edges-03-reducer.log[role=include]
First, you can think of it as an affinity graph pairing visitor sessions with pages. The Netflix prize motivated a lot of work to help us understand affinity graphs — in that case, a pairing of Netflix users with movies. Affinity graph-based recommendation engines simultaneously group similar users with similar movies (or similar sessions with similar pages). Imagine the following device. Set up two long straight lengths of wire, with beads that can slide along it. Beads on the left represent visitor sessions, ones on the right represent pages. These are magic beads, in that they can slide through each other, and they can clamp themselves to the wire. They are also slightly magnetic, so with no other tension they would not clump together but instead arrange themselves at some separated interval along the wire. Clamp all the beads in place for a moment and tie a small elastic string between each session bead and each page in that session. (These elastic bands also magically don’t interfere with each other). To combat the crawler-robot effect, choose tighter strings when there are few pages in the session, and weaker strings when there are lots of pages in the session. Once you’ve finished stringing this up, unclamp one of the session beads. It will snap to a position opposit the middle of all the pages it is tied to. If you now unclamp each of those page beads, they’ll move to sit opposite that first session bead. As you continue to unclamp all the beads, you’ll find that they organize into clumps along the wire: when a bunch of sessions link to a common set of pages, their mutal forces combine to drag them opposite each other. That’s the intuitive view; there are proper mathematical treatments, of course, for kind of co-clustering.
TODO: figure showing bipartite session-page graph