Get scraped HTML on s3 #1

rufuspollock · 2012-11-25T18:54:27Z

@callumlocke issue to record work on getting the scraped data to s3.

I'd suggest we do upload the stuff you have even if incomplete. (after running extract.py of course!)
Also we could try a further download (I guess we may want to use the query option and filter in some way e.g. by timestamp??)

rufuspollock · 2012-12-09T15:17:19Z

@callumlocke have you made any further progress on getting stuff from mongo to s3? I'd love to run the scrapers on this!

callumlocke · 2012-12-13T18:11:15Z

Sorry for the delay.

I ran the extraction yesterday and have left the s3cmd running since then.

Here's a little excerpt from my terminal just now:

cache/dumps/141276-2009/summary.html -> s3://files.opented.org/scraped/141276-2009/summary.html  [55525 of 445444]
 43572 of 43572   100% in    0s    91.11 kB/s  done
cache/dumps/141276-2011/summary.html -> s3://files.opented.org/scraped/141276-2011/summary.html  [55526 of 445444]
 36161 of 36161   100% in    0s    69.07 kB/s  done
cache/dumps/141277-2007/summary.html -> s3://files.opented.org/scraped/141277-2007/summary.html  [55527 of 445444]
 36816 of 36816   100% in    0s    72.13 kB/s  done
cache/dumps/141277-2009/summary.html -> s3://files.opented.org/scraped/141277-2009/summary.html  [55528 of 445444]
 41581 of 41581   100% in    0s    81.36 kB/s  done
cache/dumps/141277-2011/summary.html -> s3://files.opented.org/scraped/141277-2011/summary.html  [55529 of 445444]
 40449 of 40449   100% in    0s    82.14 kB/s  done
cache/dumps/141278-2007/summary.html -> s3://files.opented.org/scraped/141278-2007/summary.html  [55530 of 445444]
 65378 of 65378   100% in    0s   103.93 kB/s  done
cache/dumps/141278-2009/summary.html -> s3://files.opented.org/scraped/141278-2009/summary.html  [55531 of 445444]
 36068 of 36068   100% in    0s    63.38 kB/s  done
cache/dumps/141278-2011/summary.html -> s3://files.opented.org/scraped/141278-2011/summary.html  [55532 of 445444]
 38794 of 38794   100% in    0s    67.20 kB/s  done
cache/dumps/141279-2007/summary.html -> s3://files.opented.org/scraped/141279-2007/summary.html  [55533 of 445444]
 37059 of 37059   100% in    0s    74.79 kB/s  done
cache/dumps/141279-2009/summary.html -> s3://files.opented.org/scraped/141279-2009/summary.html  [55534 of 445444]
 36100 of 36100   100% in    0s    76.28 kB/s  done
cache/dumps/14128-2007/summary.html -> s3://files.opented.org/scraped/14128-2007/summary.html  [55535 of 445444]
 36893 of 36893   100% in    0s    76.71 kB/s  done

It's only uploaded about 12% of the files so far, and that's after running for most of yesterday, and then again all day today (my computer was disconnected from the internet overnight, but the s3cmd just carried on by itself as soon as I arrived at work this morning).

I'm happy to leave this running until it's done – it's not inconveniencing me – but it looks like it will take a while!

rufuspollock · 2012-12-23T20:19:19Z

@callumlocke how is this doing? Have we got most stuff copied over yet?

I note that http://files.opented.org.s3.amazonaws.com/scraped/index.json still on has 143 entries - could you resync the index.json?

In fact I'm thinking we should actually add index.json to the github repo (just in cache/)

rufuspollock · 2012-12-27T20:50:00Z

@callumlocke just a heads up that the "official" repo (and hence issues) have moved here (viz https://github.com/datasets/opented)

rufuspollock · 2013-01-08T14:49:03Z

@callumlocke any further progress. I'd like to start running the scrapers here.

callumlocke · 2013-01-08T16:56:04Z

I've sync'd the latest index.json to S3 (http://files.opented.org.s3.amazonaws.com/scraped/index.json) but I don't seem to have commit access on the new repo, so if you can fix that then I'll commit a copy of the file to the repo too.

I've tried syncing to S3 (several times) with this command:

s3cmd sync --acl-public --skip-existing --exclude '.DS_Store' cache/dumps/ s3://files.opented.org/scraped/

But it finishes without printing any output in my terminal. While it's running, my CPU monitor shows a that s3cmd is running a Python process, which uses the CPU for a few seconds, then stops for a few seconds, then uses it again, etc. This goes on for several minutes. Eventually it finishes, and there's absolutely no terminal output. (I've also tried omitting --skip-existing and --exclude '.DS_Store', but still nothing.)

There appear to be 66,673 items in the scraped directory on S3 (found this out with s3cmd ls s3://files.opented.org/scraped/ | wc -l), compared to 449,877 in my local cache/dumps/, so it's definitely not already all sync'd.

Maybe s3cmd can't handle syncing a directory with so many files in it (although I would have thought it should be able to...?)

Can you think of any alternative way to get the files onto S3 from my machine?

rufuspollock · 2013-01-08T19:37:23Z

I guess we can leave the index.json in the s3 repo for the moment (how big is it?)

Re s3cmd syncing - hmmm. I'm wondering if you could pipe a more limited number through somehow (e.g. going by year or similar). I also wonder if there is any info on how s3cmd works with lots of files ...

rufuspollock · 2013-02-01T21:45:11Z

@callumlocke any further update here? Just having the index.json up would be enough for the present (obviously all the files would be perfect -- we may need to start splitting into subdirectories ...)

callumlocke · 2013-02-10T23:18:15Z

Hi @rgrp, sorry for the delay. I did upload the index.json to S3. It's 16.6MB.
http://files.opented.org.s3.amazonaws.com/scraped/index.json

ghost assigned callumlocke Nov 25, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get scraped HTML on s3 #1

Get scraped HTML on s3 #1

rufuspollock commented Nov 25, 2012

rufuspollock commented Dec 9, 2012

callumlocke commented Dec 13, 2012

rufuspollock commented Dec 23, 2012

rufuspollock commented Dec 27, 2012

rufuspollock commented Jan 8, 2013

callumlocke commented Jan 8, 2013

rufuspollock commented Jan 8, 2013

rufuspollock commented Feb 1, 2013

callumlocke commented Feb 10, 2013

Get scraped HTML on s3 #1

Get scraped HTML on s3 #1

Comments

rufuspollock commented Nov 25, 2012

rufuspollock commented Dec 9, 2012

callumlocke commented Dec 13, 2012

rufuspollock commented Dec 23, 2012

rufuspollock commented Dec 27, 2012

rufuspollock commented Jan 8, 2013

callumlocke commented Jan 8, 2013

rufuspollock commented Jan 8, 2013

rufuspollock commented Feb 1, 2013

callumlocke commented Feb 10, 2013