Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get scraped HTML on s3 #1

Open
rufuspollock opened this issue Nov 25, 2012 · 9 comments
Open

Get scraped HTML on s3 #1

rufuspollock opened this issue Nov 25, 2012 · 9 comments

Comments

@rufuspollock
Copy link
Member

@callumlocke issue to record work on getting the scraped data to s3.

  • I'd suggest we do upload the stuff you have even if incomplete. (after running extract.py of course!)
  • Also we could try a further download (I guess we may want to use the query option and filter in some way e.g. by timestamp??)
@ghost ghost assigned callumlocke Nov 25, 2012
@rufuspollock
Copy link
Member Author

@callumlocke have you made any further progress on getting stuff from mongo to s3? I'd love to run the scrapers on this!

@callumlocke
Copy link
Contributor

Sorry for the delay.

I ran the extraction yesterday and have left the s3cmd running since then.

Here's a little excerpt from my terminal just now:

cache/dumps/141276-2009/summary.html -> s3://files.opented.org/scraped/141276-2009/summary.html  [55525 of 445444]
 43572 of 43572   100% in    0s    91.11 kB/s  done
cache/dumps/141276-2011/summary.html -> s3://files.opented.org/scraped/141276-2011/summary.html  [55526 of 445444]
 36161 of 36161   100% in    0s    69.07 kB/s  done
cache/dumps/141277-2007/summary.html -> s3://files.opented.org/scraped/141277-2007/summary.html  [55527 of 445444]
 36816 of 36816   100% in    0s    72.13 kB/s  done
cache/dumps/141277-2009/summary.html -> s3://files.opented.org/scraped/141277-2009/summary.html  [55528 of 445444]
 41581 of 41581   100% in    0s    81.36 kB/s  done
cache/dumps/141277-2011/summary.html -> s3://files.opented.org/scraped/141277-2011/summary.html  [55529 of 445444]
 40449 of 40449   100% in    0s    82.14 kB/s  done
cache/dumps/141278-2007/summary.html -> s3://files.opented.org/scraped/141278-2007/summary.html  [55530 of 445444]
 65378 of 65378   100% in    0s   103.93 kB/s  done
cache/dumps/141278-2009/summary.html -> s3://files.opented.org/scraped/141278-2009/summary.html  [55531 of 445444]
 36068 of 36068   100% in    0s    63.38 kB/s  done
cache/dumps/141278-2011/summary.html -> s3://files.opented.org/scraped/141278-2011/summary.html  [55532 of 445444]
 38794 of 38794   100% in    0s    67.20 kB/s  done
cache/dumps/141279-2007/summary.html -> s3://files.opented.org/scraped/141279-2007/summary.html  [55533 of 445444]
 37059 of 37059   100% in    0s    74.79 kB/s  done
cache/dumps/141279-2009/summary.html -> s3://files.opented.org/scraped/141279-2009/summary.html  [55534 of 445444]
 36100 of 36100   100% in    0s    76.28 kB/s  done
cache/dumps/14128-2007/summary.html -> s3://files.opented.org/scraped/14128-2007/summary.html  [55535 of 445444]
 36893 of 36893   100% in    0s    76.71 kB/s  done

It's only uploaded about 12% of the files so far, and that's after running for most of yesterday, and then again all day today (my computer was disconnected from the internet overnight, but the s3cmd just carried on by itself as soon as I arrived at work this morning).

I'm happy to leave this running until it's done – it's not inconveniencing me – but it looks like it will take a while!

@rufuspollock
Copy link
Member Author

@callumlocke how is this doing? Have we got most stuff copied over yet?

I note that http://files.opented.org.s3.amazonaws.com/scraped/index.json still on has 143 entries - could you resync the index.json?

In fact I'm thinking we should actually add index.json to the github repo (just in cache/)

@rufuspollock
Copy link
Member Author

@callumlocke just a heads up that the "official" repo (and hence issues) have moved here (viz https://github.com/datasets/opented)

@rufuspollock
Copy link
Member Author

@callumlocke any further progress. I'd like to start running the scrapers here.

@callumlocke
Copy link
Contributor

I've sync'd the latest index.json to S3 (http://files.opented.org.s3.amazonaws.com/scraped/index.json) but I don't seem to have commit access on the new repo, so if you can fix that then I'll commit a copy of the file to the repo too.

I've tried syncing to S3 (several times) with this command:

s3cmd sync --acl-public --skip-existing --exclude '.DS_Store' cache/dumps/ s3://files.opented.org/scraped/

But it finishes without printing any output in my terminal. While it's running, my CPU monitor shows a that s3cmd is running a Python process, which uses the CPU for a few seconds, then stops for a few seconds, then uses it again, etc. This goes on for several minutes. Eventually it finishes, and there's absolutely no terminal output. (I've also tried omitting --skip-existing and --exclude '.DS_Store', but still nothing.)

There appear to be 66,673 items in the scraped directory on S3 (found this out with s3cmd ls s3://files.opented.org/scraped/ | wc -l), compared to 449,877 in my local cache/dumps/, so it's definitely not already all sync'd.

Maybe s3cmd can't handle syncing a directory with so many files in it (although I would have thought it should be able to...?)

Can you think of any alternative way to get the files onto S3 from my machine?

@rufuspollock
Copy link
Member Author

I guess we can leave the index.json in the s3 repo for the moment (how big is it?)

Re s3cmd syncing - hmmm. I'm wondering if you could pipe a more limited number through somehow (e.g. going by year or similar). I also wonder if there is any info on how s3cmd works with lots of files ...

@rufuspollock
Copy link
Member Author

@callumlocke any further update here? Just having the index.json up would be enough for the present (obviously all the files would be perfect -- we may need to start splitting into subdirectories ...)

@callumlocke
Copy link
Contributor

Hi @rgrp, sorry for the delay. I did upload the index.json to S3. It's 16.6MB.
http://files.opented.org.s3.amazonaws.com/scraped/index.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants