-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get scraped HTML on s3 #1
Comments
@callumlocke have you made any further progress on getting stuff from mongo to s3? I'd love to run the scrapers on this! |
Sorry for the delay. I ran the extraction yesterday and have left the s3cmd running since then. Here's a little excerpt from my terminal just now:
It's only uploaded about 12% of the files so far, and that's after running for most of yesterday, and then again all day today (my computer was disconnected from the internet overnight, but the s3cmd just carried on by itself as soon as I arrived at work this morning). I'm happy to leave this running until it's done – it's not inconveniencing me – but it looks like it will take a while! |
@callumlocke how is this doing? Have we got most stuff copied over yet? I note that http://files.opented.org.s3.amazonaws.com/scraped/index.json still on has 143 entries - could you resync the index.json? In fact I'm thinking we should actually add index.json to the github repo (just in cache/) |
@callumlocke just a heads up that the "official" repo (and hence issues) have moved here (viz https://github.com/datasets/opented) |
@callumlocke any further progress. I'd like to start running the scrapers here. |
I've sync'd the latest I've tried syncing to S3 (several times) with this command:
But it finishes without printing any output in my terminal. While it's running, my CPU monitor shows a that s3cmd is running a Python process, which uses the CPU for a few seconds, then stops for a few seconds, then uses it again, etc. This goes on for several minutes. Eventually it finishes, and there's absolutely no terminal output. (I've also tried omitting There appear to be 66,673 items in the Maybe s3cmd can't handle syncing a directory with so many files in it (although I would have thought it should be able to...?) Can you think of any alternative way to get the files onto S3 from my machine? |
I guess we can leave the index.json in the s3 repo for the moment (how big is it?) Re s3cmd syncing - hmmm. I'm wondering if you could pipe a more limited number through somehow (e.g. going by year or similar). I also wonder if there is any info on how s3cmd works with lots of files ... |
@callumlocke any further update here? Just having the index.json up would be enough for the present (obviously all the files would be perfect -- we may need to start splitting into subdirectories ...) |
Hi @rgrp, sorry for the delay. I did upload the index.json to S3. It's 16.6MB. |
@callumlocke issue to record work on getting the scraped data to s3.
The text was updated successfully, but these errors were encountered: