Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce memory usage for loading DwC-A file #37

Open
adityajain07 opened this issue Aug 13, 2024 · 2 comments
Open

Reduce memory usage for loading DwC-A file #37

adityajain07 opened this issue Aug 13, 2024 · 2 comments
Assignees

Comments

@adityajain07
Copy link
Contributor

Suggestion by the IDT team:

You could use a "streaming" approach where you use an iterator to read lines from that CSV gradually, and hand them out to pool.imap_unordered() (https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered) as they come. At no time do you require all this data in memory.

Another suggestion on not asking for multiple CPUs:

A download job is network I/O-bound. It is limited entirely by the Internet connection to the outside. It is so slow that even one CPU core is massively more than enough for your needs. You're asking for 64. That means 63.5+ of those cores are wasted. Furthermore, with a streaming I/O approach as I suggest, you should not need more than 10G of RAM, meaning that your ask for 300G RAM is also >95% waste. You simply do not need to load all these URLs in RAM, still less shove them into Pandas.

@adityajain07 adityajain07 self-assigned this Aug 13, 2024
@adityajain07
Copy link
Contributor Author

Another comment:

CPUs != processes != parallel != faster
You gotta know where your probable bottleneck is. You will not get more than about 100MB/s down, in almost all certainty, through our internet pipe. Our filesystems are much faster than that so they're not the bottleneck. One CPU core can handle a download program and the disk I/O that 100MB/s through-traffic generates.
Downloading images one-by-one is not automatically bad. It might be bad if, between downloads, there is dead time (from writing out the file, or selecting the next URL, or any other reason). It might be worthwhile, therefore, to have a handful of downloads going in parallel to saturate the network connection.
A handful, here, is determined by that dead time, which is a property of your download code's (in)efficiency. It is not determined principally by CPU core count, in fact it is almost completely independent of it, and it is definitely not 64. It is extremely likely that 4-8 parallel downloads on 1 CPU core will saturate the download bandwidth entirely. One CPU core can handle almost any number of processes so long as these processes are mostly sleeping waiting for I/O and using negligible CPU% - which is likely the case for you.
If you've tied the number of downloads to the number of cores, that's a mistake. Remove that tie.
It's really got nothing to do with CPU or GPU usage efficiency - a download job is principally about moving data and just about any single CPU core ought to be adequate for all but the highest-performance downloads on the highest-performance networks and networked filesystems.

@adityajain07
Copy link
Contributor Author

Re-emphasizing: Having more than 1 downloading processes might help you reach the max bandwidth of probably 100MB/s, but adding more processes than the minimum required will only slow down every other process's download. Furthermore, since most of the time these processes will be sleeping waiting for data to come in/out, they won't be using the CPU, which can then be time-sliced between all of them. That's why you only need one CPU core, might only need a small handful of download processes, and definitely not 64 cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant