-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slow dataset loading #1559
Comments
Hi, I combined both cases, and ran each to its end, not just load the beginning of the datasets, but actually do the whole process. I also deleted the default huggingface cache before each run: from datasets import load_dataset as hf_load_dataset
from time import time
import os
from uuid import uuid4
from unitxt.api import load_dataset
import shutil
path = "/home/dafna/.cache/huggingface"
shutil.rmtree(path)
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages")
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
shutil.rmtree(path)
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
print(f"unitxt ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}") |
The printout is:
|
I then removed all IterableDatasets from Unitxt Loaders, and from the end of api.load_dataset: api.load_dataset now reads: and LoaderHF.load_iterables() reads:
and self.load_dataset , in return, has all .to_iterable_dataset() removed. Now, both loads match. Unitxt floats like a butterfly, stings like a bee:
|
Thanks! Nice find. I guess the default should be the standard Dataset, and only for very large datasets it would be better to use the iterator variant, as it doesn't assume that the full dataset is downloaded to disk / read into memory. Perhaps this can be indicated in the dataset card using some flag. |
@dafnapension notice the diff in these prints: HF:
Unitxt:
Any idea why the 2nd print is much longer? 100 secs vs. 8.5 secs? |
Hi @assaftibm , the first printout occurs after streaming all the ~180K instances through list() -- just iterate over them. |
Hi @dafnapension , I was talking only about one print - print(f"from after load_dataset to after list of ds = {t2-t1}") which reports the duration of instances = list(ds["train"]) (and I'm comparing it for the HF case and for the UT case). I'll benchmark on my Mac. |
btw, this line: print(f"unitxt ds: {ds}") prints the full dataset to the console -
|
@dafnapension this is the output on my computer:
Three conclusions:
|
@elronbandel / @yoavkatz can you benchmark on your computer? Just make sure you are on the right branch - I'm using this version of @dafnapension 's code: from datasets import load_dataset as hf_load_dataset
from time import time
import os
from uuid import uuid4
from unitxt.api import load_dataset
import shutil
from uuid import uuid4
path = os.path.join(f"cache/{uuid4()}")
# path = "/home/dafna/.cache/huggingface"
# shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
# shutil.rmtree(path)
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
print(f"unitxt ds: {len(ds)}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}") |
Hi @assaftibm , path = "/home/dafna/.cache/huggingface"
shutil.rmtree(path) Second, the first printout was before I removed the .to_iterable_dataset at the end of unitxt's
Third, in both cases (your hf and unitxt) I added the list(ds[train]) after the load. Both, hf and unitxt, for load_data, give you an object that you can iterate over (more precisely, a dictionary with a single entry - "train" that maps to a generator that you can iterate over), and I also wanted to measure the time it takes to iterate over. That is: to make sure all instances arrived from hf to my laptop, and went through the process (for your hf experiment the "process" is just list them - make sure all arrived -- and not just a generator that will call them if you trigger it, and for unitxt -- that they go through the whole card you specified). This is why I added the list(ds[train]) to both, but again - please bear in mind that what you iterate over is different, although both are called ds, and both obtained after load_data (two different load_data : of hf and of unitxt). And yes, the branch I closed , |
Thanks for the explanation! I think we understand each other :-) Can you put a link to the new PR that @elronbandel started? Thanks |
Hi @assaftibm , yes, the latter, #1564 |
Slightly updated version on the test code. Using the import os.path
import shutil
from time import time
from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt
path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"
if os.path.exists(path):
shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
shutil.rmtree(path)
# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}") |
Hi @assaftibm , for unitxt, depending on settings: unitxt/src/unitxt/settings_utils.py Line 152 in bc65c5c
the HF loader may or may not use the default file-system cache that HF uses (HF uses it if you do not specify anything not-None for parameter cache_dir in calling its load_dataset .)Line 234 in bc65c5c
So, if the settings happens to be False, then your test script plays in favor of unitxt, by allowing it to use the file-system cache for loading from HF, whereas 'your part' does wipe out the potential cache_dir before invoking load_dataset
|
And I now realize that I misled you: both PR-s of @elronbandel aim to solving your important observation, #1562 and #1564 |
It seems that loading a dataset from HF using Unitxt is much slower than doing it using the
datasets
package.Compare this:
To:
The Unitxt version takes about x5 times longer.
In both cases a fresh new copy is downloaded.
The text was updated successfully, but these errors were encountered: