Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting a seed for DataCollatorForLanguageModeling #36357

Open
capemox opened this issue Feb 23, 2025 · 2 comments
Open

Allow setting a seed for DataCollatorForLanguageModeling #36357

capemox opened this issue Feb 23, 2025 · 2 comments
Labels
Feature request Request for a new feature

Comments

@capemox
Copy link

capemox commented Feb 23, 2025

Feature request

The DataCollatorForLanguageModeling class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.

Motivation

This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with transformers.set_seed(), but one can make use of generators(PyTorch, Tensorflow, NumPy) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See this blog post for more details.

Your contribution

I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a seed argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including using transformers.set_seed().

@capemox capemox added the Feature request Request for a new feature label Feb 23, 2025
@Rocketknight1
Copy link
Member

Hi @capemox, yes, this seems like a useful PR! Keeping an internal seed for the collator makes sense, though you might have to be careful as the collator is often called via a dataset/dataloader and so you might have to think about thread/process safety when there are multiple workers.

@capemox
Copy link
Author

capemox commented Feb 25, 2025

Hey @Rocketknight1, thanks for the heads up about process safety! I spent the better part of the day doing some digging, and here's what I found:

  • The DataLoader class of PyTorch allows the user to pass a torch.Generator() object with a seed. This generator essentially determines the order of the rows retrieved from the Dataset object, which is the index value passed to it's __getitem__ function if shuffle=True.
  • The DataLoader also allows one to set the num_workers as an additional argument: this is where it gets a little complex.
    • For num_workers=0, which is the default, the DataLoader only uses the main process to load data. This means that the collate_fn is run within the main python process itself, and so without explicit setting of the seed via my proposed PR or transformers.set_seed(), the function is not deterministic, even with the generator object (with the same seed) passed to the DataLoader.
    • For num_workers>0, the DataLoader spins up that many processes for loading the data. But here's the cool part: the DataLoader explicitly sets torch.manual_seed() at the beginning of each worker process. This seed is base_seed + worker_id, base_seed is consistent between all workers and created by the generator, while worker_id is simply the index of the worker process. Since the collate_fn is called within each of these subprocesses, the masking by the collate function would be deterministic in this case just by passing the generator object with the same seed. See here for the code in the worker process that sets the seeds, and here for where the workers are spawned.

My proposal

I think passing a seed explicitly for generators in DataCollatorForLanguageModeling is useful in these ways:

  • In the num_workers=0 scenario, passing a seed into the collator would help make the masking deterministic, and also not impact the rest of the script through transformers.set_seed().
  • In the num_workers>0 scenario, the same seed is used for both shuffling the dataset as well as masking by default. This PR would help decouple this situation. So, we can shuffle the dataset in the same way (using the generator passed to the DataLoader) but change the way we mask it (using the seed passed to DataCollatorForLanguageModeling)

I have a couple of ideas on how this could be implemented, which we can discuss further on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

2 participants