Allow setting a seed for DataCollatorForLanguageModeling #36357

capemox · 2025-02-23T16:07:50Z

Feature request

The DataCollatorForLanguageModeling class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.

Motivation

This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with transformers.set_seed(), but one can make use of generators(PyTorch, Tensorflow, NumPy) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See this blog post for more details.

Your contribution

I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a seed argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including using transformers.set_seed().

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-02-24T14:47:31Z

Hi @capemox, yes, this seems like a useful PR! Keeping an internal seed for the collator makes sense, though you might have to be careful as the collator is often called via a dataset/dataloader and so you might have to think about thread/process safety when there are multiple workers.

capemox · 2025-02-25T19:00:59Z

Hey @Rocketknight1, thanks for the heads up about process safety! I spent the better part of the day doing some digging, and here's what I found:

The DataLoader class of PyTorch allows the user to pass a torch.Generator() object with a seed. This generator essentially determines the order of the rows retrieved from the Dataset object, which is the index value passed to it's __getitem__ function if shuffle=True.
The DataLoader also allows one to set the num_workers as an additional argument: this is where it gets a little complex.
- For num_workers=0, which is the default, the DataLoader only uses the main process to load data. This means that the collate_fn is run within the main python process itself, and so without explicit setting of the seed via my proposed PR or transformers.set_seed(), the function is not deterministic, even with the generator object (with the same seed) passed to the DataLoader.
- For num_workers>0, the DataLoader spins up that many processes for loading the data. But here's the cool part: the DataLoader explicitly sets torch.manual_seed() at the beginning of each worker process. This seed is base_seed + worker_id, base_seed is consistent between all workers and created by the generator, while worker_id is simply the index of the worker process. Since the collate_fn is called within each of these subprocesses, the masking by the collate function would be deterministic in this case just by passing the generator object with the same seed. See here for the code in the worker process that sets the seeds, and here for where the workers are spawned.

My proposal

I think passing a seed explicitly for generators in DataCollatorForLanguageModeling is useful in these ways:

In the num_workers=0 scenario, passing a seed into the collator would help make the masking deterministic, and also not impact the rest of the script through transformers.set_seed().
In the num_workers>0 scenario, the same seed is used for both shuffling the dataset as well as masking by default. This PR would help decouple this situation. So, we can shuffle the dataset in the same way (using the generator passed to the DataLoader) but change the way we mask it (using the seed passed to DataCollatorForLanguageModeling)

I have a couple of ideas on how this could be implemented, which we can discuss further on.

capemox added the Feature request Request for a new feature label Feb 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow setting a seed for DataCollatorForLanguageModeling #36357

Allow setting a seed for DataCollatorForLanguageModeling #36357

capemox commented Feb 23, 2025 •

edited

Loading

Rocketknight1 commented Feb 24, 2025

capemox commented Feb 25, 2025 •

edited

Loading

Allow setting a seed for DataCollatorForLanguageModeling #36357

Allow setting a seed for DataCollatorForLanguageModeling #36357

Comments

capemox commented Feb 23, 2025 • edited Loading

Feature request

Motivation

Your contribution

Rocketknight1 commented Feb 24, 2025

capemox commented Feb 25, 2025 • edited Loading

My proposal

capemox commented Feb 23, 2025 •

edited

Loading

capemox commented Feb 25, 2025 •

edited

Loading