-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow setting a seed for DataCollatorForLanguageModeling #36357
Comments
Hi @capemox, yes, this seems like a useful PR! Keeping an internal seed for the collator makes sense, though you might have to be careful as the collator is often called via a dataset/dataloader and so you might have to think about thread/process safety when there are multiple workers. |
Hey @Rocketknight1, thanks for the heads up about process safety! I spent the better part of the day doing some digging, and here's what I found:
My proposalI think passing a seed explicitly for generators in
I have a couple of ideas on how this could be implemented, which we can discuss further on. |
Feature request
The
DataCollatorForLanguageModeling
class allows training for an MLM (masked language model) task, which randomly masks or replaces certain tokens. Models such as BERT and RoBERTa are trained in such a manner. It would be great if the user can set a seed, ensuring repeatability in generating masked batches.Motivation
This would ensure generation of repeatable batches of data, which is critical for model reproducibility. Right now, there is a form of repeatability with
transformers.set_seed()
, but one can make use of generators(PyTorch, Tensorflow, NumPy) to set the data collator seed without globally setting the seed for each framework. The major reason this would help is that the MLM masking probabilities would not be influenced by code outside of it, which is good practice. This would mean that, given the same dataset and seed, the masking would happen consistently irrespective of the rest of your training script. See this blog post for more details.Your contribution
I can submit a PR for this. I have experience with TF, PyTorch and NumPy, would love to contribute. I have taken a look at the code, and can add a
seed
argument which enables usage of generators for repeatability. If not specified, however, the code would fall back to its previous behavior, including usingtransformers.set_seed()
.The text was updated successfully, but these errors were encountered: