-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions for reproducing/ comparing with SpokenWOZ baselines. #122
Comments
Thank you for being so interested in SpokenWOZ.
|
Thanks for the quick response. For the evaluation, I'm trying very much to be as comparable to your baselines as possible, so switching to another evaluation script might introduce some changes that prevent this. |
I have now finished the training with the scripts you provided.
Looking at these results, there is still quite some gap to the results reported in the paper (I'll also try to get results on the test set asap). What am I missing?
|
Since our paper does not report the results on dev, we encourage you to try to reproduce our results on test. However, it is worth noting that since our dev and test set distributions as well as the number of dialogues (500 v.s. 1000) are not the same, this part of the difference is acceptable when the current metrics are all relatively low, e.g., the JGA is only 1% different.
|
Thanks for providing the base-model. I think I'm getting close to the original numbers now. Unfortunately, when I now analysed the outputs and scores from the training run with your scripts, I found two additional issues:
As these issues make it hard to compare to the numbers from the paper, I now decided to take your advice from earlier and do the evaluation using the MultiWOZ_Evaluation script and rescore the baseline that I get from your training to still be comparable. |
For different tasks, as in previous works, we used different checkpoints to report the results. |
Hey, thanks for your continuing commitment to the project! I just want to emphasise again that in order to remain comparable to future results, a standardised evaluation framework would be required. So it would be great if you evaluate the new "fixed results" with the MultiWOZEvaluation tool as well. One more thing: I saw that you're skipping all |
Sorry for the delay. REFERENCE is the type of requestable slots to compute the SUCCESS score in MultiWOZ, which indicates the booking availability. In SpokenWOZ, we assume all of the user's bookings will be completed successfully (we ask the SYSTEM to make the successful bookings in the data collection), so we don't need to consider the reference slot in the success computation. |
@ArneNx |
Check the scrips at this link: spokenwoz/Finetuning/space_baseline/space-3/scripts/train. Meanwhile, the scrips in the space-3 folder are used for training text-only baselines. |
@S1s-Z @ArneNx I would really appreciate any input from you as well. |
Also the training script |
The text data in SpokenWOZ is organized in the same way as MultiWOZ, so we keep the comments in the original SPACE code (Due to our laziness, we didn't reorganize the code of model SPACE, causing your misunderstanding). In our released code, we have modified the slots and domains in the code of model SPACE used for training the model on SpokenWOZ, even though it is commented as MultiWOZ in the code. Therefore, you don't need to do additional data preprocessing. |
@S1s-Z Thanks for your comments. I was able to train Space-3 model on SpokenWOZ. Can you please answer the following questions for my further experiments?
|
Thanks for your interests.
|
Thanks for the quick response.
|
You mean you use the /e2e/infer_space.sh or /policy/infer_space.sh to get 0% JGA results? This is because /e2e/infer_space.sh or /policy/infer_space.sh are designed to compute the results of SUCCESS BLUE INFORM for the response generation task instead of the JGA result for the DST task. You can check the parameters such as USE_TRUE_PREV_BSPN and USE_TRUE_PREV_ASPN in the three scrips. |
No, I am not using And by the way, where can I find the explaination for paramters like |
Hi @S1s-Z , |
Sorry for the delay, we checked the code before we uploaded the code. We didn't have a problem at the time. But, you can refer to our other baselines, for example, SPACE-word, and the relevant model can reproduce the results as mentioned above. I will try to reproduce the resulted based on SPACE in next few days to check whether there are some mistakes. In the meantime, did you try to use the code of SPACE (https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/space-3) to reproduce the results? You may modify the slots and domains to reproduce the results based on the original code of SPACE. |
@S1s-Z Thanks for your response. I will look into the original code of SPACE. |
@S1s-Z I will also be grateful if you point out the parameter settings in train_space.sh for DST and Dialog Generation Task so that I can also reproduce the results from my side. |
Hello,
I am currently trying to evaluate models that I trained on SpokenWOZ in order to compare to the baselines you reported in the paper.
Doing this, I'm currently running into some issues:
same_eval_as_cambridge
anduse_true_domain_for_ctr_eval
)?space-word
myself, the training runs for a few iterations and then crashes because thenpy
-fileSNG1724
is missing. I see that the dialog exists in the original data, but it is not preprocessed correctly for some reason. Do you have an explanation for this?The text was updated successfully, but these errors were encountered: