Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: No document breaks were found in the input file! These are necessary to allow the script to ensure that random NextSentences are not sampled from the same document. Please add blank lines to indicate breaks between documents in your input file. If your dataset does not contain multiple documents, blank lines can be inserted at any natural boundary, such as the ends of chapters, sections or paragraphs. #24

Open
ChhXiitaa opened this issue Nov 6, 2022 · 2 comments

Comments

@ChhXiitaa
Copy link

感谢您的开源
我想知道 我怎么将我自己的数据集处理成N-gram.txt

@GuiminChen
Copy link
Collaborator

GuiminChen commented Nov 6, 2022 via email

@ChhXiitaa ChhXiitaa reopened this Nov 6, 2022
@shizhediao
Copy link

您好,
感谢关注我们的工作,有多种不同的方法可以构建ngram字典。其中一种可以参考这篇文章里用到的PMI方法 (Section 3.1),相关代码也已开源 https://aclanthology.org/2021.acl-long.259.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants