-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Flag words wrongly split where the parts each happen to also be words #440
Comments
This is something that has to be curated by us (as it used to be with the old I'll set up a system for for this. In the meantime, could you create a list of some word pairs that can always be merged? |
Exactly. These kinds of checks are probably hard to implement. But they are the kind that are needed. Especially if you want a product that can fix the problems that spellcheckers alone and simplistic grammar checkers just leave untouched. Also note that LLMs can already fix these with what looks like 99% accuracy to me. Someone will probably make an LLM-based grammar checker sooner or later. But those will be slow. What harper has is speed. If harper had both speed and the ability to flag the hard stuff, it would win. |
Those two are related to each other, but are the opposite of what this issue is about. Those join phrases into words. This is about splitting words into phrases. The wrongly joining has been common since even before spellcheckers. The wrongly splitting is much more recent. |
I just came back to add a new one I found and I think the easiest way is to use the growing list in the original post above as the always list. Any others like We could also build a dictionary of multi-word terms from Wiktionary and either just use it to find such cases in subset of repos or something like that. Or as a separate extension or separate tool, separate slower pass, etc. Wiktionary publishes its data twice a month in a format that is >90% parseable. It includes multi-word terms including inflected forms. There are various projects around that parse it in different ways but I haven't watched that space for years. A lot of possibilities anyway. I just had an idea I'll post separately... |
There's a whole bunch of words in this class and I've been noticing this kind of mistake more and more lately. I'll try to come back and add more such words as I spot them in the wild. For now I'll start with one:
My browser actually catches this one:
![Image](https://private-user-images.githubusercontent.com/533619/404855031-63ba5569-c452-413f-ae18-c4798d6121c4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5MTYzODgsIm5iZiI6MTczODkxNjA4OCwicGF0aCI6Ii81MzM2MTkvNDA0ODU1MDMxLTYzYmE1NTY5LWM0NTItNDEzZi1hZTE4LWM0Nzk4ZDYxMjFjNC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QwODE0NDhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT02N2IzOWUxZTkzNDEzZDI3YmQ2ODMyMThkOWZhNWFiYzdmZWZiMzE3NDczMGIyZjI5ZGQ4ZjE1YWYzMmU2Y2U0JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.QSaq2RxrTeVneoFzQYq3NXPJivzZNvKFFCNbrvqWRpc)
I'll maintain a list here, either as I remember them, or spot them in the wild:
The text was updated successfully, but these errors were encountered: