Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Flag words wrongly split where the parts each happen to also be words #440

Open
hippietrail opened this issue Jan 20, 2025 · 5 comments · May be fixed by #608
Open

feat: Flag words wrongly split where the parts each happen to also be words #440

hippietrail opened this issue Jan 20, 2025 · 5 comments · May be fixed by #608
Labels

Comments

@hippietrail
Copy link
Contributor

hippietrail commented Jan 20, 2025

There's a whole bunch of words in this class and I've been noticing this kind of mistake more and more lately. I'll try to come back and add more such words as I spot them in the wild. For now I'll start with one:

  • ... and you some how have to implement the body of the goto statement → ... and you somehow have to implement the body of the goto statement

My browser actually catches this one:
Image

I'll maintain a list here, either as I remember them, or spot them in the wild:

  • in tact → intact
  • every where → everywhere
  • surely a misleading thumb nail → surely a misleading thumbnail
  • As Brit, you're not going to up set me → As Brit, you're not going to upset me
@elijah-potter
Copy link
Collaborator

This is something that has to be curated by us (as it used to be with the old Matcher rule). For example, there fore can always be condensed, but any way cannot.

I'll set up a system for for this. In the meantime, could you create a list of some word pairs that can always be merged?

@elijah-potter
Copy link
Collaborator

I just saw #403 and #410, which seem related.

@hippietrail
Copy link
Contributor Author

This is something that has to be curated by us (as it used to be with the old Matcher rule). For example, there fore can always be condensed, but any way cannot.

Exactly. These kinds of checks are probably hard to implement. But they are the kind that are needed. Especially if you want a product that can fix the problems that spellcheckers alone and simplistic grammar checkers just leave untouched.

Also note that LLMs can already fix these with what looks like 99% accuracy to me. Someone will probably make an LLM-based grammar checker sooner or later. But those will be slow. What harper has is speed. If harper had both speed and the ability to flag the hard stuff, it would win.

@hippietrail
Copy link
Contributor Author

I just saw #403 and #410, which seem related.

Those two are related to each other, but are the opposite of what this issue is about. Those join phrases into words. This is about splitting words into phrases. The wrongly joining has been common since even before spellcheckers. The wrongly splitting is much more recent.

@hippietrail
Copy link
Contributor Author

I'll set up a system for for this. In the meantime, could you create a list of some word pairs that can always be merged?

I just came back to add a new one I found and I think the easiest way is to use the growing list in the original post above as the always list.

Any others like any way will surely each need their own separate lint. When there's lots of them we might see some repeated patterns where we can cover multiple in a single lint.

We could also build a dictionary of multi-word terms from Wiktionary and either just use it to find such cases in subset of repos or something like that. Or as a separate extension or separate tool, separate slower pass, etc. Wiktionary publishes its data twice a month in a format that is >90% parseable. It includes multi-word terms including inflected forms. There are various projects around that parse it in different ways but I haven't watched that space for years. A lot of possibilities anyway. I just had an idea I'll post separately...

@hippietrail hippietrail changed the title Flag words wrongly split where the parts each happen to also be words feat: Flag words wrongly split where the parts each happen to also be words Jan 31, 2025
@elijah-potter elijah-potter linked a pull request Feb 6, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants