feat: Flag words wrongly split where the parts each happen to also be words #440

hippietrail · 2025-01-20T11:18:50Z

There's a whole bunch of words in this class and I've been noticing this kind of mistake more and more lately. I'll try to come back and add more such words as I spot them in the wild. For now I'll start with one:

... and you some how have to implement the body of the goto statement → ... and you somehow have to implement the body of the goto statement

My browser actually catches this one:

I'll maintain a list here, either as I remember them, or spot them in the wild:

in tact → intact
every where → everywhere
surely a misleading thumb nail → surely a misleading thumbnail
As Brit, you're not going to up set me → As Brit, you're not going to upset me

elijah-potter · 2025-01-21T17:31:34Z

This is something that has to be curated by us (as it used to be with the old Matcher rule). For example, there fore can always be condensed, but any way cannot.

I'll set up a system for for this. In the meantime, could you create a list of some word pairs that can always be merged?

elijah-potter · 2025-01-21T21:57:18Z

I just saw #403 and #410, which seem related.

hippietrail · 2025-01-22T06:41:00Z

This is something that has to be curated by us (as it used to be with the old Matcher rule). For example, there fore can always be condensed, but any way cannot.

Exactly. These kinds of checks are probably hard to implement. But they are the kind that are needed. Especially if you want a product that can fix the problems that spellcheckers alone and simplistic grammar checkers just leave untouched.

Also note that LLMs can already fix these with what looks like 99% accuracy to me. Someone will probably make an LLM-based grammar checker sooner or later. But those will be slow. What harper has is speed. If harper had both speed and the ability to flag the hard stuff, it would win.

hippietrail · 2025-01-24T09:55:51Z

I just saw #403 and #410, which seem related.

Those two are related to each other, but are the opposite of what this issue is about. Those join phrases into words. This is about splitting words into phrases. The wrongly joining has been common since even before spellcheckers. The wrongly splitting is much more recent.

hippietrail · 2025-01-30T03:52:33Z

I'll set up a system for for this. In the meantime, could you create a list of some word pairs that can always be merged?

I just came back to add a new one I found and I think the easiest way is to use the growing list in the original post above as the always list.

Any others like any way will surely each need their own separate lint. When there's lots of them we might see some repeated patterns where we can cover multiple in a single lint.

We could also build a dictionary of multi-word terms from Wiktionary and either just use it to find such cases in subset of repos or something like that. Or as a separate extension or separate tool, separate slower pass, etc. Wiktionary publishes its data twice a month in a format that is >90% parseable. It includes multi-word terms including inflected forms. There are various projects around that parse it in different ways but I haven't watched that space for years. A lot of possibilities anyway. I just had an idea I'll post separately...

elijah-potter added enhancement New feature or request harper-core linting labels Jan 21, 2025

hippietrail changed the title ~~Flag words wrongly split where the parts each happen to also be words~~ feat: Flag words wrongly split where the parts each happen to also be words Jan 31, 2025

elijah-potter added a commit that referenced this issue Feb 6, 2025

feat(core): add more closed compounds from #440

faebc9a

elijah-potter linked a pull request Feb 6, 2025 that will close this issue

Closed Compound Matcher Conversions #608

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Flag words wrongly split where the parts each happen to also be words #440

feat: Flag words wrongly split where the parts each happen to also be words #440

hippietrail commented Jan 20, 2025 •

edited

Loading

elijah-potter commented Jan 21, 2025

elijah-potter commented Jan 21, 2025

hippietrail commented Jan 22, 2025

hippietrail commented Jan 24, 2025

hippietrail commented Jan 30, 2025

feat: Flag words wrongly split where the parts each happen to also be words #440

feat: Flag words wrongly split where the parts each happen to also be words #440

Comments

hippietrail commented Jan 20, 2025 • edited Loading

elijah-potter commented Jan 21, 2025

elijah-potter commented Jan 21, 2025

hippietrail commented Jan 22, 2025

hippietrail commented Jan 24, 2025

hippietrail commented Jan 30, 2025

hippietrail commented Jan 20, 2025 •

edited

Loading