-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is it possible to do a word-by-word diffing? #2
Comments
TL;DR Word-by-word diffs are not currently supported - PRs welcome. To be precise, the current diffs are not strictly char-by-char, as adjacent changes are coalesced using diff_cleanupSemantic, which should be appropriate in most cases. While I'm not planning to implemented word-by-word diffs any time soon, I'd definitely welcome a PR that adds this as an option. I can think of 2 ways to achieve it:
|
@gkubisa Thanks! I spent sometime looking into the references you listed. The #2 looks promising. The only concerns seems to be the inconsistent understanding towards "word" in a mixed language scenario. The word-to-char func is not in diff-match-patch by default as mentioned here. However, I suppose this diff library will not require an absolute accurate word count especially when it's optional. So one thing I could propose is use the way how word-count counts the words in respect of CJK languages and build our own word-to-char func that can be used to modify I'm interested in raising a PR. Would like to hear your thoughts first, thanks! |
Good point about the ambiguity of the "word" definition. Taking this into the account, I think a good approach could be to support a new option like Regarding the word-diff itself, given that the algorithm could be provided as an option (see above) and that it could be useful also without As for the details of the word-diff algorithm, IMHO it would be great to support all languages, for example by combining word-count with a regex for all letters in all languages. I'd be ok with some inaccuracies in word boundary detection. |
Sounds great. As long as textDiff is an option, I believe we can already be unblocked by implementing our own diff logic. A word-diff library is definitely useful. I'll definitely consider it after I experiment the solution in our product to get enough feedback given it's unclear whether the industrial needs can be aligned on the ambiguity of the "word" definition. |
@ipip2005 the PS. Since |
Sure, I will try. Thanks for making the changes. |
First of all, just to be clear, by "very specific adjustments" I meant what this function does.
You will likely need the same, or similar, adjustments for the word-diff.
Initially I wanted to do exactly that - allow overriding of the diff_main function only and keep all the post-processing. When I started testing that though, it didn't work properly even with the simplest diff function, which simply reported that all old content was deleted and all new content was inserted. Then I realised that that post-processing would likely be broken in a similar way for word-diffs too. Additionally, it would limit the flexibility of the |
Looks good so far! Make sure you also test adding and removing paragraphs, list items, table rows, table columns, etc - you might find that you need cleanUpNodeMarkers. I have not seen the implementation, so maybe it won't be a problem, however, consider that node markers are represented as characters in the U+E000–U+F8FF range, see https://en.wikipedia.org/wiki/Private_Use_Areas. |
@ipip2005 Sorry to dig up an old thread, but how did you go with dogfooding your word-diff function? I'd be very interested if you could share a copy! |
Hi,
data:image/s3,"s3://crabby-images/74a28/74a28654314c7fa00798b19c245b11ff9f924786" alt="image"
First I want to appreciate your effort of making this library. It fits our use case perfectly and we're considering using this library on our UI.
When trying it, we found something that is not quite straightforward in the real world paragraph comparison.
For example, in following cases, we are expecting word-by-word diffing instead of character-by-character:
I changed "some" to "same"
I changed from "I love accessibility" to "I have accountability"
data:image/s3,"s3://crabby-images/0c66c/0c66cbaf826f0d42cc079fc8c43a1f030fa5ccd2" alt="image"
The text was updated successfully, but these errors were encountered: