Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to do a word-by-word diffing? #2

Open
ipip2005 opened this issue Sep 18, 2019 · 10 comments
Open

Is it possible to do a word-by-word diffing? #2

ipip2005 opened this issue Sep 18, 2019 · 10 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ipip2005
Copy link
Contributor

ipip2005 commented Sep 18, 2019

Hi,
First I want to appreciate your effort of making this library. It fits our use case perfectly and we're considering using this library on our UI.
When trying it, we found something that is not quite straightforward in the real world paragraph comparison.
For example, in following cases, we are expecting word-by-word diffing instead of character-by-character:
I changed "some" to "same"
image

I changed from "I love accessibility" to "I have accountability"
image

@gkubisa
Copy link
Contributor

gkubisa commented Sep 23, 2019

TL;DR Word-by-word diffs are not currently supported - PRs welcome.

To be precise, the current diffs are not strictly char-by-char, as adjacent changes are coalesced using diff_cleanupSemantic, which should be appropriate in most cases.

While I'm not planning to implemented word-by-word diffs any time soon, I'd definitely welcome a PR that adds this as an option. I can think of 2 ways to achieve it:

  1. Modify diffText to optionally skip diff_cleanupSemantic and call a different, new function, which would convert the char-by-char diff to a word-by-word diff. Unfortunately, diff-match-patch does not provide it.
  2. Modify diffText to optionally generate word-by-word diffs using the technique described in Line or Word Diffs.

@gkubisa gkubisa added enhancement New feature or request help wanted Extra attention is needed labels Sep 23, 2019
@ipip2005
Copy link
Contributor Author

ipip2005 commented Sep 25, 2019

@gkubisa Thanks! I spent sometime looking into the references you listed. The #2 looks promising. The only concerns seems to be the inconsistent understanding towards "word" in a mixed language scenario. The word-to-char func is not in diff-match-patch by default as mentioned here.

However, I suppose this diff library will not require an absolute accurate word count especially when it's optional. So one thing I could propose is use the way how word-count counts the words in respect of CJK languages and build our own word-to-char func that can be used to modify diffText to do word diff

I'm interested in raising a PR. Would like to hear your thoughts first, thanks!

@gkubisa
Copy link
Contributor

gkubisa commented Sep 26, 2019

Good point about the ambiguity of the "word" definition. Taking this into the account, I think a good approach could be to support a new option like textDiff(text1: string, text2: string): Diff[], which would default to diff_main from diff-match-patch. This way users would have full control over the diff algorithm, so that they could use their favourite word-diff implementation, case insensitive diff, or whatever else they want. I can implement that option myself soon, if it's going to work for you - just let me know.

Regarding the word-diff itself, given that the algorithm could be provided as an option (see above) and that it could be useful also without visual-dom-diff (for plain-text diffs), I think it might be best to implement it in a separate GitHub repo and publish it as a separate node module. Once ready, I could link to it from the visual-dom-diff's README.md. Does it sound good to you?

As for the details of the word-diff algorithm, IMHO it would be great to support all languages, for example by combining word-count with a regex for all letters in all languages. I'd be ok with some inaccuracies in word boundary detection.

@ipip2005
Copy link
Contributor Author

Sounds great. As long as textDiff is an option, I believe we can already be unblocked by implementing our own diff logic.

A word-diff library is definitely useful. I'll definitely consider it after I experiment the solution in our product to get enough feedback given it's unclear whether the industrial needs can be aligned on the ambiguity of the "word" definition.

@gkubisa gkubisa added help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Sep 27, 2019
@gkubisa
Copy link
Contributor

gkubisa commented Sep 27, 2019

@ipip2005 the diffText option is available in version 0.7.0 and should provide enough flexibility to implement word-by-word diffs. Let me know, if you come across any issues.

PS. Since visual-dom-diff requires some very specific adjustments to what diff-match-patch produces by default (in order to improve the DOM diff quality), I think the best place for your "word" diff function might be in visual-dom-diff after all, assuming that you decide to open source it. Please open a PR, if you'd like to add that function.

@ipip2005
Copy link
Contributor Author

Sure, I will try. Thanks for making the changes.
One question regarding the "very specific adjustments": are they specified to character diff or they are actually them same with word diff? If it's the later, is it possible to only make diff_main function optional instead the whole diff function that includes the adjustments?

@gkubisa
Copy link
Contributor

gkubisa commented Sep 30, 2019

First of all, just to be clear, by "very specific adjustments" I meant what this function does.

Are they specified to character diff or they are actually them same with word diff?

You will likely need the same, or similar, adjustments for the word-diff.

If it's the later, is it possible to only make diff_main function optional instead the whole diff function that includes the adjustments?

Initially I wanted to do exactly that - allow overriding of the diff_main function only and keep all the post-processing. When I started testing that though, it didn't work properly even with the simplest diff function, which simply reported that all old content was deleted and all new content was inserted. Then I realised that that post-processing would likely be broken in a similar way for word-diffs too. Additionally, it would limit the flexibility of the diffText option, as it would not give full control over the text diff. Based on all of the above, I decided that the most sensible and flexible approach was to let the diffText option override the entire plain-text diff function, including all pre and post-processing.

@ipip2005
Copy link
Contributor Author

ipip2005 commented Oct 1, 2019

Thanks for the details. I think I've made some progress yesterday. It's kind of working in my test environment
image

And you're right the adjustments and clean-up should be modified in order to achieve word diffing. Just for example, diff_cleanupMerge and diff_cleanupSemantic will try to move the common string in the insertion and deletion to an equality nearby which breaks down word diffing. Actually I removed all the code of post-processing, and only used the raw diff as an output in order to get the results above.

I'm not sure if it has potential perf concerns, but I would treat as a good start point. My plan is dogfooding the customized word-diff in our product and gathering feedback, then I'll send a PR here. The time range for dogfooding should be 1-2 weeks. Please let me know if you have any concerns.

@gkubisa
Copy link
Contributor

gkubisa commented Oct 2, 2019

Looks good so far!

Make sure you also test adding and removing paragraphs, list items, table rows, table columns, etc - you might find that you need cleanUpNodeMarkers.

I have not seen the implementation, so maybe it won't be a problem, however, consider that node markers are represented as characters in the U+E000–U+F8FF range, see https://en.wikipedia.org/wiki/Private_Use_Areas.

@tobyzerner
Copy link

@ipip2005 Sorry to dig up an old thread, but how did you go with dogfooding your word-diff function? I'd be very interested if you could share a copy!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants