Inconsistance results in heuristic_split #3

cs-wangchong · 2021-09-23T07:05:31Z

Fix a bug

The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.

Reproduction

I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times.
The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].

Reason & Solution

I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re.
The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].

Solution: Sort the terms by term length when generating _exceptions_re.

_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)

The text was updated successfully, but these errors were encountered:

mhucka · 2022-07-17T03:59:52Z

Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.

mhucka added the enhancement label Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistance results in heuristic_split #3

Inconsistance results in heuristic_split #3

cs-wangchong commented Sep 23, 2021

mhucka commented Jul 17, 2022

Inconsistance results in heuristic_split #3

Inconsistance results in heuristic_split #3

Comments

cs-wangchong commented Sep 23, 2021

Fix a bug

Reproduction

Reason & Solution

mhucka commented Jul 17, 2022