Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Inconsistance results in heuristic_split #3

Open
cs-wangchong opened this issue Sep 23, 2021 · 1 comment
Open

Inconsistance results in heuristic_split #3

cs-wangchong opened this issue Sep 23, 2021 · 1 comment

Comments

@cs-wangchong
Copy link

Fix a bug

The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.

Reproduction

I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times.
The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].

Reason & Solution

I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re.
The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].

Solution: Sort the terms by term length when generating _exceptions_re.

_exceptions_re = re.compile(r'(' + '|'.join(sorted(common_terms_with_numbers, key=lambda term: len(term), reverse=True)) + ')', re.I)
@mhucka
Copy link
Member

mhucka commented Jul 17, 2022

Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants