You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.
The bug is that Ronin may split the same identifier into different results due to the term order in the set of common_terms_with_numbers.
Reproduction
I added md5sum into the set of common_terms_with_numbers and then ran ronin.split("md5sum") several times.
The splitting results were sometimes ["md5sum"] and sometimes ["md5", "sum"].
Reason & Solution
I checked the code and found that the heuristic_split function in simple_splitters.py relys on the regex expression _exceptions_re.
The _exceptions_re is generated from common_terms_with_numbers without considering term order in the set.
It means that if "md5" is before "md5sum" in _exceptions_re, the split result is ["md5", "sum"]; If "md5sum" is before "md5" in _exceptions_re, the split result is ["md5sum"].
Solution: Sort the terms by term length when generating _exceptions_re.
Thank you for this, and my apologies for taking so long to reply. I think your solution (sorting by length) sounds like a good idea. I want to run some tests first but it does sound like this will be an improvement.
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Fix a bug
The bug is that Ronin may split the same identifier into different results due to the term order in the set of
common_terms_with_numbers
.Reproduction
I added
md5sum
into the set ofcommon_terms_with_numbers
and then ranronin.split("md5sum")
several times.The splitting results were sometimes
["md5sum"]
and sometimes["md5", "sum"]
.Reason & Solution
I checked the code and found that the
heuristic_split
function in simple_splitters.py relys on the regex expression_exceptions_re
.The
_exceptions_re
is generated fromcommon_terms_with_numbers
without considering term order in the set.It means that if "md5" is before "md5sum" in
_exceptions_re
, the split result is["md5", "sum"]
; If "md5sum" is before "md5" in_exceptions_re
, the split result is["md5sum"]
.Solution: Sort the terms by term length when generating
_exceptions_re
.The text was updated successfully, but these errors were encountered: