-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pack-objects: Create an alternative name hash algorithm (recreated) #1823
base: master
Are you sure you want to change the base?
Changes from 1 commit
68b4127
7ee1845
a4f3d12
a507be2
a0247a6
06a9506
bab2ac3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) | |
return hash; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Jonathan Tan wrote (reply to this): "Jonathan Tan via GitGitGadget" <[email protected]> writes:
> diff --git a/pack-objects.h b/pack-objects.h
> index b9898a4e64b..15be8368d21 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
> return hash;
> }
>
> +static inline uint32_t pack_name_hash_v2(const char *name)
> +{
> + uint32_t hash = 0, base = 0, c;
> +
> + if (!name)
> + return 0;
> +
> + while ((c = *name++)) {
> + if (isspace(c))
> + continue;
> + if (c == '/') {
> + base = (base >> 6) ^ hash;
> + hash = 0;
> + } else {
> + /*
> + * 'c' is only a single byte. Reverse it and move
> + * it to the top of the hash, moving the rest to
> + * less-significant bits.
> + */
> + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
> + c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
> + c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
> + hash = (hash >> 2) + (c << 24);
> + }
> + }
> + return (base >> 6) ^ hash;
> +}
This works because `c` is masked before any arithmetic is performed on
it, but the cast from potentially signed char to uint32_t still makes
me nervous - if char is signed, it behaves as if it was first cast to
int32_t and only then to uint32_t, as you can see from running the code
below:
#include <stdio.h>
int main() {
signed char c = 128;
unsigned int u = c;
printf("hello %u\n", u);
return 0;
}
I would declare `c` as uint8_t or unsigned char. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Junio C Hamano wrote (reply to this): Jonathan Tan <[email protected]> writes:
> "Jonathan Tan via GitGitGadget" <[email protected]> writes:
>> diff --git a/pack-objects.h b/pack-objects.h
>> index b9898a4e64b..15be8368d21 100644
>> --- a/pack-objects.h
>> +++ b/pack-objects.h
>> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
>> return hash;
>> }
>>
>> +static inline uint32_t pack_name_hash_v2(const char *name)
>> +{
>> + uint32_t hash = 0, base = 0, c;
>> + ...
>> + while ((c = *name++)) {
>> + ...
>> + /*
>> + * 'c' is only a single byte. Reverse it and move
>> + * it to the top of the hash, moving the rest to
>> + * less-significant bits.
>> + */
>> + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
>> + ...
> This works because `c` is masked before any arithmetic is performed on
> it, but the cast from potentially signed char to uint32_t still makes
> me nervous - if char is signed, it behaves as if it was first cast to
> int32_t and only then to uint32_t, ...
> I would declare `c` as uint8_t or unsigned char.
I think you meant the type of "name", and your worry is that *name
may pick up a negative integer from there when the platform char is
signed? Here we are dealing with a pathname that has often UTF-8
encoded non-ASCII letters with 8th bit set, and I agree with you
that being explicitly unsigned would certainly help reduce the
cognitive load.
Thanks.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On the Git mailing list, Taylor Blau wrote (reply to this): On Fri, Dec 20, 2024 at 05:19:47PM +0000, Jonathan Tan via GitGitGadget wrote:
> The first change is to be more careful about paths using non-ASCII
> characters. With these characters in mind, reverse the bits in the byte
> as the least-significant bits have the highest entropy and we want to
> maximize their influence. This is done with some bit manipulation that
> swaps the two halves, then the quarters within those halves, and then
> the bits within those quarters.
Makes sense, and seems quite reasonable.
> The second change is to perform hash composition operations at every
> level of the path. This is done by storing a 'base' hash value that
> contains the hash of the parent directory. When reaching a directory
> boundary, we XOR the current level's name-hash value with a downshift of
> the previous level's hash. This perturbation intends to create low-bit
> distinctions for paths with the same final 16 bytes but distinct parent
> directory structures.
Very clever, I love this idea.
Thanks,
Taylor |
||
} | ||
|
||
static inline uint32_t pack_name_hash_v2(const unsigned char *name) | ||
{ | ||
uint32_t hash = 0, base = 0, c; | ||
|
||
if (!name) | ||
return 0; | ||
|
||
while ((c = *name++)) { | ||
if (isspace(c)) | ||
continue; | ||
if (c == '/') { | ||
base = (base >> 6) ^ hash; | ||
hash = 0; | ||
} else { | ||
/* | ||
* 'c' is only a single byte. Reverse it and move | ||
* it to the top of the hash, moving the rest to | ||
* less-significant bits. | ||
*/ | ||
c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; | ||
c = (c & 0xCC) >> 2 | (c & 0x33) << 2; | ||
c = (c & 0xAA) >> 1 | (c & 0x55) << 1; | ||
hash = (hash >> 2) + (c << 24); | ||
} | ||
} | ||
return (base >> 6) ^ hash; | ||
} | ||
|
||
static inline enum object_type oe_type(const struct object_entry *e) | ||
{ | ||
return e->type_valid ? e->type_ : OBJ_BAD; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, karthik nayak wrote (reply to this):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, Junio C Hamano wrote (reply to this):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the Git mailing list, karthik nayak wrote (reply to this):