pack-objects: Create an alternative name hash algorithm (recreated) #1823

gitgitgadget · 2024-12-04T20:09:34Z

On the Git mailing list, karthik nayak wrote (reply to this):

"Jonathan Tan via GitGitGadget" <[email protected]> writes: [snip] > diff --git a/pack-objects.h b/pack-objects.h > index b9898a4e64b..15be8368d21 100644 > --- a/pack-objects.h > +++ b/pack-objects.h > @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) > return hash; > } > > +static inline uint32_t pack_name_hash_v2(const char *name) > +{ > + uint32_t hash = 0, base = 0, c; > + > + if (!name) > + return 0; > + > + while ((c = *name++)) { > + if (isspace(c)) > + continue; > + if (c == '/') { > + base = (base >> 6) ^ hash; Just two questions about the implementation for my own knowledge. 1. Why use '6' here? I couldn't understand the reasoning for the specific value. 2. Generally in hashing algorithms the XOR is used to ensure that the output distribution is uniform which reduces collisions. Here, as you noted, we're more finding values for sorting rather than hashing in the traditional sense. So why use an XOR? > + hash = 0; > + } else { > + /* > + * 'c' is only a single byte. Reverse it and move > + * it to the top of the hash, moving the rest to > + * less-significant bits. > + */ > + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; > + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; > + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; > + hash = (hash >> 2) + (c << 24); > + } > + } > + return (base >> 6) ^ hash; > +} > + > static inline enum object_type oe_type(const struct object_entry *e) > { > return e->type_valid ? e->type_ : OBJ_BAD; > -- > gitgitgadget

On the Git mailing list, Junio C Hamano wrote (reply to this):

karthik nayak <[email protected]> writes: > 2. Generally in hashing algorithms the XOR is used to ensure that the > output distribution is uniform which reduces collisions. Here, as you > noted, we're more finding values for sorting rather than hashing in the > traditional sense. So why use an XOR? I am not Jonathan, but since the mixing-of-bits is done with XOR in the original that Linus and I wrote, I think the question applies to our version as well. We prefer not to lose entropy from input bytes, so we do not want to use OR or AND, which tend to paint everything with 1 or 0 as you mix in bits from more bytes. Anyway the question sounds like "generally when you take tests you write with a pencil, but right now you are merely taking notes and not taking any tests. why are you writing with a pencil?" There are multiple occasions that a pencil is a great fit as writing equipment.

On the Git mailing list, karthik nayak wrote (reply to this):

Junio C Hamano <[email protected]> writes: > karthik nayak <[email protected]> writes: > >> 2. Generally in hashing algorithms the XOR is used to ensure that the >> output distribution is uniform which reduces collisions. Here, as you >> noted, we're more finding values for sorting rather than hashing in the >> traditional sense. So why use an XOR? > > I am not Jonathan, but since the mixing-of-bits is done with XOR in > the original that Linus and I wrote, I think the question applies to > our version as well. We prefer not to lose entropy from input > bytes, so we do not want to use OR or AND, which tend to paint > everything with 1 or 0 as you mix in bits from more bytes. > Thanks for the answering. My reasoning at the start was more of my thoughts on why XOR could be used here and your explanation makes it clearer. > Anyway the question sounds like "generally when you take tests you > write with a pencil, but right now you are merely taking notes and > not taking any tests. why are you writing with a pencil?" There are > multiple occasions that a pencil is a great fit as writing equipment. I'd say my questions was more of "I know a pencil is used when taking tests because of XYZ reasons. On similar lines, why is a pencil used here?" :)

gitgitgadget · 2024-12-09T23:17:59Z

On the Git mailing list, Jonathan Tan wrote (reply to this):

"Jonathan Tan via GitGitGadget" <[email protected]> writes: > diff --git a/pack-objects.h b/pack-objects.h > index b9898a4e64b..15be8368d21 100644 > --- a/pack-objects.h > +++ b/pack-objects.h > @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) > return hash; > } > > +static inline uint32_t pack_name_hash_v2(const char *name) > +{ > + uint32_t hash = 0, base = 0, c; > + > + if (!name) > + return 0; > + > + while ((c = *name++)) { > + if (isspace(c)) > + continue; > + if (c == '/') { > + base = (base >> 6) ^ hash; > + hash = 0; > + } else { > + /* > + * 'c' is only a single byte. Reverse it and move > + * it to the top of the hash, moving the rest to > + * less-significant bits. > + */ > + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; > + c = (c & 0xCC) >> 2 | (c & 0x33) << 2; > + c = (c & 0xAA) >> 1 | (c & 0x55) << 1; > + hash = (hash >> 2) + (c << 24); > + } > + } > + return (base >> 6) ^ hash; > +} This works because `c` is masked before any arithmetic is performed on it, but the cast from potentially signed char to uint32_t still makes me nervous - if char is signed, it behaves as if it was first cast to int32_t and only then to uint32_t, as you can see from running the code below: #include <stdio.h> int main() { signed char c = 128; unsigned int u = c; printf("hello %u\n", u); return 0; } I would declare `c` as uint8_t or unsigned char.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jonathan Tan <[email protected]> writes: > "Jonathan Tan via GitGitGadget" <[email protected]> writes: >> diff --git a/pack-objects.h b/pack-objects.h >> index b9898a4e64b..15be8368d21 100644 >> --- a/pack-objects.h >> +++ b/pack-objects.h >> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) >> return hash; >> } >> >> +static inline uint32_t pack_name_hash_v2(const char *name) >> +{ >> + uint32_t hash = 0, base = 0, c; >> + ... >> + while ((c = *name++)) { >> + ... >> + /* >> + * 'c' is only a single byte. Reverse it and move >> + * it to the top of the hash, moving the rest to >> + * less-significant bits. >> + */ >> + c = (c & 0xF0) >> 4 | (c & 0x0F) << 4; >> + ... > This works because `c` is masked before any arithmetic is performed on > it, but the cast from potentially signed char to uint32_t still makes > me nervous - if char is signed, it behaves as if it was first cast to > int32_t and only then to uint32_t, ... > I would declare `c` as uint8_t or unsigned char. I think you meant the type of "name", and your worry is that *name may pick up a negative integer from there when the platform char is signed? Here we are dealing with a pathname that has often UTF-8 encoded non-ASCII letters with 8th bit set, and I agree with you that being explicitly unsigned would certainly help reduce the cognitive load. Thanks.

gitgitgadget · 2025-01-22T22:11:42Z

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Dec 20, 2024 at 05:19:47PM +0000, Jonathan Tan via GitGitGadget wrote: > The first change is to be more careful about paths using non-ASCII > characters. With these characters in mind, reverse the bits in the byte > as the least-significant bits have the highest entropy and we want to > maximize their influence. This is done with some bit manipulation that > swaps the two halves, then the quarters within those halves, and then > the bits within those quarters. Makes sense, and seems quite reasonable. > The second change is to perform hash composition operations at every > level of the path. This is done by storing a 'base' hash value that > contains the hash of the parent directory. When reaching a directory > boundary, we XOR the current level's name-hash value with a downshift of > the previous level's hash. This perturbation intends to create low-bit > distinctions for paths with the same final 16 bytes but distinct parent > directory structures. Very clever, I love this idea. Thanks, Taylor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Dec 4, 2024

gitgitgadget bot Dec 4, 2024

gitgitgadget bot Dec 5, 2024

gitgitgadget bot Dec 9, 2024

gitgitgadget bot Dec 10, 2024

gitgitgadget bot Jan 22, 2025

-Original file line number
+Diff line change
@@ Expand Up / @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name) @@
     	return hash;
     }
+    static inline uint32_t pack_name_hash_v2(const unsigned char *name)
+    {
+    	uint32_t hash = 0, base = 0, c;
+    	if (!name)
+    		return 0;
+    	while ((c = *name++)) {
+    		if (isspace(c))
+    			continue;
+    		if (c == '/') {
+    			base = (base >> 6) ^ hash;
+    			hash = 0;
+    		} else {
+    			/*
+    			 * 'c' is only a single byte. Reverse it and move
+    			 * it to the top of the hash, moving the rest to
+    			 * less-significant bits.
+    			 */
+    			c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
+    			c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
+    			c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
+    			hash = (hash >> 2) + (c << 24);
+    		}
+    	}
+    	return (base >> 6) ^ hash;
+    }
     static inline enum object_type oe_type(const struct object_entry *e)
     {
     	return e->type_valid ? e->type_ : OBJ_BAD;
@@ Expand Down @@

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Are you sure you want to change the base?

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Diff view

Diff view

There are no files selected for viewing

gitgitgadget bot Dec 4, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 4, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 5, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 9, 2024

Choose a reason for hiding this comment

gitgitgadget bot Dec 10, 2024

Choose a reason for hiding this comment

gitgitgadget bot Jan 22, 2025

Choose a reason for hiding this comment