Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pack-objects: Create an alternative name hash algorithm (recreated) #1823

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
28 changes: 28 additions & 0 deletions pack-objects.h
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
return hash;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

"Jonathan Tan via GitGitGadget" <[email protected]> writes:

[snip]

> diff --git a/pack-objects.h b/pack-objects.h
> index b9898a4e64b..15be8368d21 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
>  	return hash;
>  }
>
> +static inline uint32_t pack_name_hash_v2(const char *name)
> +{
> +	uint32_t hash = 0, base = 0, c;
> +
> +	if (!name)
> +		return 0;
> +
> +	while ((c = *name++)) {
> +		if (isspace(c))
> +			continue;
> +		if (c == '/') {
> +			base = (base >> 6) ^ hash;

Just two questions about the implementation for my own knowledge.
1. Why use '6' here? I couldn't understand the reasoning for the
specific value.
2. Generally in hashing algorithms the XOR is used to ensure that the
output distribution is uniform which reduces collisions. Here, as you
noted, we're more finding values for sorting rather than hashing in the
traditional sense. So why use an XOR?

> +			hash = 0;
> +		} else {
> +			/*
> +			 * 'c' is only a single byte. Reverse it and move
> +			 * it to the top of the hash, moving the rest to
> +			 * less-significant bits.
> +			 */
> +			c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
> +			c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
> +			c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
> +			hash = (hash >> 2) + (c << 24);
> +		}
> +	}
> +	return (base >> 6) ^ hash;
> +}
> +
>  static inline enum object_type oe_type(const struct object_entry *e)
>  {
>  	return e->type_valid ? e->type_ : OBJ_BAD;
> --
> gitgitgadget

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

karthik nayak <[email protected]> writes:

> 2. Generally in hashing algorithms the XOR is used to ensure that the
> output distribution is uniform which reduces collisions. Here, as you
> noted, we're more finding values for sorting rather than hashing in the
> traditional sense. So why use an XOR?

I am not Jonathan, but since the mixing-of-bits is done with XOR in
the original that Linus and I wrote, I think the question applies to
our version as well.  We prefer not to lose entropy from input
bytes, so we do not want to use OR or AND, which tend to paint
everything with 1 or 0 as you mix in bits from more bytes.

Anyway the question sounds like "generally when you take tests you
write with a pencil, but right now you are merely taking notes and
not taking any tests. why are you writing with a pencil?"  There are
multiple occasions that a pencil is a great fit as writing equipment.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, karthik nayak wrote (reply to this):

Junio C Hamano <[email protected]> writes:

> karthik nayak <[email protected]> writes:
>
>> 2. Generally in hashing algorithms the XOR is used to ensure that the
>> output distribution is uniform which reduces collisions. Here, as you
>> noted, we're more finding values for sorting rather than hashing in the
>> traditional sense. So why use an XOR?
>
> I am not Jonathan, but since the mixing-of-bits is done with XOR in
> the original that Linus and I wrote, I think the question applies to
> our version as well.  We prefer not to lose entropy from input
> bytes, so we do not want to use OR or AND, which tend to paint
> everything with 1 or 0 as you mix in bits from more bytes.
>

Thanks for the answering. My reasoning at the start was more of my
thoughts on why XOR could be used here and your explanation makes it
clearer.

> Anyway the question sounds like "generally when you take tests you
> write with a pencil, but right now you are merely taking notes and
> not taking any tests. why are you writing with a pencil?"  There are
> multiple occasions that a pencil is a great fit as writing equipment.

I'd say my questions was more of "I know a pencil is used when taking
tests because of XYZ reasons. On similar lines, why is a pencil used
here?" :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Jonathan Tan wrote (reply to this):

"Jonathan Tan via GitGitGadget" <[email protected]> writes:
> diff --git a/pack-objects.h b/pack-objects.h
> index b9898a4e64b..15be8368d21 100644
> --- a/pack-objects.h
> +++ b/pack-objects.h
> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
>  	return hash;
>  }
>  
> +static inline uint32_t pack_name_hash_v2(const char *name)
> +{
> +	uint32_t hash = 0, base = 0, c;
> +
> +	if (!name)
> +		return 0;
> +
> +	while ((c = *name++)) {
> +		if (isspace(c))
> +			continue;
> +		if (c == '/') {
> +			base = (base >> 6) ^ hash;
> +			hash = 0;
> +		} else {
> +			/*
> +			 * 'c' is only a single byte. Reverse it and move
> +			 * it to the top of the hash, moving the rest to
> +			 * less-significant bits.
> +			 */
> +			c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
> +			c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
> +			c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
> +			hash = (hash >> 2) + (c << 24);
> +		}
> +	}
> +	return (base >> 6) ^ hash;
> +}

This works because `c` is masked before any arithmetic is performed on
it, but the cast from potentially signed char to uint32_t still makes
me nervous - if char is signed, it behaves as if it was first cast to
int32_t and only then to uint32_t, as you can see from running the code
below:

    #include <stdio.h>
    int main() {
        signed char c = 128;
        unsigned int u = c;
        printf("hello %u\n", u);
        return 0;
    }

I would declare `c` as uint8_t or unsigned char.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Junio C Hamano wrote (reply to this):

Jonathan Tan <[email protected]> writes:

> "Jonathan Tan via GitGitGadget" <[email protected]> writes:
>> diff --git a/pack-objects.h b/pack-objects.h
>> index b9898a4e64b..15be8368d21 100644
>> --- a/pack-objects.h
>> +++ b/pack-objects.h
>> @@ -207,6 +207,34 @@ static inline uint32_t pack_name_hash(const char *name)
>>  	return hash;
>>  }
>>  
>> +static inline uint32_t pack_name_hash_v2(const char *name)
>> +{
>> +	uint32_t hash = 0, base = 0, c;
>> + ...
>> +	while ((c = *name++)) {
>> + ...
>> +			/*
>> +			 * 'c' is only a single byte. Reverse it and move
>> +			 * it to the top of the hash, moving the rest to
>> +			 * less-significant bits.
>> +			 */
>> +			c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
>> + ...
> This works because `c` is masked before any arithmetic is performed on
> it, but the cast from potentially signed char to uint32_t still makes
> me nervous - if char is signed, it behaves as if it was first cast to
> int32_t and only then to uint32_t, ...
> I would declare `c` as uint8_t or unsigned char.

I think you meant the type of "name", and your worry is that *name
may pick up a negative integer from there when the platform char is
signed?  Here we are dealing with a pathname that has often UTF-8
encoded non-ASCII letters with 8th bit set, and I agree with you
that being explicitly unsigned would certainly help reduce the
cognitive load.

Thanks.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the Git mailing list, Taylor Blau wrote (reply to this):

On Fri, Dec 20, 2024 at 05:19:47PM +0000, Jonathan Tan via GitGitGadget wrote:
> The first change is to be more careful about paths using non-ASCII
> characters. With these characters in mind, reverse the bits in the byte
> as the least-significant bits have the highest entropy and we want to
> maximize their influence. This is done with some bit manipulation that
> swaps the two halves, then the quarters within those halves, and then
> the bits within those quarters.

Makes sense, and seems quite reasonable.

> The second change is to perform hash composition operations at every
> level of the path. This is done by storing a 'base' hash value that
> contains the hash of the parent directory. When reaching a directory
> boundary, we XOR the current level's name-hash value with a downshift of
> the previous level's hash. This perturbation intends to create low-bit
> distinctions for paths with the same final 16 bytes but distinct parent
> directory structures.

Very clever, I love this idea.

Thanks,
Taylor

}

static inline uint32_t pack_name_hash_v2(const unsigned char *name)
{
uint32_t hash = 0, base = 0, c;

if (!name)
return 0;

while ((c = *name++)) {
if (isspace(c))
continue;
if (c == '/') {
base = (base >> 6) ^ hash;
hash = 0;
} else {
/*
* 'c' is only a single byte. Reverse it and move
* it to the top of the hash, moving the rest to
* less-significant bits.
*/
c = (c & 0xF0) >> 4 | (c & 0x0F) << 4;
c = (c & 0xCC) >> 2 | (c & 0x33) << 2;
c = (c & 0xAA) >> 1 | (c & 0x55) << 1;
hash = (hash >> 2) + (c << 24);
}
}
return (base >> 6) ^ hash;
}

static inline enum object_type oe_type(const struct object_entry *e)
{
return e->type_valid ? e->type_ : OBJ_BAD;
Expand Down