Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf improvement and refactoring #17

Open
zlatko-michailov opened this issue Apr 23, 2020 · 8 comments
Open

Perf improvement and refactoring #17

zlatko-michailov opened this issue Apr 23, 2020 · 8 comments

Comments

@zlatko-michailov
Copy link
Contributor

The use of std::string kills perf in many ways - every construction and modification may perform a heap allocation. That may also throw an exception, and ultimately terminate the process.

I hacked up a change that uses a char [BLOCK_BYTES] instead of a std::string, and I consistently get a 25% reduction in time for the slow test. To enter the new implementation, I introduced a new update() overload.

This new implementation must be a base, and the existing public methods should call into it. Then I noticed a few things in the code that I prefer to fix as I go:

  • Freely hanging methods that takes references (to instance members). I want to make those private methods of the class.
  • The final() method is not cohesive - it does 3 things: it finishes the work, produces a result (as a std::string), and resets the sate. I would like to split all those 3 steps in separate methods.

This line of thinking is leading me to a new class sha1 or even better - put it in a namespace - sha1::hash. We can keep the existing class SHA1 with the public API it has today. It will simply maintain a member of sha1::hash, and will delegate all the work to it.

To summarize it, I want to have an efficient low-level class, so that clients who care about perf and reliability, can use it directly. We can keep the existing, wrapper around for convenience and back-compat purposes.

Please let me know what you think about this.

@vog
Copy link
Owner

vog commented Apr 24, 2020

I'm all for improving the interface, and providing (ideally zero-cost) wrappers in the SHA1 class.

Introducing a sha1 namespace is fine. I'm not even sure if sha1::hash needs to be a class, maybe plain functions and an abstract type "sha1::hash" would be more appropriate and easier to use? Of course, RAII might be a reason to stick with classes, but on the other hand, the intermediate state of SHA-1 calculations has a fixed size and can be allocated directly on the stack, I guess.

By the way, I wonder why are you willing to put so much effort into this library.

I'm aware that SHA-1 is still needed to support legacy systems (including badly-designed systems like Git), but I wonder why newly written code would ever need SHA-1. I'm very curious about your use-case.

Also, what is the advantage of this library over the many other hashing libraries (which are usually more optimized to start with)? Is it our (non-)license?

But, please don't get me wrong. Even though I wonder if it's worth it, I appreciate your effort and I'm willing to assist you.

@zlatko-michailov
Copy link
Contributor Author

Thanks for considering my proposal! I will gladly answer al of your questions.

Re: Namespace, class, and freely hanging methods
When multiple headers are included, there is a chance for collisions. That chance is the highest for freely-hanging methods. Assuming no sane person would include two different implementations of SHA-1, putting all these functions in a sha1 class would prevent collisions.
I prefer namespaces over classes, because namespaces look nicer when we start adding more classes, constants, and other declarations. (You can technically do the same with a class. It just looks uglier to me.)

Re: Why I do this
My current hobby project is https://github.com/zlatko-michailov/abc. I want it to be suitable for IoT devices where CPU and memory are very limited. That's why size, reliability, and efficiency are very important.
I don't want to depend on compiled lib's, because they may not exist for some strange hardware. That's why my library and all of its dependencies must be header-only.

This particular implementation satisfies most of my requirements - non-GPL, small and simple, and now header-only. :) I have to improve its efficiency and reliability, but that is easily doable.

Re: Why SHA-1
I am not trying to implement anything of a cryptographic significance.

I've implemented a C++ wrapper around the POSIX socket C API. The next step is to implement a WebSocket server, so that a user can connect to a C++ program from a web browser. The WebSocket handshake requires SHA-1 and base64.

Re: What options I have
I prefer that we share the code, so that I can easily adopt improvements from other contributors.

I do notice, however, that our coding styles differ, and coming to an agreement on such a major refactoring is likely to leave one of us unhappy. Given that the core computation code is 20+ years old, I don't expect any major flaws in it yet to be discovered. Thus, I don't exclude the option of copying that core code in to my own project and moving on.

Now that I've laid out my thoughts, the latter option seems easier, unless you convince me the other way. :)

Same question to you: How do you use this library? Has any of the people who have forked or downloaded it, shared with you how they use it?

@vog
Copy link
Owner

vog commented Jan 28, 2021

I'm still very interested in your code, especially if you are willing to share it as public domain, and to add some lines for backwards compatibility.

I don't mind if the implementation changes too much, as long as it is still easy to see that the code is safe. The implementation already changed quite a lot from the original implementation in C.

(Maybe it would even make sense to go back to the original C implementation and to add C++ wrappers on top of those? On the other hand, this would miss some of the non-C++-specific improvements that we have accumulated so far.)

Regarding your question: I don't have any insights about how this library is used. I assume that it's also just keeping legacy interfaces working, which applies to my use case as well. Given your target platform, it makes of course sense to try to improve performance.

@zlatko-michailov
Copy link
Contributor Author

It's great to hear back!

I've been working on an all-round http+json endpoint. And recently, I started exploring a whole new area - virtual memory. I expect for that to keep me busy for quite a few more months. WebSocket, for which I need SHA-1, keeps getting pushed back, though it's still on the roadmap.

I have lost some context on my refactoring plan. We should resurrect this discussion once WebSocket bubbles up to the top of my priority list.

Meanwhile, I have a question about "public domain". Does it have a concrete definition, or does it mean "no restrictions whatsoever"? I want to make sure it is the same or less restrictive than the MIT license, which I have adopted as one of the least restrictive licenses.

@vog
Copy link
Owner

vog commented Jan 29, 2021

WebSocket, for which I need SHA-1, keeps getting pushed back, though it's still on the roadmap.

I'd like to mention that I had much better experiences with HTTP(S) long polling than with WebSockets, at least with regard to stability (especially in the presence of networking issues). On the other hand, of course I don't want to discourage you on improving the SHA1 code. ;-)

I have lost some context on my refactoring plan. We should resurrect this discussion once WebSocket bubbles up to the top of my priority list.

Agreed.

Meanwhile, I have a question about "public domain". Does it have a concrete definition, or does it mean "no restrictions whatsoever"? I want to make sure it is the same or less restrictive than the MIT license, which I have adopted as one of the least restrictive licenses.

"Public domain" has a strict meaning: https://en.wikipedia.org/wiki/Public_domain

It is by definition less restrictive than any license you could think of, because it describes the absence of the need for a license.

In some countries, e.g. many European countries such as Germany, the situation is slightly more complicated, but in the end boils down to the least amount of restrictions possible under law.

@zlatko-michailov
Copy link
Contributor Author

Thanks for clarifying the "public domain". That makes sense.

You are right about WebSocket. I don't know how much longer I will keep it on the roadmap. Let me finish the things that are on my mind, and if nothing new comes up, I'll try to find some time to refactor the SHA code. We are still talking months though.

@vog
Copy link
Owner

vog commented Jan 31, 2021

No problem, just come back if you have something.

Also, just in case you finally decide to abandon this, feel free to drop a note as well.

@matu3ba
Copy link

matu3ba commented Dec 6, 2024

To summarize it, I want to have an efficient low-level class, so that clients who care about perf and reliability, can use it directly. We can keep the existing, wrapper around for convenience and back-compat purposes.

perf/no alloc + checks

For perf and reliability you either have to use string_view or ptrs into buffer, be it string or anything else.
C++20 has for this https://en.cppreference.com/w/cpp/container/span and C23 has bound-check safety and _NonNull which unfortunately did not make it and represents similar semantics as Zig but much more verbose https://clang.llvm.org/docs/BoundsSafety.html

If you are ok with C abi, then you can roll your own slice struct similar to what the clang C23 bound checks do.

comptime

As far as I understand it, C++ offers only ways to comptime create string- or other literals and compound objects of those things via templates, but not filling runtime-usable stuff via consteval fn and alike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants