-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keeping bloat under control #280
Comments
The fact that the containers are complex objects much larger than a register does not have anything to do with inlining their member functions (such as
Usually, the compiler is smart enough to decide whether the arguments of an inlined function are treated as references or passed around as values, regardless of what the function signature says --this is one of the benefits of inlining. Consider this example: when no optimizations are applied,
The key is not constructed twice from args ever, although an examination of unordered/include/boost/unordered/detail/foa/table.hpp Lines 406 to 422 in c1317cb
In the first case, an object
This is in principle doable, but:
unordered/include/boost/unordered/detail/foa/core.hpp Lines 2340 to 2360 in c1317cb
|
Exactly - that function arguments, including the implicit *this, are passed and used by reference (forcing stack bouncing) instead of by value is IMO/experience one of the main sources of a function call overhead/impetus for forced inlining (other things being equal, like the function body still being visible to the compiler for IPO), the other being parts of the functions internal results which can be memoized across multiple calls (e.g. in a loop or a close by call site). And yes this includes *this (one of the motivating examples for/in the deducing this proposal is the ability to pass *this/self by value) - imagine an object consisting of a few pointers that could be otherwised by passed through registers but cannot when it is accessed through a member function (because this forces the creation of a this pointer which obviously cannot 'point to registers'). One way of mitigating this problem (for the this pointer) is for a member function to be a simple wrapper that delegates to a static function which accepts all of the arguments (for which it makes, given the type, ABI etc) by value, now we also have the deducing this/explicit object parameter pass-by-val syntax to make things easier. Anyways - my point was that unordered containers are complex objects that will always reside in memory and thus their mem.funcs. should not need to be forceinlined (for any measurable speedup).
I realize that - was just trying to question/get at the rationale and root cause - perhaps the source of the call overhead (be it stack bouncing or recalculation of a memoizable function prologue) could be mitigated with a less brute-force/'bloating' approach... (ps. sorry for the long delay:) |
The differences disappear because the function calls - the very thing we are examining - disappear (as they are well forceinlined). So, my point was to arrive at a design that would not require sunshine and an inteligent optimizer or forceinlining - unfortunately the state of the language does not help us much here - but I did come up with something that I used in my vm library (exactly for container code): |
In case of a 'compatible key' (e.g. string_view -> string) this creates the key object even for failed (preexisting) emplacements (minor issue) + (more importantly) SBO based types (like std::string) have nontrivial moves. This could be avoided with delayed construction.
Oh it would - 'suddenly' the container setup and allocation parts would no longer depend on Key, T, Args... (w/e) - reusable and no need to inline them...
This only applies for cases where construction can throw, but even for those cases you can simply split the Key/T/Args independent path in two: setup-and-allocation and 'make changes permanent postprocess', i.e.
for the cases where construction is noexcept you'd just do
|
This is generally not possible. Consider the following: boost::unordered_flat_map<std::string, int, transparent_hash, std::less<>> m;
std::string_view x("hello");
m.emplace(x, 0); This could be made to work by delaying the construction of the std::allocator<char> al;
m.emplace(al, 0); Now, |
The modifications needed to protoype this case are simple enough: template<typename... Args>
locator nosize_unchecked_emplace_at(
const arrays_type& arrays_,std::size_t pos0,std::size_t hash,
Args&&... args)
{
auto loc=nosize_noconstruct_unchecked_emplace_at(arrays_,pos0,hash);
// WARNING: element construction assumed not to throw
construct_element(loc.p,std::forward<Args>(args)...);
return loc;
}
locator nosize_noconstruct_unchecked_emplace_at(
const arrays_type& arrays_,std::size_t pos0,std::size_t hash)
{
for(prober pb(pos0);;pb.next(arrays_.groups_size_mask)){
auto pos=pb.get();
auto pg=arrays_.groups()+pos;
auto mask=pg->match_available();
if(BOOST_LIKELY(mask!=0)){
auto n=unchecked_countr_zero(mask);
auto p=arrays_.elements()+pos*N+n;
pg->set(n,hash);
BOOST_UNORDERED_ADD_STATS(cstats.insertion,(pb.length()));
return {pg,n,p};
}
else pg->mark_overflow(hash);
}
} If you're willing to test this in the context of your project and report back measured improvements/pessimizations wrt to binary size and performance, we could use that data to make an informed decision. |
Please reconsider the possible overuse of forceinlining. For example for unordered_flat_set insert() is fully inlined at each callsite (even with rather complex non-default hashers) except for the unchecked_emplace_with_rehash() part.
Considering that unordered containers are complex objects that themselves can never reside in registers I'm failing to see how this forceinlining helps anything.
The only thing I see is the common case when the objects contained are trivial objects that can be passed and returned in registers (as is my use case) - then the generic Args&&... signature would unnecessarily cause them to be passed through memory (ignoring the lucky case you get SRA optimization to kick in) - for this I'd rather some kind of modernized boost::call_traits-like mechanism be used.
On a related note - consider skipping double key construction from ...args - rather move the key (constructed with key_from at the beginning of emplace_impl) and peel of ...args used for key construction before forwarding them to unchecked_emplace*.
Also could you not extract the construction from unchecked_emplace_at() and unchecked_emplace_with_rehash() - make them only allocate the space/positon and simply do an inplace construction at one place at the end of emplace_impl()? (that way you wouldn't have to forward Args&& around at all)
The text was updated successfully, but these errors were encountered: