Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation #1108

NightOwl888 · 2025-01-20T15:26:02Z

You've read the Contributor Guide and Code of Conduct.
You've included unit or integration tests for your change, where applicable.
You've included inline docs for your change, where applicable.
There's an open issue for the PR that you are making. If you'd like to propose a change, please open an issue to discuss the change or find an existing issue.

Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation

Description

This changes ReplicationSession.sessions to use J2N.Collections.Dictionary<TKey, TValue> which allows deleting while iterating forward prior to .NET Core. This removes a .ToArray() allocation just for the sake of using an enumerator to clean up replication sessions.

…: Use J2N.Collections.Dictionary<TKey, TValue> to allow deleting while iterating forward prior to .NET Core (fixes apache#1070).

jeme · 2025-01-21T09:49:58Z

Interesting results, just scratching the surface hints that the J2N implementation is actually worse performance wise on .NET Framework where on .NET core it's much more close, the extra allocation are shown but MS is clearly improving this (from nearly 40 bytes to just nearly 15 bytes between .NET 8 and 9).

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5131/22H2/2022Update)
Intel Core i9-7900X CPU 3.30GHz (Kaby Lake), 1 CPU, 20 logical and 10 physical cores
  [Host]     : .NET Framework 4.8.1 (4.8.9290.0), X64 RyuJIT VectorSize=256
  DefaultJob : .NET Framework 4.8.1 (4.8.9290.0), X64 RyuJIT VectorSize=256

| Method            | Mean     | Error     | StdDev    | Gen0   | Allocated |
|------------------ |---------:|----------:|----------:|-------:|----------:|
| BuiltInDictionary | 1.558 us | 0.0181 us | 0.0170 us | 0.2232 |   1.38 KB |
| J2NDictionary     | 5.136 us | 0.0853 us | 0.0757 us | 0.1755 |   1.09 KB |



BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5131/22H2/2022Update)
Intel Core i9-7900X CPU 3.30GHz (Kaby Lake), 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  DefaultJob : .NET 8.0.11 (8.0.1124.51707), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method            | Mean     | Error    | StdDev   | Gen0   | Allocated |
|------------------ |---------:|---------:|---------:|-------:|----------:|
| BuiltInDictionary | 964.4 ns | 18.78 ns | 24.41 ns | 0.2079 |   1.46 KB |
| J2NDictionary     | 817.0 ns | 14.83 ns | 13.15 ns | 0.1545 |   1.09 KB |


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5131/22H2/2022Update)
Intel Core i9-7900X CPU 3.30GHz (Kaby Lake), 1 CPU, 20 logical and 10 physical cores
.NET SDK 9.0.101
  [Host]     : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL
  DefaultJob : .NET 9.0.0 (9.0.24.52809), X64 RyuJIT AVX-512F+CD+BW+DQ+VL

| Method            | Mean     | Error    | StdDev   | Gen0   | Allocated |
|------------------ |---------:|---------:|---------:|-------:|----------:|
| BuiltInDictionary | 827.9 ns | 15.30 ns | 14.31 ns | 0.1755 |   1.23 KB |
| J2NDictionary     | 808.4 ns | 15.96 ns | 19.00 ns | 0.1545 |   1.09 KB |

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Collections.Generic;

BenchmarkRunner.Run<DictionaryBenchmarks>();

[MemoryDiagnoser]
public class DictionaryBenchmarks
{
    [Benchmark]
    public byte BuiltInDictionary()
    {
        Dictionary<string, string> values = new ();
        AddItems(values);
        foreach (string b in values.Values.Where(x => x.Length > 1).ToArray())
            values.Remove(b);
        return 1;
    }


    [Benchmark]
    public int J2NDictionary()
    {
        J2N.Collections.Generic.Dictionary<string, string> values = new();
        AddItems(values);
        foreach (string b in values.Values.Where(x => x.Length > 1))
            values.Remove(b);
        return 1;
    }

    private void AddItems(IDictionary<string,string> x)
    {
       x.Add("00", "00");
       x.Add("01", "01");
       x.Add("02", "02");
       x.Add("03", "03");
       x.Add("04", "04");
       x.Add("05", "05");
       x.Add("06", "06");
       x.Add("07", "07");
       x.Add("08", "08");
       x.Add("09", "09");
       x.Add("10", "10");
       x.Add("11", "11");
       x.Add("12", "12");
       x.Add("13", "13");
       x.Add("14", "14");
       x.Add("15", "15");
    }
}

One overall concern I would have when switching from the core implementation of collections to a custom one (from e.g. J2N) is that Microsoft is pushing on performance with every itteration of .NET while other implementations might not keep up.

NightOwl888 · 2025-01-21T11:09:22Z

A few things to note:

This is the Microsoft implementation of Dictionary<TKey, TValue> which was upgraded to align with .NET 8 in J2N 2.1.0. It is only customized to the extent it took to implement nullable keys, structural formatting, and structural equality. There may be some extra overhead in J2N 2.1 because of the guard clauses and "helper" classes that are used to normalize the behavior between .NET Core and .NET Framework without polluting the codebase with #ifdef statements. Microsoft may also be doing some optimizations at the compiler level that we aren't accounting for.
Due to the fact that the nested types of collections were mostly all made internal in J2N 2.0.0, we returned the interfaces rather than the concrete types from the API. There is an open issue to fix that in J2N 3.0.0, but it requires a breaking API change. Therefore, calling GetEnumerator() will box because Enumerator is a struct. To do an apples-to-apples comparison, you need to box the BCL implementation, you need to cast the result of GetEnumerator() to IEnumerator<T> or do the whole comparison on IDictionary<TKey, TValue> rather than on the concrete type. Also, the comparison should be done on net8.0. Do note that Lucene.NET uses the interfaces extensively in the API, so in many cases this boxing will not be avoided even when the API of J2N has been fixed.
Maybe I am missing something, but the warning messages in BenchmarkDotNet mention that the results cannot be relied upon unless there is at least 100ms of activity. Normally, I will add a for loop inside of a benchmark to run the operation enough iterations so that it is over 100ms.

That being said, perhaps it would be worth it to review the changes to the dictionary in .NET 9 to see whether it makes sense to do another dictionary upgrade in J2N.

jeme · 2025-01-21T11:37:56Z

Fair, it still has the same problem of risking falling behind. Sometimes there is good reason to have a custom implementation though, so I am not saying never, but I think it's worth consider each case.
I am not sure I understand this point, the benchmark tries to "Simulate" what happens based on how it's used in a "Before" and "After" scenario in the replicator where there is as far as I can see no casting of any enumerators?
Was that on your own PC? - I am not getting any warnings, but it's not a brand new system so perhaps newer CPU's with improved IPC and Clocks exeute this much faster.

NightOwl888 · 2025-01-21T13:02:15Z

True, which is why we tried avoiding that by cascading the call to a wraped BCL implementation previously. Unfortunately, the tradeoff is that prior to .NET Core 3.x there was no support for deletion while iterating forward which Lucene uses in a few places. So, we either have to conditionally compile or make a normalized implementation in J2N. IMO, it is better to have a normalized implementation so we can push as many conditional compilation statements into J2N as possible until such time we don't need the J2N implementation any longer (in this case, when we drop support for everything below .NET Core). But, I don't think we should use JCG.Dictionary<TKey, TValue> unless it is required.
All I am saying is that the next release of Lucene.NET will not have a reference to J2N 2.1, so this comparison is not completely accurate as to what will be released and does not account for work that already exists on the main branch in J2N.

In the benchmark the compiler foreach calls SCG.Dictionary<TKey, TValue>.Values.GetEnumerator(), which returns type SCG.Dictionary<TKey, TValue>.ValueCollection.Enumerator, a struct. The J2N implementation also returns a struct, but casts it to an IEnumerator<TValue>, which will box. To do a direct comparison, the dictionary should be using the interface IDictionary<string, string> values = new Dictionary<string, string>(); so the box happens there, also. In J2N 3.0, this boxing will go away so the concrete types will return a struct which will match the upstream performance better.

Also, J2N 2.1 had some boxing in some of the guard clauses, which was recently fixed as well as using the unsigned right shift operator instead of a method. These will have a positive impact on the performance of much of the J2N library, including the collections.
Good point on the difference in hardware. That could be causing the warnings I am seeing. Of course, that makes me cautious about seeing results from others that don't have at least 100ms of work in the benchmark.

.NET Framework

There is one other difference that could explain what you are seeing on .NET Framework. We patched the implementation to nullify the data buckets only for reference types, whereas .NET Framework did it for both value and reference types. There is a method called RuntimeHelpers.IsReferenceOrContainsReferences<T>() that is used to determine whether there are any reference types to nullify, but it doesn't exist on .NET Framework so we use Reflection to analyze the type and cache the value the first time a type is used. More iterations would definitely change the result because that Reflection overhead only happens on the first call. Here is the implementation:

https://github.com/NightOwl888/J2N/blob/8845a696e2cd3e2ca0752823c779c72aa7f0e5a1/src/J2N/Runtime/CompilerServices/RuntimeHelper.cs#L31-L84

So, 2 ways we could deal with this:

Review the above implementation to see if there are any performance enhancements that could be made
Change it to always clean up references the collection buckets as was the case in .NET Framework but keep the enhancement where RuntimeHelpers.IsReferenceOrContainsReferences<T>() exists (which would require conditional compilation in the J2N collections)

.NET 9 Implementation

Actually, it is possible that the difference between net8.0 and net9.0 is due to a more efficient .ToArray() implementation and this might have little or nothing to do with enhancements to the Dictionary<TKey, TValue> implementation.

jeme · 2025-01-22T08:17:23Z

Ohhh... My sincere appologies... Now i get the issue here (the title could be more clear though as I think this is a more important focus).

WTF... I did not know there was a difference in how "Remove" was implemented in .NET 8/9 vs Framework, thats a contract change between the versions i did not know of and is quite problematic in terms of the NETStandard promise o.O, yikes.

I basically looked at this from a "performance optimization" perspective, and went, "but is it really an improvement". The functional part is way more important IMO.

As a side note, all this talk about dropping .NET Framework support (or NETStandard 2.0) is not something I am truely fond of as someone who uses it from that stack still, and I currently don't know if we will manage to actually upgrade within the comming years. :(... So hopefully that will be for a Lucene 10.X (or which Java version is the next target) and not 4.8 final release thing >.<...

PERFORMANCE: Lucene.Net.Replicator.LocalReplicator.ReplicationSession…

30a6245

…: Use J2N.Collections.Dictionary<TKey, TValue> to allow deleting while iterating forward prior to .NET Core (fixes apache#1070).

NightOwl888 added the notes:performance-improvement Contains a performance improvement label Jan 20, 2025

NightOwl888 requested a review from paulirwin January 20, 2025 15:26

paulirwin approved these changes Jan 20, 2025

View reviewed changes

NightOwl888 merged commit 2a0c974 into apache:master Jan 20, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation #1108

Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation #1108

NightOwl888 commented Jan 20, 2025

jeme commented Jan 21, 2025 •

edited

Loading

NightOwl888 commented Jan 21, 2025

jeme commented Jan 21, 2025

NightOwl888 commented Jan 21, 2025

jeme commented Jan 22, 2025 •

edited

Loading

Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation #1108

Lucene.Net.Replicator.LocalReplicator.ReplicationSession: Removed unnecessary array allocation #1108

Conversation

NightOwl888 commented Jan 20, 2025

Description

jeme commented Jan 21, 2025 • edited Loading

NightOwl888 commented Jan 21, 2025

jeme commented Jan 21, 2025

NightOwl888 commented Jan 21, 2025

.NET Framework

.NET 9 Implementation

jeme commented Jan 22, 2025 • edited Loading

jeme commented Jan 21, 2025 •

edited

Loading

jeme commented Jan 22, 2025 •

edited

Loading