tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

egrimley-arm · 2025-01-06T15:36:21Z

In fact it sometimes passes, but very infrequently, about one time in 50 tries for me.

With 7203e72, when it passes, it finishes very quickly.

When it fails the output looks like this:

9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0xd31748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0xd31920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9:   out #0 @28: running #-1; 0 in queue; 0 blocked
9:   out #1 @50012: running #-1; 0 in queue; 0 blocked
9: 
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9:   out #0 @28: running #-1; 0 in queue; 0 blocked
9:   out #1 @100012: running #-1; 0 in queue; 0 blocked
...

When it passes the corresponding part of the output looks like:

9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1ae9748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1ae9920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @28:
9: [scheduler] exiting early at input 4 with 0 live inputs left
9: [analyzer] Worker 1 finished trace shard 
9: [scheduler] Unscheduled queue lock acquired      :         2
9: [scheduler] Unscheduled queue lock contended     :         0
9: [scheduler] Stats for output #0

The test seems to have been broken by f1b2d54 (17 Sep 2024). Before then it seems to always pass. With that commit it nearly always fails with output that looks like this:

9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1f7f280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1f7f440
9: [analyzer] Worker 1 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @1115050074:
9: [scheduler] Output 0 triggered a rebalance @1130050074:
9: [scheduler] Output 0 triggered a rebalance @1145050074:
9: [scheduler] Output 0 triggered a rebalance @1160050074:
9: [scheduler] Output 0 triggered a rebalance @1175050075:
9: [scheduler] Output 0 triggered a rebalance @1190050075:
1/1 Test #9: tool.drcacheoff.analysis_unit_tests ...***Timeout  90.01 sec

On the rare occasions when it passes:

9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x25c7280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x25c7440
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @1323071107:
9: [analyzer] Worker 1 finished trace shard 
9: [scheduler] Stats for output #0
...

@derekbruening , since this was your change and only a few months ago, do you have any ideas about this?

The text was updated successfully, but these errors were encountered:

derekbruening · 2025-01-06T19:20:22Z

What is the failure in that first When it fails the output looks like this: instance? Not seeing a failure message: why did the test fail? A hang like the timeout below? A hang is what the test says will happen if the input isn't given up (comment at line 473).

I would run this under a race detector and see if there is some issue with the atomics used to mark the output inactive. We do run analyzers under ThreadSanitizer and have not seen errors but this is dynamic race detection so it only sees the schedule that happens; running on this machine where it fails might produce something.

egrimley-arm · 2025-01-06T19:32:36Z

It repeats a similar message ([scheduler] Queue snapshot: etc.) until it times out.

I think the second case doesn't really hang: it would probably repeat triggered a rebalance messages for ever if I removed the time-out. Or does that count as a hang?

Yes, I've just let it run for ten minutes. There's a never-ending stream of Queue snapshot messages with an occasional triggered a rebalance thrown in. Is that worth trying under ThreadSanitizer, if I can work out how?

derekbruening · 2025-01-06T20:27:09Z

Given that the test points at a specific hang (using that term for any non-termination) I would add local diagnostics to understand the behavior around that: did the output successfully get set as inactive as done by the fix for such a hang afdc470?

egrimley-arm added the OpSys-ARM label Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

egrimley-arm commented Jan 6, 2025

derekbruening commented Jan 6, 2025

egrimley-arm commented Jan 6, 2025 •

edited

Loading

derekbruening commented Jan 6, 2025

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

Comments

egrimley-arm commented Jan 6, 2025

derekbruening commented Jan 6, 2025

egrimley-arm commented Jan 6, 2025 • edited Loading

derekbruening commented Jan 6, 2025

egrimley-arm commented Jan 6, 2025 •

edited

Loading