Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

Open
egrimley-arm opened this issue Jan 6, 2025 · 3 comments
Open

tool.drcacheoff.analysis_unit_tests broken on ARM/AArch32 #7172

egrimley-arm opened this issue Jan 6, 2025 · 3 comments

Comments

@egrimley-arm
Copy link
Contributor

In fact it sometimes passes, but very infrequently, about one time in 50 tries for me.

With 7203e72, when it passes, it finishes very quickly.

When it fails the output looks like this:

9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0xd31748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0xd31920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9:   out #0 @28: running #-1; 0 in queue; 0 blocked
9:   out #1 @50012: running #-1; 0 in queue; 0 blocked
9: 
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9:   out #0 @28: running #-1; 0 in queue; 0 blocked
9:   out #1 @100012: running #-1; 0 in queue; 0 blocked
...

When it passes the corresponding part of the output looks like:

9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1ae9748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1ae9920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @28:
9: [scheduler] exiting early at input 4 with 0 live inputs left
9: [analyzer] Worker 1 finished trace shard 
9: [scheduler] Unscheduled queue lock acquired      :         2
9: [scheduler] Unscheduled queue lock contended     :         0
9: [scheduler] Stats for output #0

The test seems to have been broken by f1b2d54 (17 Sep 2024). Before then it seems to always pass. With that commit it nearly always fails with output that looks like this:

9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1f7f280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1f7f440
9: [analyzer] Worker 1 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @1115050074:
9: [scheduler] Output 0 triggered a rebalance @1130050074:
9: [scheduler] Output 0 triggered a rebalance @1145050074:
9: [scheduler] Output 0 triggered a rebalance @1160050074:
9: [scheduler] Output 0 triggered a rebalance @1175050075:
9: [scheduler] Output 0 triggered a rebalance @1190050075:
1/1 Test #9: tool.drcacheoff.analysis_unit_tests ...***Timeout  90.01 sec

On the rare occasions when it passes:

9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x25c7280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x25c7440
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard 
9: [scheduler] Output 0 triggered a rebalance @1323071107:
9: [analyzer] Worker 1 finished trace shard 
9: [scheduler] Stats for output #0
...

@derekbruening , since this was your change and only a few months ago, do you have any ideas about this?

@derekbruening
Copy link
Contributor

What is the failure in that first When it fails the output looks like this: instance? Not seeing a failure message: why did the test fail? A hang like the timeout below? A hang is what the test says will happen if the input isn't given up (comment at line 473).

I would run this under a race detector and see if there is some issue with the atomics used to mark the output inactive. We do run analyzers under ThreadSanitizer and have not seen errors but this is dynamic race detection so it only sees the schedule that happens; running on this machine where it fails might produce something.

@egrimley-arm
Copy link
Contributor Author

egrimley-arm commented Jan 6, 2025

It repeats a similar message ([scheduler] Queue snapshot: etc.) until it times out.

I think the second case doesn't really hang: it would probably repeat triggered a rebalance messages for ever if I removed the time-out. Or does that count as a hang?

Yes, I've just let it run for ten minutes. There's a never-ending stream of Queue snapshot messages with an occasional triggered a rebalance thrown in. Is that worth trying under ThreadSanitizer, if I can work out how?

@derekbruening
Copy link
Contributor

Given that the test points at a specific hang (using that term for any non-termination) I would add local diagnostics to understand the behavior around that: did the output successfully get set as inactive as done by the fix for such a hang afdc470?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants