You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In fact it sometimes passes, but very infrequently, about one time in 50 tries for me.
With 7203e72, when it passes, it finishes very quickly.
When it fails the output looks like this:
9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0xd31748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0xd31920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9: out #0 @28: running #-1; 0 in queue; 0 blocked
9: out #1 @50012: running #-1; 0 in queue; 0 blocked
9:
9: [scheduler] Queue snapshot: inputs: 1 schedulable, 0 unscheduled, 4 eof
9: out #0 @28: running #-1; 0 in queue; 0 blocked
9: out #1 @100012: running #-1; 0 in queue; 0 blocked
...
When it passes the corresponding part of the output looks like:
9: ----------------
9: Testing tool errors
...
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1ae9748
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1ae9920
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard
9: [scheduler] Output 0 triggered a rebalance @28:
9: [scheduler] exiting early at input 4 with 0 live inputs left
9: [analyzer] Worker 1 finished trace shard
9: [scheduler] Unscheduled queue lock acquired : 2
9: [scheduler] Unscheduled queue lock contended : 0
9: [scheduler] Stats for output #0
The test seems to have been broken by f1b2d54 (17 Sep 2024). Before then it seems to always pass. With that commit it nearly always fails with output that looks like this:
9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x1f7f280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x1f7f440
9: [analyzer] Worker 1 hit shard memref error cpuid not supported on trace shard
9: [scheduler] Output 0 triggered a rebalance @1115050074:
9: [scheduler] Output 0 triggered a rebalance @1130050074:
9: [scheduler] Output 0 triggered a rebalance @1145050074:
9: [scheduler] Output 0 triggered a rebalance @1160050074:
9: [scheduler] Output 0 triggered a rebalance @1175050075:
9: [scheduler] Output 0 triggered a rebalance @1190050075:
1/1 Test #9: tool.drcacheoff.analysis_unit_tests ...***Timeout 90.01 sec
On the rare occasions when it passes:
9: ----------------
9: Testing tool errors
9: [scheduler] 5 inputs
9: [scheduler] Reading headers from inputs to find filetypes
9: [scheduler] Output 0 triggered a rebalance @0:
9: [analyzer] Creating 2 worker threads
9: [analyzer] Worker 0 starting on trace shard 0 stream is 0x25c7280
9: [analyzer] Worker 1 starting on trace shard 1 stream is 0x25c7440
9: [analyzer] Worker 0 hit shard memref error cpuid not supported on trace shard
9: [scheduler] Output 0 triggered a rebalance @1323071107:
9: [analyzer] Worker 1 finished trace shard
9: [scheduler] Stats for output #0
...
@derekbruening , since this was your change and only a few months ago, do you have any ideas about this?
The text was updated successfully, but these errors were encountered:
What is the failure in that first When it fails the output looks like this: instance? Not seeing a failure message: why did the test fail? A hang like the timeout below? A hang is what the test says will happen if the input isn't given up (comment at line 473).
I would run this under a race detector and see if there is some issue with the atomics used to mark the output inactive. We do run analyzers under ThreadSanitizer and have not seen errors but this is dynamic race detection so it only sees the schedule that happens; running on this machine where it fails might produce something.
It repeats a similar message ([scheduler] Queue snapshot: etc.) until it times out.
I think the second case doesn't really hang: it would probably repeat triggered a rebalance messages for ever if I removed the time-out. Or does that count as a hang?
Yes, I've just let it run for ten minutes. There's a never-ending stream of Queue snapshot messages with an occasional triggered a rebalance thrown in. Is that worth trying under ThreadSanitizer, if I can work out how?
Given that the test points at a specific hang (using that term for any non-termination) I would add local diagnostics to understand the behavior around that: did the output successfully get set as inactive as done by the fix for such a hang afdc470?
In fact it sometimes passes, but very infrequently, about one time in 50 tries for me.
With 7203e72, when it passes, it finishes very quickly.
When it fails the output looks like this:
When it passes the corresponding part of the output looks like:
The test seems to have been broken by f1b2d54 (17 Sep 2024). Before then it seems to always pass. With that commit it nearly always fails with output that looks like this:
On the rare occasions when it passes:
@derekbruening , since this was your change and only a few months ago, do you have any ideas about this?
The text was updated successfully, but these errors were encountered: