Description
Background: The scheduler differentiates "spinning" and "non-spinning" Ms. We will allow up to GOMAXPROCS/2
Ms to enter the spinning state (assuming all Ps are busy).
Spinning Ms are eligible to steal runnable goroutines or timers from other Ps and will loop several times over all Ps attempting to steal before giving up and stopping.
For work conservation, the scheduler ensures that there is at least one spinning thread after readying a goroutine. To support this, spinning Ms adhere to a contract: any work submitted when sched.nmspinning > 0 is guaranteed to be discovered.
To maintain this invariant, spinning Ms must:
- If work is found and this is the last spinning M, wake another M to spin.
- If work is not found and we are going to stop, perform a “delicate dance” to transition to non-spinning state. This means decrementing sched.nmspinning and then re-checking all Ps for new work to steal in case any was submitted since the earlier steal attempt.
This is all good for spinning Ms, but there is a bad case for non-spinning Ms.
If there are already enough spinning Ms, the remaining Ms will skip the main steal loop and go “directly” to stop. However, after the stop label is the steal recheck primarily intended for spinning -> non-spinning transitions, but executed by all Ms.
A non-spinning M can detect work in this loop and jump back to the top to try to steal it (we don’t steal directly in this loop to avoid duplication of the somewhat complex stealing logic). Unfortunately this is a non-spinning M, so assuming other spinning Ms don’t go away we will remain non-spinning and skip right over the steal loop back to stop, at which point we’ll notice the same work again, jump to the top and start all over again.
Note that this can only occur while there is at least one spinning M, which means that this behavior is bounded: either the spinning M will take the work for itself or will take some other work and the non-spinning M can transition to spinning and take work for itself.
I argue that the current behavior is nearly always sub-optimal:
- If there are <=
sched.nmspinning
total available units of work, then the spinning Ms will take all of the work and the non-spinning one will have to ultimately sleep anyways.- Caveat: of course new work can be submitted at any time, so having an extra quasi-spinning M that can jump in as spinning could reduce latency.
- OTOH, the existing approach by which these “quasi-spinning” Ms aren’t tracked by the scheduler means that a spinning M may needless wake another spinning M to replace itself even though the “quasi-spinning” M could just take over.
- If there are >
sched.nmspinning
total available units of work, then the non-spinning M should eventually get some work. However, the current approach needlessly delays stealing until the existing spinning Ms have found work. I think it would be better if the non-spinning Ms could upgrade themselves to spinning (in excess of the normal GOMAXPROCS/2 limit) if they notice that there is extra work to do, as this would reduce scheduling latency without increasing work (remember that we are basically in a busy-loop anyways).
I discovered this behavior on a Google-internal workload which is basically the worst case scenario for this: one M/P/G is running continuously generating very short-lived work that must be stolen by other Ms. The other Ms spend most of their time in the scheduler looking for work. A simple change to make non-spinning Ms skip the steal recheck reduces application wall time by 12% and CPU time by 23%. I hope to create a similar open source reproducer soon.
Note that this behavior has existed since the introduction of spinning Ms, so this is not a new regression.
Something that is not directly related, but perhaps worth investigating is whether we should even do all of the “delicate dance” checks on all spinning Ms, as it should not affect correctness to limit it to only the 1->0 nmspinning transition, which would further reduce usage.