Node:Assembler Software Pipelining, Next:Assembler Loop Unrolling, Previous:Assembler SIMD Instructions, Up:Assembler Coding
Software pipelining consists of scheduling instructions around the branch point in a loop. For example a loop taking a checksum of an array of limbs might have a load and an add, but the load wouldn't be for that add, rather for the one next time around the loop. Each load then is effectively scheduled back in the previous iteration, allowing latency to be hidden.
Naturally this is wanted only when doing things like loads or multiplies that take a few cycles to complete, and only where a CPU has multiple functional units so that other work can be done while waiting.
A pipeline with several stages will have a data value in progress at each stage and each loop iteration moves them along one stage. This is like juggling.
Within the loop some moves between registers may be necessary to have the right values in the right places for each iteration. Loop unrolling can help this, with each unrolled block able to use different registers for different values, even if some shuffling is still needed just before going back to the top of the loop.