Node:Assembler Carry Propagation, Next:Assembler Cache Handling, Previous:Assembler Basics, Up:Assembler Coding
The problem that presents most challenges in GMP is propagating carries from
one limb to the next. In functions like mpn_addmul_1
and
mpn_add_n
, carries are the only dependencies between limb operations.
On processors with carry flags, a straightforward CISC style adc
is
generally best. AMD K6 mpn_addmul_1
however is an example of an
unusual set of circumstances where a branch works out better.
On RISC processors generally an add and compare for overflow is used. This
sort of thing can be seen in mpn/generic/aors_n.c
. Some carry
propagation schemes require 4 instructions, meaning at least 4 cycles per
limb, but other schemes may use just 1 or 2. On wide superscalar processors
performance may be completely determined by the number of dependent
instructions between carry-in and carry-out for each limb.
On vector processors good use can be made of the fact that a carry bit only
very rarely propagates more than one limb. When adding a single bit to a
limb, there's only a carry out if that limb was 0xFF...FF
which on
random data will be only 1 in 2^mp_bits_per_limb. mpn/cray/add_n.c
is an example of this, it adds
all limbs in parallel, adds one set of carry bits in parallel and then only
rarely needs to fall through to a loop propagating further carries.
On the x86s, GCC (as of version 2.95.2) doesn't generate particularly good code
for the RISC style idioms that are necessary to handle carry bits in
C. Often conditional jumps are generated where adc
or sbb
forms
would be better. And so unfortunately almost any loop involving carry bits
needs to be coded in assembler for best results.