Quickly find the remainder of the division of large numbers for divisors of a special kind

In this article, I will talk about one way to calculate x mod p, for a p of the form (2 ** n – omega), where omega is significantly less than 2 ** n. I will write a constant generator in Python. Here are a couple of C++ toy examples that can be exhaustively tested for all possible arguments. And as a serious check – I will calculate 97! mod(2**256 – 2**32 – 977).

Why is this needed?

If you don’t have a suitable library for long arithmetic (or wonderful Python) at hand;
If only additions, multiplications and bit shifts are available to you, but you need to implement the calculation of the remainder of the division;
To optimize for specific divisors, if the speed of the universal libraries does not suit you.

Theory

Given:

Divisor p of the form (2 ** n – omega), and omega is much smaller than (2 ** n);
Divisible x, size m bits, and m > n;
Bitword size s, s <= n, n and m are evenly divisible by s.

First, note that (2 ** n) mod p = (2 ** n) mod (2 ** n – omega) == omega.

Let x_low be the low n bits of x, and x_high be the high (m – n) bits of x, i.e. x can be represented as x_low + x_high * (2 ** n).

Then x mod p = (x_low + x_high * (2**n)) mod p == x_low + x_high * omega (mod p). Using this formula does not guarantee that the result will be in the range [0; p – 1]however, with a sufficiently large number of recursive applications, the result will be in the range [0; (2 ** n) – 1]. With a probability of about omega / (2 ** n) it will be necessary to subtract p from the result in order for it to be in the range [0; p – 1].

Let’s split x into words w[i] of size s bits: x = sum(i = 0 .. (m / s – 1); w[i] *2**(i*s)). This representation of the dividend can also be applied to the formula x mod p == x_low + x_high * omega (mod p); as a result, the exponent of higher words will decrease, and the coefficient of the word will be multiplied by omega. By applying the formula recursively a sufficient number of times, we can ensure that the coefficient for each bit word is less than (2 ** n). That is, the representation x mod p will be obtained as a sum of bit words w[i] of size s bits, multiplied by coefficients of size no more than n bits.

Coefficient Generator in Python

def build_reducer(input_bits, target_bits, limb_size_bits, omega):
  num_input_limbs = input_bits // limb_size_bits
  target_limit = 2 ** target_bits

  def split_and_reduce(coeff):
    low_part = coeff % target_limit
    high_part = coeff // target_limit
    return low_part + high_part * omega

  coeffs = [2 ** (limb_size_bits * i) for i in range(num_input_limbs)]

  while any([k >= target_limit for k in coeffs]):
    coeffs = [split_and_reduce(k) for k in coeffs]

  return coeffs

def print_reducer(coeffs, target_bits):
  for k in coeffs:
    print('%0*x' % (target_bits // 4, k))

Arguments:

input_bits – number of input bits (m);
target_bits – desired number of output bits (n);
limb_size_bits – size of a bit word (s);
omega is one of the constants that specifies the divisor (2 ** n – omega).

Principle of operation:

Determine the number of bit words of size s needed to represent a number of m bits;
We find the power of two (2 ** n), along the boundary of which we will split each coefficient into “lower” (1..n) and “higher” (n+1..m) bits;
We find the powers of two, which are the coefficients of bit words in the decomposition of the dividend;
As long as at least one of the coefficients exceeds (2 ** n) – we divide such coefficients into “lower” and “higher” bits and reassemble a new coefficient from them: (“lower bits” + “higher bits” * omega).

Examples of the coefficient generator:

Coefficients for reducing a 32-bit number to an 8-bit number modulo (2 ** 8 – 17) = 239; word size – 8 bits:

>>> r = build_reducer(32, 8, 8, 17)
>>> print_reducer(r, 8)
01
11
32
85

Coefficients for reducing a 32-bit number to a 16-bit number modulo (2**16 – 666) = 64870; word size – 8 bits:

>>> r = build_reducer(32, 16, 8, 666)
>>> print_reducer(r, 16)
0001
0100
029a
9f34

Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 2**32 – 977); word size – 32 bits; for convenience, the bits are grouped by 32 bits:

>>> r = build_reducer(512, 256, 32, 2 ** 32 + 977)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000000_00000000_00000000_00000001_000003d1
00000000_00000000_00000000_00000000_00000000_00000001_000003d1_00000000
00000000_00000000_00000000_00000000_00000001_000003d1_00000000_00000000
00000000_00000000_00000000_00000001_000003d1_00000000_00000000_00000000
00000000_00000000_00000001_000003d1_00000000_00000000_00000000_00000000
00000000_00000001_000003d1_00000000_00000000_00000000_00000000_00000000
00000001_000003d1_00000000_00000000_00000000_00000000_00000000_00000000
000003d1_00000000_00000000_00000000_00000000_00000000_00000001_000003d1

Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 2**32 – 977); word size – 64 bits; for convenience, the bits are grouped by 64 bits:

>>> r = build_reducer(512, 256, 64, 2 ** 32 + 977)
>>> print_reducer(r, 256)
0000000000000000_0000000000000000_0000000000000000_0000000000000001
0000000000000000_0000000000000000_0000000000000001_0000000000000000
0000000000000000_0000000000000001_0000000000000000_0000000000000000
0000000000000001_0000000000000000_0000000000000000_0000000000000000
0000000000000000_0000000000000000_0000000000000000_00000001000003d1
0000000000000000_0000000000000000_00000001000003d1_0000000000000000
0000000000000000_00000001000003d1_0000000000000000_0000000000000000
00000001000003d1_0000000000000000_0000000000000000_0000000000000000

Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 432420386565659656852420866394968145599); word size – 32 bits; for convenience, the bits are grouped by 32 bits:

>>> r = build_reducer(512, 256, 32, 432420386565659656852420866394968145599)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf
00000000_00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000
00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000_00000000
00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000_00000000_00000000
45512319_50b75fc4_402da173_2fc9bec0_45512319_50b75fc4_402da173_2fc9bebf
50b75fc4_402da173_2fc9bec0_9d671cd5_1b343a1b_66926b57_d2a4c1c6_1536bda7
402da173_2fc9bec0_9d671cd5_81c69bc5_9509b0b0_74ec0aea_8f564d66_7ec7eb3c
2fc9bec0_9d671cd5_81c69bc5_e697f5e4_1f12c33a_0a7b6f4e_3302b92e_a029cecd

Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 432420386565659656852420866394968145599); word size – 64 bits; for convenience, the bits are grouped by 64 bits:

>>> r = build_reducer(512, 256, 64, 432420386565659656852420866394968145599)
>>> print_reducer(r, 256)
0000000000000000_0000000000000000_0000000000000000_0000000000000001
0000000000000000_0000000000000000_0000000000000001_0000000000000000
0000000000000000_0000000000000001_0000000000000000_0000000000000000
0000000000000001_0000000000000000_0000000000000000_0000000000000000
0000000000000000_0000000000000001_4551231950b75fc4_402da1732fc9bebf
0000000000000001_4551231950b75fc4_402da1732fc9bebf_0000000000000000
4551231950b75fc4_402da1732fc9bec0_4551231950b75fc4_402da1732fc9bebf
402da1732fc9bec0_9d671cd581c69bc5_9509b0b074ec0aea_8f564d667ec7eb3c

ps (2 ** 256 – 2 ** 32 – 977) – parameter p of the SECP256K1 elliptic curve; (2**256 – 432420386565659656852420866394968145599) is the N parameter of the SECP256K1 elliptic curve. (Cm. https://en.bitcoin.it/wiki/Secp256k1)

What can be said by looking at these coefficients:

The fewer single bits in the representation of the omega number, the simpler the coefficients look;
The smaller the size of the bit word, the faster the transfer to the least significant bits, and thus the length of the bit representation x mod p is reduced;
The coefficients at the least significant bit words contain one single bit at positions corresponding to the beginnings of the bit words. Those. the first n bits of the number are taken as is, and groups of high bits are added to them, multiplied by the calculated coefficients.

Toy examples

For the first coefficient example:

const uint32_t modulus = (1 << 8) - 17;

uint32_t mod_exact(uint32_t x) {
    return x % modulus;
}

uint32_t mod_manual(uint32_t x) {
    while (x & 0xFFFFFF00) {
        uint32_t buffer = 0;

        buffer += 0x01 * (x & 0xFF); x >>= 8;
        buffer += 0x11 * (x & 0xFF); x >>= 8;
        buffer += 0x32 * (x & 0xFF); x >>= 8;
        buffer += 0x85 * (x & 0xFF); x >>= 8;

        x = buffer;
    }
    return (x < modulus) ? x : (x - modulus);
}

For the second coefficient example:

const uint32_t modulus = (1 << 16) - 666;

uint32_t mod_exact(uint32_t x) {
    return x % modulus;
}

uint32_t mod_manual(uint32_t x) {
    while (x & 0xFFFF0000) {
        uint32_t buffer = (x & 0xFFFF); x >>= 16;
        buffer += 0x029a * (x & 0xFF); x >>= 8;
        buffer += 0x9f34 * (x & 0xFF); x >>= 8;

        x = buffer;
    }
    return (x < modulus) ? x : (x - modulus);
}

A program to exhaustively test the two examples above:

#include <iostream>

// put some implementation of mod_exact() and mod_manual() here

int main() {
    uint32_t number = 0, fails = 0;
    do {
        uint32_t exact = mod_exact(number);
        uint32_t manual = mod_manual(number);

        fails += ((exact == manual) ? 0 : 1);
        if ((number & 0x00FFFFFF) == 0) {
            std::cout << number << "; fails=" << fails << std::endl;
        }

        number++;
    } while (number != 0);

    std::cout << "done; fails=" << fails << std::endl;
    return 0;
}

Serious example: 97! mod(2256 – 232 – 977)

Let’s take the third example of the operation of the coefficient generator:

3. Коэффициенты для сокращения 512-битного числа до 256-битного по модулю (2 ** 256 - 2 ** 32 - 977); размер слова - 32 бита; для удобства разряды сгруппированы по 32 бита:
>>> r = build_reducer(512, 256, 32, 2 ** 32 + 977)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000000_00000000_00000000_00000001_000003d1
00000000_00000000_00000000_00000000_00000000_00000001_000003d1_00000000
00000000_00000000_00000000_00000000_00000001_000003d1_00000000_00000000
00000000_00000000_00000000_00000001_000003d1_00000000_00000000_00000000
00000000_00000000_00000001_000003d1_00000000_00000000_00000000_00000000
00000000_00000001_000003d1_00000000_00000000_00000000_00000000_00000000
00000001_000003d1_00000000_00000000_00000000_00000000_00000000_00000000
000003d1_00000000_00000000_00000000_00000000_00000000_00000001_000003d1

We divide the input 512-bit number into 16 groups of 32 bits each, denote them w[0..15]. Then, as a result of multiplying these groups by non-zero pieces of coefficients, we obtain the following representation x mod p:

    (2 ** 0) *   (w[0] +         977 * w[8] + 977 * w[15]) +
    (2 ** 32) *  (w[1] + w[8] +  977 * w[9] +       w[15]) +
    (2 ** 64) *  (w[2] + w[9] +  977 * w[10]) +
    (2 ** 96) *  (w[3] + w[10] + 977 * w[11]) +
    (2 ** 128) * (w[4] + w[11] + 977 * w[12]) +
    (2 ** 160) * (w[5] + w[12] + 977 * w[13]) +
    (2 ** 192) * (w[6] + w[13] + 977 * w[14]) +
    (2 ** 224) * (w[7] + w[14] + 977 * w[15])

For each power of two (corresponding to one bit word of the result), there is a sum of numbers, which must certainly fit into 64 bits, taking into account the transfer to high bits. Powers of two (2 ** 256) and higher are absent, i.e. taking into account the possible transfer, the result will consist of a maximum of 9 bit words of 32 bits each, and the rest will be equal to zero. Thus, for subsequent iterations, simplified expressions can be used:

    (2 ** 0) *  (w[0] + 977 * w[8]) +
    (2 ** 32) * (w[1] +       w[8]) +
    (2 ** 64) *  w[2] +
    (2 ** 96) *  w[3] +
    (2 ** 128) * w[4] +
    (2 ** 160) * w[5] +
    (2 ** 192) * w[6] +
    (2 ** 224) * w[7]

Using these results, we write a function to reduce a 512-bit number to a 256-bit number modulo (2 ** 256 – 2 ** 32 – 977):

void reduce_512(const uint32_t x[16], uint32_t y[8]) {
    // check if any of bits[257..512] is set
    uint32_t mask = 0;
    for (int i = 8; i < 16; mask |= x[i++]);
    if (mask == 0) return;

    uint64_t buffer = 0;

    // stage 1: reduce 16 limbs into 9
    // (2 ** 0) *   (w[0] +         977 * w[8] + 977 * w[15]) +
    buffer += (uint64_t)x[0] + 977 * (uint64_t)x[8] + 977 * (uint64_t)x[15];
    y[0] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 32) *  (w[1] + w[8] +  977 * w[9] +       w[15]) +
    buffer += (uint64_t)x[1] + (uint64_t)x[8] + 977 * (uint64_t)x[9] + (uint64_t)x[15];
    y[1] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 64) *  (w[2] + w[9] +  977 * w[10]) +
    buffer += (uint64_t)x[2] + (uint64_t)x[9] + 977 * (uint64_t)x[10];
    y[2] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 96) *  (w[3] + w[10] + 977 * w[11]) +
    buffer += (uint64_t)x[3] + (uint64_t)x[10] + 977 * (uint64_t)x[11];
    y[3] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 128) * (w[4] + w[11] + 977 * w[12]) +
    buffer += (uint64_t)x[4] + (uint64_t)x[11] + 977 * (uint64_t)x[12];
    y[4] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 160) * (w[5] + w[12] + 977 * w[13]) +
    buffer += (uint64_t)x[5] + (uint64_t)x[12] + 977 * (uint64_t)x[13];
    y[5] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 192) * (w[6] + w[13] + 977 * w[14]) +
    buffer += (uint64_t)x[6] + (uint64_t)x[13] + 977 * (uint64_t)x[14];
    y[6] = buffer & 0xFFFFFFFF; buffer >>= 32;
    // (2 ** 224) * (w[7] + w[14] + 977 * w[15])
    buffer += (uint64_t)x[7] + (uint64_t)x[14] + 977 * (uint64_t)x[15];
    y[7] = buffer & 0xFFFFFFFF; buffer >>= 32;

    // stage 2: reduce 9 limbs into 8
    while (buffer != 0) {
        // save 9th limb (10 bits max) and flush buffer
        const uint64_t overflow256 = buffer & 0xFFFFFFFF; buffer >>= 32;
        assert(buffer == 0);

        // (2 ** 0) *  (w[0] + 977 * w[8]) +
        buffer += (uint64_t)y[0] + 977 * overflow256;
        y[0] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 32)*  (w[1] +       w[8]) +
        buffer += (uint64_t)y[1] + overflow256;
        y[1] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 64) *  w[2] +
        buffer += (uint64_t)y[2];
        y[2] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 96) *  w[3] +
        buffer += (uint64_t)y[3];
        y[3] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 128) * w[4] +
        buffer += (uint64_t)y[4];
        y[4] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 160) * w[5] +
        buffer += (uint64_t)y[5];
        y[5] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 192) * w[6] +
        buffer += (uint64_t)y[6];
        y[6] = buffer & 0xFFFFFFFF; buffer >>= 32;
        // (2 ** 224) * w[7]
        buffer += (uint64_t)y[7];
        y[7] = buffer & 0xFFFFFFFF; buffer >>= 32;
    }
    
    // 512 bits are now reduced to 256 bits, however the result may exceed
    // (2^256 - 2^32 - 977) and an additional subtraction may be necessary
}

A test program with a pair of factorial 97 numbers (number about 500 bits in size) and the correct calculation result (97! mod (2 ** 256 – 2 ** 32 – 977)), necessary to check the results of the function:

#include <iostream>
#include <cassert>

void reduce_512(const uint32_t x[16], uint32_t y[8]) {
  // put function code here
}

int main() {
    // math.factorial(97), this is about 500 bits long
    // least significant limbs go first
    const uint32_t x[16] = {
        0x00000000, 0x00000000, 0xc0000000, 0xc63bc975,
        0xcb0e1818, 0xfe74c03b, 0x13559f1a, 0xca00bb56,
        0xef9d44bc, 0xf57bf161, 0xf3e3d5c3, 0xab918234,
        0xb69daa20, 0x4532ed8b, 0xafb0a77f, 0x01d62e2f
    };
    // math.factorial(97) % (2 ** 256 - 2 ** 32 - 977)
    // least significant limbs go first
    const uint32_t y_expected[8] = {
        0x7999b163, 0xcf77a9bd, 0x7dfec23d, 0x80718b50,
        0x6655e0fc, 0xcc6efc90, 0xd9b7c95d, 0x7c17a6d2
    };

    uint32_t y[8];
    reduce_512(x, y); // stage 2 is ran once for this input

    // print/verify
    bool ok = true;
    for (int i = 0; i < 8; ok &= (y[i] == y_expected[i]), i++) {
        std::cout << std::hex << y[i] << std::endl;
    }
    std::cout << (ok ? "Ok" : "Failed") << std::endl;

    return 0;
}

ps All examples tested in Microsoft Visual Studio 2022 (v17.5.4)