# Quickly find the remainder of the division of large numbers for divisors of a special kind

In this article, I will talk about one way to calculate x mod p, for a p of the form (2 ** n – omega), where omega is significantly less than 2 ** n. I will write a constant generator in Python. Here are a couple of C++ toy examples that can be exhaustively tested for all possible arguments. And as a serious check – I will calculate 97! mod(2**256 – 2**32 – 977).

## Why is this needed?

• If you don’t have a suitable library for long arithmetic (or wonderful Python) at hand;

• If only additions, multiplications and bit shifts are available to you, but you need to implement the calculation of the remainder of the division;

• To optimize for specific divisors, if the speed of the universal libraries does not suit you.

## Theory

Given:

• Divisor p of the form (2 ** n – omega), and omega is much smaller than (2 ** n);

• Divisible x, size m bits, and m > n;

• Bitword size s, s <= n, n and m are evenly divisible by s.

First, note that (2 ** n) mod p = (2 ** n) mod (2 ** n – omega) == omega.

Let x_low be the low n bits of x, and x_high be the high (m – n) bits of x, i.e. x can be represented as x_low + x_high * (2 ** n).

Then x mod p = (x_low + x_high * (2**n)) mod p == x_low + x_high * omega (mod p). Using this formula does not guarantee that the result will be in the range [0; p – 1]however, with a sufficiently large number of recursive applications, the result will be in the range [0; (2 ** n) – 1]. With a probability of about omega / (2 ** n) it will be necessary to subtract p from the result in order for it to be in the range [0; p – 1].

Let’s split x into words w[i] of size s bits: x = sum(i = 0 .. (m / s – 1); w[i] *2**(i*s)). This representation of the dividend can also be applied to the formula x mod p == x_low + x_high * omega (mod p); as a result, the exponent of higher words will decrease, and the coefficient of the word will be multiplied by omega. By applying the formula recursively a sufficient number of times, we can ensure that the coefficient for each bit word is less than (2 ** n). That is, the representation x mod p will be obtained as a sum of bit words w[i] of size s bits, multiplied by coefficients of size no more than n bits.

## Coefficient Generator in Python

``````def build_reducer(input_bits, target_bits, limb_size_bits, omega):
num_input_limbs = input_bits // limb_size_bits
target_limit = 2 ** target_bits

def split_and_reduce(coeff):
low_part = coeff % target_limit
high_part = coeff // target_limit
return low_part + high_part * omega

coeffs = [2 ** (limb_size_bits * i) for i in range(num_input_limbs)]

while any([k >= target_limit for k in coeffs]):
coeffs = [split_and_reduce(k) for k in coeffs]

return coeffs

def print_reducer(coeffs, target_bits):
for k in coeffs:
print('%0*x' % (target_bits // 4, k))``````

Arguments:

• input_bits – number of input bits (m);

• target_bits – desired number of output bits (n);

• limb_size_bits – size of a bit word (s);

• omega is one of the constants that specifies the divisor (2 ** n – omega).

Principle of operation:

1. Determine the number of bit words of size s needed to represent a number of m bits;

2. We find the power of two (2 ** n), along the boundary of which we will split each coefficient into “lower” (1..n) and “higher” (n+1..m) bits;

3. We find the powers of two, which are the coefficients of bit words in the decomposition of the dividend;

4. As long as at least one of the coefficients exceeds (2 ** n) – we divide such coefficients into “lower” and “higher” bits and reassemble a new coefficient from them: (“lower bits” + “higher bits” * omega).

Examples of the coefficient generator:

1. Coefficients for reducing a 32-bit number to an 8-bit number modulo (2 ** 8 – 17) = 239; word size – 8 bits:

``````>>> r = build_reducer(32, 8, 8, 17)
>>> print_reducer(r, 8)
01
11
32
85``````
1. Coefficients for reducing a 32-bit number to a 16-bit number modulo (2**16 – 666) = 64870; word size – 8 bits:

``````>>> r = build_reducer(32, 16, 8, 666)
>>> print_reducer(r, 16)
0001
0100
029a
9f34``````
1. Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 2**32 – 977); word size – 32 bits; for convenience, the bits are grouped by 32 bits:

``````>>> r = build_reducer(512, 256, 32, 2 ** 32 + 977)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000000_00000000_00000000_00000001_000003d1
00000000_00000000_00000000_00000000_00000000_00000001_000003d1_00000000
00000000_00000000_00000000_00000000_00000001_000003d1_00000000_00000000
00000000_00000000_00000000_00000001_000003d1_00000000_00000000_00000000
00000000_00000000_00000001_000003d1_00000000_00000000_00000000_00000000
00000000_00000001_000003d1_00000000_00000000_00000000_00000000_00000000
00000001_000003d1_00000000_00000000_00000000_00000000_00000000_00000000
000003d1_00000000_00000000_00000000_00000000_00000000_00000001_000003d1``````
1. Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 2**32 – 977); word size – 64 bits; for convenience, the bits are grouped by 64 bits:

``````>>> r = build_reducer(512, 256, 64, 2 ** 32 + 977)
>>> print_reducer(r, 256)
0000000000000000_0000000000000000_0000000000000000_0000000000000001
0000000000000000_0000000000000000_0000000000000001_0000000000000000
0000000000000000_0000000000000001_0000000000000000_0000000000000000
0000000000000001_0000000000000000_0000000000000000_0000000000000000
0000000000000000_0000000000000000_0000000000000000_00000001000003d1
0000000000000000_0000000000000000_00000001000003d1_0000000000000000
0000000000000000_00000001000003d1_0000000000000000_0000000000000000
00000001000003d1_0000000000000000_0000000000000000_0000000000000000``````
1. Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 432420386565659656852420866394968145599); word size – 32 bits; for convenience, the bits are grouped by 32 bits:

``````>>> r = build_reducer(512, 256, 32, 432420386565659656852420866394968145599)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf
00000000_00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000
00000000_00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000_00000000
00000001_45512319_50b75fc4_402da173_2fc9bebf_00000000_00000000_00000000
45512319_50b75fc4_402da173_2fc9bec0_45512319_50b75fc4_402da173_2fc9bebf
50b75fc4_402da173_2fc9bec0_9d671cd5_1b343a1b_66926b57_d2a4c1c6_1536bda7
402da173_2fc9bec0_9d671cd5_81c69bc5_9509b0b0_74ec0aea_8f564d66_7ec7eb3c
2fc9bec0_9d671cd5_81c69bc5_e697f5e4_1f12c33a_0a7b6f4e_3302b92e_a029cecd``````
1. Coefficients for reducing a 512-bit number to a 256-bit number modulo (2**256 – 432420386565659656852420866394968145599); word size – 64 bits; for convenience, the bits are grouped by 64 bits:

``````>>> r = build_reducer(512, 256, 64, 432420386565659656852420866394968145599)
>>> print_reducer(r, 256)
0000000000000000_0000000000000000_0000000000000000_0000000000000001
0000000000000000_0000000000000000_0000000000000001_0000000000000000
0000000000000000_0000000000000001_0000000000000000_0000000000000000
0000000000000001_0000000000000000_0000000000000000_0000000000000000
0000000000000000_0000000000000001_4551231950b75fc4_402da1732fc9bebf
0000000000000001_4551231950b75fc4_402da1732fc9bebf_0000000000000000
4551231950b75fc4_402da1732fc9bec0_4551231950b75fc4_402da1732fc9bebf
402da1732fc9bec0_9d671cd581c69bc5_9509b0b074ec0aea_8f564d667ec7eb3c``````

ps (2 ** 256 – 2 ** 32 – 977) – parameter p of the SECP256K1 elliptic curve; (2**256 – 432420386565659656852420866394968145599) is the N parameter of the SECP256K1 elliptic curve. (Cm. https://en.bitcoin.it/wiki/Secp256k1)

What can be said by looking at these coefficients:

1. The fewer single bits in the representation of the omega number, the simpler the coefficients look;

2. The smaller the size of the bit word, the faster the transfer to the least significant bits, and thus the length of the bit representation x mod p is reduced;

3. The coefficients at the least significant bit words contain one single bit at positions corresponding to the beginnings of the bit words. Those. the first n bits of the number are taken as is, and groups of high bits are added to them, multiplied by the calculated coefficients.

## Toy examples

For the first coefficient example:

``````const uint32_t modulus = (1 << 8) - 17;

uint32_t mod_exact(uint32_t x) {
return x % modulus;
}

uint32_t mod_manual(uint32_t x) {
while (x & 0xFFFFFF00) {
uint32_t buffer = 0;

buffer += 0x01 * (x & 0xFF); x >>= 8;
buffer += 0x11 * (x & 0xFF); x >>= 8;
buffer += 0x32 * (x & 0xFF); x >>= 8;
buffer += 0x85 * (x & 0xFF); x >>= 8;

x = buffer;
}
return (x < modulus) ? x : (x - modulus);
}``````

For the second coefficient example:

``````const uint32_t modulus = (1 << 16) - 666;

uint32_t mod_exact(uint32_t x) {
return x % modulus;
}

uint32_t mod_manual(uint32_t x) {
while (x & 0xFFFF0000) {
uint32_t buffer = (x & 0xFFFF); x >>= 16;
buffer += 0x029a * (x & 0xFF); x >>= 8;
buffer += 0x9f34 * (x & 0xFF); x >>= 8;

x = buffer;
}
return (x < modulus) ? x : (x - modulus);
}``````

A program to exhaustively test the two examples above:

``````#include <iostream>

// put some implementation of mod_exact() and mod_manual() here

int main() {
uint32_t number = 0, fails = 0;
do {
uint32_t exact = mod_exact(number);
uint32_t manual = mod_manual(number);

fails += ((exact == manual) ? 0 : 1);
if ((number & 0x00FFFFFF) == 0) {
std::cout << number << "; fails=" << fails << std::endl;
}

number++;
} while (number != 0);

std::cout << "done; fails=" << fails << std::endl;
return 0;
}``````

## Serious example: 97! mod(2**256 – 2**32 – 977)

Let’s take the third example of the operation of the coefficient generator:

``````3. Коэффициенты для сокращения 512-битного числа до 256-битного по модулю (2 ** 256 - 2 ** 32 - 977); размер слова - 32 бита; для удобства разряды сгруппированы по 32 бита:
>>> r = build_reducer(512, 256, 32, 2 ** 32 + 977)
>>> print_reducer(r, 256)
00000000_00000000_00000000_00000000_00000000_00000000_00000000_00000001
00000000_00000000_00000000_00000000_00000000_00000000_00000001_00000000
00000000_00000000_00000000_00000000_00000000_00000001_00000000_00000000
00000000_00000000_00000000_00000000_00000001_00000000_00000000_00000000
00000000_00000000_00000000_00000001_00000000_00000000_00000000_00000000
00000000_00000000_00000001_00000000_00000000_00000000_00000000_00000000
00000000_00000001_00000000_00000000_00000000_00000000_00000000_00000000
00000001_00000000_00000000_00000000_00000000_00000000_00000000_00000000
00000000_00000000_00000000_00000000_00000000_00000000_00000001_000003d1
00000000_00000000_00000000_00000000_00000000_00000001_000003d1_00000000
00000000_00000000_00000000_00000000_00000001_000003d1_00000000_00000000
00000000_00000000_00000000_00000001_000003d1_00000000_00000000_00000000
00000000_00000000_00000001_000003d1_00000000_00000000_00000000_00000000
00000000_00000001_000003d1_00000000_00000000_00000000_00000000_00000000
00000001_000003d1_00000000_00000000_00000000_00000000_00000000_00000000
000003d1_00000000_00000000_00000000_00000000_00000000_00000001_000003d1``````

We divide the input 512-bit number into 16 groups of 32 bits each, denote them w[0..15]. Then, as a result of multiplying these groups by non-zero pieces of coefficients, we obtain the following representation x mod p:

``````    (2 ** 0) *   (w[0] +         977 * w[8] + 977 * w[15]) +
(2 ** 32) *  (w[1] + w[8] +  977 * w[9] +       w[15]) +
(2 ** 64) *  (w[2] + w[9] +  977 * w[10]) +
(2 ** 96) *  (w[3] + w[10] + 977 * w[11]) +
(2 ** 128) * (w[4] + w[11] + 977 * w[12]) +
(2 ** 160) * (w[5] + w[12] + 977 * w[13]) +
(2 ** 192) * (w[6] + w[13] + 977 * w[14]) +
(2 ** 224) * (w[7] + w[14] + 977 * w[15])``````

For each power of two (corresponding to one bit word of the result), there is a sum of numbers, which must certainly fit into 64 bits, taking into account the transfer to high bits. Powers of two (2 ** 256) and higher are absent, i.e. taking into account the possible transfer, the result will consist of a maximum of 9 bit words of 32 bits each, and the rest will be equal to zero. Thus, for subsequent iterations, simplified expressions can be used:

``````    (2 ** 0) *  (w[0] + 977 * w[8]) +
(2 ** 32) * (w[1] +       w[8]) +
(2 ** 64) *  w[2] +
(2 ** 96) *  w[3] +
(2 ** 128) * w[4] +
(2 ** 160) * w[5] +
(2 ** 192) * w[6] +
(2 ** 224) * w[7]``````

Using these results, we write a function to reduce a 512-bit number to a 256-bit number modulo (2 ** 256 – 2 ** 32 – 977):

``````void reduce_512(const uint32_t x[16], uint32_t y[8]) {
// check if any of bits[257..512] is set
uint32_t mask = 0;
for (int i = 8; i < 16; mask |= x[i++]);
if (mask == 0) return;

uint64_t buffer = 0;

// stage 1: reduce 16 limbs into 9
// (2 ** 0) *   (w[0] +         977 * w[8] + 977 * w[15]) +
buffer += (uint64_t)x[0] + 977 * (uint64_t)x[8] + 977 * (uint64_t)x[15];
y[0] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 32) *  (w[1] + w[8] +  977 * w[9] +       w[15]) +
buffer += (uint64_t)x[1] + (uint64_t)x[8] + 977 * (uint64_t)x[9] + (uint64_t)x[15];
y[1] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 64) *  (w[2] + w[9] +  977 * w[10]) +
buffer += (uint64_t)x[2] + (uint64_t)x[9] + 977 * (uint64_t)x[10];
y[2] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 96) *  (w[3] + w[10] + 977 * w[11]) +
buffer += (uint64_t)x[3] + (uint64_t)x[10] + 977 * (uint64_t)x[11];
y[3] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 128) * (w[4] + w[11] + 977 * w[12]) +
buffer += (uint64_t)x[4] + (uint64_t)x[11] + 977 * (uint64_t)x[12];
y[4] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 160) * (w[5] + w[12] + 977 * w[13]) +
buffer += (uint64_t)x[5] + (uint64_t)x[12] + 977 * (uint64_t)x[13];
y[5] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 192) * (w[6] + w[13] + 977 * w[14]) +
buffer += (uint64_t)x[6] + (uint64_t)x[13] + 977 * (uint64_t)x[14];
y[6] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 224) * (w[7] + w[14] + 977 * w[15])
buffer += (uint64_t)x[7] + (uint64_t)x[14] + 977 * (uint64_t)x[15];
y[7] = buffer & 0xFFFFFFFF; buffer >>= 32;

// stage 2: reduce 9 limbs into 8
while (buffer != 0) {
// save 9th limb (10 bits max) and flush buffer
const uint64_t overflow256 = buffer & 0xFFFFFFFF; buffer >>= 32;
assert(buffer == 0);

// (2 ** 0) *  (w[0] + 977 * w[8]) +
buffer += (uint64_t)y[0] + 977 * overflow256;
y[0] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 32)*  (w[1] +       w[8]) +
buffer += (uint64_t)y[1] + overflow256;
y[1] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 64) *  w[2] +
buffer += (uint64_t)y[2];
y[2] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 96) *  w[3] +
buffer += (uint64_t)y[3];
y[3] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 128) * w[4] +
buffer += (uint64_t)y[4];
y[4] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 160) * w[5] +
buffer += (uint64_t)y[5];
y[5] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 192) * w[6] +
buffer += (uint64_t)y[6];
y[6] = buffer & 0xFFFFFFFF; buffer >>= 32;
// (2 ** 224) * w[7]
buffer += (uint64_t)y[7];
y[7] = buffer & 0xFFFFFFFF; buffer >>= 32;
}

// 512 bits are now reduced to 256 bits, however the result may exceed
// (2^256 - 2^32 - 977) and an additional subtraction may be necessary
}``````

A test program with a pair of factorial 97 numbers (number about 500 bits in size) and the correct calculation result (97! mod (2 ** 256 – 2 ** 32 – 977)), necessary to check the results of the function:

``````#include <iostream>
#include <cassert>

void reduce_512(const uint32_t x[16], uint32_t y[8]) {
// put function code here
}

int main() {
// math.factorial(97), this is about 500 bits long
// least significant limbs go first
const uint32_t x[16] = {
0x00000000, 0x00000000, 0xc0000000, 0xc63bc975,
0xcb0e1818, 0xfe74c03b, 0x13559f1a, 0xca00bb56,
0xef9d44bc, 0xf57bf161, 0xf3e3d5c3, 0xab918234,
0xb69daa20, 0x4532ed8b, 0xafb0a77f, 0x01d62e2f
};
// math.factorial(97) % (2 ** 256 - 2 ** 32 - 977)
// least significant limbs go first
const uint32_t y_expected[8] = {
0x7999b163, 0xcf77a9bd, 0x7dfec23d, 0x80718b50,
0x6655e0fc, 0xcc6efc90, 0xd9b7c95d, 0x7c17a6d2
};

uint32_t y[8];
reduce_512(x, y); // stage 2 is ran once for this input

// print/verify
bool ok = true;
for (int i = 0; i < 8; ok &= (y[i] == y_expected[i]), i++) {
std::cout << std::hex << y[i] << std::endl;
}
std::cout << (ok ? "Ok" : "Failed") << std::endl;

return 0;
}``````

ps All examples tested in Microsoft Visual Studio 2022 (v17.5.4)