Why does the Switch SDK have three different sins?
While working at Gaijin a few years ago, I had the opportunity to be involved in porting a couple of the company’s games to the Nintendo Switch console, then conquering new markets with might and main. For me, this was the first major project on this platform. And taking into account that neither the team nor the developer of the engine were familiar with the platform, the build system and the Nintendo ecosystem in general, all the rakes had to be looked for and carefully stepped on. In order to test the capabilities of the new platform, in parallel with porting the game, an internal middleware was written (a bunch of dagor engine + nxsdk + jam) and the code was overgrown with all sorts of tests, build matrix, benchmarks, stability run and other internal checks. It should be noted that at the time of 2018, some of the posix functions like poll and send / receive were not implemented in the switch sdk itself, and most of the functions for working with files, the posix layer had to be written by ourselves. Then the hands reached the point of writing various benchmarks for the functions of the standard library, and some anomalies were noticed in the behavior of some of the trigonometric functions in various build modes. For reference, sdk uses a stripped down version of musl libc (https://www.musl-libc.org/), everything is statically linked into one big binary by clang from Nintendo 9 version (2018), which is then launched on the console. We didn’t have access to the source codes of the libc itself performed by Nintendo, but you can always look at the disasm and more or less imagine what is happening.
I will describe my actions in the present tense, just remember that it was 2018 outside. The benchmark code is absolutely childish, we drive the sine (in the SDK all trigonometry was in tr1) in a circle on the console
static void nxsdk_sin(benchmark::State& state) {
// Code inside this loop is measured repeatedly
float i = 0;
for( auto _ : state) {
float p = tr1::sin(p);
i += 0.1f;
benchmark::DoNotOptimize(p);
}
}
BENCHMARK(nxsdk_sin);
We get the following results (less time is better)
-----------(меньше лучше)-----------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 8.68 ns 8.58 ns 74666667 <<< https://git.musl-libc.org/cgit/musl/tree/src/math/sin.c
nxsdk_sin_vec 8.35 ns 8.30 ns 75825003 <<< sin(vec4f)
dagor_sin 5.48 ns 5.47 ns 102504001
glibc_sin 8.28 ns 8.08 ns 77623654 <<< https://github.com/lattera/glibc/blob/master/sysdeps/ieee754/dbl-64/s_sin.c
While everything is expected, the formulas for calculating sin in musl are about the same as in glibc, and the times show about the same. But if we turn on -03 + -math:precise, Nintendo’s sine has become almost a third faster, either the compiler is too smart, or the benchmark is lying and the moon is in the wrong phase, or something else. Actually, when my colleague showed me these results, I answered him like that – they say, your bench is broken, go double-check, and forgot about this moment. The next day we watched this test together, because there really is nowhere to get perf from (well, that’s what I thought at first). As expected, you can see a slight decrease in glibc_sin time due to clang’s aggressive optimizations.
-----------(меньше лучше)-----------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 6.03 ns 6.00 ns 87625024
nxsdk_sin_vec 5.82 ns 5.78 ns 90022043
dagor_sin 5.38 ns 5.27 ns 104602002
glibc_sin 8.08 ns 8.05 ns 78214356
An even more “different” result was obtained when the test was run with the -O3 + -math:fast flags, here I was really surprised because at that time there was no faster dagor_sin with comparable accuracy, and the results of tests for correctness all sines passed equally well , the discrepancy between implementations with std::sin did not exceed 1e-6
-----------(меньше лучше)-----------------------------
Benchmark Time CPU Iterations
------------------------------------------------------
nxsdk_sin 4.03 ns 3.99 ns 132307692
nxsdk_sin_vec 4.09 ns 4.08 ns 131403251
dagor_sin 5.13 ns 5.10 ns 110400021
glibc_sin 7.43 ns 7.16 ns 79544722
And here I had to get into dissm to understand what geniuses from Nintendo came up with in order to achieve such impressive numbers in the benchmark.
Let’s start with the -01 (speed) mode, I will post only the important sections so that there is an understanding of what is happening, and I will try not to post the whole dissm, you never know what kind of IP there is. The sin function from the nx sdk delivery is an alias for __sin_nx_502, (2018, current sdk 4.0 and 5.0) apparently the function name is generated based on this data
__sin_nx_502
__sin_nx_502(double):
sub sp, sp, #32
stp x29, x30, [sp, #16]
add x29, sp, #16
fmov x8, d0
mov w9, #8699
ubfx x8, x8, #32, #31
movk w9, #16361, lsl #16
cmp w8, w9
b.hi .ERIT1_3 // .ERITX_X -> .BBX_X arm
lsr w9, w8, #20
cmp w9, #996
b.hi .ERIT1_5
mov x9, #4066750463515557888
mov x10, #5147614374084476928
fmov d1, x9
fmov d2, x10
fmul d1, d0, d1
fadd d2, d0, d2
cmp w8, #256, lsl #12
fcsel d1, d1, d2, lo
str d1, [sp]
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_3:
lsr w8, w8, #20
cmp w8, #2047
b.lo .ERIT1_6
fsub d0, d0, d0
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_5:
adrp x8, .ERIT1_0
... неинтересный код
add sp, sp, #32
ret
.ERIT1_6:
mov x0, sp
bl __rem_pio2_nx_502(double, double*) // вызов __rem_pio2
and w8, w0, #0x3
... неинтересный код
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_10:
ldp d1, d0, [sp]
...
fadd d2, d2, d3
fmov d5, #0.50000000
ldr d3, [x8, :lo12:.ERIT1_5]
...
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_11:
ldp d0, d1, [sp]
bl __cos_nx_502(double, double) // вызов __сos
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
.ERIT1_12:
ldp d0, d1, [sp]
bl __cos_nx_502(double, double) // вызов __сos
fneg d0, d0
ldp x29, x30, [sp, #16]
add sp, sp, #32
ret
let’s quickly go over the code structure, and it almost completely repeats the standard function from musl – you can see the call signature __rem_pio2/__cos
double sin(double x) {
double y[2];
uint32_t ix;
unsigned n;
/* High word of x. */
GET_HIGH_WORD(ix, x);
ix &= 0x7fffffff;
/* |x| ~< pi/4 */
if (ix <= 0x3fe921fb) {
if (ix < 0x3e500000) { /* |x| < 2**-26 */
/* raise inexact if x != 0 and underflow if subnormal*/
FORCE_EVAL(ix < 0x00100000 ? x/0x1p120f : x+0x1p120f);
return x;
}
return __sin(x, 0.0, 0);
}
/* sin(Inf or NaN) is NaN */
if (ix >= 0x7ff00000)
return x - x;
/* argument reduction needed */
n = __rem_pio2(x, y);
switch (n&3) {
case 0: return __sin(y[0], y[1], 1);
case 1: return __cos(y[0], y[1]);
case 2: return -__sin(y[0], y[1], 1);
default:
return -__cos(y[0], y[1]);
}
}
But no matter how smart the compiler is, it will not give us more than 10-15% gain here, no matter what modes you turn on. We dig deeper, turn on -03 + -math:precise and see what we have generated this time. This time, we received an alias to another function called __sin_dorf_nx_502, the code is very linear, few ifof, additions and multiplications
__sin_dorf_nx_502
__sin_dorf_nx_502(float):
push {r4, r5, r6, r7, r8, r9, r10, r11, lr}
add r11, sp, #28
sub sp, sp, #4
ldr r1, .EVRT0_0
mov r4, r0
bl __aeabi_fmul
and r1, r4, #-2147483648
orr r1, r1, #1056964608
bl __aeabi_fadd
bl __aeabi_f2uiz
mov r9, r0
bl __aeabi_ui2f
ldr r8, .EVRT0_1
mov r5, r0
mov r1, r5
ldr r0, [r8, #24]
bl __aeabi_fmul
mov r1, r0
mov r0, r4
bl __aeabi_fsub
mov r4, r0
ldr r0, [r8, #28]
mov r1, r5
bl __aeabi_fmul
mov r1, r0
mov r0, r4
bl __aeabi_fsub
mov r1, r0
mov r7, r0
bl __aeabi_fmul
mov r5, r0
mov r0, r7
mov r1, r5
bl __aeabi_fmul
mov r6, r0
ldr r0, [r8, #8]
ldm r8, {r4, r10}
mov r1, r5
str r0, [sp]
ldr r0, [r8, #12]
bl __aeabi_fmul
ldr r1, [r8, #16]
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
ldr r1, [r8, #20]
bl __aeabi_fadd
mov r1, r0
mov r0, r6
bl __aeabi_fmul
mov r1, r0
mov r0, r7
bl __aeabi_fadd
mov r7, r0
mov r0, r4
mov r1, r5
bl __aeabi_fmul
mov r1, r0
mov r0, r10
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
mov r1, r0
ldr r0, [sp]
bl __aeabi_fadd
mov r1, r0
mov r0, r5
bl __aeabi_fmul
mov r1, #1065353216
bl __aeabi_fadd
tst r9, #1
moveq r0, r7
tst r9, #2
eorne r0, r0, #-2147483648
sub sp, r11, #28
pop {r4, r5, r6, r7, r8, r9, r10, r11, lr}
bx lr
.EVRT0_0:
.long 1059256694 @ 0x3f22f976
.EVRT0_1:
.long .KI_MergedGlobals
...
.KI_MergedGlobals: // вот тут блок параметров для полинома, он располагался гдето в секции данных
.long 3132246419 @ float -0.001360224
.long 1026203702 @ float 0.04165669
.long 3204448223 @ float -0.4999990
.long 3108801646 @ float -1.950727E-4
.long 1007190850 @ float 0.008332075
.long 3190467233 @ float -0.1666665
.long 1070141402 @ float 1.570796 /// вот это Pi/2
.long 866263401 @ float 7.549790E-8
It is very similar to the approximation of the sine by some kind of polynomial, if you try to restore the approximate logic of work in the pseudo code, it will be similar to the code below. And it already looks like what the benchmark shows.
float __sin_dorf_nx_502(float x) {
const float EVRT0_0 = 0.636618972f;
const float EVRT0_1 = 1.0f;
const float KI_MergedGlobals[8] = {
-0.001360224f,
0.04165669f,
-0.4999990f,
-1.950727E-4f,
0.008332075f,
-0.1666665f,
1.570796f, // PI / 2
7.549790E-8f
};
float result;
float temp = EVRT0_0 * x;
temp = temp + 1056964608.0f;
int intTemp = (int)temp;
float floatTemp = (float)intTemp;
float r5 = floatTemp;
floatTemp = r5 * KI_MergedGlobals[0];
float r4 = x - floatTemp;
floatTemp = r5 * KI_MergedGlobals[1];
r4 = r4 - floatTemp;
float r7 = r4 * r4;
floatTemp = r7 * KI_MergedGlobals[2];
floatTemp = floatTemp + KI_MergedGlobals[3];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + KI_MergedGlobals[4];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + KI_MergedGlobals[5];
floatTemp = floatTemp * r7;
floatTemp = floatTemp + EVRT0_1;
if (intTemp & 1) {
floatTemp = floatTemp * r4;
} else {
floatTemp = -floatTemp * r4;
}
result = floatTemp;
return result;
}
I think you already imagine what will happen when we collect the sample with -03 + -math:fast? That’s right, an alias for one more function. The final version that is used in the release assembly, with which the games were sent for certification
__sin_corf_nx_502
__sin_corf_nx_502(float):
push {r4, r5, r6, r7, r8, r9, r10, r11, lr}
add r11, sp, #28
sub sp, sp, #4
bl __aeabi_f2d
ldr r2, .DUIE0_0
ldr r3, .DUIE0_1
mov r5, r0
mov r6, r1
bl __aeabi_dmul
bl __aeabi_d2iz
mov r10, r0
bl __aeabi_i2d
ldr r3, .DUIE0_2
mov r2, #1610612736
bl __aeabi_dmul
mov r2, r0
mov r3, r1
mov r0, r5
mov r1, r6
bl __aeabi_dadd
bl __aeabi_d2f
ldr r1, .DUIE0_3
mov r7, r0
and r0, r10, #255
ldr r8, [r1]
ldr r0, [r8, r0, lsl #2]
bl __aeabi_f2d
mov r3, #266338304
mov r2, #0
str r0, [sp]
mov r9, r1
orr r3, r3, #-1342177280
bl __aeabi_dmul
mov r5, r0
mov r0, r7
mov r6, r1
bl __aeabi_f2d
mov r7, r0
mov r4, r1
mov r0, r5
mov r1, r6
mov r2, r7
mov r3, r4
bl __aeabi_dmul
mov r5, r0
add r0, r10, #64
mov r6, r1
and r0, r0, #255
ldr r0, [r8, r0, lsl #2]
bl __aeabi_f2d
mov r2, r5
mov r3, r6
bl __aeabi_dadd
mov r2, r7
mov r3, r4
bl __aeabi_dmul
ldr r2, [sp]
mov r3, r9
bl __aeabi_dadd
bl __aeabi_d2f
sub sp, r11, #28
pop {r4, r5, r6, r7, r8, r9, r10, r11, lr}
bx lr
.DUIE0_0:
.long 1682373044 @ 0x6446f9b4
.DUIE0_1:
.long 1078222640 @ 0x40445f30
.DUIE0_2:
.long 3214483963 @ 0xbf9921fb
.DUIE0_3:
.long __sin_corf_duie_nx_502
The pseudocode for this assembler part will be like this, and judging by the logic of work, this is the calculation of the sine using the table. Fastest of all, accuracy does not suffer much.
float __sin_corf_nx_502(float x) {
const double DUIE0_0 = 1.4636916370976118;
const double DUIE0_1 = 3.141592653589793; // PI
const double DUIE0_2 = -0.0009424784718423055;
const float* DUIE0_3 = __sin_corf_duie_nx_502;
double temp = x * DUIE0_0;
temp = temp + DUIE0_1;
int intTemp = static_cast<int>(temp);
float floatTemp = static_cast<float>(intTemp);
float r5 = floatTemp;
floatTemp = r5 * DUIE0_2;
float r4 = x - floatTemp;
double r7 = r4 * r4;
floatTemp = static_cast<float>(r7);
int index = intTemp & 255;
float floatTemp2 = *(DUIE0_3 + index);
double r2 = floatTemp2;
double r3 = static_cast<double>(intTemp >> 8);
double r0 = static_cast<double>(r5);
double r1 = static_cast<double>(r4);
r0 = r0 + r1;
r0 = r0 * r2;
r0 = r0 + r3;
float result = static_cast<float>(r0);
return result;
}
If you still delve into the Internet on the topic of fast sine, then you can find, for example, another such approximation https://en.wikipedia.org/wiki/Bhaskara_I’s_sine_approximation_formula the calculation error for this formula is within 0.001, which is often enough for most problems.
Or pay attention to an article by Lars Nyland (nVidia)one of the authors of CUDA.
For sines, the Dagor engine uses its own implementation, it behaves equally well and quickly on all platforms, you can find it here. Do not take it for advertising, I have not worked for the company for a long time, maybe it will come in handy for someone. Best regards to my former colleagues.
conclusions
We didn’t get any practical benefit from this, but it was interesting to understand the features of the work of the SDK. Honor and praise to the Japanese engineers who gave the world this wonderful console. I have only one question, what for three?
UPD: a friend wrote in a personal note that this is how Nintendo tracked the use of old and burned SDKs, without an active developer key, the first implementation was always called, so the build that tried to get certified helped to find stolen accounts and keys. I can’t verify this data, I can only believe it 🙂