Why does the Switch SDK have three different sins?

While working at Gaijin a few years ago, I had the opportunity to be involved in porting a couple of the company’s games to the Nintendo Switch console, then conquering new markets with might and main. For me, this was the first major project on this platform. And taking into account that neither the team nor the developer of the engine were familiar with the platform, the build system and the Nintendo ecosystem in general, all the rakes had to be looked for and carefully stepped on. In order to test the capabilities of the new platform, in parallel with porting the game, an internal middleware was written (a bunch of dagor engine + nxsdk + jam) and the code was overgrown with all sorts of tests, build matrix, benchmarks, stability run and other internal checks. It should be noted that at the time of 2018, some of the posix functions like poll and send / receive were not implemented in the switch sdk itself, and most of the functions for working with files, the posix layer had to be written by ourselves. Then the hands reached the point of writing various benchmarks for the functions of the standard library, and some anomalies were noticed in the behavior of some of the trigonometric functions in various build modes. For reference, sdk uses a stripped down version of musl libc (https://www.musl-libc.org/), everything is statically linked into one big binary by clang from Nintendo 9 version (2018), which is then launched on the console. We didn’t have access to the source codes of the libc itself performed by Nintendo, but you can always look at the disasm and more or less imagine what is happening.

To sine or not to sine?

To sine or not to sine?


I will describe my actions in the present tense, just remember that it was 2018 outside. The benchmark code is absolutely childish, we drive the sine (in the SDK all trigonometry was in tr1) in a circle on the console

static void nxsdk_sin(benchmark::State& state) {
    // Code inside this loop is measured repeatedly
    float i = 0;
    for( auto _ : state) {
        float p = tr1::sin(p);

        i += 0.1f;
        benchmark::DoNotOptimize(p);
    }
}
BENCHMARK(nxsdk_sin);

We get the following results (less time is better)

-----------(меньше лучше)-----------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
nxsdk_sin         8.68 ns         8.58 ns     74666667 <<< https://git.musl-libc.org/cgit/musl/tree/src/math/sin.c
nxsdk_sin_vec     8.35 ns         8.30 ns     75825003 <<< sin(vec4f)
dagor_sin         5.48 ns         5.47 ns    102504001
glibc_sin         8.28 ns         8.08 ns     77623654 <<< https://github.com/lattera/glibc/blob/master/sysdeps/ieee754/dbl-64/s_sin.c

While everything is expected, the formulas for calculating sin in musl are about the same as in glibc, and the times show about the same. But if we turn on -03 + -math:precise, Nintendo’s sine has become almost a third faster, either the compiler is too smart, or the benchmark is lying and the moon is in the wrong phase, or something else. Actually, when my colleague showed me these results, I answered him like that – they say, your bench is broken, go double-check, and forgot about this moment. The next day we watched this test together, because there really is nowhere to get perf from (well, that’s what I thought at first). As expected, you can see a slight decrease in glibc_sin time due to clang’s aggressive optimizations.

-----------(меньше лучше)-----------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
nxsdk_sin         6.03 ns         6.00 ns     87625024
nxsdk_sin_vec     5.82 ns         5.78 ns     90022043
dagor_sin         5.38 ns         5.27 ns    104602002
glibc_sin         8.08 ns         8.05 ns     78214356

An even more “different” result was obtained when the test was run with the -O3 + -math:fast flags, here I was really surprised because at that time there was no faster dagor_sin with comparable accuracy, and the results of tests for correctness all sines passed equally well , the discrepancy between implementations with std::sin did not exceed 1e-6

-----------(меньше лучше)-----------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
nxsdk_sin         4.03 ns         3.99 ns    132307692
nxsdk_sin_vec     4.09 ns         4.08 ns    131403251
dagor_sin         5.13 ns         5.10 ns    110400021
glibc_sin         7.43 ns         7.16 ns     79544722

And here I had to get into dissm to understand what geniuses from Nintendo came up with in order to achieve such impressive numbers in the benchmark.

Let’s start with the -01 (speed) mode, I will post only the important sections so that there is an understanding of what is happening, and I will try not to post the whole dissm, you never know what kind of IP there is. The sin function from the nx sdk delivery is an alias for __sin_nx_502, (2018, current sdk 4.0 and 5.0) apparently the function name is generated based on this data

__sin_nx_502
__sin_nx_502(double):
        sub     sp, sp, #32            
        stp     x29, x30, [sp, #16]     
        add     x29, sp, #16           
        fmov    x8, d0
        mov     w9, #8699
        ubfx    x8, x8, #32, #31
        movk    w9, #16361, lsl #16
        cmp     w8, w9
        b.hi    .ERIT1_3                // .ERITX_X -> .BBX_X arm
        lsr     w9, w8, #20
        cmp     w9, #996              
        b.hi    .ERIT1_5
        mov     x9, #4066750463515557888
        mov     x10, #5147614374084476928
        fmov    d1, x9
        fmov    d2, x10
        fmul    d1, d0, d1
        fadd    d2, d0, d2
        cmp     w8, #256, lsl #12     
        fcsel   d1, d1, d2, lo
        str     d1, [sp]
        ldp     x29, x30, [sp, #16]     
        add     sp, sp, #32            
        ret
.ERIT1_3:
        lsr     w8, w8, #20
        cmp     w8, #2047              
        b.lo    .ERIT1_6
        fsub    d0, d0, d0
        ldp     x29, x30, [sp, #16]    
        add     sp, sp, #32            
        ret
.ERIT1_5:
        adrp    x8, .ERIT1_0
        ... неинтересный код
        add     sp, sp, #32           
        ret
.ERIT1_6:
        mov     x0, sp
        bl      __rem_pio2_nx_502(double, double*) // вызов __rem_pio2
        and     w8, w0, #0x3
        ... неинтересный код
        ldp     x29, x30, [sp, #16]     
        add     sp, sp, #32             
        ret
.ERIT1_10:
        ldp     d1, d0, [sp]
        ...
        fadd    d2, d2, d3
        fmov    d5, #0.50000000
        ldr     d3, [x8, :lo12:.ERIT1_5]
        ...
        ldp     x29, x30, [sp, #16]    
        add     sp, sp, #32            
        ret
.ERIT1_11:
        ldp     d0, d1, [sp]
        bl      __cos_nx_502(double, double)  // вызов __сos
        ldp     x29, x30, [sp, #16]     
        add     sp, sp, #32           
        ret
.ERIT1_12:
        ldp     d0, d1, [sp]
        bl      __cos_nx_502(double, double) // вызов __сos
        fneg    d0, d0
        ldp     x29, x30, [sp, #16]     
        add     sp, sp, #32            
        ret

let’s quickly go over the code structure, and it almost completely repeats the standard function from musl – you can see the call signature __rem_pio2/__cos

double sin(double x) {
    double y[2];
    uint32_t ix;
    unsigned n;

    /* High word of x. */
    GET_HIGH_WORD(ix, x);
    ix &= 0x7fffffff;

    /* |x| ~< pi/4 */
    if (ix <= 0x3fe921fb) {
        if (ix < 0x3e500000) {  /* |x| < 2**-26 */
            /* raise inexact if x != 0 and underflow if subnormal*/
            FORCE_EVAL(ix < 0x00100000 ? x/0x1p120f : x+0x1p120f);
            return x;
        }
        return __sin(x, 0.0, 0);
    }

    /* sin(Inf or NaN) is NaN */
    if (ix >= 0x7ff00000)
        return x - x;

    /* argument reduction needed */
    n = __rem_pio2(x, y);
    switch (n&3) {
    case 0: return  __sin(y[0], y[1], 1);
    case 1: return  __cos(y[0], y[1]);
    case 2: return -__sin(y[0], y[1], 1);
    default:
        return -__cos(y[0], y[1]);
    }
}

But no matter how smart the compiler is, it will not give us more than 10-15% gain here, no matter what modes you turn on. We dig deeper, turn on -03 + -math:precise and see what we have generated this time. This time, we received an alias to another function called __sin_dorf_nx_502, the code is very linear, few ifof, additions and multiplications

__sin_dorf_nx_502
__sin_dorf_nx_502(float):
        push    {r4, r5, r6, r7, r8, r9, r10, r11, lr}
        add     r11, sp, #28
        sub     sp, sp, #4
        ldr     r1, .EVRT0_0
        mov     r4, r0
        bl      __aeabi_fmul
        and     r1, r4, #-2147483648
        orr     r1, r1, #1056964608
        bl      __aeabi_fadd
        bl      __aeabi_f2uiz
        mov     r9, r0
        bl      __aeabi_ui2f
        ldr     r8, .EVRT0_1
        mov     r5, r0
        mov     r1, r5
        ldr     r0, [r8, #24]
        bl      __aeabi_fmul
        mov     r1, r0
        mov     r0, r4
        bl      __aeabi_fsub
        mov     r4, r0
        ldr     r0, [r8, #28]
        mov     r1, r5
        bl      __aeabi_fmul
        mov     r1, r0
        mov     r0, r4
        bl      __aeabi_fsub
        mov     r1, r0
        mov     r7, r0
        bl      __aeabi_fmul
        mov     r5, r0
        mov     r0, r7
        mov     r1, r5
        bl      __aeabi_fmul
        mov     r6, r0
        ldr     r0, [r8, #8]
        ldm     r8, {r4, r10}
        mov     r1, r5
        str     r0, [sp]                
        ldr     r0, [r8, #12]
        bl      __aeabi_fmul
        ldr     r1, [r8, #16]
        bl      __aeabi_fadd
        mov     r1, r0
        mov     r0, r5
        bl      __aeabi_fmul
        ldr     r1, [r8, #20]
        bl      __aeabi_fadd
        mov     r1, r0
        mov     r0, r6
        bl      __aeabi_fmul
        mov     r1, r0
        mov     r0, r7
        bl      __aeabi_fadd
        mov     r7, r0
        mov     r0, r4
        mov     r1, r5
        bl      __aeabi_fmul
        mov     r1, r0
        mov     r0, r10
        bl      __aeabi_fadd
        mov     r1, r0
        mov     r0, r5
        bl      __aeabi_fmul
        mov     r1, r0
        ldr     r0, [sp]                
        bl      __aeabi_fadd
        mov     r1, r0
        mov     r0, r5
        bl      __aeabi_fmul
        mov     r1, #1065353216
        bl      __aeabi_fadd
        tst     r9, #1
        moveq   r0, r7
        tst     r9, #2
        eorne   r0, r0, #-2147483648
        sub     sp, r11, #28
        pop     {r4, r5, r6, r7, r8, r9, r10, r11, lr}
        bx      lr
.EVRT0_0:
        .long   1059256694              @ 0x3f22f976
.EVRT0_1:
        .long   .KI_MergedGlobals
        ...
.KI_MergedGlobals: // вот тут блок параметров для полинома, он располагался гдето в секции данных
        .long   3132246419              @ float -0.001360224
        .long   1026203702              @ float 0.04165669
        .long   3204448223              @ float -0.4999990
        .long   3108801646              @ float -1.950727E-4
        .long   1007190850              @ float 0.008332075
        .long   3190467233              @ float -0.1666665
        .long   1070141402              @ float 1.570796 /// вот это Pi/2
        .long   866263401               @ float 7.549790E-8

It is very similar to the approximation of the sine by some kind of polynomial, if you try to restore the approximate logic of work in the pseudo code, it will be similar to the code below. And it already looks like what the benchmark shows.

float __sin_dorf_nx_502(float x) {
    const float EVRT0_0 = 0.636618972f;
    const float EVRT0_1 = 1.0f;

    const float KI_MergedGlobals[8] = {
        -0.001360224f,
        0.04165669f,
        -0.4999990f,
        -1.950727E-4f,
        0.008332075f,
        -0.1666665f,
        1.570796f, // PI / 2
        7.549790E-8f
    };

    float result;

    float temp = EVRT0_0 * x;
    temp = temp + 1056964608.0f;

    int intTemp = (int)temp;
    float floatTemp = (float)intTemp;

    float r5 = floatTemp;
    floatTemp = r5 * KI_MergedGlobals[0];
    float r4 = x - floatTemp;
    floatTemp = r5 * KI_MergedGlobals[1];
    r4 = r4 - floatTemp;

    float r7 = r4 * r4;
    floatTemp = r7 * KI_MergedGlobals[2];
    floatTemp = floatTemp + KI_MergedGlobals[3];
    floatTemp = floatTemp * r7;
    floatTemp = floatTemp + KI_MergedGlobals[4];
    floatTemp = floatTemp * r7;
    floatTemp = floatTemp + KI_MergedGlobals[5];
    floatTemp = floatTemp * r7;
    floatTemp = floatTemp + EVRT0_1;

    if (intTemp & 1) {
        floatTemp = floatTemp * r4;
    } else {
        floatTemp = -floatTemp * r4;
    }

    result = floatTemp;

    return result;
}

I think you already imagine what will happen when we collect the sample with -03 + -math:fast? That’s right, an alias for one more function. The final version that is used in the release assembly, with which the games were sent for certification

__sin_corf_nx_502
__sin_corf_nx_502(float):
        push    {r4, r5, r6, r7, r8, r9, r10, r11, lr}
        add     r11, sp, #28
        sub     sp, sp, #4
        bl      __aeabi_f2d
        ldr     r2, .DUIE0_0
        ldr     r3, .DUIE0_1
        mov     r5, r0
        mov     r6, r1
        bl      __aeabi_dmul
        bl      __aeabi_d2iz
        mov     r10, r0
        bl      __aeabi_i2d
        ldr     r3, .DUIE0_2
        mov     r2, #1610612736
        bl      __aeabi_dmul
        mov     r2, r0
        mov     r3, r1
        mov     r0, r5
        mov     r1, r6
        bl      __aeabi_dadd
        bl      __aeabi_d2f
        ldr     r1, .DUIE0_3
        mov     r7, r0
        and     r0, r10, #255
        ldr     r8, [r1]
        ldr     r0, [r8, r0, lsl #2]
        bl      __aeabi_f2d
        mov     r3, #266338304
        mov     r2, #0
        str     r0, [sp]              
        mov     r9, r1
        orr     r3, r3, #-1342177280
        bl      __aeabi_dmul
        mov     r5, r0
        mov     r0, r7
        mov     r6, r1
        bl      __aeabi_f2d
        mov     r7, r0
        mov     r4, r1
        mov     r0, r5
        mov     r1, r6
        mov     r2, r7
        mov     r3, r4
        bl      __aeabi_dmul
        mov     r5, r0
        add     r0, r10, #64
        mov     r6, r1
        and     r0, r0, #255
        ldr     r0, [r8, r0, lsl #2]
        bl      __aeabi_f2d
        mov     r2, r5
        mov     r3, r6
        bl      __aeabi_dadd
        mov     r2, r7
        mov     r3, r4
        bl      __aeabi_dmul
        ldr     r2, [sp]               
        mov     r3, r9
        bl      __aeabi_dadd
        bl      __aeabi_d2f
        sub     sp, r11, #28
        pop     {r4, r5, r6, r7, r8, r9, r10, r11, lr}
        bx      lr
.DUIE0_0:
        .long   1682373044              @ 0x6446f9b4
.DUIE0_1:
        .long   1078222640              @ 0x40445f30
.DUIE0_2:
        .long   3214483963              @ 0xbf9921fb
.DUIE0_3:
        .long   __sin_corf_duie_nx_502

The pseudocode for this assembler part will be like this, and judging by the logic of work, this is the calculation of the sine using the table. Fastest of all, accuracy does not suffer much.

float __sin_corf_nx_502(float x) {
    const double DUIE0_0 = 1.4636916370976118;
    const double DUIE0_1 = 3.141592653589793; // PI
    const double DUIE0_2 = -0.0009424784718423055;

    const float* DUIE0_3 = __sin_corf_duie_nx_502;

    double temp = x * DUIE0_0;
    temp = temp + DUIE0_1;

    int intTemp = static_cast<int>(temp);
    float floatTemp = static_cast<float>(intTemp);

    float r5 = floatTemp;
    floatTemp = r5 * DUIE0_2;
    float r4 = x - floatTemp;

    double r7 = r4 * r4;
    floatTemp = static_cast<float>(r7);

    int index = intTemp & 255;
    float floatTemp2 = *(DUIE0_3 + index);

    double r2 = floatTemp2;
    double r3 = static_cast<double>(intTemp >> 8);
    double r0 = static_cast<double>(r5);
    double r1 = static_cast<double>(r4);
    r0 = r0 + r1;
    r0 = r0 * r2;
    r0 = r0 + r3;

    float result = static_cast<float>(r0);

    return result;
}

If you still delve into the Internet on the topic of fast sine, then you can find, for example, another such approximation https://en.wikipedia.org/wiki/Bhaskara_I’s_sine_approximation_formula the calculation error for this formula is within 0.001, which is often enough for most problems.

Or pay attention to an article by Lars Nyland (nVidia)one of the authors of CUDA.

For sines, the Dagor engine uses its own implementation, it behaves equally well and quickly on all platforms, you can find it here. Do not take it for advertising, I have not worked for the company for a long time, maybe it will come in handy for someone. Best regards to my former colleagues.

conclusions

We didn’t get any practical benefit from this, but it was interesting to understand the features of the work of the SDK. Honor and praise to the Japanese engineers who gave the world this wonderful console. I have only one question, what for three?

UPD: a friend wrote in a personal note that this is how Nintendo tracked the use of old and burned SDKs, without an active developer key, the first implementation was always called, so the build that tried to get certified helped to find stolen accounts and keys. I can’t verify this data, I can only believe it 🙂

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *