How to squeeze 1.5 teraflops of performance for 32-bit floating point numbers on a single M1 processor