We continue to overclock FizzBuzz

After writing the first article about FizzBuzz (which, unexpectedly for me, became the Editor’s Choice at Technotext 2021), I had thoughts about what could be accelerated, but all the time it was not up to it. And then I got a glove.

So I had to uncover the correct GCC in order to measure the code with @ChePeter.

A couple of notes before dipping your hands into the code:

  • @ChePeter compared the speed of his solution with my not the fastest solution. But, as I understand it, there is no support for i5-9400 BMI1so the customprint2 variant is indeed the fastest one that runs on this processor

  • the code from the “pensioner” brought to mind an expression from 1983: “The determined Real Programmer can write FORTRAN programs in any language“. I can only say that such code is unlikely to pass the review

  • “Well, there is nothing to parallelize here, creating threads will eat up more processor time than actual calculations” – I completely disagree with this statement. Just this task is quite well parallelized, and in the case of my previous run, multithreading immediately gave an increase of more than 2.5 times

The task remains the same, but a restriction is added – do not use intrinsics. “retired” code on my computer (output redirected to /dev/null) works out in 2.337 seconds, so I need to beat this time. But in fact, I want to “fit” into a second – last time I missed some 51 milliseconds. But then I had intrinsiki, now I’ll try to do without them.

Since intrinsics are excluded, buffer reuse becomes not so efficient, that is, we will collect the buffer each time from scratch. But even here there is where to roam.

Optimization of converting a number into a string

In the previous version, I used my analogue of the function itoa, and she divided the number by 10 in the cycle. The cycle is bad, the division is bad, we throw out both in figs, and replace the lookup with a table. Since we have a limit of a billion (actually 999999990, as the largest multiple of 15 up to a billion), we need to be able to work with 9-digit numbers, a table of a billion 10-byte rows will be huge, not good. Then you can divide the number into two parts, in this case there is enough table for 5-digit numbers, that is, 100000 6-byte lines, this is already tolerable, but not very much yet. However, if we process not 15 numbers, but 30 in one pass, then the last digit of the number will be determined by its serial number in the 30-number series, and then it is enough for us to translate an 8-digit number into a string, and since we divide it by two parts, then it will be two lookups in a table of 10000. And since each part we get 4 bytes, it is convenient for us to store them not as strings, but as 32-bit int‘s, and this will be useful to us later. We get a table of 40K, which fits perfectly into any cache, so we can expect that the search for such a table will fly.

Due to the fact that we do not need to determine the last digit of the number, and we have a series of 30 numbers, we need to search the table only 3 times per series, when tens are switched. Nice side effect. In addition, we can almost completely get rid of the division – we just increment the lower part of the number, and check if we need to do a carry. With one shot we will lay down a whole herd (a flock? a joint?) of hares.

Reducing the number of function calls

In previous versions of FizzBuzz, the standard library functions were used quite extensively, and in the first place memcpy(). I sure that memcpy() works very fast and uses available vector instructions, but in our task we need to constantly copy small buffers – from 4 to 10 bytes, which can be safely packed into a couple of assignment instructions instead of calling a function. Since C does not allow array assignments, all buffers must be converted to int‘s fixed size. Here, of course, little endian creates problems on x86, you have to think very hard about the byte order in these “numbers”, but it’s worth it.

Different types of workers

The new number-to-string technique will only work for 9-digit numbers, which make up 90% of our number field. You can also make a separate option for 8-digit numbers, this will cover another 9%, and the remaining percentage can be processed with a slower, but universal processor. But this means that you need to be able to set different handlers for different ranges of numbers. This will complicate the task manager a little, but without it it will be impossible to implement new ideas without abandoning multithreading.

Even without counting two lookup tables for 10000 and 1000 numbers (one could do with big endian, for little endian it will require additional dances that will inevitably hit performance), the code has become noticeably larger, so I will give only the most important a piece – a worker that processes 9-digit numbers (that is, 90% of the numeric field):

#define FIZZ do { *((uint64_t *)cur) = 0x0a7a7a6946; cur += 5; } while (0)
#define BUZZ do { *((uint64_t *)cur) = 0x0a7a7a7542; cur += 5; } while (0)
#define FIZZBUZZ do { *((uint64_t *)cur) = 0x7a7a75427a7a6946; cur += 8; *((uint8_t *)cur) = 0x0a; cur++; } while (0)
#define FIZZ_BUZZ do { *((uint64_t *)cur) = 0x7a75420a7a7a6946; cur += 8; *((uint16_t *)cur) = 0x0a7a; cur += 2; } while (0)
#define BUZZ_FIZZ do { *((uint64_t *)cur) = 0x7a69460a7a7a7542; cur += 8; *((uint16_t *)cur) = 0x0a7a; cur += 2; } while (0)
#define DIGIT(A) do {*cur = A; cur++; *cur="\n"; cur ++; } while (0)
#define NUM do {*((uint32_t *)cur) = high; cur += 4; *((uint32_t *)cur) = low; cur += 4; } while (0)
#define NORM do { l++; if (l >= 10000) { l -= 10000; h += 1; high = table10K[h]; } low = table10K[l]; } while(0)

void *fast9(void *arg) {
	struct thread_data *data = (struct thread_data *)arg;
	char *cur = data->buf;

	int first_digits = data->first / 10;
	int h = first_digits / 10000;    // decimal digits 1-4
	int l = first_digits % 10000;    // decimal digits 5-8
	uint32_t high = table10K[h];
	uint32_t low = table10K[l];
	for (int i = data->first; i <= data->last; i += 30) {
		NUM; DIGIT('1');	// 1
		NUM; DIGIT('2');	// 2
		FIZZ;				// 3
		NUM; DIGIT('4');	// 4
		BUZZ_FIZZ;			// 5, 6
		NUM; DIGIT('7');	// 7
		NUM; DIGIT('8');	// 8
		FIZZ_BUZZ;			// 9, 10
		NORM;
		NUM; DIGIT('1');	// 11
		FIZZ;				// 12
		NUM; DIGIT('3');	// 13
		NUM; DIGIT('4');	// 14
		FIZZBUZZ;			// 15
		NUM; DIGIT('6');	// 16
		NUM; DIGIT('7');	// 17
		FIZZ;				// 18
		NUM; DIGIT('9');	// 19
		BUZZ_FIZZ;			// 20, 21
		NORM;
		NUM; DIGIT('2');	// 22
		NUM; DIGIT('3');	// 23
		FIZZ_BUZZ;			// 24, 25
		NUM; DIGIT('6');	// 26
		FIZZ;				// 27
		NUM; DIGIT('8');	// 28
		NUM; DIGIT('9');	// 29
		FIZZBUZZ;			// 30
		NORM;
	}

	data->buflen = cur - data->buf;
	pthread_exit(NULL);
}

The mysterious hex constants are the words Fizz and Buzz in different versions, represented as little endian numbers.

Full source here.

Result

Well, and most importantly – did you manage to get rid of the “pensioner”?

$ time ./multithreaded2 >/dev/null
real 0m0.692s
user 0m2.659s
sys 0m0.033s

Yes, more than 3 times faster. And we managed to enter in a second, moreover, with a huge margin. Interestingly, I updated the data for all tests and now even multithreaded perfectly enters in a second (but the new version is still faster). I don’t know if this is due to the newer kernel (5.17 instead of 4.12) or the newer gcc (11.3 instead of 7.5).

Similar Posts

Leave a Reply