reduced the size to 14kB and got rid of the C++ runtime
Formatting library {fmt} is known for its low impact on binary size. Most often, its code is several times smaller than libraries such as IOStreams, Boost Format or, ironically, tinyformat. This is achieved through careful application of type erasure at different levels, which minimizes unnecessary use of templates.
Formatting arguments are passed via format_args
with erased types:
auto vformat(string_view fmt, format_args args) -> std::string;
template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std::string {
return vformat(fmt, fmt::make_format_args(args...));
}
As you can see, format
delegates all non-template work, vformat
.
For output iterators and other output types, types are erased using a special buffer API.
This approach reduces the use of templates to a minimal top-level layer, while reducing the binary size and compile time.
For example, the following code:
// test.cc
#include <fmt/base.h>
int main() {
fmt::print("The answer is {}.", 42);
}
Compiles to
.LC0:
.string "The answer is {}."
main:
sub rsp, 24
mov eax, 1
mov edi, OFFSET FLAT:.LC0
mov esi, 17
mov rcx, rsp
mov rdx, rax
mov DWORD PTR [rsp], 42
call fmt::v11::vprint(fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>)
xor eax, eax
add rsp, 24
ret
It is significantly smaller than the equivalent IOStreams code and is comparable to printf
:
.LC0:
.string "The answer is %d."
main:
sub rsp, 8
mov esi, 42
mov edi, OFFSET FLAT:.LC0
xor eax, eax
call printf
xor eax, eax
add rsp, 8
ret
Unlike printf
, {fmt}
provides full type safety at runtime. Errors in format strings can be caught at compile time, but even if strings are defined at runtime, errors are treated as exceptions, preventing undefined behavior, memory corruption, and potential crashes. Also, in general, calls {fmt}
more efficient, especially when positional arguments are used, which is what C is for varargs
not very suitable.
In 2020, I dedicated some time library size optimizationsuccessfully reducing its size to less than 100 kB (only ~57 kB when using -Os -flto
). Much has changed since then. In particular, {fmt}
now uses an outstanding algorithm Dragonbox for formatting floating point numbers, courtesy of its author Junekey Jeon. Let's see how these changes affected the binary file size and whether it can be reduced even further.
One might ask: why exactly the size of the binary file?
Usage {fmt}
on devices with limited memory has generated considerable interest – in particular, these examples from the distant past: #758 And #1226. A particularly intriguing case is programming for retro machines, where people use {fmt}
for systems such as Amiga (#4054).
We will use the same methodology as earlierexamining the size of the executable file of a program using {fmt}
as this will be most relevant for most users. All tests will be performed on a platform with aarch64 Ubuntu 22.04 and GCC 11.4.0.
First of all, let's define the starting point: what is the binary file size of the latest version? {fmt}
(11.0.2)?
$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 75K Aug 30 19:24 a.out
The size of the resulting file is 75 kB. On the positive side, despite the significant number of changes over the last 4 years, the size has not increased significantly.
It's time to consider possible optimization paths. One of the first candidates might be disabling locale support. All formatting in {fmt}
locale-independent by default (which breaks the C++ tradition of having bad defaults), but is still accessible via the format specifier L
. It can be disabled using a macro. FMT_STATIC_THOUSANDS_SEPARATOR
:
$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 71K Aug 30 19:25 a.out
Disabling locale support reduces the binary file size to 71 kB.
Now let's check the result with our trusty tool – Bloaty:
$ bloaty -d symbols a.out
FILE SIZE VM SIZE
-------------- --------------
43.8% 41.1Ki 43.6% 29.0Ki [121 Others]
6.4% 6.04Ki 8.1% 5.42Ki fmt::v11::detail::do_write_float<>()
5.9% 5.50Ki 7.5% 4.98Ki fmt::v11::detail::write_int_noinline<>()
5.7% 5.32Ki 5.8% 3.88Ki fmt::v11::detail::write<>()
5.4% 5.02Ki 7.2% 4.81Ki fmt::v11::detail::parse_replacement_field<>()
3.9% 3.69Ki 3.7% 2.49Ki fmt::v11::detail::format_uint<>()
3.2% 3.00Ki 0.0% 0 [section .symtab]
2.7% 2.50Ki 0.0% 0 [section .strtab]
2.3% 2.12Ki 2.9% 1.93Ki fmt::v11::detail::dragonbox::to_decimal<>()
2.0% 1.89Ki 2.4% 1.61Ki fmt::v11::detail::write_int<>()
2.0% 1.88Ki 0.0% 0 [ELF Section Headers]
1.9% 1.79Ki 2.5% 1.66Ki fmt::v11::detail::write_float<>()
1.9% 1.78Ki 2.7% 1.78Ki [section .dynstr]
1.8% 1.72Ki 2.4% 1.62Ki fmt::v11::detail::format_dragon()
1.8% 1.68Ki 1.5% 1016 fmt::v11::detail::format_decimal<>()
1.6% 1.52Ki 2.1% 1.41Ki fmt::v11::detail::format_float<>()
1.6% 1.49Ki 0.0% 0 [Unmapped]
1.5% 1.45Ki 2.2% 1.45Ki [section .dynsym]
1.5% 1.45Ki 2.0% 1.31Ki fmt::v11::detail::write_loc()
1.5% 1.44Ki 2.2% 1.44Ki [section .rodata]
1.5% 1.40Ki 1.1% 764 fmt::v11::detail::do_write_float<>()::{lambda()#2}::operator()()
100.0% 93.8Ki 100.0% 66.6Ki TOTAL
A significant portion of the binary file is code devoted to formatting numbers, especially floating point numbers. The latter formatting also relies on the use of large tables, not shown here. But what if we don't need floating point support? {fmt}
allows you to turn it off, although the method of disabling it is a bit spontaneous and does not apply to other types.
The main problem is that formatting functions need to know about all the types being formatted. But is this really true? This is true for printf
defined by the C standard, but not required for {fmt}
. {fmt}
supports an extension API that allows formatting arbitrary types without knowing their full set in advance. While built-in and string types are treated specifically to optimize performance, a different approach may be required when optimizing file size. By removing special handling and handling each type using the extension API, we can avoid the overhead of unused types.
I created an experimental one implementation of this idea. By setting the value to 0 for the macro FMT_BUILTIN_TYPES
only int
is handled specifically, and all other types go through the extension API. We still need to know about int
for dynamic width and precision, for example:
fmt::print("{:{}}\n", "hello", 10); // prints "hello "
This gives us a “don't pay for what you don't use” model, although it does increase the size of each call in the binary a bit. If you handle floating point or other types, the corresponding code will still be included in the final assembly. And while it is possible to reduce the implementation of floating point handling, we will not touch on this topic in this article.
Using FMT_BUILTIN_TYPES=0
the binary file size has been reduced to 31 kB, which gives us a significant improvement:
$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG \
"-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0 \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 31K Aug 30 19:37 a.out
However, Bloaty's updated results show that some locale artifacts still persist, such as digit_grouping
:
$ bloaty -d fullsymbols a.out
FILE SIZE VM SIZE
-------------- --------------
41.8% 18.0Ki 39.7% 11.0Ki [84 Others]
6.4% 2.77Ki 0.0% 0 [section .symtab]
5.3% 2.28Ki 0.0% 0 [section .strtab]
4.6% 1.99Ki 6.9% 1.90Ki fmt::v11::detail::format_handler<char>::on_format_specs(int, char const*, char const*)
4.4% 1.88Ki 0.0% 0 [ELF Section Headers]
4.1% 1.78Ki 5.8% 1.61Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int_noinline<char, fmt::v11::basic_appender<char>, unsigned int>(fmt::v11::basic_appender<char>, fmt::v11::detail::write_int_arg<unsigned int>, fmt::v11::format_specs const&, fmt::v11::detail::locale_ref) (.constprop.0)
3.7% 1.60Ki 5.8% 1.60Ki [section .dynstr]
3.5% 1.50Ki 4.8% 1.34Ki void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) (.constprop.0)
3.5% 1.49Ki 4.9% 1.35Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned __int128, char>(fmt::v11::basic_appender<char>, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
3.1% 1.31Ki 4.7% 1.31Ki [section .dynsym]
3.0% 1.29Ki 4.2% 1.15Ki fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned long, char>(fmt::v11::basic_appender<char>, unsigned long, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
After disabling these artifacts in commits e582d37 And b3ccc2das well as adding a more convenient option for opting out via macro FMT_USE_LOCALE
the binary size has been reduced to 27 kB:
$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 \
-I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 27K Aug 30 19:38 a.out
There are several places in the library where size is sacrificed for speed. For example, consider the function for calculating the number of decimal digits:
auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the upper 32 bits (log10(T) - 1) when >= T is added.
# define FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
static constexpr uint64_t table[] = {
FMT_INC(0), FMT_INC(0), FMT_INC(0), // 8
FMT_INC(10), FMT_INC(10), FMT_INC(10), // 64
FMT_INC(100), FMT_INC(100), FMT_INC(100), // 512
FMT_INC(1000), FMT_INC(1000), FMT_INC(1000), // 4096
FMT_INC(10000), FMT_INC(10000), FMT_INC(10000), // 32k
FMT_INC(100000), FMT_INC(100000), FMT_INC(100000), // 256k
FMT_INC(1000000), FMT_INC(1000000), FMT_INC(1000000), // 2048k
FMT_INC(10000000), FMT_INC(10000000), FMT_INC(10000000), // 16M
FMT_INC(100000000), FMT_INC(100000000), FMT_INC(100000000), // 128M
FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000), // 1024M
FMT_INC(1000000000), FMT_INC(1000000000) // 4B
};
auto inc = table[__builtin_clz(n | 1) ^ 31];
return static_cast<int>((n + inc) >> 32);
}
The table used here is 256 bytes. There is no universal solution, and making changes may negatively affect other situations. Fortunately, there is another implementation of this function for cases where __builtin_clz
unavailable, for example, constexpr
:
template <typename T> constexpr auto count_digits_fallback(T n) -> int {
int count = 1;
for (;;) {
// Integer division is slow so do it for a group of four digits instead
// of for every digit. The idea comes from the talk by Alexandrescu
// "Three Optimization Tips for C++". See speed-test for a comparison.
if (n < 10) return count;
if (n < 100) return count + 1;
if (n < 1000) return count + 2;
if (n < 10000) return count + 3;
n /= 10000u;
count += 4;
}
}
All that remains is to give users the option to choose whether to use the alternative implementation. As you might guess, this is done with another configuration macro FMT_OPTIMIZE_SIZE
:
auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
return count_digits_fallback(n);
}
With this and a few similar changes we reduced the binary size to 23 kB:
$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 23K Aug 30 19:41 a.out
We can probably reduce the binary size even further by playing with the settings, but let's turn our attention to something more global – the C++ standard library. What's the point of optimizing the size when we end up with 1-2 megabytes of C++ runtime?
Although {fmt} tries to rely as little as possible on the standard library, can we remove this dependency altogether? One obvious problem is exceptions, which we can disable with FMT_THROW, for example by specifying the value abort. In general, this is not recommended, but may be suitable in some cases, given that most errors are caught at compile time.
Let's try compiling with the ‑nodefaultlibs flag and exceptions disabled:
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
'-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
-nodefaultlibs -lc
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::basic_memory_buffer<char, 500ul, std::allocator<char> >::grow(fmt::v11::detail::buffer<char>&, unsigned long)':
<artificial>:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: <artificial>:(.text+0xab8): undefined reference to `operator new(unsigned long)'
/usr/bin/ld: <artificial>:(.text+0xaf8): undefined reference to `operator delete(void*, unsigned long)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) [clone .constprop.0]':
<artificial>:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned long)'
collect2: error: ld returned 1 exit status
Surprisingly, this approach almost worked. The only dependency on the C++ runtime comes from fmt::basic_memory_buffer
– a small buffer on the stack that can be expanded into dynamic memory if necessary.
fmt::print
can write directly to the buffer FILE
and generally does not require dynamic memory allocation, so we could remove the dependency on fmt::basic_memory_buffer
from fmt::print
. But since it may be used elsewhere, a more correct approach would be to replace the standard allocator with malloc
And free
instead of new
And delete
.
template <typename T> struct allocator {
using value_type = T;
T* allocate(size_t n) {
FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
T* p = static_cast<T*>(malloc(n * sizeof(T)));
if (!p) FMT_THROW(std::bad_alloc());
return p;
}
void deallocate(T* p, size_t) { free(p); }
};
This reduces the binary size to 14 kB:
$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I include \
-DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
'-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
-nodefaultlibs -lc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 14K Aug 30 19:06 a.out
Given that a C program with an empty function main
takes up 6 kB in the system used, {fmt}
adds less than 10 kB to the binary size.
We can also verify that we are no longer dependent on the C++ runtime:
$ ldd a.out
linux-vdso.so.1 (0x0000ffffb0738000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
/lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)
I hope you found it interesting, and good luck with the inline formatting!