reduced the size to 14kB and got rid of the C++ runtime

Formatting library {fmt} is known for its low impact on binary size. Most often, its code is several times smaller than libraries such as IOStreams, Boost Format or, ironically, tinyformat. This is achieved through careful application of type erasure at different levels, which minimizes unnecessary use of templates.

Formatting arguments are passed via format_args with erased types:

auto vformat(string_view fmt, format_args args) -> std::string;

template <typename... T>
auto format(format_string<T...> fmt, T&&... args) -> std::string {
  return vformat(fmt, fmt::make_format_args(args...));
}

As you can see, format delegates all non-template work, vformat.

For output iterators and other output types, types are erased using a special buffer API.

This approach reduces the use of templates to a minimal top-level layer, while reducing the binary size and compile time.

For example, the following code:

// test.cc
#include <fmt/base.h>

int main() {
  fmt::print("The answer is {}.", 42);
}

Compiles to

.LC0:
        .string "The answer is {}."
main:
        sub     rsp, 24
        mov     eax, 1
        mov     edi, OFFSET FLAT:.LC0
        mov     esi, 17
        mov     rcx, rsp
        mov     rdx, rax
        mov     DWORD PTR [rsp], 42
        call    fmt::v11::vprint(fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>)
        xor     eax, eax
        add     rsp, 24
        ret

It is significantly smaller than the equivalent IOStreams code and is comparable to printf:

.LC0:
        .string "The answer is %d."
main:
        sub     rsp, 8
        mov     esi, 42
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        call    printf
        xor     eax, eax
        add     rsp, 8
        ret

Unlike printf, {fmt} provides full type safety at runtime. Errors in format strings can be caught at compile time, but even if strings are defined at runtime, errors are treated as exceptions, preventing undefined behavior, memory corruption, and potential crashes. Also, in general, calls {fmt} more efficient, especially when positional arguments are used, which is what C is for varargs not very suitable.

In 2020, I dedicated some time library size optimizationsuccessfully reducing its size to less than 100 kB (only ~57 kB when using -Os -flto). Much has changed since then. In particular, {fmt} now uses an outstanding algorithm Dragonbox for formatting floating point numbers, courtesy of its author Junekey Jeon. Let's see how these changes affected the binary file size and whether it can be reduced even further.

One might ask: why exactly the size of the binary file?

Usage {fmt} on devices with limited memory has generated considerable interest – in particular, these examples from the distant past: #758 And #1226. A particularly intriguing case is programming for retro machines, where people use {fmt} for systems such as Amiga (#4054).

We will use the same methodology as earlierexamining the size of the executable file of a program using {fmt}as this will be most relevant for most users. All tests will be performed on a platform with aarch64 Ubuntu 22.04 and GCC 11.4.0.

First of all, let's define the starting point: what is the binary file size of the latest version? {fmt} (11.0.2)?

$ git checkout 11.0.2
$ g++ -Os -flto -DNDEBUG -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 75K Aug 30 19:24 a.out

The size of the resulting file is 75 kB. On the positive side, despite the significant number of changes over the last 4 years, the size has not increased significantly.

It's time to consider possible optimization paths. One of the first candidates might be disabling locale support. All formatting in {fmt} locale-independent by default (which breaks the C++ tradition of having bad defaults), but is still accessible via the format specifier L. It can be disabled using a macro. FMT_STATIC_THOUSANDS_SEPARATOR:

$ g++ -Os -flto -DNDEBUG "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 71K Aug 30 19:25 a.out

Disabling locale support reduces the binary file size to 71 kB.

Now let's check the result with our trusty tool – Bloaty:

$ bloaty -d symbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  43.8%  41.1Ki  43.6%  29.0Ki    [121 Others]
   6.4%  6.04Ki   8.1%  5.42Ki    fmt::v11::detail::do_write_float<>()
   5.9%  5.50Ki   7.5%  4.98Ki    fmt::v11::detail::write_int_noinline<>()
   5.7%  5.32Ki   5.8%  3.88Ki    fmt::v11::detail::write<>()
   5.4%  5.02Ki   7.2%  4.81Ki    fmt::v11::detail::parse_replacement_field<>()
   3.9%  3.69Ki   3.7%  2.49Ki    fmt::v11::detail::format_uint<>()
   3.2%  3.00Ki   0.0%       0    [section .symtab]
   2.7%  2.50Ki   0.0%       0    [section .strtab]
   2.3%  2.12Ki   2.9%  1.93Ki    fmt::v11::detail::dragonbox::to_decimal<>()
   2.0%  1.89Ki   2.4%  1.61Ki    fmt::v11::detail::write_int<>()
   2.0%  1.88Ki   0.0%       0    [ELF Section Headers]
   1.9%  1.79Ki   2.5%  1.66Ki    fmt::v11::detail::write_float<>()
   1.9%  1.78Ki   2.7%  1.78Ki    [section .dynstr]
   1.8%  1.72Ki   2.4%  1.62Ki    fmt::v11::detail::format_dragon()
   1.8%  1.68Ki   1.5%    1016    fmt::v11::detail::format_decimal<>()
   1.6%  1.52Ki   2.1%  1.41Ki    fmt::v11::detail::format_float<>()
   1.6%  1.49Ki   0.0%       0    [Unmapped]
   1.5%  1.45Ki   2.2%  1.45Ki    [section .dynsym]
   1.5%  1.45Ki   2.0%  1.31Ki    fmt::v11::detail::write_loc()
   1.5%  1.44Ki   2.2%  1.44Ki    [section .rodata]
   1.5%  1.40Ki   1.1%     764    fmt::v11::detail::do_write_float<>()::{lambda()#2}::operator()()
 100.0%  93.8Ki 100.0%  66.6Ki    TOTAL

A significant portion of the binary file is code devoted to formatting numbers, especially floating point numbers. The latter formatting also relies on the use of large tables, not shown here. But what if we don't need floating point support? {fmt} allows you to turn it off, although the method of disabling it is a bit spontaneous and does not apply to other types.

The main problem is that formatting functions need to know about all the types being formatted. But is this really true? This is true for printfdefined by the C standard, but not required for {fmt}. {fmt} supports an extension API that allows formatting arbitrary types without knowing their full set in advance. While built-in and string types are treated specifically to optimize performance, a different approach may be required when optimizing file size. By removing special handling and handling each type using the extension API, we can avoid the overhead of unused types.

I created an experimental one implementation of this idea. By setting the value to 0 for the macro FMT_BUILTIN_TYPESonly int is handled specifically, and all other types go through the extension API. We still need to know about int for dynamic width and precision, for example:

fmt::print("{:{}}\n", "hello", 10); // prints "hello     "

This gives us a “don't pay for what you don't use” model, although it does increase the size of each call in the binary a bit. If you handle floating point or other types, the corresponding code will still be included in the final assembly. And while it is possible to reduce the implementation of floating point handling, we will not touch on this topic in this article.

Using FMT_BUILTIN_TYPES=0the binary file size has been reduced to 31 kB, which gives us a significant improvement:

$ git checkout 377cf20
$ g++ -Os -flto -DNDEBUG \
      "-DFMT_STATIC_THOUSANDS_SEPARATOR=','" -DFMT_BUILTIN_TYPES=0 \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 31K Aug 30 19:37 a.out

However, Bloaty's updated results show that some locale artifacts still persist, such as digit_grouping:

$ bloaty -d fullsymbols a.out

    FILE SIZE        VM SIZE
 --------------  --------------
  41.8%  18.0Ki  39.7%  11.0Ki    [84 Others]
   6.4%  2.77Ki   0.0%       0    [section .symtab]
   5.3%  2.28Ki   0.0%       0    [section .strtab]
   4.6%  1.99Ki   6.9%  1.90Ki    fmt::v11::detail::format_handler<char>::on_format_specs(int, char const*, char const*)
   4.4%  1.88Ki   0.0%       0    [ELF Section Headers]
   4.1%  1.78Ki   5.8%  1.61Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int_noinline<char, fmt::v11::basic_appender<char>, unsigned int>(fmt::v11::basic_appender<char>, fmt::v11::detail::write_int_arg<unsigned int>, fmt::v11::format_specs const&, fmt::v11::detail::locale_ref) (.constprop.0)
   3.7%  1.60Ki   5.8%  1.60Ki    [section .dynstr]
   3.5%  1.50Ki   4.8%  1.34Ki    void fmt::v11::detail::vformat_to<char>(fmt::v11::detail::buffer<char>&, fmt::v11::basic_string_view<char>, fmt::v11::detail::vformat_args<char>::type, fmt::v11::detail::locale_ref) (.constprop.0)
   3.5%  1.49Ki   4.9%  1.35Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned __int128, char>(fmt::v11::basic_appender<char>, unsigned __int128, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)
   3.1%  1.31Ki   4.7%  1.31Ki    [section .dynsym]
   3.0%  1.29Ki   4.2%  1.15Ki    fmt::v11::basic_appender<char> fmt::v11::detail::write_int<fmt::v11::basic_appender<char>, unsigned long, char>(fmt::v11::basic_appender<char>, unsigned long, unsigned int, fmt::v11::format_specs const&, fmt::v11::detail::digit_grouping<char> const&)

After disabling these artifacts in commits e582d37 And b3ccc2das well as adding a more convenient option for opting out via macro FMT_USE_LOCALEthe binary size has been reduced to 27 kB:

$ git checkout b3ccc2d
$ g++ -Os -flto -DNDEBUG -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 \
      -I include test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 27K Aug 30 19:38 a.out

There are several places in the library where size is sacrificed for speed. For example, consider the function for calculating the number of decimal digits:

auto do_count_digits(uint32_t n) -> int {
// An optimization by Kendall Willets from https://bit.ly/3uOIQrB.
// This increments the upper 32 bits (log10(T) - 1) when >= T is added.
#  define FMT_INC(T) (((sizeof(#T) - 1ull) << 32) - T)
  static constexpr uint64_t table[] = {
      FMT_INC(0),          FMT_INC(0),          FMT_INC(0),           // 8
      FMT_INC(10),         FMT_INC(10),         FMT_INC(10),          // 64
      FMT_INC(100),        FMT_INC(100),        FMT_INC(100),         // 512
      FMT_INC(1000),       FMT_INC(1000),       FMT_INC(1000),        // 4096
      FMT_INC(10000),      FMT_INC(10000),      FMT_INC(10000),       // 32k
      FMT_INC(100000),     FMT_INC(100000),     FMT_INC(100000),      // 256k
      FMT_INC(1000000),    FMT_INC(1000000),    FMT_INC(1000000),     // 2048k
      FMT_INC(10000000),   FMT_INC(10000000),   FMT_INC(10000000),    // 16M
      FMT_INC(100000000),  FMT_INC(100000000),  FMT_INC(100000000),   // 128M
      FMT_INC(1000000000), FMT_INC(1000000000), FMT_INC(1000000000),  // 1024M
      FMT_INC(1000000000), FMT_INC(1000000000)                        // 4B
  };
  auto inc = table[__builtin_clz(n | 1) ^ 31];
  return static_cast<int>((n + inc) >> 32);
}

The table used here is 256 bytes. There is no universal solution, and making changes may negatively affect other situations. Fortunately, there is another implementation of this function for cases where __builtin_clz unavailable, for example, constexpr:

template <typename T> constexpr auto count_digits_fallback(T n) -> int {
  int count = 1;
  for (;;) {
    // Integer division is slow so do it for a group of four digits instead
    // of for every digit. The idea comes from the talk by Alexandrescu
    // "Three Optimization Tips for C++". See speed-test for a comparison.
    if (n < 10) return count;
    if (n < 100) return count + 1;
    if (n < 1000) return count + 2;
    if (n < 10000) return count + 3;
    n /= 10000u;
    count += 4;
  }
}

All that remains is to give users the option to choose whether to use the alternative implementation. As you might guess, this is done with another configuration macro FMT_OPTIMIZE_SIZE:

auto count_digits(uint32_t n) -> int {
#ifdef FMT_BUILTIN_CLZ
  if (!is_constant_evaluated() && !FMT_OPTIMIZE_SIZE) return do_count_digits(n);
#endif
  return count_digits_fallback(n);
}

With this and a few similar changes we reduced the binary size to 23 kB:

$ git checkout 8e3da9d
$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      test.cc src/format.cc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 23K Aug 30 19:41 a.out

We can probably reduce the binary size even further by playing with the settings, but let's turn our attention to something more global – the C++ standard library. What's the point of optimizing the size when we end up with 1-2 megabytes of C++ runtime?

Although {fmt} tries to rely as little as possible on the standard library, can we remove this dependency altogether? One obvious problem is exceptions, which we can disable with FMT_THROW, for example by specifying the value abort. In general, this is not recommended, but may be suitable in some cases, given that most errors are caught at compile time.

Let's try compiling with the ‑nodefaultlibs flag and exceptions disabled:

$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      '-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
      -nodefaultlibs -lc

/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::basic_memory_buffer<char, 500ul, std::allocator<char> >::grow(fmt::v11::detail::buffer<char>&, unsigned long)':
<artificial>:(.text+0xaa8): undefined reference to `std::__throw_bad_alloc()'
/usr/bin/ld: <artificial>:(.text+0xab8): undefined reference to `operator new(unsigned long)'
/usr/bin/ld: <artificial>:(.text+0xaf8): undefined reference to `operator delete(void*, unsigned long)'
/usr/bin/ld: /tmp/cc04DFeK.ltrans0.ltrans.o: in function `fmt::v11::vprint_buffered(_IO_FILE*, fmt::v11::basic_string_view<char>, fmt::v11::basic_format_args<fmt::v11::context>) [clone .constprop.0]':
<artificial>:(.text+0x18c4): undefined reference to `operator delete(void*, unsigned long)'
collect2: error: ld returned 1 exit status

Surprisingly, this approach almost worked. The only dependency on the C++ runtime comes from fmt::basic_memory_buffer – a small buffer on the stack that can be expanded into dynamic memory if necessary.

fmt::print can write directly to the buffer FILE and generally does not require dynamic memory allocation, so we could remove the dependency on fmt::basic_memory_buffer from fmt::print. But since it may be used elsewhere, a more correct approach would be to replace the standard allocator with malloc And free instead of new And delete.

template <typename T> struct allocator {
  using value_type = T;

  T* allocate(size_t n) {
    FMT_ASSERT(n <= max_value<size_t>() / sizeof(T), "");
    T* p = static_cast<T*>(malloc(n * sizeof(T)));
    if (!p) FMT_THROW(std::bad_alloc());
    return p;
  }

  void deallocate(T* p, size_t) { free(p); }
};

This reduces the binary size to 14 kB:

$ git checkout c0fab5e
$ g++ -Os -flto -DNDEBUG -I include \
      -DFMT_USE_LOCALE=0 -DFMT_BUILTIN_TYPES=0 -DFMT_OPTIMIZE_SIZE=1 \
      '-DFMT_THROW(s)=abort()' -fno-exceptions test.cc src/format.cc \
      -nodefaultlibs -lc
$ strip a.out && ls -lh a.out
-rwxrwxr-x 1 vagrant vagrant 14K Aug 30 19:06 a.out

Given that a C program with an empty function main takes up 6 kB in the system used, {fmt} adds less than 10 kB to the binary size.

We can also verify that we are no longer dependent on the C++ runtime:

$ ldd a.out
        linux-vdso.so.1 (0x0000ffffb0738000)
        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000ffffb0530000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffb06ff000)

I hope you found it interesting, and good luck with the inline formatting!

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *