Toy implementation of fixed point numbers in C++

There is no base in C++ fixed point number typethere are no classes for them in the standard library either. At the same time, working with floating-point numbers (double, float) can often be non-obvious (for example, answer the question: is the addition operation on them associative?), in addition, the language provides the (often criticized) ability to overload arithmetic operators, pushing us to create our own data type.

Before writing the code, let's review the math, namely the representation of numbers in the uint8_t, int8_t types and the features of arithmetic operations on them. So, the addition of two uint8_t occurs modulo 256, that is 1 + 2 = 3, but 1 + 255 = 0, for int8_t negative values can be entered as follows: negative numbers correspond to those unsigned numbers from uint8_t that, when added modulo 256, will give zero, that is, -1 will look like 255 (FF) in memory. The boundaries of the int8_t type are -128…+127, for negative numbers the most significant bit is always 1. When multiplying two int8_t, we get a result of the int16_t type, the quotient of dividing int16_t by uint8_t will have the uint8_t type. All this information is absolutely trivial, but it is necessary for further understanding of the article.

So, let's move on to the main idea: what if we mentally take a value of type int8_t and say that now it is not a number of units, but, say, the number 1/4 (let's say it in words: this number shows how many fourth parts are in the original number)? Then we encapsulate this variable in a class field and overload the main arithmetic operators for it and write our own operator string() for the correct output of such numbers. Below, you can see what comes out of this idea.

#include <stdexcept>
#include <cmath>
#include <climits>
#include <iostream>
#include <sstream>
#include <iomanip>

template <typename T>
constexpr T ipow(T num, unsigned int pow)
{
    return (pow >= sizeof(unsigned int)*8) ? 0 :
               pow == 0 ? 1 : num * ipow(num, pow-1);
}

// fills and return given number of ones in the least significant bits of a byte
constexpr uint8_t mask(uint8_t num)
{
    return num == 0 ? 0 : (mask(num - 1) << 1) | 1;
}

// FractionLength - how many bits are after a period
template <uint8_t FractionLength>
class FixedPoint
{
public:

    explicit FixedPoint(int decimal, unsigned int fraction = 0): val(0)
    {
        if (decimal > max_dec || decimal < min_dec || (decimal == min_dec && fraction) || (fraction >= full_fraction))
        {
            throw std::invalid_argument("It won't fit our so few bits");
        }

        int8_t sign = decimal > 0 ? +1 : -1;

        val |= fraction;
        val |= (decimal << FractionLength);
        val *= sign;
    }

    explicit FixedPoint(double v): val(0)
    {
        double decimal_double = 0.0;
        double fraction = std::modf(v, &decimal_double);

        if (decimal_double > max_dec || decimal_double < min_dec)
        {
            throw std::invalid_argument("It won't fit our so few bits");
        }

        int8_t decimal_part = static_cast<int8_t>(decimal_double);
        int8_t sign = v > 0 ? +1 : -1;
        decimal_part = std::abs(decimal_part);

        uint32_t fraction_decimal = std::abs(fraction)* ipow(10, FractionLength + 1);
        uint8_t count = fraction_decimal / fraction_unit_halv;
        count += count & 0x01;
        count >>= 1;

        if (count && sign == -1 && decimal_part == min_dec_abs)
        {
            throw std::invalid_argument("It won't fit our so few bits");
        }

        if (count == full_fraction)
        {
            decimal_part += 1;

            auto decimal_part_signed = sign * decimal_part;

            if (decimal_part_signed > max_dec || decimal_part_signed < min_dec)
            {
                throw std::invalid_argument("It won't fit our so few bits");
            }
        }
        else
        {
            val |= count;
        }

        val |= (decimal_part << FractionLength);

        val *= sign;

    }


    FixedPoint operator + (const FixedPoint& r) const
    {
        return makeFx(val + r.val);
    }

    FixedPoint operator - (const FixedPoint& r) const
    {
        return makeFx(val - r.val);
    }

    FixedPoint operator - () const
    {
        return makeFx(-val);
    }

    FixedPoint operator * (const FixedPoint& r) const
    {
        uint16_t temp = val * r.val;
        temp >>= (FractionLength - 1);
        uint8_t unit = temp & 0x01;

        return makeFx((temp >> 1) + unit);
    }

    FixedPoint operator / (const FixedPoint& r) const
    {
        uint16_t left = val;
        left <<= (FractionLength + 1);
        uint8_t temp = left / r.val;
        uint8_t unit = temp & 0x01;

        return makeFx((temp >> 1) + unit );
    }

    operator std::string() const
    {
        std::stringstream res;
        uint8_t temp = val;
        auto fraction = temp & fraction_mask;

        if (sign_mask & temp)
        {
            res << '-';

            temp = ~temp + 1;
            fraction = temp & fraction_mask;
        }

        res << (temp >> FractionLength);


        if (fraction)
        {
            res << '.' << std::setw(FractionLength) << fraction * fraction_unit;
        }

        return res.str();
    }

private:

    int8_t val;
    static constexpr uint8_t decimal_lenght = CHAR_BIT - FractionLength;
    static constexpr uint8_t max_dec = (1 << (decimal_lenght - 1)) - 1;
    static constexpr int8_t  min_dec_abs = (1 << (decimal_lenght - 1));
    static constexpr int8_t  min_dec = -min_dec_abs;
    static constexpr uint32_t fraction_unit = ipow(10, FractionLength) / (1 << FractionLength);
    static constexpr uint32_t fraction_unit_halv = ipow(10, FractionLength + 1) / (1 << (FractionLength + 1));
    static constexpr uint8_t full_fraction = ipow(2, FractionLength);
    static constexpr uint8_t sign_mask = 0x80;
    static constexpr uint8_t fraction_mask = mask(FractionLength);
    static constexpr uint8_t decimal_mask = 0xFF & ~FractionLength;


    explicit FixedPoint(): val(0)
    {
    }
    
    static FixedPoint makeFx(int8_t v)
    {
        FixedPoint fp;
        fp.val = v;
        return fp;
    }
};

template <uint8_t FractionLength>
std::ostream& operator << (std::ostream& out, const FixedPoint<FractionLength>& number)
{
    out << std::string(number);

    return out;
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator + (const FixedPoint<FractionLength>& l, int r)
{
    return l + FixedPoint<FractionLength>(r);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator + (int l, const FixedPoint<FractionLength>& r)
{
    return r + FixedPoint<FractionLength>(l);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator - (const FixedPoint<FractionLength>& l, int r)
{
    return l - FixedPoint<FractionLength>(r);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator - (int l, const FixedPoint<FractionLength>& r)
{
    return r - FixedPoint<FractionLength>(l);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator * (const FixedPoint<FractionLength>& l, int r)
{
    return l * FixedPoint<FractionLength>(r);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator * (int l, const FixedPoint<FractionLength>& r)
{
    return r * FixedPoint<FractionLength>(l);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator / (const FixedPoint<FractionLength>& l, int r)
{
    return l / FixedPoint<FractionLength>(r);
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator *= (FixedPoint<FractionLength>& l, const FixedPoint<FractionLength>& r)
{
    l = l * r;
    return l;
}

template <uint8_t FractionLength>
const FixedPoint<FractionLength> operator += (FixedPoint<FractionLength>& l, const FixedPoint<FractionLength>& r)
{
    l = l + r;
    return l;
}

using FixedPoint_2 = FixedPoint<2>;
using FixedPoint_4 = FixedPoint<4>;

Explanations:

The longest and most complex function here is the FixedPoint(double v) constructor, it is only there for ease of testing – skip it on first reading
The FixedPoint(int decimal, unsigned int fraction = 0) constructor is much simpler, but is mainly needed for overloaded arithmetic operators where one of the arguments is of type int
Constructors check for overflows of the val field, but arithmetic operators do not. This is a conscious decision: it is inconvenient to initially work with invalid numbers that were truncated and guess why the whole program is working incorrectly, at the same time there is no need to check for this everywhere, just as the compiler does not check for built-in types
Arithmetic operators outside the class are trivially based on the operators defined in the class
Addition and subtraction are trivial, it makes no difference what you add, hundredths (fourths) or wholes
Private constructor and function makeFx – a workaround to “fill” into the val field, in some operators

What's going on in operator * and operator / ? Let me explain: if you multiply 2 numbers representing the quantity 1/4 you get a number representing the quantity 1/16 – so we shift them to the right by 2 places, but by shifting them like this we do rough rounding – we take into account the last shifted place, that is, we get the number 0.10₂before the shift, without this we would lose it completely, and so we do it according to the standard rounding rules. With the division operator, it's a slightly different story: when dividing the number of 1/4 by each other, the result moves to the right by 2 digits, but in the code I move the dividend to the left by 3 digits in advance – the 3rd digit is again needed for better rounding.

string operator() does 3 things, selects the integer part and outputs it as an integer, selects the fractional part and converts it to a decimal number (hundredths) and outputs it as an integer, the last one is determining the sign, we do this through the most significant bit, the string ~temp + 1 is the same as a unary minus, it's just strange to use a unary minus over an unsigned type.

I'll give you a small example where we'll see how our type works.

template <typename T>
T poly(T x)
{
    return (2*x +1)*x - 3;
}

double x = 0.261799;
FixedPoint_4 fx{x};

std::cout << poly(x) << ' ' << poly(fx) << std::endl;

Result of work:

-2.60112 -2.6250

In my opinion, it's not bad, for something that was created “on the fly!”

So, what can be done to remove the word “toy” from the class attributes:

use a larger base type, best of all int32_t – intermediate results will be in int64_t, you won't have to come up with some “long” multiplication
overload more operators
cover with tests