Step by step improvement of algorithm performance

I recently had the opportunity to work on a new approximate nearest neighbor search algorithm called RaBitQ. The author of this algorithm has already provided a fairly fast implementation in C++. I tried rewrite this algorithm in Rust (another case of “why not rewrite it in Rust”). However, I found that my implementation is much slower than the original. Next, I will tell you how I improved its performance step by step.

Preparing the environment

Datasets

In this case, it is most important to select adequate datasets. Since the above-mentioned article describing the implementation in C++ has already demonstrated some results on datasets sift_dim128_1m_l2 And gist_dim960_1m_l2options with 128 and 960 dimensions are considered typical, and 1_000_000 there should be enough vectors to place control points. These datasets can be downloaded from here. (Yes, I know this site does not support TLS and only allows FTP downloads).

These datasets use the format fvecs/ivecs, in fact, it is an ordinary vector:

| dim (4 bytes) | vector (4 * dim bytes) |
| dim (4 bytes) | vector (4 * dim bytes) |
...
| dim (4 bytes) | vector (4 * dim bytes) |

You can download the script for reading and writing Here.

Profiling tool

To profile Rust code I use the tool sample. It integrates well with Firefox profiler. In addition, when working with it, you can share the obtained profiling results with colleagues by uploading them to the cloud. Here is an example that profiles the C++ version. The most common representations are: FlameGraph And CallTree. Remember to grant access to analyze performance events and increase the limit mlock:

echo '1' | sudo tee /proc/sys/kernel/perf_event_paranoid
sudo sysctl kernel.perf_event_mlock_kb=2048

You will also find the Compiler Browser useful God Boltwhich provides a useful comparison of assembly function code between C++ and Rust.

Cargo profile

To include debugging information in the release build, you can add another profile to Cargo.toml:

[profile.perf]
inherits = "release"
debug = true
codegen-units = 16

The ease of profiling depends heavily on the compilation overhead and how quickly the program runs.

  • U cargo build compilation speed is faster, but the code may run slower than pure Python

  • cargo build –release is fast, but may take a long time to compile

To place control points we have no other options but to use opt-level = 3.

Several times I came across advice to set the following settings:

codegen-units = 1
lto = "fat"
panic = "abort"

But in my practice, they only slowed down compilation and did not improve performance at all.

Benchmarking Tools

Criterion is a good tool for benchmarking (setting control points), the work of which is based on statistical principles. I created another one repositoryin which he compiled all the benchmarking code. But it turns out that all the material needs to be kept in the same repository.

Here, in particular, it should be noted that the benchmarking results are not very stable. I've seen the difference before ±10%, without making any changes to the code. If you're working on a laptop, the situation can be even worse, as high temperatures may cause the CPU to run at a lower clock speed.

I advise you to set checkpoints for several different function parameters. In this case I use different dimensions. If the results are positive across all dimensions, this usually means that the optimization is effective.

Metrics

Don't forget to add some metrics from the start. Many bugs and performance issues can be identified by checking the metrics. I directly use AtomicU64since the current requirements are simple. Later, maybe I'll switch to Prometheus metrics.

Please note that going overboard with metrics/logging/tracing can also impact performance. So add them with caution.

Resources

While setting up control points, I noticed that the end-to-end QPS (requests per second) value is extremely unstable. Literally overnight I could get 15% towards improvement or degradation, without even recompiling the code. Then I noticed that the CPUs were not completely idle since I was using VSCode + Rust analyzer. It seems as if they do not load the processor too much, but their work has a very serious impact on the benchmarking results. Although I work with Intel Core i7-13700Kwhich has 8 high-performance and 8 energy-efficient cores, our program is single-threaded.

With the help taskset I bind the process to a specific CPU core. Therefore, processes will not be affected by a mixed distribution of cores.

Please note: Intel Core 13/14 processor cores suffer from instability due to very high voltage. I fixed this problem at the BIOS level.

Cloud virtual machines are probably not affected by processor temperature, but cloud providers may have their own policies for skipping processor cycles and overselling.

Step by step improvement

Let's start with a simplified implementation

In my first release The RaBitQ algorithm was based on an algebraic library called nalgebra. This choice was due, first of all, to the fact that to obtain an orthogonal matrix I needed to use QR decomposition, and this is a key step of the RaBitQ algorithm. In addition, a mature linear algebra library provides many useful functions for matrix and vector operations – and this library really made implementing the algorithm easier for me. Imagine what it would be like to implement an algorithm in Python without the numpy library that involves multiplying, projecting, and decomposing matrices. It's a nightmare.

I expected good performance because nalgebra optimized for exactly this type of scenario. But according to benchmarking data, the algorithm works much slower than I expected. I think it will work much faster if we reimplement it using numpy 🙁

According to profilingthere are a lot of calls being made here f32::clone(). They take up approximately 33% of the total time, or even 44% if you focus on function query_one. Here I remember that you can pre-allocate memory for several vectors and reuse it during iteration – this is a fairly common trick. Therefore, instead of (x - y).norm_squared() I need to declare in advance another vector that stores the result (x - y)ultimately taking on the form x.sub_to(y, &mut z); z.norm_squared(). Cm. commit 23f9aff.

Here, as in most algebraic libraries, the matrix is ​​stored in the order of expansion by columns – and this means that iterating through columns is faster than through rows. This is a little annoying, since before iteration the matrix has to be transposed, and not all operations with multiplication of vectors/matrices are able to detect a dimension mismatch error at the compilation stage (1 x dyn or dyn x 1).

Target CPU

RaBitQ uses a binary dot product to estimate the approximate distance, which is calculated as follows:

fn binary_dot_product(x: &[u64], y: &[u64]) -> u32 {
    assert_eq!(x.len(), y.len());
    let mut res = 0;
    for i in 0..x.len() {
        res += (x[i] & y[i]).count_ones();
    }
    res
}

I thought I was here u64::count_ones() will directly use the built-in functions. But it turns out that I still have to activate the feature myself popcnt at the compilation stage. This can be done using RUSTFLAGS="-C target-feature=+popcnt"but I prefer the option RUSTFLAGS="-C target-cpu=native"in which all features supported by the current CPU are activated at once. True, this also eliminates the ability to transfer the binary format, but at this stage this is normal. The following sections also require you to enable env to take advantage of AVX2 features.

You can check the capabilities of your CPU with the following command:

rustc --print=cfg -C target-cpu=native | rg target_feature

OKMD

When searching for nearest neighbors, the distance function plays a key role, and in this case we are talking about the Euclidean distance. As a rule, we use the square of the distance L2 so that we don’t have to take the square root. Here's what a simplified implementation looks like:

{
    y.sub_to(x, &mut residual);
    residual.norm_squared()
}

After profiling I found that f32::clone() still here. After checking the nalgebra source code, I found out that for some reason this library has a lot of clonewhy – I don’t know. Then I decided to write OKMD manually. Fortunately, in hnswlib (a popular implementation of the HNSW algorithm is finding nearest neighbors using a small world) such functionality has already been implemented.

As a result, it is possible to exclude f32::clone() from the distance calculation, then the QPS score for SIFT (Scale Invariant Feature Transform) is improved by 28%. See for yourself, look commit 5f82fcc.

My CPU doesn't support AVX512, so I use the AVX2 version. You can look Steam Hardware Statshere in the section “Other Settings» (Other settings) support for OKMD is indicated. U 100% users have SSE3, 94.61% there is AVX2 and only 13.06% users have AVX512F. Of course, these statistics are not entirely objective, since most Intel CPUs support AVX512, but not all users are gamers.

The most useful guide to working with OKMD isIntel Intrinsics Guide. It is better to download the entire site, since working with it online is not very convenient. In built-in functions, do not forget to check the indicators “latency” (delay) and “throughput” (bandwidth), otherwise the code may run slower than the regular version.

Another good resource is x86 built-in features cheat sheet. Good for beginners like me.

@ashvardanian wrote fast about “masks”, which tells you how to solve the problem with the tail elements (AVX512 required).

Add this to make the code work on other platforms:

#[cfg(any(target_arch = "x86_64", target_arch = "x86"))]
{
    if is_x86_feature_detected!("avx2") {
        // Версия для AVX2 
    } else {
        // обычная версия
    }
}

Rust has several useful packages that help you write even better cfg for OKMD, but for now let's keep it simple.

More about OKMD

OKMD resembles a hammer, so now I need to look in the code for nails that could be hammered with it.

  • You can rewrite the function binarize_vector under AVX2 as done in commit f114fc1. Then the QPS for GIST improves by 32%.

This implementation, compared to the original version for C++, also does not contain branches. When enabled opt-level=3 this can be optimized at the compiler level. Cm. assembly code.

As @andrewaylett pointed out, opt-level=3 allows you to optimize this:

- let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
+ let shift = ((i >> 5) & 1) << 5;  // противоположная версия, так как я храню бинарник в u64, где действует противоположный порядок старшинства, нежели в версии на C++ 

@NovaX first pointed out that this code is equivalent to i&32 and this version is more readable.

Scalar quantization

To further clean f32::clone() from the code, I decided to replace even more functions from nalgebra on implementations that I would have done manually myself. The most common of these functions are min and max. This is what the version from nalgebra:

let lower_bound = residual.min();
let upper_bound = residual.max();
То же самое делается так:
fn min_max(vec: &[f32]) -> (f32, f32) {
    let mut min = f32::MAX;
    let mut max = f32::MIN;
    for v in vec.iter() {
        if *v < min {
            min = *v;
        }
        if *v > max {
            max = *v;
        }
    }
    (min, max)
}

I'm used to working with f32::min() And f32::max()because they are comfortable. But when working with non-increasing (non-decreasing) vectors if turns out to be more productive.

In order not to iterate over the vector several times within the chain of functions and, therefore, not to calculate scalar quantization within different iterations:

let y_scaled = residual.add_scalar(-lower_bound) * one_over_delta + &self.rand_bias;
let y_quantized = y_scaled.map(|v| v.to_u8().expect("convert to u8 error"));
let scalar_sum = y_quantized.iter().fold(0u32, |acc, &v| acc + v as u32);

you can write the following loop:

{
    let mut sum = 0u32;
    for i in 0..vec.len() {
        let q = ((vec[i] - lower_bound) * multiplier + bias[i]) as u8;
        quantized[i] = q;
        sum += q as u32;
    }
    sum
}

With scalar quantization one can be sure that f32 converted to u8so let's use as u8 instead of to_u8().unwrap().

Commit af39c1c And commit d2d51b0 allowed to improve QPS for GIST by 31%.

The next part can also be rewritten using OKMD, which improves the QPS for GIST by 12%:

I also tried replacing tr_mul on OKMD, that is, apply a vector projection. It turns out that nalgebra in this case uses BLASso performance remains unchanged.

Faer: another algebra package

Learning how to solve a problem with f32::clone() I discovered fire — another algebraic package for Rust. It is optimized through the active use of OKMD and provides better performance when iterating over rows and columns. QR decomposition with it is also much faster than with nalgebra. IN commit 0411821 we were able to speed up the learning phase.

Moreover, now (after commit 0d969bd) I can use these vectors as a regular segment, without ColRef or RowRef as wrappers.

I must admit that if I had used faer from the very beginning, I would have gotten rid of a lot of problems. In any case, this experience was very fruitful for me.

Binary dot product

I thought that popcnt already solves the binary dot product problem, but on FlameGraph it's clear that count_ones() takes only 7% of binary_dot_product. Although AVX512 has the vpopcntq command, I would prefer to use the AVX2 simulation since it is more common.

Here good reference on implementing popcnt for AVX2. IN commit edabd4a this feature is re-implemented in Rust, which improves QPS for GIST on 11%. This trick only works in cases where the vector has more than 256 dimensions, that is, 256 bits for the binary representation.

Inline attribute

Attribute #[inline] use with caution. If you add it to all the SIMD functions, the QPS for GIST improves by 5%.

I/O

Let's start by giving a little context here.

The modern implementation is based on the IVF algorithm, which uses k-average for clustering vectors, and the centroids are stored in memory. The query vector is compared only with clusters that have a lower value l2_squared_distance(query, centroid).

There is a parameter called n_probewhich determines how many nearby clusters will be probed. With a large n_probe value, the completeness will increase, but the QPS will decrease.

In RaBitQ, the binary dot product is used to approximate the distance. If the value is below the threshold, then the algorithm will re-rank using the original squared L2 distance and update the threshold accordingly.

Previously I used the option slice::select_nth_unstablewhich only selects the n nearest neighbors, but does not sort them in order. If you touch clusters located far from the query, then the probability of re-ranking increases, and because of this, the computational load for calculating the squared L2 distance increases. When re-sorting the selected n clusters, the QPS for GIST improved by 4%.

Another trick is to sort the vectors in each cluster by the distance separating them from the centroid. IN commit ea13ebc also managed to improve QPS for GIST by 4%.

There are a number of metadata to help estimate the approximate distance for each vector:

  • factor_ip: f32

  • factor_ppc: f32

  • error: f32

  • x_c_distance_square: f32

Previously, I used 4 Vec to store them, which are not very convenient for working with I/O, since the calculations for each of them require vector[i]. When I combined them into one structure struct V commit bb440e3QPS for GIST improved by 2.5%. Works well since it's 4xf32, so I can use the C representation directly:

#[derive(Debug, Clone, Copy, Default, Serialize, Deserialize)]
#[repr(C)]
struct Factor {
    factor_ip: f32,
    factor_ppc: f32,
    error_bound: f32,
    center_distance_square: f32,
}

Unfortunately, faer does not support vectors u64. Therefore, we have to store the binary representation of the vector in Vec<Vec<u64>>. When I changed it to Vec<u64> V commit 48236b2QPS for GIST improved by 2%.

Constant generics

In the C++ version, we use a template to generate code for the different dimensions. This feature is also available in Rust. I won't try it because recompiling the code for a different set of dimensions may only be applicable in specific cases, for example, in a company where a small fixed number of dimensions are used. With a public library, it's better to present a one-stop solution to users so they don't have to recompile the code themselves.

Other tools

Exists a collection of recipes for testing boundaries, which illustrates with several examples how to get rid of bounds checking in safe Rust.

I tried PGO And BOLTbut did not achieve any improvement.

Go to jemalloc or mmalloc also didn't help performance at all.

Conclusion

  • OKMD is an excellent method if used correctly

  • I/O is also important, especially for large datasets

The current version on a set of GIST data has the same performance as the C++ version. Where I am increasing the use of OKMD, the C++ version uses constant generics.

Link

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *