How CI tests were created and optimized

Modern development teams test every code change before merging. This isn’t just common practice: Along with code reviews, it’s the default standard in almost every company’s codebase. We call it CI (continuous integration) testing. As a result, the average organization runs hundreds of test suites per day.

In the past, continuous integration testing has not always been with us, unlike regular testing. In my observations, CI is the result of testing becoming increasingly faster. Let's look at how this happened and how testing will continue to speed up.

The slowest way to test code is reading

In the 1980s, software testing was slow. Most testing was focused primarily on searching for possible errors in code. Michael Fagan popularized “Fagan's inspections“, in which teams of developers would review code printouts looking for errors. The process was time-consuming, manual, and more akin to intensive code review than what we would think of today as software testing.

In the 1990s, unit testing of software became more common. But for a while, unit testing wrote mostly dedicated software testers using the company's own tools and methods. Some thought that source code authors might have blind spots when testing their own code, and frankly, we still don't trust developers to test changes to their own code for similar reasons.

Back then, tests weren't run very often for two reasons: first, they weren't always written by the software authors themselves; second, tests could run slowly. Computers were slower back then, and depending on the complexity of the tests, running a single set of tests could take hours or even a day. If the tests were written by a single developer, and the set of tests weren't run or weren't run until the next evening, it could be days before the developer learned why their change was breaking the build.

Periodic self-testing

Published in 1999, Kent Beck's book,Extreme programming” (XP) helped change the culture of software development. Developers were encouraged to write small, isolated tests for each new piece of code they contributed.

According to XP proponents, programmers could learn to be effective testers, at least at the level of individual sections of code, whereas a separate group of testers would hopelessly slow down the feedback loop that tests provided. xUnit played an important role here: it was designed specifically to minimize the time lost by programmers writing tests.

Code authors began writing their own tests. This meant that new code was tested more frequently than during the integration phase. Faster testing meant that developers could get feedback faster. But self-testing was voluntary, and it was necessary to rely on authors to carefully run local tests before merging. Moreover, the success of a test depended on the author’s local machine, not some reference server. Code bases could still break the next time they built and ran their test suites.

Launching an automated server to test changes

While Google started automate your tests builds in 2003, it took the software industry as a whole a little longer to do the same. But automation was desperately needed.

Software systems are becoming larger and more complex. To make matters worse, new versions are delivered to users frequently, sometimes several times a day. This is a far cry from the world of “boxed software” that is updated only once or twice a year.

The ability of humans to manually test every system behavior has not kept pace with the growth in features and platforms in most software.

Software developers at Google

Sun Microsystems developer Kosuke Kawaguchi played a key role in ushering in a new era of testing. In 2004, he created Hudsonlater renamed Jenkins after a conflict with Oracle. At his day job, Kosuke was “tired of incurring the wrath of his team every time his code broke the build.” He could have manually run tests before each code check-in, but instead, Kosuke did what a typical programmer does: he created an automated program. The Hudson tool acted as a long-lived test server that could automatically test each code change as it was integrated into the code base.

Kosuke open-sourced Hudson, and it became wildly popular. The first generation of automated continuous integration tests began, and for the first time it became common to test every change to code as it was written. Similar tools like Bamboo and TeamCity quickly took off, but the open-source Hudson was more popular.

Pay someone else to automatically test changes

By the late 2000s, code hosting had moved to the cloud. While teams used to run their own Subversion servers to host and integrate code changes, more and more people now began to place their code on GitHub. Continuous integration tests followed this trend and also moved to the cloud, with products like CircleCI and Travis CI appearing in 2011. Now even smaller companies could outsource the maintenance of the CI job processors themselves. Larger older companies generally stayed with Jenkins because they could afford to continue maintaining the CI servers themselves and because Jenkins offered more advanced control.

In the mid-2010s, we witnessed two evolutions of cloud CI systems.

  1. Zero-maintenance CI systems have merged with code hosting services. GitLab was the firstwho came up with this universal solution: it allowed users to run their CI tests on the same platform where they reviewed and merged changes. Microsoft acquired GitHub in 2018 and achieved the release of GitHub Actions, which was supported by Microsoft's Azure DevOps product. In both cases, the two most popular code hosting platforms initially offered integrated CI test execution.

  2. Large organizations have migrated from Jenkins to more modern self-hosted options. Buildkite was the first A popular modern solution launched in 2013, it allowed companies to get the benefits of web-based dashboards and orchestration while still hosting their code and running tests on their own machines. Later, GitHub and GitLab offered their own CI job runners, and some companies with a lot of manual testing decided to run their own tests in CodeDeploy pipelines on AWS or on the DevOps platform in Azure.

The software testing process can be viewed in terms of speed and cost:

  • Days and weeks. In the 1980s, code changes were slowly reviewed by hand to find bugs. Test suites might be run overnight or just before release.

  • Days and nights. In the 1990s, automated tests began to be written more and more often, either by testers who specialized in this or by the code authors themselves. Code changes began to be tested before, rather than after, merging with the rest of the code base.

  • Hours and minutes. In the early 2000s, the first automated integration testing servers emerged and became popular, leading to testing of every change as it was merged into the code base.

  • Minutes. Around 2011, zero-maintenance CI testing services became available, making testing every change profitable for smaller teams.

Reduce the time it takes to test one change

Best practices aim to ensure that CI time was about 10-15 minutes: This way, developers can maintain short iterations. But this is becoming increasingly difficult as codebases and test suites grow larger every year.

Developers don't expect slow tests. The slower a test runs, the less often it will be run, and the longer it will take to re-run the test after a failure.

Software developers at Google

There are only three ways to speed up anything in software: vertical scaling, parallelism, and caching. In the case of CI, all three are used, with caching and parallelism receiving more and more attention in recent years.

First, for decades, Moore's Law has ensured that increasingly powerful processors can run test suites faster, albeit at a higher cost. Using on-demand cloud services, developers can toggle setting to AWS or GitHub Actions to pay for a more powerful server and hope that their test suite will run faster.

Second, CI providers have gradually improved in parallelization. Buildkite, GitHub Actions, and other providers allow users to define graphs of test stage dependencieswhich allows different computers to pass context and run tests in parallel for the same code change. Cloud computing allows organizations to allocate an infinite number of parallel hosts to run tests without fear of running out of resources. Finally, sophisticated build tools, such as Bazel And Buckallow large code bases to compute build graphs, and parallelize build and testing based on the dependency graph in the code itself.

Third, caching systems in CI have evolved to minimize repetitive work. CI job runners typically support remote caching of setup and build steps, allowing tests to skip preliminary setup work if parts of the codebase have not changed.

Increased speed of testing all changes

Development teams reach a theoretical limit on how quickly they can test a single code change, assuming the test requirements are “run all tests and build for every code change.”

Yet development patterns continue to be optimized for speed.

Question: What could be faster than running CI when changing code using fast computers, parallel tests and heavy caching?

Answer: Don't run some tests for this change at all.

Much like the pre-CI days, some organizations with high development velocities use batching and dependencies between pull requests to save compute resources and provide developers with feedback faster. At our company, we see this happening in two areas. The first is the company's merge queues. Internal merge queues at large companies like Uber, provide batching and skipping of testsThe point is that if you establish an order in which changes to your code are made, you no longer have to test every change in the queue as rigorously as before – although there are some downsides to this.

Chromium uses a variant of this approach called Commit Queue to keep the master branch always green. Changes that pass stage one are selected for stage two every few hours to pass a large test suite that takes about four hours. If the build breaks at this stage, the entire batch of changes is rejected. Build sheriffs and their assistants step in at this point to attribute failures to specific faulty changes. Note that this approach results in packages being released, not individual commits.

The second place where you can skip CI is multi-layered code changespopularized by Facebook*. If a developer sets up a series of small pull requests, they are implicitly describing the desired order in which to merge these changes. As with a merge queue, CI can merge with these changes and git bisect them if any failures are detected. Failures at the bottom of the stack can notify developers before changes start moving up the stack.

Whereas dependency graphs in testing previously offered early failure detection and saved computing time when testing a single change, they now provide the same benefits for multiple pull requests. The savings in computing time are significant: despite the fact that cloud resources offer infinite horizontal scalability, testing costs can account for up to 10–20% of companies’ overall cloud computing costs.

Testing is faster than running code

The fastest form of CI testing I know of today involves batch testing many changes at once, and using as much parallelism and caching as possible.

Before the advent of batch integration testing, some developers still reviewed code changes manually, sometimes looking at printouts at their desks to check for errors. We abandoned this method of review because machines could now execute code faster than humans could read and comprehend it.

This ratio may change with the advent of large language models. I suspect we are on the verge of fast and cheap code analysis using artificial intelligence. I have said before that there are only three ways to speed up computation: faster chips, parallelism, and caching. Technically, there is a fourth option if you are willing to live with fuzzy results: probabilistic prediction of the outcome. Fun fact: CPUs they are already doing it This is today.

It may not replace human unit tests and code reviews, but AI can find common errors in proposed code changes in ten seconds or less. It can point out linter issues, inconsistent patterns in the codebase, spelling errors, and other types of errors. Existing CI coordinators can initiate AI-powered reviews and return results faster than other tests. Alternatively, AI-powered reviews could become so fast and cheap that they could run passively in the background of developers’ editors, like Copilot. Unlike traditional CI tests, AI reviews don’t require the entire codebase to identify issues.

Will AI-style tests become popular? It's unclear, but companies are trying move in that direction. If AI tests ever get good enough to be widely adopted, it could become another technological example of “everything old is new again.”

* Facebook is a project of Meta Platforms, Inc., whose activities are prohibited in Russia.


To grow, you need to step out of your comfort zone and take a step towards change. You can learn something new by starting with free classes:

Or open up prospects with professional training and retraining:

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *