Refactoring in depth

image

Refactoring is “

it is a controlled technique for improving the structure of existing code

” [

Фаулер

]. There is so much written about code smells and micro-scale refactoring techniques (there is, for example,

books

And

entire websites

). And I want to look at the situation close-up and discuss,

how exactly

And

in what order

these techniques should be used. In particular, I would argue that refactoring is best done

inside out

that is, start from the boundary with the external API, and then work through the code in depth, moving on to classes, methods, algorithms, types, tests, or variable names.

The code examples in this post are written in Rust, but the technique of inside-out refactoring is also applicable to other programming languages. I chose Rust as an example because refactoring is more convenient the stronger the type system.

End-to-end example

Inspired by the example

Stefan

which he demonstrated at

workshop

“Refactoring Rust” at the EuroRust 2023 conference, I will offer such an example. Let's say we have a very complex function

parse_magic_key

which loops through the files in a directory, finds all the files in JSON format, parses the files for specific keys, and returns all the corresponding values. Maybe this function was written once for a prototype, or it was created during a hackathon, or maybe it grew on its own and grew to hundreds of lines. Below is a shortened version of this mind-blowing function – something like this might look like this:

use std::fs;
use std::fs::File;
use std::io::{BufRead, BufReader};

/// Разбирает синтаксис и возвращает поля `magic_key` из всех файлов JSON, расположенных в заданном каталоге.
pub fn parse_magic_key(path: &str) -> Vec<String> {
    let mut results = Vec::new();
    for path in fs::read_dir(path).unwrap() {
        let path = path.unwrap().path();
        if path.extension().unwrap() == "json" {
            // Ниже вашему вниманию предлагаются сотни строк с переусложнённой/уродливой/хрупкой логикой синтаксического разбора, которые здесь упрощены 
            // всего до нескольких строк.
            let file = BufReader::new(File::open(path).unwrap());
            for line in file.lines() {
                let line = line.unwrap();
                let parts: Vec<&str> = line.split(":").collect();
                if parts.len() == 2 && parts.get(0).unwrap().contains("magic_key") {
                    results.push((*parts.get(1).unwrap()).trim().to_string());
                }
            }
        }
    }

    results
}

#[cfg(test)]
mod tests {
    use std::fs;
    use tempdir::TempDir;
    use crate::parser::parse_magic_key;

    #[test]
    fn can_parse_dir() {
        let tmp_dir = TempDir::new("").unwrap();
        let dir = tmp_dir.as_ref();
        fs::write(dir.join("positive.json"), r#"
            {
              "magic_key": "1"
            }"#).unwrap();
        fs::write(dir.join("negative.json"), r#"
            {
              "foo": "2"
            }"#).unwrap();
        fs::write(dir.join("negative.not-json"), r#"
            {
              "magic_key": "3"
            }"#).unwrap();

        assert_eq!(parse_magic_key(dir.to_str().unwrap()), vec![r#""1""#]);
    }
}

Of course, this code has more disadvantages than advantages. For example:

  • Naive/crooked/fragile parsing logic
  • There is no proper error handling
  • It is difficult to understand what the expected behavior of this function is.
  • Convoluted directory traversal and parsing logic
  • Hard-coded assumptions of the JSON format, making it difficult to extend
  • This feature is difficult to develop.
  • Inflexible and unclear API (eg. what exactly does path: &str?)
  • Capacitive complexity O(|result|)

Let's say we want to refactor this code. Naturally, the first question that arises is:

where to start?

Maybe I should tackle the smallest and most obvious details first, like finding more appropriate variable and method names, breaking the method into several smaller ones? Maybe I'll start by fixing the parsing logic, like using

serde_json

instead of this faulty, flawed parser? Or, after all, start with the API – for example, types or method signatures?

Taking a closer look, we can immediately distinguish two paths that we can take when refactoring: deep into And outside.

Inward or outward?

When refactoring is in progress

deep into

we start working on the implementation details. We fix the parsing logic (and add tests), look for more suitable variable names. Next, we may add support for formats other than JSON. Finally, when we are satisfied with the internals, we fix the API.

Wait, why do we need to fix the API? And how do we know, Howfix it? Well, I dare say that any sensible undertaking related to refactoring, should start with a hypothesis and setting design goals: We refactor because ____. By the way, because the code base is uglythere is almost certainly no benefit to such a goal. Rather, we are refactoring the code base because its structure or implementation details are no longer compatible with the requirements and the cases in which it is used. For example:

  • We need a more flexible API to reuse the module in another context or application.
    In the code fragment under consideration: since we are passing Pathand this is the directory path, we can't just selectively parse JSON files; either all or none.
  • The code base is designed in a way that inevitably limits performance, and that is no longer acceptable to us.
    In the code fragment under consideration: since the entire result must fit in the main memory, we cannot parse large amounts of data.

Often, such changes in requirements directly lead to changes in the API of a module or function. This is how the idea of ​​refactoring was born.

deep into

: first we change the API, and then we move on to smaller and smaller details until the implementation suits us. This idea has been described before, though in more “corporate” language and in relation to the enterprise. It was called

strangler pattern

.

Refactoring in depth

Step 0: Understand the old API and implementation

Unless you are unconstrained and starting from scratch, you will have to support all the provided and often even unintended behaviors old API and the entire implementation. Then – keep in mind Chesterton's fence — list all those aspects of the old API that can be dispensed with. By identifying which limitations or call patterns are no longer used and therefore not needed, it is often possible to significantly simplify existing systems.

Step 1: Design a new API

Once we've analyzed the old API and figured out what needs to be kept, what can be thrown away, and what needs to be added or changed, it's time to write the new API. In our example, the new API might look something like this:

pub fn parse(files: impl IntoIterator<Item=PathBuf>)
    -> impl Iterator<Item=String> {
  todo!()
}

We changed the input parameter from a directory path to an iterator iterating over the input files. Now the caller can control which files to parse. Second, the function returns an iterator iterating over the results. This makes it easier for the caller to consume the results as a continuous stream within the limited memory. Of course, this API structure won't work in all cases, but for the purposes of this exercise, let's assume that's what we need. (To keep the example simple, I decided not to consider error handling here.)

Step 2: Implement the new API using the old code

In the next step we will try to implement a new API using the old code and making as few changes to it as possible. This is important: at this stage, we are not trying to write a perfect new implementation, but rather we want to reuse the existing one and adapt it to the new API. Here, for example, is what such an adaptation might look like:

/// Разбирает и возвращает поля `magic_key` из всех файлов JSON, расположенных в заданном каталоге.
/// TODO: обработка ошибок
pub fn parse(files: impl IntoIterator<Item=PathBuf>)
    -> impl Iterator<Item=String> {
    // TODO: Разбирать результаты поступательно, а не собирать их все в один вектор 
    let mut results = vec![];
    for path in files {
        if let Ok(file) = File::open(&path) {
            let extension = path.extension().map(|p| p.to_str()).flatten().expect("Failed to determine file extension");
            let file_results = match extension {
                "json" => parse_json(file),
                // TODO: поддержка YAML
                // TODO: поддержка XML
                _ => {
                    println!("Unknown extension: {extension}");
                    vec![]
                }
            };

            results.extend(file_results);
        }
    }

    results.into_iter()
}

fn parse_json(read: impl Read) -> Vec<String> {
    let reader = BufReader::new(read);
    let mut results = Vec::new();
    for line in reader.lines() {
        let line = line.unwrap();
        let parts: Vec<&str> = line.split(":").collect();
        // TODO: использовать потоковый парсер JSON 
        // TODO: добиться, чтобы логика парсинга стала менее хрупкой
        if parts.len() == 2 && parts.get(0).unwrap().contains("magic_key") {
            results.push((*parts.get(1).unwrap()).trim().to_string());
        }
    }

    results
}

/// Разбирает и возвращает поля `magic_key` из всех файлов JSON, расположенных в заданном каталоге.
pub fn parse_magic_key(path: &str) -> Vec<String> {
    let mut results = Vec::new();
    for path in fs::read_dir(path).unwrap() {
        let path = path.unwrap().path();
        if path.extension().unwrap() == "json" {
            let file = File::open(path).unwrap();
            results.append(&mut parse_json(file));
        }
    }

    results
}

Here's what we did:

  • Extracted the parsing code into a new private method parse_json
  • Rewrote the original method parse_magic_key using the extracted method parse_json
  • Implemented a new parse method using the extracted method parse_json
  • Added TODO wherever we were a bit lazy and just marked things that would need to be refactored or optimized later

Note: API, behavior and tests of the original method

parse_magic_key

remained unchanged.

Step 3: Write tests for the new API

Next, you can write some tests for the new API, for example:

    #[test]
    fn can_parse_json() {
        assert_eq!(parse_json(r#"
            {
              "magic_key": "1"
            }"#.as_bytes()), vec![r#""1""#]);

        assert!(parse_json(r#"
            {
              "foo": "1"
            }"#.as_bytes()).is_empty());
    }

    #[test]
    fn can_parse_files() {
        let tmp_dir = TempDir::new("").unwrap();
        let dir = tmp_dir.as_ref();
        fs::write(dir.join("positive.json"), r#"
            {
              "magic_key": "1"
            }"#).unwrap();
        fs::write(dir.join("negative.json"), r#"
            {
              "foo": "2"
            }"#).unwrap();
        fs::write(dir.join("negative.not-json"), r#"
            {
              "magic_key": "3"
            }"#).unwrap();

        let files = vec![dir.join("positive.json"), dir.join("negative.json"), dir.join("negative.not-json")];
        let results: Vec<String> = parse(files).collect();
        assert_eq!(results, vec![r#""1""#]);
    }

Step 4: Fix the insides

Finally, we can fix all those TODOs that we managed to collect along the way. For example, to ensure that the parsing logic is correct, to provide support for other file types. I want to separately note that the initial implementation of step 2 still has the same memory requirements as the original implementation. However, the key point here is that such an API is designed for delayed additional efficiency.

Note that steps 1-2-3-4 can be done as independent commits or even pull requests. Most of the TODOs listed in step 4 can be done in parallel. They can be reviewed independently, and after each step, the build becomes green. All the parsing methods called earlier still apply. This approach to changing code can be called “adiabatic“.

Once you're done with the refactoring, you can migrate all API consumers and eventually remove both the old API and its implementation.

But why?

We looked at,

How

deep refactoring is in place, but has not been discussed yet,

For what

to do this. There are “internal” and “external” reasons why, in my experience, refactoring inward is usually more effective than refactoring outward.

The “internal” nature of the new API we designed in Step 1 allows us to set targets for further refactoring. I mean, we can reasonably consider the refactoring complete when we find a reasonable implementation for our API. After all, the new API should incorporate all the existing plus expected new usage patterns (e.g., gradual parsing) and constraints (e.g., memory limit) for our library or application.

For example, since we are tasked with creatingand not we can target an implementation that uses a limited amount of memory. Thus, our implementation will lead us to a streaming solution or an LL parser that usually reads from the beginning to the end. In addition, we still have tests (and possibly benchmarks) on the “old” behavior – they can also be used to avoid regression cases.

The “outside” aspect is even more interesting. If we write the new API first, we (partially) decouple the refactoring work from the productive work of the developers who want to use the new API. Once the new API is added to the mainline, your teammates can start experimenting with it. Ideally, they will make progress on their own, each working on their own part of the codebase (while using the new API). More often, they will be able to give feedback on the API, since something will definitely go wrong. This is real agile without agile coach.

The opposite is also true: you have almost certainly encountered the shortcomings of the approach described.The new API is still raw, because the refactoring started with the internal details of the algorithm and began to rebuild the database. We will let you know when it is ready, another week or two [Точно!]“.

Conclusion

I hope I've convinced you that your next refactoring project should start “deep.” Write the API first, and then do everything else. As with all things software engineering, this isn't a panacea, but it's worth a try.

P.S. Please note that we are currently running a sale on our website.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *