about how we delete the accounts of our employees every day

I strongly believe in the need for automated tests and have a very disciplined approach to writing them. It is incredibly difficult to maintain functional correctness in programming and even more difficult to avoid regression errors. As author Michael Feathers said, “Legacy code is any code that lacks a test.”

For some things—server endpoints, database schemas, UI library components—testing is fairly straightforward. Others are more difficult to test, such as endpoints that call third-party APIs, react web pages with complex states, or asynchronous processes that require detailed database entries. When I worked at Airbnb, I had difficulty testing password changes via email due to the fact that sending emails is usually outsourced.

Still, this kind of functionality deserves to be tested for two reasons. Firstly, here it is also important to avoid regression errors, and their occurrence becomes even more likely due to complexity. Second, the responsibility of testing complex functionality often forces developers to design it in such a way that it is possible. Connecting tests early can lead to more concise interfaces and less coupling, which improves the quality of the codebase in the long run.

To test or not to test?

There is no perfection in the world – it happens that there is time to write functionality, but not to implement automated tests for it. Why is that? This is like the P and NP class equality problem, only in reverse: many types of functionality are easier to create than to test. Imagine a React to-do list app that requires you to swipe it off the screen to delete an item. You can do this in half an hour, but then spend hours or even days creating automated tests for the UI that will check whether the swipe works. This imbalance, coupled with the rush that business imposes, leads to teams prescribing functionality and omitting costly tests.

Is this bad? If you look at things pragmatically, then not always – there are cases when going into debt and implementing functionality without tests will be a justified decision. It may turn out that an untested feature will bring significant benefits, while testing will require a lot of effort and you have few resources. Perhaps you don’t have enough people on your team or you’re writing a personal project in the evenings. If you force yourself to write automated unit tests for your to-do list app, it may never make it to its top ten trendsetters. However, if you don’t test swipes even though you have a million users, then you’re frankly asking for trouble.

Putting off automated testing can easily lead to a downward spiral, but building a product involves the art of strategically calculating when to borrow. Such “loans” allow the team to quickly validate, and the value that is discovered can then be used to pay off technical debt (with interest). The same is true for startups raising venture capital as well as teams working on an MVP. If you spend too much time on expensive tests from the beginning, you may not have enough time to produce results, learn and maneuver.

Translation

How to ensure your code will never be used again: “It took some extra effort, but now we can use it in all future projects.”
How to Ensure Your Code Lives Forever: “Let’s not dig too deep, if anyone else uses this code this far into the future, we’ll have bigger problems.”

Source

Let's be creative

Keep in mind that the absence of automated tests does not imply a complete abandonment of testing altogether. By default, the lack of tests means that you are subtly forcing users to find bugs for you. Production traffic tracking and a well-established rapid response system can act as a weak substitute for automated testing, but then it is worth having quick rollbacks and flags at your disposal and keeping your finger on the pulse.

However, it is better to monitor not all traffic, but only part of it – often, to identify a regressive error, it is not necessary for all users to stumble upon a bug, just a few will be enough. They will serve you well here canary releases and beta testers. But the problem of real users also experiencing regression errors remains, so careful monitoring is needed to notice when users are having a hard time.

Dogfooding – this is even better than beta testers. Let the functionality for which you do not want to write automated tests be tested by real users – only these users will be you. You won't throw your own product in a rage, and your own eyes work great as a dynamic alert tool (just remember to close them for eight hours every night).

The struggle for onboarding

So, we have a whole range of ways to test functionality – automated tests, canary releases, beta groups, dogfooding. The Graphite development team uses all of these techniques, and yet we still have blind spots in functionality.

One of the most difficult testing challenges for us was onboarding in the product. The code is still the same: it includes an OAuth loop from GitHub, asynchronous loading of metadata from the repository, queries to our own databases, as well as custom UI elements that are not used anywhere else in the application. But despite all the difficulties, onboarding is key, and it needs to be tested somehow.

Classic synthetic tests turn out to be less reliable here than usual due to GitHub's bot detection procedure at login. Testing with canary traffic doesn't help much, since users who drop out on onboarding rarely report a problem, and the logs often look like they're floundering rather than stuck. Beta groups rarely catch anything because they only go through onboarding once; the same can be said about dogfooding in its traditional form.

Let's start the roulette

The solution we came to at Graphite was to launch a roulette script that every day at nine in the morning randomly selects one of our developers’ accounts on Graphite for deletion. We don’t just force the person to go through onboarding again, but demolish the entire account, with tokens, configured filters, loaded GIFs, and so on.

Does this annoy people on the team? By itself. After all, they come to work to write code for new functionality, and not to find themselves kicked out of the system and forced to recreate an account from scratch. We introduced this technique with caution at first, but the benefits were immediate. Note that we are only talking about their account in the Graphite product – access to GitHub and other company accounts remains.

Translation

The dbradf account has been deleted for onboarding dogfooding.
Bad experience: I went to the environment and was greeted by a page of sharpening pencils. It feels like the backend has collapsed.
Unknown error when selecting default repositories.
The “Continue” button is inactive and does not provide any prompts as to why. I'm stuck and can't continue onboarding.
I figured it out by refreshing the page and selecting other repositories.
Now I'm sitting at the “Graphite is updating – this may take a few minutes” screen for a surprisingly long time. It's been almost three minutes and the wheel is still spinning.
I refreshed the page and it continues to spin.

Like most products, Graphite strives to make onboarding fast, painless, and bug-free. For us, the best way to guarantee this is to make sure that every day someone from the company suffers through this procedure themselves. Each specific person on our end product team is on average subject to deletion about once a month. But the fact that one employee is at the forefront of an attack every day, as we found out, is enough to find bugs and motivate us to fix them.

Deleting employee accounts made it possible to cover one of the most important and least accessible areas for testing with dogfooding. We caught dozens of bugs and cultivated empathy for the user where the blind spot is usually located. I strongly encourage other product teams to also consider automatically deleting employee accounts to reap the same benefits.

Current restrictions

Does this scheme work perfectly? No. Remote employees recreate accounts as part of the existing Graphite organization. In other words, they still miss some of the aspects that real users encounter when they first install Graphite in a new organization, such as the lengthy initial synchronization. In the future, I would like to try a scheme where the entire company is removed from Graphite on a monthly basis to improve dogfooding – although this will create more serious obstacles to morning code edits, which causes me concerns.

Can any product afford to delete accounts as often as we do? I believe not – some accounts accumulate not only configurations, but also a lot of other valuable user content. Instagram or Google Docs, say, are unlikely to implement such a radical approach without consequences. But many services, especially those where user-generated data is not affected by the deletion of personal accounts, could use it. Products like Datadog, Vercel, Hex or Superhuman could easily delete employee accounts once a month. Of course, people would have to re-configure their dashboards and filters, but that's the point.

For the future

Will we continue to delete employee accounts in Graphite? Presumably yes. However, dogfooding is not higher than automated testing – we are constantly working to implement normal unit and e2e tests, sparing no investment. But dogfooding differs from automated testing just enough for the two to have a combined effect. Dogfooding catches unknown unknowns in a way that automated tests cannot. It creates empathy among developers for users – as well as opinions about the product that can be taken into account in future updates. If creating your account again is not easy, what is it like for a new user starting from scratch?