The Ultimate Productivity Approach to Software Design
How and why Shopify moved from a monolithic architecture to a modular-monolithic one.
Shopify has one of the largest Ruby on Rails codebases, built over a decade by over a thousand developers. It includes a wide variety of features, such as invoicing merchants, managing third-party apps, updating product information, handling shipping, and more.
The system was originally built as a monolith, meaning all these different functionalities were built into a single code base with no boundaries between them. This architecture worked fine for many years, but eventually we reached a point where the disadvantages of the monolith outweighed the advantages. We had to make a choice about how to proceed.
Microservices have grown in popularity in recent years and have been hailed as a one-stop solution to all the problems that arise when using monoliths. However, our own collective experience told us that there was no universal solution.and microservices bring their own challenges. We decided to turn Shopify into a modular monolith, meaning that we would keep all the code in one codebase, but ensure that the boundaries between the different components were defined and respected.
Any software architecture has its own set of pros and cons. Depending on what stage of development the application is at, it will make sense to use different solutions for it. The transition from a monolith to a modular monolith was the next logical step for us.
Monolithic architecture
According to Wikipedia, a monolith is a software system in which functionally distinct aspects are intertwined rather than having architecturally separate components. In Shopify’s case, this meant that the code handling shipping calculations lived next to the code handling checkout, with little to stop them from calling each other. Over time, this led to code handling different business processes being overly coupled.
Advantages of monolithic systems
Monolithic architecture is the easiest to implement. If you don't use any architecture, you'll likely end up with a monolith. This is especially true for the Ruby on Rails framework, which lends itself to monoliths due to the global availability of all code at the application level. Monolithic architecture can take an application very far because it's easy to develop and allows teams to progress very quickly at the start, and therefore roll out their product to customers sooner.
Having your entire codebase in one place and deploying your application in one place has a lot of benefits. You only have to maintain one repository, and you can easily search and find all the functionality in one folder. You also only have to maintain one testing and deployment pipeline, which, depending on the complexity of your application, can save you a lot of overhead. Creating, configuring, and maintaining these pipelines can be expensive because you need to be intentional about keeping them consistent. Since all the code is deployed to one application, all the data can be stored in one common database. If you need to retrieve a piece of data, it's a simple database query.
Because monoliths are deployed in a single location, there is only one set of infrastructure to manage. Most Ruby applications come with a database, a web server, the ability to run background jobs, and often other infrastructure components like Redis, Kafka, Elasticsearch, etc. Each additional infrastructure block means you spend more time as a DevOps rather than an architect. More infrastructure means more points of failure, which means less resiliency and security for your applications.
One of the biggest advantages of a monolithic architecture over multiple separate services is that you can access different components directly, rather than through web service APIs. This means you don’t have to worry about API versioning, backward compatibility, or potential delays.
Disadvantages of monolithic systems
However, once an app reaches a certain scale, or the team building it reaches a certain scale, it eventually outgrows a monolithic architecture. This happened at Shopify in 2016, and manifested itself in the ever-increasing complexity of building and testing new features. A couple of things in particular served as red flags for us.
The application was very fragile, and new code had unexpected consequences. A seemingly innocuous change could cause a cascade of different testing failures. For example, if the code that calculated shipping rates called the code that calculated tax rates, then changing the way the tax rates were calculated could affect the outcome of the shipping rate calculation, but it might not be obvious why. This was due to tight coupling and a lack of boundaries, which also meant that tests were difficult to write and ran very slowly during continuous integration.
Developing at Shopify required taking into account a lot of context for seemingly simple changes. When new Shopify employees joined the team and were introduced to the codebase, they had a lot of information to absorb before they could begin working. For example, a new developer joining the shipping team only needed to understand the implementation of the shipping business logic. However, in reality, that new developer also needed to understand how orders are created, how we we process payments and more, because everything was tightly coupled. That's a lot of information to keep in mind just to ship your first feature. In complex monolithic applications, the learning curve is steep.
All the problems we encountered stemmed directly from the lack of boundaries between different functional areas in our code. It was clear that we needed to loosen the coupling between different areas, but the question was how to do it.
Microservice architecture
One solution that is currently gaining popularity in the industry is microservices. Microservice architecture is an approach to application development in which a large application is built as a set of smaller services that are deployed independently of each other. While microservices might solve the problems we are facing, they would introduce a whole new set of problems.
We would have to maintain multiple different testing and deployment pipelines, incur infrastructure overhead for each service, and not always have access to the right data when we need it. Since each service is deployed independently of the others, communication between services means traversing the network, which increases latency and reduces reliability every time we access it. Additionally, large refactorings that affect multiple services can be tedious, as they require changes to all dependent services and deployment coordination mechanisms.
Modular monoliths
We needed a solution that would increase modularity without increasing the number of deployable units, allowing us to get the benefits of monoliths and microservices without their drawbacks.
Monolith vs. Microservices by Simon Brown.
A modular monolith is a system in which all code runs as a single application, and there are well-defined boundaries between different domains.
Implementing a Modular Monolith in Shopify: Componentization
When it became clear that we had outgrown the monolithic structure and it was impacting performance, a survey was sent out to all developers working on our core system to identify the main pain points. We knew we had a problem, but we wanted to be data-driven when designing a solution so that it actually solved the problem and wasn't just anecdotal.
Based on the results of this survey, we decided to split our codebase. In early 2017, a small but strong team was assembled to tackle this task. The project was initially called “Break-Core-Up-Into-Multiple-Pieces,” and over time we renamed this process “componentization.”
Code organization
The first issue we tackled was code organization. Our code is currently was set up as a typical Rails application: by software concepts (models, views, controllers). The goal was to reorganize it by real-world concepts (like orders, shipping, inventory, and billing) to make it easier to find the code, find people who understand the code, and understand the individual pieces on their own. Each component would be structured as its own mini-application on Rails, with the goal of eventually being labeled as Ruby modules. We hoped that the new organization would highlight areas that were unnecessarily coupled.
Realistic Reorganization: Before and After.
Building the initial list of components required a lot of research and engagement from stakeholders across the company. We did this by listing every Ruby class (about 6,000 in total) in a massive spreadsheet and manually notating which component it belonged to. Even though no code was changed during this process, it still affected the entire codebase and was potentially very risky if we got it wrong. We accomplished this with one large pull request, built with automated scripts.
Since the changes we made were just file moves, potential failures could have occurred because our code didn't “know” where to find object definitions, which would have resulted in runtime errors. Our codebase is well tested, so by running our tests locally and in CI without failures, and running as much functionality as possible locally and in staging, we were able to ensure that nothing was missed. We decided to do all of this in a single PR to minimize developer disruption. Unfortunately, as a result of this change, we lost a lot of Git history in Github, where file moves were incorrectly counted as deletions and creations rather than renames. We can still track the lineage with git -follow , which tracks the history of file moves, but Github doesn't understand what the moves are.
Isolating dependencies
The next step was to isolate dependencies by separating business domains from each other. Each component defined a clean, specialized interface with domain boundaries expressed through a public API. Each component was given exclusive ownership of the data associated with it.
While the team couldn't implement this order across Shopify's entire codebase because it required subject matter experts from every domain, the team did define patterns and provide tools to accomplish the task.
We built a tool called Wedge that tracks each component’s progress toward isolation. It identifies any domain boundary violations (when another component is accessed through something other than its publicly defined API) as well as data coupling between components. To do this, we wrote a tool that hooks into Ruby tracepoints during CI to get a full call graph. We then sort the callers and callees by component, picking out only the calls that cross component boundaries, and send them to Wedge. Along with these calls, we send some additional data from code analysis, such as ActiveRecord associations and inheritance. Wedge then determines which of these cross-component things (calls, associations, inheritance) are OK, and which are violating. In general:
- Intercomponent associations always violate componentness
- Calls are only possible to things that are clearly publicly available.
Wedge then calculates the overall score and lists the violations for each component.
Shopify's Wedge tracks progress towards each component goal.
As a next step, we will plot the ratings over time and display significant differences so people can see why and when the rating changed.
Enforcing boundaries
In the long term, we would like to take it one step further and ensure that these boundaries are respected programmatically. In this article by Dan Mangs provides a detailed example of how one application team achieved boundary compliance. While we're still exploring the approach we want to take, in general we plan to have each component only load components that it explicitly depends on. This will cause runtime errors if it attempts to access code in a component that it hasn't declared a dependency on. We may also cause runtime errors or test failures when components are accessed through anything other than their public API.
We also want to untangle domain dependency grapheliminating random and circular dependencies. Achieving complete isolation is an ongoing task, but one that all Shopify developers are committed to, and we are already seeing some of the expected benefits. For example, we had a legacy tax engine that was no longer meeting the needs of our merchants. Before the efforts described in this post, replacing the old system with a new one would have been a nearly impossible task. However, because we put so much effort into isolating dependencies, we were able to replace our tax engine with an entirely new tax calculation system.
In conclusion, it should be said that no architecture at all is often the best architecture in the early stages of working on a system. This doesn't mean you shouldn't implement good software practices, but you shouldn't spend weeks or months trying to design a complex system you don't yet know. The hypothesis of the sustainability of architecture Martin Fowler illustrates this idea well, explaining that in the early stages of most apps, you can move very quickly without design. It makes sense to trade off quality of design against time to market. Once the rate of adding features and functionality starts to slow down, it’s time to invest in good design.
The best time to refactor and rebuild is as late as possible, because as you develop you are constantly learning more about your system and business domain. Designing a complex microservices system before you have a thorough understanding of the domain is a risky move that many people stumble on. According to Martin Fowler“In almost every case where I've heard of a system being built as microservices from scratch, it ended up in serious trouble… You shouldn't start a new project with microservices, even if you're confident your application will be large enough to justify the approach.”
Good software architecture is an evolving task, and the scale at which you work will determine which solution is right for your application. Monoliths, modular monoliths, and service-oriented architectures all evolve on an evolutionary scale as your application grows in complexity. Each architecture is right for a different size team/application, and moving from one level to the next will be painful. When you start hitting many of the pain points described in this article, you know you've outgrown your current solution and it's time to move on to the next.
P.S. Please note that we are currently running a sale on our website.