How to use GraphQL Federation to incrementally migrate from Monolith (Python) to Microservices (Go)

Or how to change the foundation of an old house so that it does not collapse

About 10 years ago, we chose Python 2 to develop our monolithic learning platform. But the industry has changed dramatically since then. Python 2 was officially buried on January 1, 2020. IN previous In this article, we explained why we decided not to migrate to Python 3.

Millions of people use our platform every month.

We took some risk when we decided to rewrite our backend in Go and change the architecture.

We chose Go for several reasons:

  1. High compilation speed.
  2. Saving RAM.
  3. Quite a wide selection of IDEs with Go support.

But we took an approach that minimized the risk.

GraphQL Federation

We decided to build our new architecture around GraphQL Apollo Federation… GraphQL was created by the Facebook developers as an alternative to the REST API. Federation is about building a single gateway for multiple services. Each service can have its own GraphQL schema. A common gateway combines their schemas, generates a single API, and allows requests for multiple services at the same time.

Before we go further, I would like to highlight the following:

  1. Unlike REST APIs, each GraphQL server has its own typed data schema. It allows you to get any combination of exactly the data with arbitrary fields that you need.
  2. The REST API gateway allows you to send a request to only one backend service; The GraphQL gateway generates a query plan for an arbitrary number of backend services and allows you to return selections from them in a single generic response.

So, having included the GraphQL gateway in our system, we get something like this:

Image URL:

The gateway (aka the graphql-gateway service) is responsible for creating a query plan and sending GraphQL queries to our other services – not just the monolith. Our Go services have their own GraphQL schemas. To form responses to requests, we use gqlgen (this is a GraphQL library for Go).

Since the GraphQL Federation provides a common GraphQL schema, and the gateway bundles all the individual service schemas into one, our monolith will interact with it just like any other service. This is a fundamental point.

Next, we will talk about how we customized the server. Apollo GraphQLto safely climb from our monolith (Python) to a microservice architecture (Go).

Side-by-side testing

GraphQL “thinks” with sets of objects and fields of certain types. The code that knows what to do with the incoming request, how and what data to extract from the fields is called a resolver.

Let’s consider the migration process using an example of the data type for assignments:

123 type Assignment {createdDate: Time ……….}

It is clear that in reality we have much more fields, but for each field everything will look the same.

Let’s say we want this monolith field to be represented in our new service written in Go. How can we be sure that the new service on demand will return the same data as the monolith? For this, we use an approach similar to the library Scientist: we request data from both the monolith and the new service, but then compare the results and return only one of them.

Step 1: manual mode

When the user asks for the value of the createdDate field, our GraphQL gateway first accesses the monolith (which is written in Python, remember).

In the first step, we need to ensure that the field can be added to the new assignments service already written in Go. The file with the .graphql extension should contain the following resolver code:

12345 extend type Assignment key (fields: “id”) {id: ID! external createdDate: Time @migrate (from: “python”, state: “manual”)}

Here we are using Federation to say that the service adds a createdDate field to the Assignment type. The field is accessed by id. We also add a “secret ingredient” – the migrate directive. We wrote code that understands these directives and generates several schemas that the GraphQL gateway will use when deciding whether to route a request.

In manual mode, the request will only be addressed to the monolith code. We must consider this possibility when developing a new service. To get the value of the createdDate field, we can still access the monolith directly (in primary mode), or we can query the GraphQL gateway for the schema in manual mode. Both options should work.

Step 2: side-by-side mode

After we have written the resolver code for the createdDate field, we switch it to side-by-side mode:

12345 extend type Assignment key (fields: “id”) {id: ID! external createdDate: Time @migrate (from: “python”, state: “side-by-side”)}

And now the gateway will access both the monolith (Python) and the new service (Go). It will compare the results, log the cases where there are differences, and return the result from the monolith to the user.

This mode really instills a lot of confidence that our system will not be buggy during the migration process. Over the years, millions of users and “kilotons” of data have gone through our frontend and backend. By observing how this code works in real conditions, we can make sure that even rare cases and random outliers are captured and then processed stably and correctly.

During testing, we receive such reports.

Try to enlarge this picture during layout somehow without a strong loss of quality.

They focus on cases where discrepancies are found in the operation of the monolith and the new service.

At first, we often encountered such cases. Over time, we have learned to identify such problems, assess them for criticality and, if necessary, eliminate them.

When working with our dev servers, we use tools that highlight differences in color. This makes it easier to analyze problems and test solutions.

What about mutations?

You might be wondering if we run the same logic in both Python and Go, what happens to the code that modifies the data, rather than just querying it? In GraphQL terms, this is called mutation.

Our side-by-side tests do not take mutations into account. We looked at some of the approaches to doing this – they turned out to be more complex than we thought. But we have developed an approach that helps solve the very problem of mutations.

Step 2.5: canary mode

If we have a field or mutation that has successfully survived to the production stage, we enable the canary mode.

12345 extend type Assignment key (fields: “id”) {id: ID! external createdDate: Time @migrate (from: “python”, state: “canary”)}

Canary fields and mutations will be added to the Go service for a small percentage of our users. In addition, internal users of the platform are testing the canary scheme. This is a fairly safe way to test complex changes. We can quickly disable the canary circuit if something doesn’t work as expected.

We only use one canary circuit at a time. In practice, not many fields and mutations are in canary mode at the same time. So, I think there will be no problems in the future. This is a good compromise because the schema is quite large (over 5000 fields) and the gateway instances must store three schemas in memory – primary, manual, and canary.

Step 3: migrated mode

In this step, the createdDate field should be in migrated mode:

12345 extend type Assignment key (fields: “id”) {id: ID! external createdDate: Time @migrate (from: “python”, state: “migrated”)}

In this mode, the GraphQL gateway only sends requests to a new service written in Go. But at any moment we can see how the monolith will process the same request. This makes it much easier to deploy and roll back changes if something goes wrong.

Step 4: Completing the migration

After a successful deployment, we no longer need the monolith code for this field, and we remove the @migrate directive from the resolver code:

12345 extend type Assignment key (fields: “id”) {id: ID! external createdDate: Time}

From now on, the gateway will interpret the Assignment.createdDate expression as getting a field value from a new service written in Go.

This is how incremental migration is!

And how far have we gone?

We completed our side-by-side testing infrastructure just this year. This allowed us to safely, slowly but surely rewrite a bunch of Go code. Throughout the year, we have maintained the high availability of the platform against the backdrop of growing traffic in our system. At the time of this writing, ~ 40% of our GraphQL fields are moved to Go services. So, the approach we described worked well in the migration process.

Even after the project is completed, we can continue to use this approach for other tasks related to changing our architecture.

PS Steve Coffman gave a talk on this topic (at Google Open Source Live). You can see recording this YouTube talk (or just watch presentation).

Cloud servers from Macleod fast and safe.

Register using the link above or by clicking on the banner and get a 10% discount for the first month of renting a server of any configuration!

Similar Posts

Leave a Reply