In this article, we will describe how we deployed Service Mesh, solved some of the problems of microservice architecture, and reduced the load on developers and infrastructure engineers.
Why we needed a Service Mesh
Service Mesh is gaining popularity. I think it makes no sense once again to define and describe everything why it might come in handy. If you do not know what it is about, but your company has many teams, services, or you are going to share a monolith, then you should familiarize yourself with the subject. You can start by translating an article written by the creator of the first Service Mesh project, William Morgan.
Over the past year, the number of product teams in the company has grown 1.5 times. At the time of this writing, we have more than 20 teams, almost everyone has developers. With the increase in their number, the development of the monolith has not become easier, so we began to move more actively towards microservice architecture. While we are rather at the beginning of our journey, however, already 3-4 years ago, services for various purposes began to multiply with decent speed.
Of course, the more services, the more difficult it is to keep track of them, the more difficult it is to manage them and the infrastructure configuration. This would be especially difficult in a traditional system in which there is a wall between developers and admins. Therefore, simultaneously with the creation of microservices, we began to give most of the responsibility for the launch and operation of the service to developers, providing Ops-examination as a service. However, developers, of course, do not know all the subtleties of setting up the infrastructure.
Service Mesh just very successfully combines and abstracts from developers such things as:
- Service Discovery
- Distributed tracing;
- Circuit Breaking / Retries / Timeouts;
- Monitoring / telemetry interaction and many others.
First of all, it should be noted that we are deploying in the cloud, packaging services in Docker containers. Orchestrates Nomad containers, which is used in conjunction with Consul. This choice was made when Kubernetes was still not very popular and stable, but even then it was much more difficult for Nomad to operate and configure.
For those who are only familiar with Kubernetes, I’ll give a little explanation about how everything works in the HashiCorp stack.
Nomad is responsible for the orchestration of Docker containers (also QEMU, raw / isolated fork / exec, and Java). Consul is responsible for Service Discovery and can act as a DNS server.
On each host, Nomad and Consul agents are installed in the cloud. In addition, there are still several master servers (usually three or five) that gather in a quorum and are responsible for coordination.
Nomad and Consul are tightly integrated. Consul monitors whether the agent (node) is alive, whether the healthcheck of the service passes. A service can have several checks at once (TCP, HTTP, gRPC, script). Also in the Nomad specification, you can use interpolation from key-value Consul or from Vault secrets, for example, to set environment variables inside a container or create template files. For this, Nomad uses a consul-template.
Both Nomad and Consul provide an HTTP API, with which, for example, from Nomad you can get all the data about the status of tasks being performed, and from the consul an ip-port list for some service taking into account healthcheck. What is called Pod in k8s is called Allocation in the nomad, below I will use this term.
For a long time in our company there is an approach that excludes the presence of such a position as operations engineer. In other words, we want the developers to be as self-sufficient as possible: they themselves wrote, they themselves deployed, they themselves follow.
To do this, we have been using a self-written frontend for the HashiCorp stack for a long time, where developers themselves upload services and deploy them to the stage or production. Service Mesh is integrated with this system so that developers are able to solve trivial cases like adding Redis for their service or connecting one service to another.
About proxies in Service Mesh
Service Mesh has two so-called planes: the data plane and the control plane (data plane and control plane). Data plane is usually executed in the form of proxies, which are somehow distributed across hosts. Proxies are configured through the control plane, which is usually centralized. Interservice interaction occurs through these same proxies.
The most common proxy deployment options are one per host or sidecar (specialized container in Pod’e / allocation). It is still possible to deploy proxies on a dedicated host, but in this case, network delays become noticeable.
Different implementations of Service Mesh use different proxies. Linkerd uses linkerd-proxy, while Consul Connect and Istio allow the use of third-party proxies. Mostly Envoy from Lyft is playing this role. This is an open source proxy that was originally developed for this application.
Why ready-made implementations did not suit us
We did not use ready-made implementations, and here’s why:
- At the beginning of our transition to Service Mesh in early 2018, this concept was new and still raw. For the transition, you would inevitably have to bother with implementing the proxy, bypassing the abstraction.
- For our purposes, the implementations were overloaded.
- The centralized control plane did not suit us.
Refusal of the centralized control plane
As a first approximation, we tried to make our centralized control plane – a separate service that would allow us to flexibly configure Service Mesh for developers. Pretty soon it became clear that the developers would not be able to simultaneously monitor the deployment and routing. Also, the integration of these two systems is not the most trivial thing and, according to forecasts, would take quite a lot of time.
We decided to integrate the Service Mesh management into the deployment system so that when editing the application configuration, the part of the Mesh that covers this application is also updated. The number of twists for Envoy at the same time, of course, was reduced so as not to scare the developer.
Envoy as a data plane
Envoy launch sidecar containers next to the main application container. We chose sidecar containers, and not just one proxy per host. In this way, the proxy configuration of various applications is ensured and scaling is obtained by itself. The proxy on incoming requests (ingress) pretends to be the service itself from the point of view of Consul, and on outgoing (egress) connections it opens one port for each service dependency. The configuration is mostly static, which indicates what outgoing services and settings there are for interacting with a particular service, such as timeouts and the limit on the number of attempts.
The connection between sidecars is easy to configure: Nomad provides environment variables with the addresses of neighbors in all allocation containers.
But what about Service Discovery? How does an outgoing proxy find out where other services are? To do this, we launch another sidecar, a self-written control plane agent, which in functionality is a small subset of the full control plane API, which is responsible for converting the Consul API to Envoy API. We have implemented only part of endpoint discovery (EDS). The agent subscribes to all updates to the list of “healthy” instances of service dependency in Consul. Envoy periodically polls the agent to find out the current dependency addresses.
Telemetry and traffic visualization
We use Prometheus to collect telemetry. Envoy just provides detailed metrics like the number of requests per second, the number of active connections, the statistics of the response of Circuit Breaker, and much, much more. As a result, we collect from it deschboards like this:
We also have a service that, based on telemetry from Prometheus, builds a graph of service dependencies and visualizes traffic intensity using Vizceral from Netflix:
Track request path
Telemetry is wonderful, it helps a lot. However, the picture would be incomplete if it is impossible to trace the path of a particular request. Fortunately, Envoy supports Distributed tracing technology in the Zipkin / Jaeger (X-b3-headers) format.
However, I must say that, although not completely, we left Jaeger in the direction of Elastic APM (Application Performance Management). This analytics service is part of Kibana, which we ourselves host. The product from Elastic is, in our opinion, a more comprehensive solution than Jaeger, which combines logs, APM and Distributed tracing. The jaeger tracing collected by Envoy, despite the fact that it does not allow you to look inside the application, has helped us more than once and is now used in services on third-party images where there is no way to add Elastic APM. In the case of both Jaeger and Elastic APM, I had to run my hands into the application code so that they could pass HTTP headers along the chain and write trace-id to the logs. In the case of Jaeger, this is a self-written library, and Elastic provides an agent for all the languages that we use.
Summary of implementation and next steps
At the moment, Service Mesh covers all the network interactions within each of our virtual networks. It is only about the services of development teams – infrastructure services have decided not to cover yet. This is basically HTTP / 1, but there is still a bit of TCP or Redis. There is no interworking through Mesh yet, but we will pack it too. This also applies to how we now configure external load balancers. They still have too much routing logic that would fit in the Service Mesh much more naturally.
Investments in Ops-expertise within specific teams will allow you to more fully reveal the capabilities of Mesh: traffic separation, failure simulation, more advanced Canary than we have now. Most likely, we will switch to an industry standard like Istio and a centralized configuration.
Pros and Cons of our Service Mesh
So, for ourselves, we highlight the following advantages of our Service Mesh implementation approach:
- At the time of the deployment of the service, its configuration changes and only its proxy – there is no way to completely break the entire network.
- Service Discovery becomes much more compact, since you only need to monitor the service dependencies of a particular service. The c-plane memory consumption is about 10MB and, of course, varies on the number of services that are indicated as dependencies of a particular service, but not significantly. Let’s say a dozen dependencies won’t even add a few megabytes. At the same time, Envoy can be configured with a maximum of 150 MB of RAM, which is enough for our most loaded services. Services that need to keep less than 100 RPS fit into 30 MB.
- Simplicity achieved for the developer. With the help of one daw and 2-3 krutilok they get a configured proxy for incoming and outgoing requests and tracing.
- The solution does not require manual intervention – since it was in production, I have never had to be distracted by it.
- Not very flexible configuration, as we deliberately reduced the number of settings for proxies.
- To start using some new Envoy features, you have to crawl into the code of the self-written front-end system.
- To centrally update the proxy, you will have to do a massive redo (on the other hand, the process is easier to control).