Economical APM for backend or how to use Loki+Grafana+application logs for APM


Problem

The API began to receive more and more load, it was necessary to analyze the work, and then optimize the work of a large backend. Typically, the process of optimizing a typical php backend includes:

  • optimization of SQL queries in the database;

  • optimizing work with the cache (increasing hitrate, increasing the amount of cached data);

  • optimization of internal processes in the backend (putting heavy processes into asynchronous mode, optimization of internal data structures)

All this has already been done, but did not give enough effect – the API was large, it had many services with different logic, complexity and connections. Spot improvements had an effect, but there was a desire to look at the entire backend from above – to evaluate data flows, evaluate the load created on each of the external storages (database, cache, services) and optimize based on the received data.

Already existing implementations of APM (newrelic, datadog, elastic apm agent, dynatrace, open tracing) were considered, but were not selected, as it was supposed to include external php extensions in the project, buy a license and / or raise additional nodes in the infrastructure.

I agree that for the purposes of APM, it’s not the analysis of logs that is more suitable, but a separate storage with aggregated data – metrics, on the basis of which graphs will be built – it’s faster, it lives longer (according to the period of access to data) and generally more correct. However, monitoring was implemented progressively, over several months, step by step changing the emphasis in the data, without any changes in the infrastructure. In part, it was also an experiment: how well log monitoring can be with minimal effort.

Solution

It was decided to use the application logs from nginx and the application server (php-fpm in this case), sent to loki, to store aggregated information on request-responses, which will therefore be aggregated by grafana + loki and on which graphs will be built.

In respect of:

  1. Changing the format of nginx and php-fpm logs – added output of additional fields and change the format to json;
    This is quite simple: everything is taken from the documentation: the format is corrected using log_format and access.format, respectively.

  2. Add logging of connection and execution of requests/commands with timing – to the database and cache in the application
    It was not difficult, because the framework used allows you to override the base classes for commands and queries. The implementation is very framework dependent, but most often consists of overriding base interaction classes (pdo, cache) with simple $dbConnectionTime += $start_connection_time - microtime(true).

  3. Wrapping all external service calls into a gasket (named connector), which counts the call time, performs validation (including the structure) of the response, and logs the success of the call;
    It was even useful, because now all calls to external systems (and there are already about 15 of them) are uniformly executed, validated and logged.

nginx log format and application

The nginx logs look pretty simple:

{
    "body_bytes_sent": "3186",
    "client_ip": "10.***.***.***",
    "client_req_id": "cad148c1-****-4819-****-035a4a9ace14",
    "connection": "10186566",
    "http_user_agent": "***",
    "pid": "16",
    "request_method": "POST",
    "request_time": "0.102",
    "request_uri": "/***/***/***?***=***",
    "request_uri_path": "/***/***/***",
    "status": "200",
    "time_local": "10/Apr/2023:23:27:15 +0300",
    "upstream_connect_time": "0.000",
    "upstream_header_time": "0.101",
    "upstream_response_time": "0.101",
    "uri": "/index.php"
}
nginx settings
// общие
http {
    // ...
    log_format json_upstream_logs escape=json '{"time_local":"$time_local","client_req_id":"$client_req_id","client_ip":"$client_ip","request_method":"$request_method","request_uri":"$request_uri","request_uri_path":"$request_uri_path","status":"$status","body_bytes_sent":"$body_bytes_sent","http_user_agent":"$http_user_agent","request_time":"$request_time","uri":"$uri","upstream_connect_time":"$upstream_connect_time","upstream_header_time":"$upstream_header_time","upstream_response_time":"$upstream_response_time","pid":"$pid","connection":"$connection"}';
    // ...
    map $request_uri $request_uri_path {
        "~^(?P<path>[^?]*)(\?.*)?$"  $path;
    }
}

// сервера
server {
    #Определение идентификатора запроса. если его нет в запросе - создаём
    if ($arg_client_req_id !~ $arg_client_req_id) {
        set $client_req_id $request_id;
    }
    if ($arg_request_id) {
        set $client_req_id $arg_request_id;
    }
    if ($arg_client_req_id) {
        set $client_req_id $arg_client_req_id;
    }
    #Определение ip адреса клиента
    # запросы приходят с проксирующего сервера, так что HTTP-заголовки приходят доверенные
    if ($http_x_real_ip !~ $http_x_real_ip) {
        set $client_ip $remote_addr;
    }
    if ($http_x_real_ip) {
        set $client_ip $http_x_real_ip;
    }


    location ~ ^/index\.php$ {
        // ..
        access_log  /var/log/nginx/access.log  json_upstream_logs;
        // ..
    }
}

The replaced php-fpm logs look pretty simple too:

{
    "C": "43.21",
    "M": "2097152",
    "P": "9",
    "Q": "",
    "R": "-",
    "RequestId": "f894d70d5910***3772a94261f7c40ca",
    "T": "20/Apr/2023:16:17:19 +0300",
    "d": "0.023",
    "f": "/var/www/html/web//index.php",
    "l": "42",
    "m": "POST",
    "n": "www",
    "p": "17",
    "q": "",
    "r": "/index.php",
    "s": "200",
    "t": "20/Apr/2023:16:17:19 +0300",
    "u": ""
}
php fpm settings
access.format="{"RequestId":"%{REQUEST_ID}e","C":"%C","d":"%d","f":"%f","l":"%l","m":"%m","M":"%M","n":"%n","P":"%P","p":"%p","q":"%q","Q":"%Q","r":"%r","R":"%R","s":"%s","T":"%T","t":"%t","u":"%u"}"

Timings in the application

We add meta-information to all logs from the application (for the possibility of tracing logs between different services):

  • request id;

  • called service;

  • id of the authorized user, session;

First, we wrap everything: connections / execution of queries to the database, connections / execution of commands in the cache, we consider the time / amount of data.

From the received data, we form the first category of messages – timings, about the progress of the request. One message is generated for each request at the end of execution:

{
    "cacheCommands": 14,
    "cacheCommandsTime": 0.005800008773803711,
    "cacheConnectionTime": 0.0006201267242431641,
    "cacheReads": 7,
    "cacheReadsSize": 7989,
    "cacheWrites": 4,
    "cacheWritesSize": 1480,
    "dbConnectionTime": 0.0017120838165283203,
    "dbQueries": 5,
    "dbQueriesTime": 0.0563662052154541,
    "externalCalls": 1,
    "externalCallsTime": 0.03611111640930176
}

Calls to external services

On the success and execution time of requests to external services, we form the second category – external_calls – for each call according to the message in the log.

{
    "requestTime": 0.03611111640930176,
    "requestUrl": "http://10.***.***.***/***/***/***/***",
    "systemId": "***"
}
{
    "error": "cURL error 28: Connection timeout after 1001 ms (see https://curl.haxx.se/libcurl/c/libcurl-errors.html): 0",
    "request":
    {
        "body": "",
        "url": "http://***.***.***:8280/***/***/***?***=***"
    },
    "requestTime": 1.0022687911987305,
    "response": null,
    "systemId": "***"
}

Visualization

Then it remains only to arm yourself with logql, grafana and write queries that are true to the repository.

We start two variables that are selected in both dashboards (thus, the change of the studied circuit is implemented):

The entire APM is divided into two dashboards:

  1. performance status. To evaluate “how well everything is doing.”

  2. Remote Calls. To study interaction with other services.

performance status

Consists of the following blocks:

API response statistics by individual services/response codes.

  1. RPS. Count of requests to the entire API, divided by nginx response codes.

  2. summary for {code}. Select any response codes from the grafana drop-down list, on the basis of which it will be built:

    1. Average Response time (nginx/fpm).

    2. response time. Response percentiles based on nginx data.

  3. Response count by path for {code}. Number of service responses for the selected response code.

  4. Response size by path. Average response size of services.

  5. Response time by path. Average service response time.

What does it look like
RPS on all API requests

RPS on all API requests

How is formed

{{status}}: sum by (status) (rate($nginx_source | json | error != "JSONParserErr"[1s]))

Average API response time

Average API response time

How is formed

avg by nginx – {{status}}

avg_over_time( $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s] ) by (status)

connect_time – {{status}}

avg_over_time($nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap upstream_connect_time | __error__=""[1s]) by (status)

header_time – {{status}}

avg_over_time($nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap upstream_header_time | __error__=""[1s]) by (status)

response_time – {{status}}

avg_over_time($nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap upstream_response_time | __error__=""[1s]) by (status)

avg by php-fpm – {{s}}

avg_over_time($fpm_source | json | __error__ != "JSONParserErr" | s=$http_code | unwrap d | __error__=""[1s]) by (s)

Response Time Percentiles

Response Time Percentiles

How is formed

max by nginx – {{status}}

max_over_time($nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

95% by nginx – {{status}}

quantile_over_time(0.95, $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

90% by nginx – {{status}}

quantile_over_time(0.9, $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

70% by nginx – {{status}}

quantile_over_time(0.7, $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

50% by nginx – {{status}}

quantile_over_time(0.5, $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

30% by nginx – {{status}}

quantile_over_time(0.3, $nginx_source | json | __error__ != "JSONParserErr" | status=$http_code | unwrap request_time | __error__=""[1s]) by (status)

Number of responses by service

Number of responses by service

How is formed

{{request_uri_path}}: sum by (request_uri_path) (rate($nginx_source | json | __error__ != "JSONParserErr" | status=$http_code [1s]))

Service response size

Service response size

How is formed

{{request_uri_path}}: avg_over_time($nginx_source | json | __error__ != "JSONParserErr" | unwrap body_bytes_sent | __error__=""[1s]) by (request_uri_path)

Average response time by service

Average response time by service

How is formed

{{request_uri_path}}: avg_over_time($nginx_source | json | __error__ != "JSONParserErr" | unwrap request_time | __error__=""[1s]) by (request_uri_path)

Summary statistics of the application in the context of individual storages (according to fpm data).

  1. DB/cache/calls summary. Spending time by the application on various activities: working with the database, cache and other services.

  2. queries and commands. The number of requests, reads/writes from the cache.

  3. cache size. The amount of data read/written to the cache.

  4. External calls summary. Distribution of work time with external services.

What does it look like
Elapsed time summary

Elapsed time summary

How is formed

connect to dbtime

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_dbConnectionTime | __error__=""[1s]))

db queries time

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_dbQueriesTime | __error__=""[1s]))

connect to cache time

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheConnectionTime | __error__=""[1s]))

cache time

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheCommandsTime | __error__=""[1s]))

external calls time

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_externalCallsTime | __error__=""[1s]))

Querying the database and cache

Querying the database and cache

How is formed

db queries

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_dbQueries | __error__=""[1s]))

cache commands

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheCommands | __error__=""[1s]))

cache reads

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheReads | __error__=""[1s]))

cache writes

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheWrites | __error__=""[1s]))

The volume of writing and reading from the cache

The volume of writing and reading from the cache

How is formed

read from cache

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheReadsSize | __error__=""[1s]))

write to cache

sum(sum_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheWritesSize | __error__=""[1s]))

Time spent calling external services

Time spent calling external services

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"external_calls"` | json | __error__ != "JSONParserErr" | unwrap message_requestTime | __error__=""[1s]) by (message_systemId)

Statistics of work with the database in the context of individual services.

  1. Db time. Working time with the database is average.

  2. DB queries count. The average number of queries to the database.

What does it look like
Time of work with the database by services

Time of work with the database by services

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_dbQueriesTime | __error__=""[1s]) by (method)

Number of queries to the database by service

Number of queries to the database by service

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_dbQueries | __error__=""[1s]) by (method)

Statistics of work with the cache in the context of individual services.

  1. Cache commands time. Average cache time.

  2. Cache commands count. Average number of instructions to the cache.

  3. Cache write size. Average cache entry size.

  4. Cache reads size. Average cache entry size.

What does it look like
Cache time by service

Cache time by service

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheCommandsTime | __error__=""[1s]) by (method)

Number of commands in the cache by service

Number of commands in the cache by service

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheCommands | __error__=""[1s]) by (method)

Cache entry size by service

Cache entry size by service

How is formed

{{request_uri_path}}: quantile_over_time(0.9, $fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheWritesSize | __error__=""[1s]) by (method)

Cache entry size by service

Cache read size by service

How is formed

{{request_uri_path}}: quantile_over_time(0.9, $fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_cacheReadsSize | __error__=""[1s]) by (method)

Statistics of work with other APIs in the context of individual services.

  1. External calls time. Working time with external services.

What does it look like
Time spent calling external services by service

Time spent calling external services by service

How is formed

{{request_uri_path}}: avg_over_time($fpm_source |=`:"timings"` | json | __error__ != "JSONParserErr" | unwrap message_externalCallsTime | __error__=""[1s]) by (method)

Utilization of CPU and memory by workers.

  1. PHP avg memory & CPU.

What does it look like
How is formed

php cpu cons. % – {{s}}

avg_over_time($fpm_source | json | __error__ != "JSONParserErr" | unwrap C | __error__=""[1s]) by (s)

php memcons. – {{s}}

avg_over_time($fpm_source | json | __error__ != "JSONParserErr" | unwrap M | __error__=""[1s]) by (s)

Non-standard nginx / fpm logs.

  1. nginx messages. All messages from nginx that are not access logs.

  2. php fpm messages. The same, only for php.

What does it look like
How is formed

nginx logs

$nginx_source !=`client_req_id`

php fpm logs

$fpm_source !=`RequestId` !=`category` !=`executing too slow` !=`failed to ptrace`

Summary of application exceptions.

  1. Exceptions. All lines from application logs that contain Exception.

  2. Exception messages. Detailed log lines with Exceptions to study right on the spot.

What does it look like
How is formed

{{category}}: sum by (category) (rate($fpm_source |=`Exception` | json | __error__ != "JSONParserErr" [1s]))

external calls

Consists of a variable number of blocks, each of which contains a summary of the interaction with one external service that the monolith is accessing.

  1. RPS chart & avg. time + errors. Graph that shows:

    1. The number of successful responses from the service.

    2. Average and maximum response time of the service.

    3. Errors (grouped by type) when accessing the service, parsing the response, parsing the response.

  2. Service call logs. All cases are listed in reverse chronological order.

What does it look like
Graph of requests - successful and not very successful, and response time is average / maximum

Graph of requests – successful and not very successful, and response time is average / maximum

How is formed

Successful Requests

{{message_systemId}}

sum by (message_systemId) (count_over_time($fpm_source |=`external_calls` |=`systemId":"$external_system` |=`info` | json | __error__ != "JSONParserErr" [10s]))

Failed Requests

errored – {{message_systemId}} : {{message_error}}

sum by (message_systemId, message_error) (count_over_time($fpm_source |=`external_calls` |=`systemId":"$external_system` |=`error` | json | __error__ != "JSONParserErr" [10s]))

Average execution time

avg request time – {{message_systemId}}

avg_over_time($fpm_source |=`external_calls` |=`systemId":"$external_system` | json | __error__ != "JSONParserErr" | unwrap message_requestTime | __error__=""[10s]) by (message_systemId)

Maximum execution time

max request time – {{message_systemId}}

max_over_time($fpm_source |=`external_calls` |=`systemId":"$external_system` | json | __error__ != "JSONParserErr" | unwrap message_requestTime | __error__=""[10s]) by (message_systemId)

Response time to dive into the details of specific requests

Response time to dive into the details of specific requests

How is formed

$fpm_source |=`external_calls` |=`systemId":"$external_system` | json | __error__ != "JSONParserErr" | line_format "{{.timestamp}}: {{ printf "%s" .message_requestTime }} {{ .message_error }}"

conclusions

It turned out to build your APM from improvised means.

It has downsides:

  • Short period of data storage; we fill the already heavy storage of logs with metrics;

  • Slow data aggregation;

  • It is necessary to manually change configurations or redefine (embed) to work with the database/cache/external services.
    Also, do not forget about “hidden” calls to external storage: various libraries for integrations (s3, oauth2.0), which go to external systems and which should also be monitored. For example, we also have monitoring for accessing the oauth2 data server (and the request is also proxied with a double retray) – we only had to inherit a couple of base library classes and add data logging.

All the minuses, except for the last one, can be leveled by switching to another repository or a ready-made APM implementation.

There are pluses:

  • Everything works on the current installed software; You don’t need to install anything new.

  • When studying user problems (especially complaints about a long response time), you can immediately, without departing from the business logic logs, check the execution timings of various kinds of requests;
    For example, often we had problems with a slow API precisely because of long calls to external services. By installing an nginx proxy in front of these services, we have reduced network overhead.

  • During development, you can immediately evaluate the complexity of the created service from the container logs – you can see how many times it went to the database, to the cache, how many queries it executed and which of them were heavier than the others;

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *