How to reduce log sizes without losing functionality

Developers are periodically torn between the desire to log as much information as possible and the need to make the volume of logs reasonably compact.

When we are talking about a small local service with a load of 10 requests per day, you can afford to write everything in the logs: from the full text of the request to the full text of the response with a bunch of intermediate information (what came from the database, what request was sent to the external service and what was received in answer, etc.). When we are talking about a relatively high-load service that processes about 1000 rps, even a small part of this information in a couple of days can easily cause a modern hard drive to overflow.

A logical question arises: how to log only the necessary information?

Issues

The vast majority of modern logging systems support multiple message levels. In different systems they may be named slightly differently, but the essence is generally the same: Debug, Info, Warn, Error, Critical.

I will not dwell on them in detail now, I will only say that they very conditionally solve the described problem. Yes, we can correctly prioritize messages and decide that we will, for example, write only messages at a level not lower than Info to the logs, and then, when we start to quickly run out of space for logs, we’ll cut it by correcting one line in the configuration file and that’s it , which is below Warn. And this will solve the log size problem. Alas, but only partially.

Quite often, Warn, Error and Critical do not occur on their own, but are preceded by some other messages of lower priority. For example, if the user executes some rule to all elements of the list, and the list itself turns out to be empty after querying the database and applying some business logic, which should not be the case – this is a typical Warn level event, and possibly an Error. But if we did not record the base response before this, then we will not know from the logs the reason for this event. Perhaps the problem is in the database data, perhaps on the logic side. The only problem is that at the stage of returning a response from the database, we do not know whether a Warn-level event will occur or not, we need to write a response, or it will simply become an extra entry leading to swelling of the logs.

Another common example is logging long responses. To further improve performance, it can be useful to track bottlenecks in such queries. Moreover, if a query consists of many subqueries, each of which is executed faster than the threshold value, then we will learn about the problem only at the very end, when we no longer have intermediate data.

In other words, we have a need to write to the log not only the data that we know at the time of the occurrence of a Warn level event or higher, but also some background history that could previously have been saved with a lower priority.

In modern logging systems, for example, Serilog, there is such a thing as log enrichment. But this mechanism is more suitable for a scenario when we need to add additional parameters to all records, for example, a trace identifier, rather than grouping data across several records.

In general, you need to write something of your own…

And some code

At the conceptual level, the idea is actually quite simple: we need some kind of query-level storage in which we will store low-priority records, and then write them one-time when necessary. The only problem here is that in an already existing and working system, redoing ready-made logging is often too expensive. I would like to have functionality that would allow grouping low-priority logs without massive changes to the code. And there is such functionality.

Let's look at it using Log4net as an example, since it is still actively used in our project. There is a special thread context level variable (LogicalThreadContext) into which temporary values can be added. After some simplifications, the code will look like the one shown below. Similar code can be written for Serilog.

Let's create a wrapper over LoggerFactory to return the overloaded logger.

internal class DecoratedLoggerFactory : ILoggerFactory
{
	private readonly ILoggerFactory log4NetLoggerFactory;

	public DecoratedLoggerFactory(ILoggerFactory log4NetLoggerFactory)
	{
		this.log4NetLoggerFactory = log4NetLoggerFactory;
	}

	public Logger CreateLogger(ILoggerRepository repository, string name)
	{
		var logger = log4NetLoggerFactory.CreateLogger(repository, name);
		return new DecoratedLog4NetLogger(logger.Name);
	}
}

And then the logger itself.

internal class DecoratedLog4NetLogger : Logger
{
	public DecoratedLog4NetLogger(string name) : base(name) { }

	public override void Log(Level level, object message, Exception exception)
	{
		if(IsEnabledFor(level)
            && LogicalThreadContext.Properties["trace"] is TraceContext traceContext
            && traceContext.IsTraceEnabled)
		{
			if(level > Level.Info)
				base.Log(level, message, exception);
			traceContext.Add(level, $"{message} {exception}");
		}
		else
			base.Log(level, message, exception);
		}
	}
	// аналогичный код для перегруженных методов
}

The code for the overloaded method is quite simple. We check that logging is allowed for a given level, we have the required property in the LogicalThreadContext and it matches the specified type. Then we add our entry to the traceContext object. In addition, all high-priority records are logged immediately, rather than simply placed in a “until better times” context. You need to find out about serious problems immediately, since in case of global emergencies the logs from the context may not have time to be recorded.

An additional check for IsTraceEnabled allows you to set permission to trace in the context of a specific request, disabling the storage of logs for a specific request at any time.

The TraceContext object is a collection of TraceItem objects that store the level of the message, the time it occurred, the text itself, and the offset value of the moment of recording relative to the beginning of the trace.

internal class TraceItem
{
	public TraceItem(Level level, DateTime utcNow, string message, double offset)
	{
		Message = message;
		Offset = offset;
		WritedAt = utcNow;
		Level = level;
	}

	public string Message { get; private set; }
	public double Offset { get; private set; }
	public DateTime WritedAt { get; private set; }
	public TraceLevel Level { get; private set; }
}

And the TraceContext object itself looks like this:

public class TraceContext
{
	private readonly ConcurrentQueue<TraceItem> items = new();
	private readonly Stopwatch stopwatch = new();
	private readonly object lockObject = new();

	public bool IsTraceEnabled { get; set; }
	public CurrentLevel Level { get; private set; }

	public TraceContext()
	{
		IsTraceEnabled = true;
		CurrentLevel = Level.Debug;
		stopwatch.Start();
	}

	public void Add(Level level, string message)
	{
		if(!IsTraceEnabled)
			return;

		if(level > CurrentLevel)
		{
			lock(lockObject)
			{
				if(level > CurrentLevel)
				{
					CurrentLevel = level;
				}
			}
		}

		items.Enqueue(new TraceItem(level, DateTime.UtcNow, message, stopwatch.Elapsed.TotalMilliseconds));
	}

	public override string ToString()
	{
		if(!IsTraceEnabled)
			return string.Empty;

		StringBuilder stringBuilder = new();
		stringBuilder.AppendFormat(CultureInfo.InvariantCulture, "Max level: {0}. Total time: {1} ms", Level, stopwatch.Elapsed.TotalMilliseconds);
		stringBuilder.AppendLine();
		var i = 0;
		foreach(var e in items)
		{
			stringBuilder.AppendFormat(CultureInfo.InvariantCulture,
                    "{0} of {1} [{2}] at {3}, offset {4}: {5}", i + 1, items.Count, e.Level.ToString().ToUpper(), e.WritedAt.ToString("O"), e.Offset, e.Message);
			stringBuilder.AppendLine();
			i++;
		}
		return stringBuilder.ToString();
	}
}

In this code we are essentially doing three things.

First, when we add an entry, we save it to a thread-safe collection, timing the execution of the command. Secondly, we, if necessary, raise the error level of the entire trace from the original Debug to the one that came with the current request and ensure that a lower priority does not overwrite a higher one if it has already been set. And thirdly, we overloaded ToString to ensure data output in a certain format, so that we got the entire detailed “sheet” in one request.

And the last thing we can do is write Middleware that will intercept requests and check them for signs indicating that the request details should be logged.

public class LoggingMiddleware
{
	private readonly RequestDelegate next;
	private readonly ILogger logger;
	private readonly ISettings settingsProvider;

	public LoggingMiddleware(RequestDelegate next, ILogger<LoggingMiddleware> logger, ISettings settingsProvider)
	{
		this.next = next;
		this.logger = logger;
		this.settingsProvider = settingsProvider;
	}

	public async Task InvokeAsync(HttpContext context)
	{
		var sw = Stopwatch.StartNew();
		if(settingsProvider.EnableGlobalTracing)
			LogicalThreadContext.Properties["trace"] = new TraceContext();
		else
			LogicalThreadContext.Properties["trace"] = null;
		await next(context);

		sw.Stop();
		if(LogicalThreadContext.Properties["trace"] is TraceContext traceContext
			&& (traceContext.Level >= Level.Warn
				|| sw?.ElapsedMilliseconds >= settingsProvider.LongCallTriggerDurationMs))
		{
			LogicalThreadContext.Properties["trace"] = null;
			if(traceContext.Level >= TraceLevel.Warn)
				logger.Log(Convert(traceContext.Level), traceContext.ToString());
			else
				logger.LogWarning($"Long call. {traceContext}");
		}
		LogicalThreadContext.Properties["trace"] = null;
	}
}

Essentially, we look at whether we have query tracing enabled at the global level (so we can quickly turn it off if something goes wrong) and whether or not we allow tracing of the current request. And then we check whether everything is in order with our request and if not, then we display a complete picture with all the layouts in the log.

The result in the logs looks something like this:

Since Warn level logging is detected, the entire trace meets the logging criterion and is recorded at this level.

Conclusion

Based on the results of practical application, I can note that this code resulted in a reduction in logs by more than 3 times without any actual loss of functionality. Moreover, this is far from the limit, since we left some types of logs additionally.

Separately, the developers noted that it became somewhat more convenient to work with the resulting “sheet” than first searching for a problematic request based on indirect signs, and then pulling out all requests with the same id.

The obvious disadvantages of this approach include the costs of storing logs in memory and the additional costs of concatenating rows during the final recording of logs. But it was expected that these costs would not be significant against the general background. As practice has shown, there really hasn’t been any noticeable increase in memory costs or processor resources.