CutTheLog – when it’s too big


Log translated from English

Log translated from English “log”

Parsing logs is a typical task for system administrators and devops. This is usually a regular operation. Data is constantly written to the end of the file, so it must be periodically opened and new records processed. In the process, you need to look for the place where we finished reading last time. Let’s automate this search so that it doesn’t require looking at the log from the beginning.

The first thing that comes to mind is to remember the position in the file where we left off. It is returned by the function tell. At the next iteration, we will go directly to this position and continue reading from there. This is what the function does seek practically free.

The scheme is simple and working, but problems arise when rotating the log. Firstly, going beyond the file boundary will not lead to an error, just when reading we will get empty data. This means that before moving on, you need to check that the saved value does not exceed the file size, and if this is not the case, read the log from the beginning. Even in this case, after the rotation, enough data may be written to successfully pass the test. Then, as if nothing had happened, we will continue “from the place where we finished” and skip a piece of data.

There is a way to improve the algorithm: remember the offset not of the end of the file, but of the beginning of the last line read and its value. In this case, we will also jump to the cached position, and then read the line and compare it with the saved one. If the values ​​match, continue reading further. If they differ, we return to the beginning of the file. Since there is usually a timestamp in each line of the log, it is almost impossible to successfully pass the check after rotation.

I implemented this idea in a small Python module. It is called cutthelog. It has a single class CutTheLog, which allows you to open the file and perform the described check. As a result, it returns an iterator over the unseen lines of the log. To create an object, we pass the path of the file and, optionally, the offset and value of the line on which we finished reading. It looks like this:

import sys
from cutthelog import CutTheLog

ctl = CutTheLog('/var/log/syslog', offset=7777, last_line=b'...')
with ctl as line_iter:
  print('Starting at:', ctl.get_position(), end='', file=sys.stderr)
  for line in line_iter:
		print(line.decode(), end='')
print('Ending at:', ctl.get_position(), end='', file=sys.stderr)

Log lines are of type bytessince even on my desktop ubuntu in the log /var/log/syslog there are characters that cannot be decoded into unicode. Each string returned by the iterator updates the position within the object. This means that we can read an arbitrary number of lines, and not just until the end of the file. The class has methods for saving the cache position on disk and reading it from there.

The module itself is a command line utility. It prints the contents of the file to the console, storing the position of the end of the file in the cache. The next time it is run, only the lines that have been written since the previous read are printed. It looks like this:

$ echo -e "one\ntwo\nthree" > example
$ cutthelog example
one
two
three
$ cutthelog example
$ echo -e "four\nfive\nsix" >> example
$ cutthelog example
four
five
six
$ cutthelog example
$ echo -e "seven\neight\nnine\nten\neleven\ntwelve" > example
$ cutthelog example
seven
eight
nine
ten
eleven
twelve

The cache is stored in a file .cutthelog in the user’s home directory. Running as a different user will display the entire file. The path to the cache can be set manually with the option -c/--cache-file.

The module code is located at github. The Python package is available at pypi. I hope they are useful to you.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *