Understanding system calls in Linux with strace

Translation of the article prepared especially for students basic and advanced Administrator Linux courses.


The syscall is the mechanism by which user programs interact with the Linux kernel, and strace is a powerful tool for keeping track of them. To better understand how the operating system works, it is helpful to understand how they work.

The operating system can be divided into two modes of operation:

  • Kernel mode is the privileged mode used by the operating system kernel.
  • User mode is the mode in which most user applications run.

Users typically use command line utilities and a graphical interface (GUI) for their daily work. At the same time, system calls invisibly work in the background, referring to the kernel to do the work.

System calls are very similar to function calls in the sense that arguments are passed in and they return values. The only difference is that system calls work at the kernel level, but functions do not. Switching from user mode to kernel mode is carried out using a special mechanism interrupts

Most of these details are hidden from the user in the system libraries (glibc on Linux systems). System calls are generic in nature, but despite this, the mechanics of their execution are largely hardware-dependent.

This article explores several practical examples of parsing system calls using strace… The examples use Red Hat Enterprise Linux, but all commands should work on other Linux distributions as well:

[root@sandbox ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]#
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#

First, make sure you have the necessary tools installed on your system. Check if installed strace can be done using the command below. To view the version strace run it with the -V parameter:

[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]#
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#

If strace not installed, then install by running:

yum install strace

For example, create a test directory at /tmp and two files using the command touch:

[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]#
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]#
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#

(I am using the directory /tmp just because everyone has access to it, but you can use any other.)

Using the command ls check what’s in the directory testdir files were created:

[root@sandbox tmp]# ls testdir/
file1  file2
[root@sandbox tmp]#

You are probably using the command ls every day without realizing that system calls are running under the hood. This is where abstraction comes into play. This is how this command works:

Утилита командной строки -> Функции системных библиотек (glibc) -> Системные вызовы

Command ls calls functions from Linux system libraries (glibc). These libraries, in turn, call system calls, which do most of the work.

If you want to know what functions were called from the glibc library, then use the command ltrace with the next command ls testdir/:

ltrace ls testdir/

If ltrace not installed, then install:

yum install ltrace

There will be a lot of information on the screen, but don’t worry – we’ll cover that later. Here are some of the important library functions from the output ltrace:

opendir("testdir/")                                  = { 3 }
readdir({ 3 })                                       = { 101879119, "." }
readdir({ 3 })                                       = { 134, ".." }
readdir({ 3 })                                       = { 101879120, "file1" }
strlen("file1")                                      = 5
memcpy(0x1665be0, "file1", 6)                      = 0x1665be0
readdir({ 3 })                                       = { 101879122, "file2" }
strlen("file2")                                      = 5
memcpy(0x166dcb0, "file2", 6)                      = 0x166dcb0
readdir({ 3 })                                       = nil
closedir({ 3 })    

By examining this output, you can probably understand what’s going on. Directory named testdir opens with a library function opendirfollowed by function calls readdirreading the contents of the directory. At the end there is a function call closedirwhich closes the previously opened directory. For now, ignore the rest of the functions like strlen and memcpy

As you can see, it is easy to see the library functions being called, but in this article we will focus on the system calls that are called by the system library functions.

To view system calls use strace with the team ls testdiras shown below. And again you get a bunch of incoherent information:

[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL)                               = 0x1f12000
<<< truncated strace output >>>
write(1, "file1  file2n", 13file1  file2
)          = 13
close(1)                                = 0
munmap(0x7fd002c8d000, 4096)            = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
[root@sandbox tmp]#

As a result of execution strace you will get a list of system calls made when the command was running ls… All system calls can be divided into the following categories:

  • Process management
  • File management
  • Directory and file system management
  • Other

There is a convenient way to analyze the received information – write the output to a file using the option -o

[root@sandbox tmp]# strace -o trace.log ls testdir/
file1  file2
[root@sandbox tmp]#

This time there will be no data on the screen – the command ls will work as expected by showing a list of files and writing all the output strace to file trace.log… For a simple command ls the file contains almost 100 lines:

[root@sandbox tmp]# ls -l trace.log
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]#
[root@sandbox tmp]# wc -l trace.log
114 trace.log
[root@sandbox tmp]#

Take a look at the first line in the file trace.log:

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
  • At the beginning of the line is the name of the system call being executed — execve.
  • The text in parentheses is the arguments passed to the system call.
  • The number after the = sign (in this case, 0) is the value returned by the system call.

Now the result doesn’t seem too scary, does it? And you can apply the same logic for other lines as well.

Pay attention to the only command you called – ls testdir… You know the directory name used by the command lsso why not use grep for testdir in file trace.log and not see what is found? Look closely at the result:

[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#

Coming back to the above analysis execve, can you tell what the next system call is doing?

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0

You don’t have to remember all the system calls and what they do: everything is in the documentation. Man pages rush to the rescue! Make sure the package is installed before running man command man-pages:

[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#

Remember you need to add “2” between the command man and the name of the system call. If you read in man about man (man man), you will see that section 2 is reserved for system calls. Likewise, if you want information about library functions, then you need to add 3 between man and the name of the library function.

Below are the section numbers man:

1. Выполняемые программы или команды для командной оболочки.
2. Системные вызовы (функции, предоставляемые ядром).
3. Библиотечные вызовы (функции программных библиотек).
4. Специальные файлы (которые обычно находятся в /dev).

To view the documentation for a system call, run man with the name of that system call.

man 2 execve

As per the documentation, the system call execve executes the program that is passed to it in parameters (in this case, it is ls). Additional parameters for ls are also passed to it. In this example it is testdir… Hence, this system call simply runs ls from testdir as a parameter:

'execve - execute program'

'DESCRIPTION
       execve()  executes  the  program  pointed to by filename'

Into the next system call stat parameter is passed testdir:

stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0

To view the documentation use man 2 stat… The stat system call returns information about the specified file. Remember that everything in Linux is a file, including directories.

Next system call openat opens testdir… Note that the return value is 3. This is the file descriptor that will be used in subsequent system calls:

openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3

Now open the file

trace.log

and notice the line following the system call openat… You will see a system call getdentswhich does most of the work required to execute the command ls testdir… Now let’s execute grep getdents for file trace.log:

[root@sandbox tmp]# grep getdents trace.log
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
[root@sandbox tmp]#

The documentation (man getdents) it says that getdents reads directory entries, this is actually what we need. Note that the argument for getdent equals 3 – this is the file descriptor obtained earlier from the system call openat

Now that the contents of the directory have been received, we need a way to display the information in the terminal. So we do grep for another system call writewhich is used to output to the terminal:

[root@sandbox tmp]# grep write trace.log
write(1, "file1  file2n", 13)          = 13
[root@sandbox tmp]#

In the arguments, you can see the names of the files that will be output: file1 and file2… For the first argument (1), remember that on Linux, three file descriptors are opened by default for any process:

  • 0 – standard input stream
  • 1 – standard output stream
  • 2 – standard error stream

So the system call write deduces file1 and file2 to standard output, which is the terminal, denoted by the number 1.

Now you know which system calls did most of the work for the command ls testdir/… But what about the other 100+ system calls in the file trace.log?

The operating system does a lot of supporting things to start the process, so much of what you see in the file trace.log Is the initialization and cleaning of the process. Take a look at the trace.log file in its entirety and try to understand what happens when the command is run. ls

Now you can analyze system calls for any program. The strace utility also provides many useful command line options, some of which are described below.

Default strace does not display all information about system calls. However, she has the option -v verbosewhich will show additional information about each system call:

strace -v ls testdir

It’s good practice to use a parameter -f to keep track of child processes created by a running process:

strace -f ls testdir

What if you only want the names of the system calls, the number of times they are run, and the percentage of time spent executing? You can use the option -cto get these statistics:

strace -c ls testdir/

If you want to trace a specific system call, for example open, and ignore the others, you can use the option -e with the system call name:

[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1  file2
+++ exited with 0 +++
[root@sandbox tmp]#

What if you need to filter by multiple system calls? Don’t worry, you can use the same option -e and separate the required system calls with a comma. For example, for write and getdent:

[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
write(1, "file1  file2n", 13file1  file2
)          = 13
+++ exited with 0 +++
[root@sandbox tmp]#

So far, we’ve only tracked explicit command runs. But what about the commands that were run earlier? What if you want to track demons? To do this, have strace there is a special option -pto which you can pass the process ID.

We will not start the daemon, but use the command catwhich displays the contents of the file passed as an argument to it. But if no argument is specified, then the command cat will just wait for user input. After entering text, it will display the entered text on the screen. And so on until the user clicks Ctrl+C to exit.

Run the command cat on one terminal.

[root@sandbox tmp]# cat

On another terminal, find the process id (PID) using the command ps:

[root@sandbox ~]# ps -ef | grep cat
root      22443  20164  0 14:19 pts/0    00:00:00 cat
root      22482  20300  0 14:20 pts/1    00:00:00 grep --color=auto cat
[root@sandbox ~]#

Now run strace with option -p and the PID that you found with ps… After launch strace will display information about the process to which it connected, as well as its PID. Now strace monitors system calls executed by the command cat… The first syscall you’ll see is read, waiting for input from stream 0, that is, from standard input, which is now the terminal the command is running on. cat:

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,

Now go back to the terminal where you left the running command cat, and enter some text. For demonstration, I entered x0x0… note that cat just repeated what I entered and x0x0 will appear twice on the screen.

[root@sandbox tmp]# cat
x0x0
x0x0

Go back to the terminal where strace was connected to the process cat… Now you see two new system calls: the previous one readwho has now read x0x0, and one more to write writewhich writes x0x0 back to the terminal, and again a new one readthat is waiting to be read from the terminal. Note that standard input (0) and standard output (1) are on the same terminal:

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0n", 65536)                = 5
write(1, "x0x0n", 5)                   = 5
read(0,

Imagine how a startup can benefit you strace for demons: you can see everything that is done in the background. Complete the command

cat

by clicking

Ctrl+C

… This will also terminate the session

strace

since the monitored process has been terminated.

To view the timestamps of system calls use the option -t:

[root@sandbox ~]#strace -t ls testdir/

14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL)                      = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

What if you want to know the time spent between system calls? There is a convenient option -rwhich shows the time taken to execute each system call. Pretty helpful, isn’t it?

[root@sandbox ~]#strace -r ls testdir/

0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL)                 = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

Conclusion

Utility strace very handy for learning about system calls in Linux. For other command line options, see man and online documentation.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *