Understanding system calls in Linux with strace
The syscall is the mechanism by which user programs interact with the Linux kernel, and strace is a powerful tool for keeping track of them. To better understand how the operating system works, it is helpful to understand how they work.
The operating system can be divided into two modes of operation:
- Kernel mode is the privileged mode used by the operating system kernel.
- User mode is the mode in which most user applications run.
Users typically use command line utilities and a graphical interface (GUI) for their daily work. At the same time, system calls invisibly work in the background, referring to the kernel to do the work.
System calls are very similar to function calls in the sense that arguments are passed in and they return values. The only difference is that system calls work at the kernel level, but functions do not. Switching from user mode to kernel mode is carried out using a special mechanism interrupts…
Most of these details are hidden from the user in the system libraries (glibc on Linux systems). System calls are generic in nature, but despite this, the mechanics of their execution are largely hardware-dependent.
This article explores several practical examples of parsing system calls using strace
… The examples use Red Hat Enterprise Linux, but all commands should work on other Linux distributions as well:
[root@sandbox ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]#
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#
First, make sure you have the necessary tools installed on your system. Check if installed strace
can be done using the command below. To view the version strace
run it with the -V parameter:
[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]#
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#
If strace
not installed, then install by running:
yum install strace
For example, create a test directory at /tmp
and two files using the command touch
:
[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]#
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]#
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#
(I am using the directory /tmp
just because everyone has access to it, but you can use any other.)
Using the command ls
check what’s in the directory testdir
files were created:
[root@sandbox tmp]# ls testdir/
file1 file2
[root@sandbox tmp]#
You are probably using the command ls
every day without realizing that system calls are running under the hood. This is where abstraction comes into play. This is how this command works:
Утилита командной строки -> Функции системных библиотек (glibc) -> Системные вызовы
Command ls
calls functions from Linux system libraries (glibc). These libraries, in turn, call system calls, which do most of the work.
If you want to know what functions were called from the glibc library, then use the command ltrace
with the next command ls testdir/
:
ltrace ls testdir/
If ltrace
not installed, then install:
yum install ltrace
There will be a lot of information on the screen, but don’t worry – we’ll cover that later. Here are some of the important library functions from the output ltrace
:
opendir("testdir/") = { 3 }
readdir({ 3 }) = { 101879119, "." }
readdir({ 3 }) = { 134, ".." }
readdir({ 3 }) = { 101879120, "file1" }
strlen("file1") = 5
memcpy(0x1665be0, "file1 ", 6) = 0x1665be0
readdir({ 3 }) = { 101879122, "file2" }
strlen("file2") = 5
memcpy(0x166dcb0, "file2 ", 6) = 0x166dcb0
readdir({ 3 }) = nil
closedir({ 3 })
By examining this output, you can probably understand what’s going on. Directory named testdir
opens with a library function opendir
followed by function calls readdir
reading the contents of the directory. At the end there is a function call closedir
which closes the previously opened directory. For now, ignore the rest of the functions like strlen
and memcpy
…
As you can see, it is easy to see the library functions being called, but in this article we will focus on the system calls that are called by the system library functions.
To view system calls use strace
with the team ls testdir
as shown below. And again you get a bunch of incoherent information:
[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL) = 0x1f12000
<<< truncated strace output >>>
write(1, "file1 file2n", 13file1 file2
) = 13
close(1) = 0
munmap(0x7fd002c8d000, 4096) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
[root@sandbox tmp]#
As a result of execution strace
you will get a list of system calls made when the command was running ls
… All system calls can be divided into the following categories:
- Process management
- File management
- Directory and file system management
- Other
There is a convenient way to analyze the received information – write the output to a file using the option -o
…
[root@sandbox tmp]# strace -o trace.log ls testdir/
file1 file2
[root@sandbox tmp]#
This time there will be no data on the screen – the command ls
will work as expected by showing a list of files and writing all the output strace
to file trace.log
… For a simple command ls
the file contains almost 100 lines:
[root@sandbox tmp]# ls -l trace.log
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]#
[root@sandbox tmp]# wc -l trace.log
114 trace.log
[root@sandbox tmp]#
Take a look at the first line in the file trace.log
:
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
- At the beginning of the line is the name of the system call being executed — execve.
- The text in parentheses is the arguments passed to the system call.
- The number after the = sign (in this case, 0) is the value returned by the system call.
Now the result doesn’t seem too scary, does it? And you can apply the same logic for other lines as well.
Pay attention to the only command you called – ls testdir
… You know the directory name used by the command ls
so why not use grep
for testdir
in file trace.log
and not see what is found? Look closely at the result:
[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#
Coming back to the above analysis execve
, can you tell what the next system call is doing?
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
You don’t have to remember all the system calls and what they do: everything is in the documentation. Man pages rush to the rescue! Make sure the package is installed before running man command man-pages
:
[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#
Remember you need to add “2” between the command man
and the name of the system call. If you read in man
about man
(man man
), you will see that section 2 is reserved for system calls. Likewise, if you want information about library functions, then you need to add 3 between man
and the name of the library function.
Below are the section numbers man
:
1. Выполняемые программы или команды для командной оболочки.
2. Системные вызовы (функции, предоставляемые ядром).
3. Библиотечные вызовы (функции программных библиотек).
4. Специальные файлы (которые обычно находятся в /dev).
To view the documentation for a system call, run man with the name of that system call.
man 2 execve
As per the documentation, the system call execve
executes the program that is passed to it in parameters (in this case, it is ls
). Additional parameters for ls are also passed to it. In this example it is testdir
… Hence, this system call simply runs ls
from testdir
as a parameter:
'execve - execute program'
'DESCRIPTION
execve() executes the program pointed to by filename'
Into the next system call stat
parameter is passed testdir
:
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
To view the documentation use man 2 stat
… The stat system call returns information about the specified file. Remember that everything in Linux is a file, including directories.
Next system call openat
opens testdir
… Note that the return value is 3. This is the file descriptor that will be used in subsequent system calls:
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
Now open the file
trace.log
and notice the line following the system call openat
… You will see a system call getdents
which does most of the work required to execute the command ls testdir
… Now let’s execute grep getdents
for file trace.log
:
[root@sandbox tmp]# grep getdents trace.log
getdents(3, /* 4 entries */, 32768) = 112
getdents(3, /* 0 entries */, 32768) = 0
[root@sandbox tmp]#
The documentation (man getdents
) it says that getdents
reads directory entries, this is actually what we need. Note that the argument for getdent
equals 3 – this is the file descriptor obtained earlier from the system call openat
…
Now that the contents of the directory have been received, we need a way to display the information in the terminal. So we do grep
for another system call write
which is used to output to the terminal:
[root@sandbox tmp]# grep write trace.log
write(1, "file1 file2n", 13) = 13
[root@sandbox tmp]#
In the arguments, you can see the names of the files that will be output: file1
and file2
… For the first argument (1), remember that on Linux, three file descriptors are opened by default for any process:
- 0 – standard input stream
- 1 – standard output stream
- 2 – standard error stream
So the system call write
deduces file1
and file2
to standard output, which is the terminal, denoted by the number 1.
Now you know which system calls did most of the work for the command ls testdir/
… But what about the other 100+ system calls in the file trace.log
?
The operating system does a lot of supporting things to start the process, so much of what you see in the file trace.log
Is the initialization and cleaning of the process. Take a look at the trace.log file in its entirety and try to understand what happens when the command is run. ls
…
Now you can analyze system calls for any program. The strace utility also provides many useful command line options, some of which are described below.
Default strace
does not display all information about system calls. However, she has the option -v verbose
which will show additional information about each system call:
strace -v ls testdir
It’s good practice to use a parameter -f
to keep track of child processes created by a running process:
strace -f ls testdir
What if you only want the names of the system calls, the number of times they are run, and the percentage of time spent executing? You can use the option -c
to get these statistics:
strace -c ls testdir/
If you want to trace a specific system call, for example open
, and ignore the others, you can use the option -e
with the system call name:
[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1 file2
+++ exited with 0 +++
[root@sandbox tmp]#
What if you need to filter by multiple system calls? Don’t worry, you can use the same option -e
and separate the required system calls with a comma. For example, for write
and getdent
:
[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768) = 112
getdents(3, /* 0 entries */, 32768) = 0
write(1, "file1 file2n", 13file1 file2
) = 13
+++ exited with 0 +++
[root@sandbox tmp]#
So far, we’ve only tracked explicit command runs. But what about the commands that were run earlier? What if you want to track demons? To do this, have strace
there is a special option -p
to which you can pass the process ID.
We will not start the daemon, but use the command cat
which displays the contents of the file passed as an argument to it. But if no argument is specified, then the command cat
will just wait for user input. After entering text, it will display the entered text on the screen. And so on until the user clicks Ctrl+C
to exit.
Run the command cat
on one terminal.
[root@sandbox tmp]# cat
On another terminal, find the process id (PID) using the command ps
:
[root@sandbox ~]# ps -ef | grep cat
root 22443 20164 0 14:19 pts/0 00:00:00 cat
root 22482 20300 0 14:20 pts/1 00:00:00 grep --color=auto cat
[root@sandbox ~]#
Now run strace
with option -p
and the PID that you found with ps
… After launch strace
will display information about the process to which it connected, as well as its PID. Now strace
monitors system calls executed by the command cat
… The first syscall you’ll see is read, waiting for input from stream 0, that is, from standard input, which is now the terminal the command is running on. cat
:
[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,
Now go back to the terminal where you left the running command cat
, and enter some text. For demonstration, I entered x0x0
… note that cat
just repeated what I entered and x0x0
will appear twice on the screen.
[root@sandbox tmp]# cat
x0x0
x0x0
Go back to the terminal where strace
was connected to the process cat
… Now you see two new system calls: the previous one read
who has now read x0x0
, and one more to write write
which writes x0x0
back to the terminal, and again a new one read
that is waiting to be read from the terminal. Note that standard input (0) and standard output (1) are on the same terminal:
[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0n", 65536) = 5
write(1, "x0x0n", 5) = 5
read(0,
Imagine how a startup can benefit you strace
for demons: you can see everything that is done in the background. Complete the command
cat
by clicking
Ctrl+C
… This will also terminate the session
strace
since the monitored process has been terminated.
To view the timestamps of system calls use the option -t
:
[root@sandbox ~]#strace -t ls testdir/
14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL) = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
What if you want to know the time spent between system calls? There is a convenient option -r
which shows the time taken to execute each system call. Pretty helpful, isn’t it?
[root@sandbox ~]#strace -r ls testdir/
0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL) = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
Conclusion
Utility strace
very handy for learning about system calls in Linux. For other command line options, see man and online documentation.