Linux 详细介绍strace命令

system call(系统调用)是程序向内核请求服务的一种编程方式，strace是一个功能强大的工具，可以跟踪用户进程和 Linux 内核之间的交互。

要了解操作系统如何工作，首先需要了解系统调用如何工作。操作系统的主要功能之一是为用户程序提供了一个抽象。

操作系统大致可以分为两种模式：

内核模式(Kernel mode:）：操作系统内核使用的特权且强大的模式

用户模式(User mode)：大多数用户应用程序运行的地方

用户主要使用命令行程序和图形用户界面 (GUI) 来完成日常任务。系统调用在后台默默工作，与内核交互以完成工作。

system call(系统调用)与function call(函数调用)非常相似，都接受并处理参数和返回值。唯一的区别是system call进入内核，而function call则不进入内核。从用户空间切换到内核空间是使用特殊的trap机制完成的。

下面将通过一些通用命令来使用 strace 分析每个命令进行的系统调用，并探索一些实际示例。范例将使用 Red Hat Enterprise Linux，这些命令在其他 Linux 发行版上的工作方式应该也相同：

bash 复制代码

[root@sandbox ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 (Maipo)
[root@sandbox ~]# 
[root@sandbox ~]# uname -r
3.10.0-1062.el7.x86_64
[root@sandbox ~]#

首先，确保系统上安装了所需的工具。使用下面的 RPM 命令验证是否安装了 strace；并使用 -V 选项检查 strace 实用程序版本号：

bash 复制代码

[root@sandbox ~]# rpm -qa | grep -i strace
strace-4.12-9.el7.x86_64
[root@sandbox ~]# 
[root@sandbox ~]# strace -V
strace -- version 4.12
[root@sandbox ~]#

如果没有安装，使用如下命令安装

bash 复制代码

yum install strace

基于演示的目的，在 /tmp 中创建一个测试目录，并使用 touch 命令创建两个文件：

bash 复制代码

[root@sandbox ~]# cd /tmp/
[root@sandbox tmp]# 
[root@sandbox tmp]# mkdir testdir
[root@sandbox tmp]# 
[root@sandbox tmp]# touch testdir/file1
[root@sandbox tmp]# touch testdir/file2
[root@sandbox tmp]#

我使用 /tmp 目录是因为每个人都可以访问它，其实也可以选择其他目录。）

验证文件是否是在 testdir 目录上使用 ls 命令创建的：

bash 复制代码

[root@sandbox tmp]# ls testdir/
file1  file2
[root@sandbox tmp]#

可能每天都使用 ls 命令，但其实他的工作是基于到系统调用之上。它的工作原理如下：
Command-line utility -> Invokes functions from system libraries (glibc) -> Invokes system calls

ls 命令在 Linux 上内部调用系统库（又名 glibc）中的函数。这些库调用完成大部分工作的系统调用。

如果想知道从 glibc 库调用了哪些函数，可以使用 ltrace 命令，后跟常规 ls testdir/ 命令：

bash 复制代码

ltrace ls testdir/

如果没有安装，可以使用如下命令安装

bash 复制代码

yum install ltrace

一堆输出将被转储到屏幕上；不用担心------只要跟着做就可以了。 ltrace 命令输出中与本示例相关的一些重要库函数包括：

bash 复制代码

opendir("testdir/")                                  = { 3 }
readdir({ 3 })                                       = { 101879119, "." }
readdir({ 3 })                                       = { 134, ".." }
readdir({ 3 })                                       = { 101879120, "file1" }
strlen("file1")                                      = 5
memcpy(0x1665be0, "file1\0", 6)                      = 0x1665be0
readdir({ 3 })                                       = { 101879122, "file2" }
strlen("file2")                                      = 5
memcpy(0x166dcb0, "file2\0", 6)                      = 0x166dcb0
readdir({ 3 })                                       = nil
closedir({ 3 })

通过查看上面的输出，可能可以理解发生了什么。 opendir 库函数正在打开一个名为 testdir 的目录，然后调用 readdir 函数来读取该目录的内容。最后，调用 closedir 函数，该函数关闭之前打开的目录。暂时忽略其他 strlen 和 memcpy 函数。

可以看到正在调用哪些库函数，但本文将重点讨论由系统库函数调用的系统调用。

与上面类似，要了解调用了哪些系统调用，只需将 strace 放在 ls testdir 命令之前即可，如下所示。再次，一堆乱码将被转储到您的屏幕上

bash 复制代码

[root@sandbox tmp]# strace ls testdir/
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
brk(NULL)                               = 0x1f12000
<<< truncated strace output >>>
write(1, "file1  file2\n", 13file1  file2
)          = 13
close(1)                                = 0
munmap(0x7fd002c8d000, 4096)            = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++
[root@sandbox tmp]#

运行 strace 命令后屏幕上的输出只是为运行 ls 命令而进行的系统调用。每个系统调用都有特定的操作系统用途，它们可以大致分为以下几部分：
1.Process management system calls
2.File management system calls
3.Directory and filesystem management system calls
4.Other system calls

分析转储到屏幕上的信息的一种更简单的方法是使用 strace 的 -o 选项将输出记录到文件中。在 -o 标志后添加合适的文件名并再次运行命令：

bash 复制代码

[root@sandbox tmp]# strace -o trace.log ls testdir/
file1  file2
[root@sandbox tmp]#

这次，没有输出转储到屏幕上 - ls 命令按预期工作，显示文件名并将所有输出记录到文件 trace.log 中。仅一个简单的 ls 命令，该文件就有近 100 行内容：

bash 复制代码

[root@sandbox tmp]# ls -l trace.log 
-rw-r--r--. 1 root root 7809 Oct 12 13:52 trace.log
[root@sandbox tmp]# 
[root@sandbox tmp]# wc -l trace.log 
114 trace.log
[root@sandbox tmp]#

看一下示例的trace.log 中的第一行：

bash 复制代码

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0

1.该行的 execve 是正在执行的系统调用的名称。

2.括号内的文本是提供给系统调用的参数。

3.= 符号后面的数字（在本例中为 0）是 execve 系统调用返回的值。

这只是范本解释，可以应用相同的逻辑来理解其他行。

现在，将注意力集中到调用的单个命令，即 ls testdir。知道命令 ls 使用的目录名称，grep testdir trace.log 详细查看每一行结果：

bash 复制代码

[root@sandbox tmp]# grep testdir trace.log
execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0
openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[root@sandbox tmp]#

回顾一下上面对execve的分析，能看出这个系统调用是做什么的吗？

bash 复制代码

execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0

不需要记住所有的系统调用或它们的作用，因为您可以在需要时参考文档。手册页来救援！在运行 man 命令之前确保安装了以下软件包：

bash 复制代码

[root@sandbox tmp]# rpm -qa | grep -i man-pages
man-pages-3.53-5.el7.noarch
[root@sandbox tmp]#

请记住，需要在 man 命令和需要查询的系统调用名称之间添加 2。如果使用 man man 阅读 man 的手册页，可以看到第 2 部分是为系统调用保留的。同样，如果需要库函数的信息，则需要在man和库函数名之间添加3。

bash 复制代码

[root@sandbox tmp]# man man
       1   Executable programs or shell commands
       2   System calls (functions provided by the kernel)
       3   Library calls (functions within program libraries)
       4   Special files (usually found in /dev)
       5   File formats and conventions eg /etc/passwd
       6   Games
       7   Miscellaneous (including macro packages and conventions), e.g. man(7), groff(7)
       8   System administration commands (usually only for root)
       9   Kernel routines [Non standard]

以下 man 命令查看上面范例中涉及的execve系统调用的文档：

bash 复制代码

man 2 execve

输出如下：

bash 复制代码

EXECVE(2)                  Linux Programmer's Manual                 EXECVE(2)

DESCRIPTION
       execve()  executes  the  program  pointed to by filename.  filename must be either a binary executable, or a script starting with a line of the form "#! interpreter [arg]".  In the latter case, the interpreter must be a
       valid pathname for an executable which is not itself a script, which will be invoked as interpreter [arg] filename

根据 execve 手册页说名，这个execve系统调用会执行一个作为参数传入的程序（在本例中为 ls）。还可以向 ls 提供其他参数，例如本示例中的 testdir。因此，该系统调用仅以 testdir 作为参数运行 ls：

下一个名为 stat 的系统调用使用 testdir 参数：

bash 复制代码

stat("testdir/", {st_mode=S_IFDIR|0755, st_size=32, ...}) = 0

使用 man 2 stat 访问文档。 stat 是获取文件状态的系统调用 - 请记住，Linux 中的所有内容都是文件，包括目录。

接下来，openat 系统调用打开 testdir。留意返回的 3。这是文件描述，后面的系统调用会用到：

bash 复制代码

openat(AT_FDCWD, "testdir/", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3

到目前为止，一切都很好。现在，打开trace.log 文件并转到openat 系统调用后面的行。将看到 getdents 系统调用被调用，它执行执行 ls testdir 命令所需的大部分操作。现在，从trace.log 文件中grep getdents：

bash 复制代码

[root@sandbox tmp]# grep getdents trace.log 
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
[root@sandbox tmp]#

man getdents,我们将得知getdents = get directory entries，这里注意，getdents 的参数是 3，它是上面 openat 系统调用中的文件描述符。

现在您已经有了目录列表，需要一种在终端中显示它的方法。因此，grep 另一个系统调用 write，用于在将输出的内容写入终端：

bash 复制代码

[root@sandbox tmp]# grep write trace.log
write(1, "file1  file2\n", 13)          = 13
[root@sandbox tmp]#

在这些参数中，您可以看到将显示的文件名：file1 和 file2。关于第一个参数（1），请记住在 Linux 中，当任何进程运行时，默认情况下会为其打开三个文件描述符。以下是默认的文件描述符：
0 - Standard input
1 - Standard out
2 - Standard error

因此，write 系统调用正在"1 - Standard out"上显示 file1 和 file2，默认将输出到显示屏幕上，由 1 标识。

现在知道哪些系统调用完成了 ls testdir/ 命令的大部分工作。但是trace.log 文件中的其他100 多个系统调用又如何呢？操作系统必须执行大量内部工作才能运行进程，因此在日志文件中看到的很多内容都是initialization 和cleanup。阅读整个trace.log 文件并尝试了解发生了什么使ls 命令正常工作。

现在知道如何分析给定命令的系统调用，可以将此知识用于其他命令来了解正在执行哪些系统调用。 strace 提供了许多有用的命令行选项，使strace过程更轻松，下面介绍了其中一些选项。

默认情况下，strace 不包含所有系统调用信息。但是，它有一个方便的 -v verbose 选项，可以提供有关每个系统调用的附加信息：

bash 复制代码

strace -v ls testdir

运行 strace 命令时始终使用 -f 选项是一个很好的做法。它允许 strace 跟踪当前正在跟踪的进程创建的任何子进程：

bash 复制代码

strace -f ls testdir

假设只需要系统调用的名称、它们运行的次数以及每个系统调用所花费的时间百分比。您可以使用 -c 选项来获取这些统计信息：

bash 复制代码

strace -c ls testdir/

假设想专注于特定的系统调用，例如专注于open系统调用而忽略其余的。您可以使用 -e 选项后跟系统调用名称：

bash 复制代码

[root@sandbox tmp]# strace -e open ls testdir
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libcap.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libacl.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpcre.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libattr.so.1", O_RDONLY|O_CLOEXEC) = 3
open("/lib64/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
open("/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
file1  file2
+++ exited with 0 +++
[root@sandbox tmp]#

如果您想专注于多个系统调用，可以使用相同的 -e 命令行选项，并在两个系统调用之间使用逗号。例如，要查看 write 和 getdents 系统调用：

bash 复制代码

[root@sandbox tmp]# strace -e write,getdents ls testdir
getdents(3, /* 4 entries */, 32768)     = 112
getdents(3, /* 0 entries */, 32768)     = 0
write(1, "file1  file2\n", 13file1  file2
)          = 13
+++ exited with 0 +++
[root@sandbox tmp]#

到目前为止的示例都是显式地跟踪运行的命令。但是那些已经运行并且正在执行的命令呢？例如，如果想跟踪只是长时间运行的进程的daemon(守护进程），该怎么办？为此，strace 提供了一个特殊的 -p 选项，可以向其提供进程 ID。

替代在一个守护进程上运行strace命令，这里我们以cat 命令为例进行演示，如果提供文件名给cat命令作为参数，这个命令通常会显示文件的内容。如果没有给出参数，cat 命令只是在终端等待用户输入文本。输入文本后，它会重复给定的文本，直到用户按 Ctrl+C 退出。

从一个终端运行 cat 命令；它会显示一个提示，然后只需等待（记住 cat 仍在运行并且尚未退出）：

bash 复制代码

[root@sandbox tmp]# cat

从另一个终端，使用 ps 命令查找进程标识符 (PID)：

bash 复制代码

[root@sandbox ~]# ps -ef | grep cat
root      22443  20164  0 14:19 pts/0    00:00:00 cat
root      22482  20300  0 14:20 pts/1    00:00:00 grep --color=auto cat
[root@sandbox ~]#

现在，使用 -p 选项和 PID（上面使用 ps 找到的）对正在运行的进程运行 strace。运行 strace 后，输出会显示进程所附加的内容以及 PID 号。现在，strace 正在跟踪 cat 命令发出的系统调用。您看到的第一个系统调用是 read，它正在等待来自 0 或标准输入的输入，这是运行 cat 命令的终端：

bash 复制代码

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0,

现在，返回到运行 cat 命令的终端并输入一些文本。我输入 x0x0 是出于演示目的。请注意 cat 是如何简单地重复我输入的内容的；因此，x0x0 出现两次。我输入第一个，第二个是 cat 命令重复的输出：

bash 复制代码

[root@sandbox tmp]# cat
x0x0
x0x0

返回到 strace 连接到 cat 进程的终端。现在会看到两个额外的系统调用：之前的 read 系统调用，现在在终端中读取 x0x0，另一个用于 write，它将 x0x0 写回终端，还有一个新的 read，它正在等待从终端读取。请注意，标准输入 (0) 和标准输出 (1) 均位于同一终端中：

bash 复制代码

[root@sandbox ~]# strace -p 22443
strace: Process 22443 attached
read(0, "x0x0\n", 65536)                = 5
write(1, "x0x0\n", 5)                   = 5
read(0,

想象一下，当对守护进程运行 strace 以查看它在后台执行的所有操作时，这有多么有用。按 Ctrl+C 退出cat 命令；这也将会终止strace 会话，因为该进程不再运行。

如果想查看所有系统调用的时间戳，只需将 -t 选项与 strace 结合使用即可：

bash 复制代码

14:24:47 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
14:24:47 brk(NULL)                      = 0x1f07000
14:24:47 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2530bc8000
14:24:47 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
14:24:47 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

如果想知道系统调用之间所花费的时间， strace 有一个方便的 -r 命令，可以显示执行每个系统调用所花费的时间。非常有用

bash 复制代码

[root@sandbox ~]#strace -r ls testdir/
0.000000 execve("/usr/bin/ls", ["ls", "testdir/"], [/* 40 vars */]) = 0
0.000368 brk(NULL)                 = 0x1966000
0.000073 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb6b1155000
0.000047 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
0.000119 open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3

结论

strace 实用程序对于理解 Linux 上的系统调用非常方便。要了解其其他命令行选项，可参阅手册页和在线文档。