Authors Top

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1. Overview

When working with Linux systems, we will eventually reach the point where we’ll need to figure out what killed our process and why. In this article, we’ll explain how to troubleshoot that.

We’ll start by understanding how a process terminates in Linux. Next, we’ll show where to find the relevant logs for when the kernel decides to kill a process. Finally, we’ll examine the reason why that procedure kicked in.

2. Understand How a Processes Terminates

There are generally two ways that a process can terminate in Linux:

  • voluntarily: calling the exit() system call. This means the process has finished its tasks, so it chooses to terminate.
  • involuntarily: when receiving a signal. This signal can be sent by another user, another process, or Linux itself.

We’ll focus on the latter case here and see what signals are and how they work. We’ll then go over those signals that are relevant to process termination.

2.1. Introduction to Linux Signals

Signals are one of the ways that inter-process communication (IPC) takes place in Linux. When a process receives a signal, it stops its normal execution path, and unless it explicitly ignores that particular signal, it goes and executes the respective signal handler.

This signal handler is a small routine that dictates what the process should do when it receives a specific signal. A process may choose either to define its signal handler for one or more of them or to piggyback on the default handler that Linux provides. As we’ll see in a later section, there is one signal that we can’t ignore or overwrite – the SIGKILL signal.

2.2. Termination Signals

Now, let’s briefly mention those signals that are relevant to terminating a process:

  • SIGTERM: This is a “nice” way to ask a process to terminate, meaning that it can perform some cleanup operations and shut down gracefully.
  • SIGINT: Indicates that a user interrupted the process by sending the INTR character (CTRL + c).
  • SIGQUIT: Similar to SIGTERM, except that it also produces a core tump during the termination process.
  • SIGHUP: This indicates that a user’s terminal is disconnected for some reason.
  • SIGKILL: This special signal can’t be ignored or handled, and it immediately kills the process.

In the next section, we will quickly mention SIGKILL, as this is how Linux terminates our processes when it needs to do so.

2.3. The SIGKILL Signal

When a process receives SIGKILL, it can’t hear the bullet coming. Unlike SIGTERM or SIGQUIT, we can’t block or handle SIGKILL in a different way. For that reason, it’s often seen as the last resort when we need to terminate a process immediately.

A common practice when trying to terminate a process is to try with a SIGTERM or SIGQUIT first, and if it doesn’t stop after a reasonable amount of time, then force it via SIGKILL.

As we’ll see in a later section, Linux may also send SIGKILL to a process to enforce immediate termination when the operating system is struggling with its resource utilization.

3. Finding out Who Killed the Process

Now that we’ve seen how processes are being killed, we can look into the root causes of a process termination and find more information about it.

Assuming that the process didn’t exit voluntarily, the only way that a process can be terminated is via those signals that we mentioned earlier. These signals can be sent either by another process or the Linux Kernel. We’ll examine both scenarios in the coming sections but focus more on the Kernel initiated process termination.

3.2. Process Initiated Signals

On many occasions, some other user or process may choose to kill a process. As described earlier, this will eventually happen via a SIGTERM or a SIGKILL signal (or a combination of both).

Here’s an example of a user that stops a process with the pkill command.

We first open up a terminal and create a new process that sleeps for 20 seconds:

$ sleep 20

Then, from another terminal:

$ pkill sleep

Finally, going back to our first terminal, we get:

$ sleep 20
Terminated

Note that there is this Terminated message, which indicates that the process terminated. If, however, the process is not active in a terminal (because, for example, the user started it from the Graphical User Interface or a cron job), it will be more difficult to try and understand what happened. So, we’d have to set up things ahead of time to monitor this sort of user activity. Tools such as pssact or auditd can help us accomplish that.

3.3. Kernel Initiated Signals

The Linux Kernel may also decide to terminate one or more processes when the system is running low on resources. A very common example of that is the out-of-memory (OOM) killer, which takes action when the system’s physical memory is getting exhausted.

When this event happens, the kernel logs the relevant information into the kernel log buffer, which is made available through /dev/kmsg. Several tools make reading from that virtual device easier, with the most popular being dmesg and journalctl. Let’s see some examples.

First, let’s trigger the OOM killer:

$ (echo "li = []" ; echo "for r in range(9999999999999999): li.append(str(r))") | python
Killed

Now, let’s examine the relevant logs with dmesg:

$ sudo dmesg | tail -7
[427918.962500] [ 142394] 1000 142394 58121 2138 335872 0 0 Socket Process
[427918.962505] [ 179856] 1000 179856 660680 21527 1208320 0 0 Web Content
[427918.962508] [ 179902] 1000 179902 605502 3489 483328 0 0 Web Content
[427918.962510] [ 179944] 1000 179944 3197660 3175506 25534464 0 0 python
[427918.962514] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,
global_oom,task_memcg=/user.slice/user-1000.slice/session-2.scope,task=python,pid=179944,uid=1000
<strong>[427918.962531] Out of memory: Killed process 179944 (python) total-vm:12790640kB, anon-rss:12702024kB,</strong>
file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:24936kB oom_score_adj:0
[427919.411464] oom_reaper: reaped process 179944 (python), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

We can also investigate the logs with journalctl:

$ journalctl --list-boots | \
    awk '{ print $1 }' | \
    xargs -I{} journalctl --utc --no-pager -b {} -kqg 'killed process' -o verbose --output-fields=MESSAGE
Fri 2021-12-10 17:49:55.782801 UTC [s=38b35d6842d24a09ab14c8735cd79ff7;i=aeef;b=cd40ed63d47d4814b4c2c0f9ab73341f;m=63a20e170d;t=5d2ce59d50891;x=cdd950f6be42011b]
<strong>    MESSAGE=Out of memory: Killed process 179944 (python) total-vm:12790640kB, anon-rss:12702024kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:24936kB oom_score_adj:0</strong>

4. The Out-of-Memory Killer

In this section, we’ll briefly touch upon the OOM killer and its underlying mechanics.

Linux has this concept of virtual memory, which means that, from the view of each process, the whole physical memory of the entire computer is available for use. This makes programming extremely easier, as it enables overcommitting memory. That, in turn, allows each process to quickly get more memory space, even though another process may have already claimed the available memory. If, however, all processes try to use that same memory space simultaneously, then there will be an integrity issue. This is where the OOM killer comes into the picture.

The job of the OOM killer is to pick the minimum number of processes when the system is running out of memory and terminate them. It uses a badness score – which is available through procfs via /proc/<pid>/oom_score – to decide which processes to kill. While making that decision, it tries to minimize the damage by making sure that it:

  • minimizes the lost work
  • recovers as much memory as possible
  • doesn’t kill innocent processes, but only those that consume a lot of memory
  • minimizes the number of killed processes (ideally, to just one)
  • kills the process(es) that the user would expect

There are many more parameters we can use to customize the OOM killer to fit our needs.

5. Conclusion

In this article, we saw how to check what killed our Linux process and why this happened. Next, we saw the most common signals for process termination. Then, we explained how the Linux kernel might kill our processes. Finally, we touched upon the basics of the OOM killer.

Authors Bottom

If you have a few years of experience in the Linux ecosystem, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

1 Comment
Oldest
Newest
Inline Feedbacks
View all comments
Comments are closed on this article!