Killing a process and all of its descendants
Killing processes in a Unix-like system can be trickier than expected. Last week I was debugging an odd issue related to job stopping on Semaphore. More specifically, an issue related to the killing of a running process in a job.
Here are the highlights of what I learned:
- Unix-like operating systems have sophisticated process relationships.
- Sending signals to all processes in a session is not trivial with syscalls.
- Processes started with exec inherit their parent signal configuration.
- There are many challenges with handling orphaned process groups.
There are multiple types of relationships between processes. Parent-child, process groups, sessions, and session leaders. However, the details are not uniform across operating systems like Linux and macOS. POSIX compliant operating systems support sending signals to process groups with a negative PID number.
Killing a parent doesn’t kill the child processes
Every process has a parent. We can observe this with pstree
or the ps
utility.
# start two dummy processes
$ sleep 100 &
$ sleep 101 &
$ pstree -p
init(1)-+
|-bash(29051)-+-pstree(29251)
|-sleep(28919)
`-sleep(28964)
$ ps j -A
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
0 1 1 1 ? -1 Ss 0 0:03 /sbin/init
29051 1470 1470 29051 pts/2 2386 SN 1000 0:00 sleep 100
29051 1538 1538 29051 pts/2 2386 SN 1000 0:00 sleep 101
29051 2386 2386 29051 pts/2 2386 R+ 1000 0:00 ps j -A
1 29051 29051 29051 pts/2 2386 Ss 1000 0:00 -bash
The ps
command displays the PID (id of the process), and the PPID (parent ID
of the process).
I held a very incorrect assumption about this relationship. I thought that if I kill the parent of a process, it kills the children of that process too. However, this is incorrect. Instead, child processes become orphaned, and the init process re-parents them.
Let’s see the re-parenting in action by killing the bash process — the current parent of the sleep commands — and observe the changes.
$ kill 29051 # killing the bash process
$ pstree -A
init(1)-+
|-sleep(28919)
`-sleep(28965)
The re-parenting behavior was odd to me. For example, when I SSH into a server, start a process, and exit, the started process is killed. I wrongly assumed this is the default behavior on Linux. It turns that killing of processes when I leave an SSH session is related to process groups, session leaders, and controlling terminals.
What are process groups and session leaders?
Let’s observe the output of ps j
from the previous example again.
$ ps j -A
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
0 1 1 1 ? -1 Ss 0 0:03 /sbin/init
29051 1470 1470 29051 pts/2 2386 SN 1000 0:00 sleep 100
29051 1538 1538 29051 pts/2 2386 SN 1000 0:00 sleep 101
29051 2386 2386 29051 pts/2 2386 R+ 1000 0:00 ps j -A
1 29051 29051 29051 pts/2 2386 Ss 1000 0:00 -bash
Apart from the parent-child relationship expressed by PPID and PID, we have two other relationships:
- Process groups represented by PGID
- Sessions represented by SID
Process groups are observable in shells that support job control, like bash
and zsh
, that are creating a process group for every pipeline of commands. A
process group is a collection of one or more processes (usually associated with
the same job) that can receive signals from the same terminal. Each process
group has a unique process group ID.
# start a process group that consists of tail and grep
$ tail -f /var/log/syslog | grep "CRON" &
$ ps j
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND
29051 19701 19701 29051 pts/2 19784 SN 1000 0:00 tail -f /var/log/syslog
29051 19702 19701 29051 pts/2 19784 SN 1000 0:00 grep CRON
29051 19784 19784 29051 pts/2 19784 R+ 1000 0:00 ps j
29050 29051 29051 29051 pts/2 19784 Ss 1000 0:00 -bash
Notice that the PGID of tail
and grep
is the same in the previous snippet.
A session is a collection of process groups, usually associated with one controlling terminals and a session leader process. If a session has a controlling terminal, it has a single foreground process group, and all other process groups in the session are background process groups.
Not all bash processes are sessions, but when you SSH into a remote server, you usually get a session. When bash runs as a session leader, it propagates the SIGHUP signal to its children. SIGHUP propagation to children was the core reason for my long-held belief that children are dying along with the parents.
Sessions are not consistent across Unix implementations
In the previous examples, you can notice the occurrence of SID, the session ID of the process. It is the ID shared by all processes in a session.
However, you need to keep in mind that this is not true across all Unix implementations. The Single UNIX Specification talks only about a “session leader”; there is no “session ID” similar to a process ID or a process group ID. A session leader is a single process that has a unique process ID, so we could talk about a session ID that is the process ID of the session leader.
System V Release 4 introduced Session IDs.
In practice, this means that you get session ID in the ps
output on Linux, but
on BSD and its variants like MacOS, the session ID isn’t present or always zero.
Killing all processes in a process group or session
We can use that PGID to send a signal to the whole group with the kill utility:
$ kill -SIGTERM -- -19701
We used a negative number -19701
to send a signal to the group. If kill
receives a positive number, it kills the process with that ID. If we pass a
negative number, it kills the process group with that PGID.
The negative number comes from the system call definition directly.
Killing all processes in a session is quite different. As explained in the
previous section, some systems don’t have a notion of a session ID. Even the
ones that have session IDs, like Linux, don’t have a system call to kill all
processes in a session. You need to walk the /proc
tree, collect the SIDs, and
terminate the processes.
Pgrep implements the algorithm for walking, collecting, and process killing by session ID. Use the following snipped:
pkill -s <SID>
Nohup propagation to process descendants
Ignored signals, like the ones ignored with nohup
, are propagated to all
descendants of a process. This propagation was the final bottleneck in my bug
hunting exercise last week.
In my program — an agent for running bash commands — I verified that I have an established a bash session that has a controlling terminal. It is the session leader of the processes started in that bash session. My process tree looks like this:
agent -+
+- bash (session leader) -+
| - process1
| - process2
I assumed that when I kill the bash session with SIGHUP, it kills the children as well. Integration tests on the agent also verified this.
However, what I missed was that the agent is started with nohup
. When you
start a subprocess with exec
, like we start the bash process in the agent, it
inherits the signals states from its parents.
This last one took me by surprise.