a simple watchdog for subprocesses

A problem I often encounter when writing code that deals with subprocesses is the issue of orphans. These are perfectly useful in many situations, such as keeping program running via disown. They’re also real annoyances when you’re trying to fork worker processes; every time your parent process has a bug, you end up with orphaned workers that you have to manually terminate.

For instance, I might want to launch a bunch of worker processes, each listening on a separate port:

for i in range(num_procs):
  subprocess.Popen('python worker.py %d' % 9999 + i)

The orphan problem crops up if the parent dies without terminating the worker process (gracefully or otherwise). After the parent process is killed, the workers are left running in the background; holding onto both resources and their port. We now have to manually terminate them in order to try running again:

pkill -9 -f worker.py

You can use the atexit functionality from most languages to try to cleanup, but this doesn’t help if your process segfaults, or has some other ignominious end.

In a “real” environment, this is typically solved by having a cluster manager which schedules jobs and can start and kill processes as needed (for example SLURM or Torque. But often we’re running in an environment where this is unavailable (either our local machine, or we just haven’t bothered to set something up). What do we do?

The solution here is watchdog timer which terminates the workers if the master process dies. This can be accomplished in a number of ways, including via the RPC system, but I’ve found a simple mechanism that works well for me: polling the stdin channel.

class FileWatchdog(threading.Thread):
  """Watchdog for a file (typically sys.stdin).
  When the file closes, terminate the process.
  (This typically occurs when the parent process is lost.)
  def __init__(self, file_handle):
    threading.Thread.__init__(self, name='WatchdogThread')
    self.file_handle = file_handle
    # self.log = open('/tmp/watchdog.%d' % os.getpid(), 'w')
  def run(self):
    f = [self.file_handle]
    while 1:
      r, w, x = select.select(f, f, f, 1.0)
      # print >>self.log, 'Watchdog running: %s %s %s' % (r,w,x)
      # self.log.flush()
      if w:
        # print >>self.log, 'Watchdog: file closed.  Shutting down.'
        # self.log.flush()

I just have my worker processes spawn a watchdog thread at startup time, and they will terminate themselves as soon as they lose their connection to the master process. This simple mechanism falls apart if you’re using stdin to communicate to your child processes, but it works well for most other simple needs.