Use tmux to debug distributed Python programs

Posted on 2021-03-06 Last updated on 2021-03-07

It is always hard to debug distributed programs. Not only the concurrency is extremely naughty, but we don't have enough tools, or don't know there are several tools to debug the distributed programs. But I found that tmux is capable of handling multiple windows, which means it's possible to control numerous nodes without GUI.

Usage of tmux

Here is my tmux cheating sheet. For more details, check the website https://gist.github.com/henrik/1967800.

Create Session and Window

# Shell Commands
tmux new -s s1 # Create a session named s1
tmux neww -t s1: htop # Create a window in session s1 and launch htop

# tmux Control
Ctrl-b 0 # Switch to window 0
Ctrl-b 1 # Switch to window 1
...

Window / Pane Conversion

Note: You are allowed to use autocomplete by clicking tab or check the history by clicking arrow keys after press Ctrl-b and :.

1 2	Ctrl-b :join-pane -s :2 # Move window 2 into a new split pane Ctrl-b :break-pane # Move all inactive panes into windows

Sometimes -t represents target while -s represents source.

Example

RPC via SSH

This is a launcher which will spawn several processes on remote machines. (Source: DGL Library)

def execute_remote(cmd, ip, port, thread_list):
    """execute command line on remote machine via ssh"""
    cmd = 'ssh -o StrictHostKeyChecking=no -p ' + str(port) + ' ' + ip + ' \'' + cmd + '\''
    # thread func to run the job
    def run(cmd):
        subprocess.check_call(cmd, shell = True)

    thread = Thread(target = run, args=(cmd,))
    thread.setDaemon(True)
    thread.start()
    thread_list.append(thread)

To debug the program, we need to create a session on the login node first.

1	login$ tmux new -s dgl

Then modify the source code of the launcher to let newly spawned processes attach to tmux.

Put tmux neww at the beginning of the command
Put ;bash -i at the end to prevent window from closing after program exited

def execute_remote(cmd, ip, port, thread_list):
    """execute command line on remote machine via ssh"""
    cmd = 'tmux neww -t dgl: ssh -o StrictHostKeyChecking=no -p ' + str(port) + ' ' + ip + ' \'' + cmd + '\'' + ';bash -i'
    ...

Finally execute the modified launcher on login node directly. After that we could notice several windows are created and shown at the bottom of tmux.

1	login$ python launch.py ...

RPC via MPI

Just like what the previous section does, add something at the beginning or the end of the command.

1 2	tmux new -s mpi mpirun -n 4 tmux neww -t mpi: "python ...; bash -i"

with Debugger

It is easier to debug distributed programs when each remote process shown in a separated window is attached by a separated debugger.

PDB

PDB is a built-in utility, and it is easy to use, especially it allows the program to trap in interactive debugging mode by inserting one instruction explictly. For example, try to execute the following code.

1 2	import pdb pdb.set_trace()

Then your Python program will pause and a interactive dialogue like gdb appeared.

PUDB

This is basically PDB equipped with TUI (Text-based user interface), and its usage is quite similar to PDB's. But you have to download it before using it.

1 2	conda install pudb # Install by conda pip install pudb # Install by pip

1 2	import pudb pdb.set_trace()

However, the TUI heavily relies on some features of pseudo-tty. Without it, the TUI cannot work correctly. But, by default SSH will not allocate pseudo-tty when using SSH to launch a remote program instead of a console. Thus, we need to do some modifications to the launcher.

specify a SSH argument -t to force pseudo-tty allocation.

def execute_remote(cmd, ip, port, thread_list):
    """execute command line on remote machine via ssh"""
    cmd = 'tmux neww -t dgl: ssh -t -o StrictHostKeyChecking=no -p ' + str(port) + ' ' + ip + ' \'' + cmd + '\'' + ';bash -i'
    ...