This tutorial will introduce how to utilize map-by option to deal with many complex scenarios such as running a hybrid MPI program (mixture of OpenMP and MPI).
Before starting
The behavior of MPI varies significantly if the environment changes (including MPI version and implementations, dependent libraries, and job schedulers). All the experiments mentioned in this article are conducted on OpenMPI 4.0.2, which means if you use different implementations or versions of MPI, you may encounter unexpected problems. For example, OpenMPI 2.1.1, the default version bundled in Ubuntu 18.04, will behave strangely and fail to control the number of OpenMP threads when running a hybrid program. Thus I strongly recommended to download the latest version of MPI.
Test Environment
On the test platform, each machine contains 2 NUMA nodes, 36 physical cores, 72 hardware threads overall. The test hybrid program could be downloaded from https://rcc.uchicago.edu/docs/running-jobs/hybrid/index.html, and OpenMPI 4.0.2 and GCC 7.3.0 are downloaded from Anaconda.
map-by unit
This is the most fundamental syntax. And unit can be filled in hwthread, core, L1cache, L2cache, L3cache, socket, numa, board, node. Note that hwthread means hardware thread, while core means physical core. numa option is commonly used.
The following example illustrates the differences of each option. To make output clear, PE=1 was added to limit thread numbers, and we will introduce it in the section map-by unit:pe=n. --report-bindings is a proprietary option of OpenMPI to visualize bindings, and you can check Appendix to figure out the similar usage of other MPI implementations.
map-by numa
1 2 3 4 5 6 7 8 9 10 11 12 13
mpirun -n 4 --map-by numa:PE=1 --report-bindings ./a.out [asialab-01:69587] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69587] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..] [asialab-01:69587] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69587] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01
map-by hwthread
1 2 3 4 5 6 7 8 9
mpirun -n 4 --map-by hwthread:PE=1 --report-bindings ./a.out [asialab-01:69621] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69621] MCW rank 1 bound to socket 0[core 0[hwt 1]]: [.B/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69621] MCW rank 2 bound to socket 0[core 1[hwt 0]]: [../B./../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69621] MCW rank 3 bound to socket 0[core 1[hwt 1]]: [../.B/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 1 from process 3 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 0 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 1 out of 4 on asialab-01
map-by core
1 2 3 4 5 6 7 8 9 10 11 12 13
mpirun -n 4 --map-by core:PE=1 --report-bindings ./a.out [asialab-01:69653] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69653] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69653] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:69653] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01
Observe that when we set to map-by numa, MPI will take NUMA architecture into consideration and balance the workload of two NUMA nodes. Otherwise, MPI will ignore NUMA.
You may notice that when map-by is set to hwthread, only one thread is allocated to OpenMP in each rank. The section bind-to unit explains this to some degree.
bind-to unit
The default option is core if we didn't specify this option. Although this option is not so important, but there are several interesting concepts to learn. You may have heard the word slot, and you can imagine each slot will hold one rank at most. My understanding is that it is slot will be bound to specified units such as hardware threads or physical cores. Let us go through some examples.
bind-to hwthread
1 2 3 4 5 6 7 8 9
mpirun -n 4 --bind-to hwthread --map-by numa --report-bindings ./a.out [asialab-01:71905] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:71905] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [../../../../../../../../../../../../../../../../../..][B./../../../../../../../../../../../../../../../../..] [asialab-01:71905] MCW rank 2 bound to socket 0[core 0[hwt 1]]: [.B/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:71905] MCW rank 3 bound to socket 1[core 18[hwt 1]]: [../../../../../../../../../../../../../../../../../..][.B/../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 1 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 3 out of 4 on asialab-01 Hello from thread 0 out of 1 from process 0 out of 4 on asialab-01
bind-to core
1 2 3 4 5 6 7 8 9 10 11 12 13
mpirun -n 4 --bind-to core --map-by numa --report-bindings ./a.out [asialab-01:71922] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:71922] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..] [asialab-01:71922] MCW rank 2 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:71922] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01
If each slot is bound to a hardware thread, only one thread could be allocated to OpenMP in each rank. If each slot is bound to a physical core, all the threads in a physical core could be allocated. If each slot is bound to a NUMA node, all the threads in a NUMA node could be allocated.
map-by unit:pe=n
In the previous section we introduce the concept slot. By default, each slot is bound to one physical core. This section we will dig deep into pe, and it may refer to processing element according to a website. This concept is ambiguous, and my understanding is that pe=n determines the number of units that each slot will occupy. Here are some examples.
bind-to core, PE=1
1 2 3 4 5 6 7
mpirun -n 2 --bind-to core --map-by numa:PE=1 --report-bindings ./a.out [asialab-01:72668] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:72668] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 2 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 2 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 2 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 2 on asialab-01
bind-to core, PE=2
1 2 3 4 5 6 7 8 9 10 11
mpirun -n 2 --bind-to core --map-by numa:PE=2 --report-bindings ./a.out [asialab-01:72700] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:72700] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..] Hello from thread 0 out of 4 from process 0 out of 2 on asialab-01 Hello from thread 2 out of 4 from process 0 out of 2 on asialab-01 Hello from thread 3 out of 4 from process 0 out of 2 on asialab-01 Hello from thread 1 out of 4 from process 0 out of 2 on asialab-01 Hello from thread 2 out of 4 from process 1 out of 2 on asialab-01 Hello from thread 0 out of 4 from process 1 out of 2 on asialab-01 Hello from thread 1 out of 4 from process 1 out of 2 on asialab-01 Hello from thread 3 out of 4 from process 1 out of 2 on asialab-01
bind-to hwthread, PE=1
1 2 3 4 5
mpirun -n 2 --bind-to hwthread --map-by numa:PE=1 --report-bindings ./a.out [asialab-01:72729] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B./../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:72729] MCW rank 1 bound to socket 1[core 18[hwt 0]]: [../../../../../../../../../../../../../../../../../..][B./../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 1 from process 1 out of 2 on asialab-01 Hello from thread 0 out of 1 from process 0 out of 2 on asialab-01
bind-to hwthread, PE=2
1 2 3 4 5 6 7
mpirun -n 2 --bind-to hwthread --map-by numa:PE=2 --report-bindings ./a.out [asialab-01:72740] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:72740] MCW rank 1 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 2 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 2 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 2 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 2 on asialab-01
However, I failed to launch with the setting -n 1 --bind-to numa --map-by node:PE=2, and I expect that only one rank consumes all the resources (equivalent to bind-to core, PE=36, or bind-to hwthread, PE=72). Anyway, pe=n will work well when combining with bind-to core or bind-to hwthread.
map-by ppr:n:unit
ppr is short for processes per resource. The processes here basically are equivalent to MPI Rank. This option actually limits the maximum number of ranks (number of slots) that each unit can hold. Let us verify this.
bind-to core, ppr:4
1 2 3 4 5 6 7 8 9 10 11 12 13
mpirun -n 4 --bind-to core --map-by ppr:4:numa --report-bindings ./a.out [asialab-01:00520] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00520] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00520] MCW rank 2 bound to socket 0[core 2[hwt 0-1]]: [../../BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00520] MCW rank 3 bound to socket 0[core 3[hwt 0-1]]: [../../../BB/../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01
bind-to core, ppr:2
1 2 3 4 5 6 7 8 9 10 11 12 13
mpirun -n 4 --bind-to core --map-by ppr:2:numa --report-bindings ./a.out [asialab-01:00541] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00541] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00541] MCW rank 2 bound to socket 1[core 18[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/../../../../../../../../../../../../../../../../..] [asialab-01:00541] MCW rank 3 bound to socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][../BB/../../../../../../../../../../../../../../../..] Hello from thread 0 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 2 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 3 out of 4 on asialab-01 Hello from thread 0 out of 2 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 2 from process 0 out of 4 on asialab-01
bind-to core, ppr:1
1 2 3 4 5 6 7 8 9 10 11
mpirun -n 4 --bind-to core --map-by ppr:1:numa --report-bindings ./a.out -------------------------------------------------------------------------- Your job has requested more processes than the ppr for this topology can support:
App: ./a.out Number of procs: 4 PPR: 1:numa
Please revise the conflict and try again. --------------------------------------------------------------------------
Since each NUMA node is limited to hold one MPI process at most, and there are two NUMA nodes overall, it is reasonable to fail to run the program with four ranks.
map-by ppr:n:unit:pe=n
This is complete form of map-by. There is nothing new, so you should be able to explain the following complex example. Hint: -host hostname:-1 will let MPI detect the number of available slots on the remote machine automatically.
mpirun -n 4 -host asialab-01:-1,asialab-03:-1 --bind-to hwthread --map-by ppr:1:numa:pe=4 --report-bindings ./a.out [asialab-01:00614] MCW rank 0 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-01:00614] MCW rank 1 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..] [asialab-03:73603] MCW rank 2 bound to socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]: [BB/BB/../../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../../../..] [asialab-03:73603] MCW rank 3 bound to socket 1[core 18[hwt 0-1]], socket 1[core 19[hwt 0-1]]: [../../../../../../../../../../../../../../../../../..][BB/BB/../../../../../../../../../../../../../../../..] Hello from thread 0 out of 4 from process 2 out of 4 on asialab-03 Hello from thread 2 out of 4 from process 2 out of 4 on asialab-03 Hello from thread 3 out of 4 from process 2 out of 4 on asialab-03 Hello from thread 0 out of 4 from process 3 out of 4 on asialab-03 Hello from thread 2 out of 4 from process 3 out of 4 on asialab-03 Hello from thread 1 out of 4 from process 3 out of 4 on asialab-03 Hello from thread 3 out of 4 from process 3 out of 4 on asialab-03 Hello from thread 1 out of 4 from process 2 out of 4 on asialab-03 Hello from thread 0 out of 4 from process 0 out of 4 on asialab-01 Hello from thread 3 out of 4 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 4 from process 0 out of 4 on asialab-01 Hello from thread 2 out of 4 from process 0 out of 4 on asialab-01 Hello from thread 1 out of 4 from process 1 out of 4 on asialab-01 Hello from thread 0 out of 4 from process 1 out of 4 on asialab-01 Hello from thread 2 out of 4 from process 1 out of 4 on asialab-01 Hello from thread 3 out of 4 from process 1 out of 4 on asialab-01