MPI Hands-On Exercises

Purpose

This hands-on session consists of two parts. The first part will guide you through the process of logging in to ACF computers. The second part will then provide you with a set of MPI programming exercises which we believe will help you understand the basic ideas of MPI parallel programming by demonstrating the key features of message passing interface through these sample programs.

Contents

ACF Environment Setup

To start editing, compiling and running the programs you'll first need to login to ACF. To do that you'll need to use your temporary account NAME, DUO login, or PASSCODE that will be given to you during the workshop. ACF can be accessed via the Secure Shell protocol (SSH) using One-Time Password (OTP) Authenticationm or DUO login. The instructions on how to activate your OTP token will be given to you in the same envelope with your token.

Access to compute resources on ACF is managed by the Portable Batch System (PBS). Batch scripts are run on service nodes that have access to the home, project and software directories. To run your parallel programs on ACF you'll use the mpirun command. Executables launched with mpirun do not have access to these three directories; they have access only to the Lustre scratch directories. Therefore, after you log in to ACF you'll need to copy the workshop folder with all of the exercises to your Lustre scratch directory. Remember, all input and output files must reside in that directory.

For your convenience the workshop exercises have already been copied to the /lustre/haven/user/kwong/TRAINING/ folder on ACF. You can copy this folder to your Lustre scratch directory by simply entering the following three commands:

cd /lustre/haven/user/$USER
mkdir ACF-TRAINING
cd ACF-TRAINING
cp /lustre/haven/user/kwong/TRAINING/MPI_WORKSHOP.tar .
tar xvf MPI_WORKSHOP.tar
ls

Note that when the workshop ends the access to this folder will no longer be allowed.

In the newly copied MPI_WORKSHOP folder you will find three sub-folders, namely C/ CPP/ Fortran/

These folders contain all the material necessary for editing, compiling and running the MPI exercises written in C, C++ or Fortran. Select the directory that suits your programming preferences. You are now ready to proceed into the world of MPI.

Back to Contents

List of Exercises

During this workshop you will utilize the most fundamental MPI calls necessary for development of any MPI code, as well as learn how to compile and run MPI codes on ACF. Each of the three aforementioned folders /C, /CPP, /FORTRAN has a set of sub-folders with exercises, one for each exercise.

Folder name Description
hello1 MPI Hello world!
hello2 Hello From..! MPI_COMM_RANK
pical Pi Calculation
timing MPI Communication Timing Test
sendrecv Passing a message: MPI_Send; MPI_Recv
deadlock Deadlock Situation
deadlockf Deadlock Situation - Fixed
pingpong Ping-Pong - Calculating MPI Transfer Rates
ringnb Ring (Non-blocking Communication)
ringbl Ring (Blocking Communication)
array Simple Array Assignment
matmul Matrix Multiplication
laplace 2D Laplace Equation
struct MPI Struct Derived Data Type
collectives MPI Collective Communication Functions
group MPI Group and Communicator Functions
cartesian MPI Virtual Topology Functions

During the workshop you will encounter the following exercises:

C Fortran C++ Description
hello1.c hello1.f hello1.cpp MPI Hello world!
hello2.c hello2.f hello2.cpp Hello From..! MPI_COMM_RANK
pical.c pical.f Not yet available Pi Calculation
timing.c timing.f Not yet available MPI Communication Timing Test
pingpong.c pingpong.f.template Not yet available Ping-Pong - Calculating MPI Transfer Rates
sendrecv.c sendrecv.f sendrecv.cpp Passing a message: MPI_Send; MPI_Recv
deadlock.c deadlock.f deadlock.cpp Deadlock Situation
deadlockf.c deadlockf.f deadlockf.cpp Deadlock Situation - Fixed
ringnb.c ringnb.f Not yet available Ring (Non-blocking Communication)
ringbl.c ringbl.f Not yet available Ring (Blocking Communication)
array.c array.f Not yet available Simple Array Assignment
matmul.c matmul.f Not yet available Matrix Multiplication
laplace.c laplace.f Not yet available 2D Laplace Equation
struct.c struct.f Not yet available MPI Struct Derived Data Type
collectives.c collectives.f Not yet available MPI Collective Communication Functions
group.c group.f Not yet available MPI Group and Communicator Functions
cartesian.c cartesian.f Not yet available MPI Virtual Topology Functions

Back to Contents

Exercise Information

During the workshop exercises you may be asked to write a code; if possible write the codes on your own. To do so use the files with the .start extensions that are provided in the directories where programming is required.

For those of you wishing to concentrate only on the message passing aspects of the code, files with .template extensions have been provided in the exercise directories that require programming. To modify the template files, first copy filename._.template to filename.c, filename.cpp or filename.f depending on whether you are using C, C++ or FORTRAN. For example:

cp filename.c.template filename.c
cp filename.cpp.template filename.cpp
cp filename.f.template filename.f

Next, invoke your favorite text editor and modify the template by replacing all of the XXXXX's with the appropriate MPI calls.

For your convenience, there are completed solutions to the programming exercises available as files with .soln extensions. Remember, the only way to learn how to program is by actually programming, SO LOOK AT THE SOLUTIONS ONLY AS A LAST RESORT.

Back to Contents

Hello World 1 - The Minimal MPI Program

This exercise DOES NOT require programming

The objective of this exercise is not to write an MPI code but to demonstrate the fundamentals of compiling an MPI program and submitting it via qsub.

First, examine the "Hello World!" program hello1.c/hello1.cpp/hello1.f. Notice that every process prints "Hello World!" and that the "Hello World!" program:

  1. Includes a header,
  2. Initializes MPI,
  3. Prints a "Hello World!" message, and
  4. Finalizes MPI

Next, compile "Hello World!". For the version of MPI that we are using here there are several ways to compile a program. We will use the Intel compilers mpiicc, mpiicpc and mpiifort to compile our C, C++ and FORTRAN programs, respectively. These compilers locate the MPI libraries and header files as needed. To compile, you can either compile at the terminal's prompt or use the provided makefile. If you use the provided makefile, please read it first to understand what it is doing.

In general, to compile a program called filename.c/filename.cpp/filename.f, at the command line you need to enter the following. The default PE environment is Intel:

mpiicc filename.c -o filename or
mpiicpc filename.cpp -o filename or
mpiifort filename.f -o filename

where filename is the resulting executable.

To compile a program called filename.c/filename.cpp/filename.f using the provided makefile, enter the following:

make

in the appropriate directory. The resulting executable will be filename.

For the "Hello World!" program, enter either

mpiicc hello1.c -o hello1 for C programs
mpiicpc hello1.cpp -o hello1 for C++ programs
mpiifort hello1.f -o hello1 for FORTRAN programs

or, enter

make

Again, if you use the provided makefile, first make sure you understand what it is doing.

Now we want to run the "Hello World!" program. This elementary problem will use 12 processors and will assign a rank to each of them.

Since a parallel program runs on several processors, there needs to be some method of starting the program on each of the different processors. On ACF this is done by using the Batch Scripts. Batch scripts can be used to run a set of commands on a systems compute partition. The batch script is a shell script containing PBS flags and commands to be interpreted by a shell. Batch scripts are submitted to the batch manager, PBS, where they are parsed. Based on the parsed data, PBS places the script in the queue as a job. Once the job makes its way through the queue, the script will be executed on the head node of the allocated resources. The batch script example below explains how to submit the batch scripts, and common usage tips.

Example Batch Script:

1: #!/bin/bash
2: #PBS -A ACF-XXXXX
3: #PBS -N test
4: #PBS -j oe
5: #PBS -l nodes=:ppn=12,walltime=5:00
6:
7: cd $PBS_O_WORKDIR
8: date
9: mpirun -np 12 ./executable

This batch script can be broken down into the following sections:

  • Shell interpreter
    • Line 1
    • Can be used to specify an interpreting shell.
  • PBS commands
    • The PBS options will be read and used by PBS upon submission.
    • Lines 2–5
      • 2: The job will be charged to the ACF-XXXXX project.
      • 3: The job will be named “test.”
      • 4: The jobs standard output and error will be combined.
      • 5: The job will request 12 cores for 5 minutes.
    • Please see the PBS Options page for more options.
  • Shell commands
    • Once the requested resources have been allocated, the shell commands will be executed on the allocated nodes head node.
    • Lines 6–9
      • 6: This line is left blank, so it will be ignored.
      • 7: This command will change directory into the script's submission directory. We assume here that the job was submitted from a directory in /lustre/haven/.
      • 8: This command will run the date command.
      • 9: This command will run the executable executable on 4 cores.

Submitting Batch Jobs

Batch scripts can be submitted for execution using the qsub command on ACF. For example, the following will submit the batch script named test.pbs:

qsub test.pbs

If successfully submitted, a PBS job ID will be returned. This ID can be used to track the job.

For more information about qsub see the man pages.

Look at the running job page for more (and some redundant) information. In particular, look at the PBS commands for submitting jobs, removing jobs from the queue, etc.

We will use the PBS pbssub script to submit our job for the hello1 exercise. After compiling the source code you will find an executable file in the current hello1 directory. To submit the job to the ACF queue use the provided pbssub file, but first be sure to examine the file.

Submit the job with the command:

qsub pbssub

Did the job enter the batch queue? Check, with the showq command. Where did the job run?

Running this program on 12 processes will output 12 lines depending on the rank of the process.

Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World! (from worker node)
Hello, World!!! (from masternode)
Hello, World! (from worker node)
Hello, World! (from worker node)

Back to Contents

Hello World 2 - Hello Again!

This exercise requires programming

The objective of this exercise is to demonstrate the most fundamental MPI calls used in almost any MPI program. You are asked to write an SPMD(Single Process, Multiple Data) program where, again, each process checks its rank, and decides if it is the master (if its rank is 0), or a worker (if its rank is 1 or greater).

The SPMD programs should:

  1. Include the header,
  2. Initialize MPI,
  3. Check its rank, and
    • if the process is the master, then send a "Hello World!"message, in characters, to each of the workers
    • if the process is a worker, then receive the "Hello World!"message and print it out
  4. Finalize MPI

Compile your program at the command line or via the makefile. Run this code on 12 processes using qsub pbssub command. You can also run the code on 24, 48 etc processes keeping in mind that the number of requested cores must be the multiple of 12.

When compiled and run correctly the master process will send a copy of the "Hello,World!" message to each of the worker processes, which will then print it to the standard output.

Running this program on 12 processes will output 11 lines similar to the ones below.

Process 1 : Hello,World!
Process 4 : Hello,World!
Process 2 : Hello,World!
Process 3 : Hello,World!
Process 5 : Hello,World!
Process 6 : Hello,World!
Process 8 : Hello,World!
Process 7 : Hello,World!
Process 9 : Hello,World!
Process 10 : Hello,World!
Process 11 : Hello,World!

Back to Contents

Pi Calculation

This exercise DOES NOT require programming

This program calculates π-number by integrating f(x) = 4 /(1+x^2) . Area under the curve is divided into rectangles and the rectangles are distributed to the processors.

  • Each node:
    1. receives the number of rectangles used in the approximation.
    2. calculates the areas of it's rectangles.
    3. Synchronizes for a global summation.
  • Node 0 prints the result.

Your tasks are:

make
mpiicc pical.c -o pical
mpiicpc pical.cpp -o pical
mpiifort pical.f -o pical
  • Review the pical.c/pical.cpp/pical.f code to get an idea of what it does.
  • Compile the code. A makefile has been provided for you. To compile with the makefile, simply type:
  • Or to compile at the command line use the mpiicc compiler for the C code, mpiicpc compiler for the C++ code or the mpiifort compiler for the FORTRAN code:
  • Run the code: pical using provided pbssub file.

Running this program using 12 processes will have an output similar to the following

Process 1 of 12 on nid11652
Process 11 of 12 on nid11652
Process 10 of 12 on nid11652
Process 3 of 12 on nid11652
Process 5 of 12 on nid11652
Process 0 of 12 on nid11652
Process 7 of 12 on nid11652
Process 8 of 12 on nid11652
Process 4 of 12 on nid11652
Process 9 of 12 on nid11652
Process 6 of 12 on nid11652
Process 2 of 12 on nid11652
pi is approximately 3.1415926544231252, Error is 0.0000000008333321
wall clock time = 0.000610

Back to Contents

MPI Communication Timing Test

This exercise DOES NOT require programming

The objective of this exercise is to investigate the amount of time required for message passing between two processes, i.e. an MPI communication timing test is performed.

In this exercise different size messages are sent back and forth between two processes a number of times. Timings are made for each message before it is sent and after it has been received. The difference is computed to obtain the actual communication time. Finally, the average communication time and the bandwidth are calculated and output to the screen.

For example, one can run this code on two nodes (one process on each node) passing messages of length 1, 100, 10,000, and 1,000,000 and record the results in a table like the one below.

Length Communication Time (μSec) Communication Bandwidth (Megabit/Sec)
1 0.000001 73.403187
100 0.000002 3240.799903
10,000 0.000046 13947.887101
1,000,000 0.003633 17616.862072

The makefile and pbssub files have been provided.

  1. Compile the code by typing make
  2. Submit the job by typing qsub pbssub

Running this program using 2 processes will have an output similar to the following

Doing round trip test for:
Message length = 1 integer value(s)
Message size   = 4 Bytes
Number of Reps = 1000
*** Round Trip Avg = 0.000001 uSec
*** Bandwidth = 9175398.414001 Byte/Sec
              = 73.403187 Megabit/Sec


Doing round trip test for:
Message length = 100 integer value(s)
Message size   = 400 Bytes
Number of Reps = 1000
*** Round Trip Avg = 0.000002 uSec
*** Bandwidth = 405099987.927080 Byte/Sec
              = 3240.799903 Megabit/Sec


Doing round trip test for:
Message length = 10000 integer value(s)
Message size   = 40000 Bytes
Number of Reps = 1000
*** Round Trip Avg = 0.000046 uSec
*** Bandwidth = 1743485887.683419 Byte/Sec
              = 13947.887101 Megabit/Sec


Doing round trip test for:
Message length = 1000000 integer value(s)
Message size   = 4000000 Bytes
Number of Reps = 1000
*** Round Trip Avg = 0.003633 uSec
*** Bandwidth = 2202107758.953355 Byte/Sec
              = 17616.862072 Megabit/Sec

Back to Contents

Passing a message: MPI_Send; MPI_Recv

This exercise requires programming

Now that you've learned the two basic MPI_Send and MPI_Recv communication calls, you are asked to write a simple program that performs communication between two processors. Processor 0 sends a message Hello, World! usind blocking MPI_Send to processor 1, which receives this message using blocking receive MPI_Recv.

Running this program on any number of processors should output 1 line similar to the one below

Process 1 : Hello, World!

Back to Contents

Deadlock Situation

This exercise DOES NOT require programming

This example shows improper use of blocking calls resulting in deadlock run on two nodes. All tasks are simply waiting for events that haven’t been initiated. Try to understand why this is happening. Also, before proceeding to the next exercise try to think of a possible fix to this deadlock situation.

Please do not modify the requested wall time(1 min) in the pbssub file. Otherwise the job will idle and waste compute resources. After 1 minute your job will be killed automatically.

Running this program using 2 processes will have an output similar to the following

Task 1 has sent the message Task 0 has sent the message

Back to Contents

Deadlock Situation - Fixed

This exercise DOES NOT require programming

This program is the solution to the deadlock example above showing the use of a non-blocking send to eliminate deadlock.

Running this program using 2 processes will have an output similar to the following

Task 0 has sent the message
Task 1 has sent the message
Task 0 has received the message
Task 1 has received the message

Back to Contents

Ping-Pong - Calculating MPI Transfer Rates

This exercise requires programming

Before we ask you to write a code using blocking and non-blocking calls, let's write another simple program (pingpong) in which two processes send packets of information back and forth a number of times and record the amount of time required. The packets of information consist of an array of dummy floating point numbers which vary in length from one floating point number to 10,001floating point numbers. These packets are sent back and forth from one process to another 100 times and the total amount of time required is recorded. From this timing data the average amount of time per send/receive is computed as well as the transfer rate (bandwidth) in bytes/sec.

First of all, you'll need to use the MPI_Wtime() routine for timing the calls. This routine returns the time expressed in seconds (see man MPI_Wtime). In order to time something two MPI_Wtime() calls are needed and the difference between the returned values should be taken to obtain the total elapsed time (in wall clock seconds).

After you've inserted two MPI_Wtime() calls to estimate the average time taken for one message to complete a one way trip, calculate the transfer rate in bytes per second. What did you find?

Now, add a loop around the timing calls changing the length of the message (length varies from 1 to 10001 in increments of 1000) to investigate how the time taken varies with the size of the message.

Running this program using 2 processes should have the output similar to the following

length   time/message (sec)    transfer rate (byte/sec)
1             0.000002               1955386.480186
1001          0.000004               936642120.245399
2001          0.000007               1146557691.803279
3001          0.000010               1238126772.802164
4001          0.000012               1282859841.681797
5001          0.000014               1419796213.148320
6001          0.000015               1583891657.610320
7001          0.000018               1574600029.707085
8001          0.000021               1554719773.175817
9001          0.000022               1604459426.434339
10001         0.000023               1707427874.386893

Back to Contents

Ring (Non-blocking Communication)

This exercise requires programming

This program allows a processor to communicate its rank around a ring. The sum of all ranks is then accumulated and printed out by each processor.

Consider a set of processes arranged in a ring as shown below. Use a token passing method to compute the sum of the ranks of the processes.

   1
 /   \ 
0     2
 \   /
   3

Figure 1: Four processes arranged in a ring. Messages are sent from 0 to 1 to 2 to 3 to 0 again, sum of ranks is 6.

Each processor stores its rank in MPI_COMM_WORLD as an integer and sends this value to the processor on its right. It then receives an integer from its left neighbor. It keeps track of the sum of all the integers received. The processors continue passing on the values they receive until they get their own rank back. Each process should finish by printing out the sum of the values.

For this program use synchronous non-blocking send MPI_Issend(). Make sure that you do not overwrite information.

You are asked to use synchronous message passing because the standard send can be either buffered or synchronous, and you should learn to program for either possibility.

Running this program using 12 processes should have an output similar to the following

Proc 10 sum = 66 
Proc 2 sum = 66 
Proc 0 sum = 66 
Proc 4 sum = 66 
Proc 7 sum = 66 
Proc 6 sum = 66 
Proc 1 sum = 66 
Proc 3 sum = 66 
Proc 8 sum = 66 
Proc 9 sum = 66 
Proc 11 sum = 66 
Proc 5 sum = 66 

Back to Contents

Ring (Blocking Communication)

This exercise requires programming

This program allows a processor to communicate its rank around a ring. The sum of all ranks is then accumulated and printed out by each processor.

Consider a set of processes arranged in a ring as shown below. Use a token passing method to compute the sum of the ranks of the processes.

   1
 /   \ 
0     2
 \   /
   3

Figure 1: Four processes arranged in a ring. Messages are sent from 0 to 1 to 2 to 3 to 0 again, sum of ranks is 6.

Each processor stores its rank in MPI_COMM_WORLD as an integer and sends this value to the processor on its right. It then receives an integer from its left neighbor. It keeps track of the sum of all the integers received. The processors continue passing on the values they receive until they get their own rank back. Each process should finish by printing out the sum of the values.

For this program use synchronous blocking send MPI_Ssend(). Watch out for deadlock situations.

You are asked to use synchronous message passing because the standard send can be either buffered or synchronous, and you should learn to program for either possibility.

Running this program using 12 processes should have an output similar to the following

Proc 10 sum = 66 
Proc 2 sum = 66 
Proc 0 sum = 66 
Proc 4 sum = 66 
Proc 7 sum = 66 
Proc 6 sum = 66 
Proc 1 sum = 66 
Proc 3 sum = 66 
Proc 8 sum = 66 
Proc 9 sum = 66 
Proc 11 sum = 66 
Proc 5 sum = 66 

Back to Contents

Simple Array Assignment

This exercise requires programming

This is a simple array assignment used to demonstrate the distribution of data among multiple tasks and the communications required to accomplish that distribution.

The master task initiates numtasks-1 number of worker tasks and then distributes an equal portion of the array to each worker. Each worker receives its portion of the array and performs a simple value assignment to each of its elements. The value assigned to each element is simply that element's index in the array plus 1. Each worker then sends its portion of the array back to the master. As the master receives a portion of the array from each worker, selected elements are displayed.

Note: For this example, the number of processes should be set to an odd number (aprun -n 11), to ensure even distribution of the array to numtasks-1 worker tasks.

Running the solution program using 11 processes will have an output similar to the following

MASTER: number of worker tasks will be= 10
Sending to worker task 1
Sending to worker task 2
Sending to worker task 3
Sending to worker task 4
Sending to worker task 5
Sending to worker task 6
Sending to worker task 7
Sending to worker task 8
Sending to worker task 9
Sending to worker task 10
---------------------------------------------------
MASTER: Sample results from worker task = 1
   result[0]=1.000000
   result[100]=101.000000
   result[1000]=1001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 2
   result[6000]=6001.000000
   result[6100]=6101.000000
   result[7000]=7001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 3
   result[12000]=12001.000000
   result[12100]=12101.000000
   result[13000]=13001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 4
   result[18000]=18001.000000
   result[18100]=18101.000000
   result[19000]=19001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 5
   result[24000]=24001.000000
   result[24100]=24101.000000
   result[25000]=25001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 6
   result[30000]=30001.000000
   result[30100]=30101.000000
   result[31000]=31001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 7
   result[36000]=36001.000000
   result[36100]=36101.000000
   result[37000]=37001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 8
   result[42000]=42001.000000
   result[42100]=42101.000000
   result[43000]=43001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 9
   result[48000]=48001.000000
   result[48100]=48101.000000
   result[49000]=49001.000000

---------------------------------------------------
MASTER: Sample results from worker task = 10
   result[54000]=54001.000000
   result[54100]=54101.000000
   result[55000]=55001.000000

MASTER: All Done!

Back to Contents

Matrix Multiplication

This exercise requires programming

This example is a simple matrix multiplication program, i.e. AxB=C. Matrix A is copied to every processor. Matrix B is divided into blocks and distributed among processors. The data is distributed among the workers who perform the actual multiplication in smaller blocks and send back their results to the master.

Note: The C and FORTRAN versions of this code differ because of the way arrays are stored/passed. C arrays are stored in row-major order while FORTRAN arrays are stored in column-major order.

Running this program using 11 processes will have an output similar to the following

Number of worker tasks = 11
   sending 10 rows to task 1
   sending 9 rows to task 2
   sending 9 rows to task 3
   sending 9 rows to task 4
   sending 9 rows to task 5
   sending 9 rows to task 6
   sending 9 rows to task 7
   sending 9 rows to task 8
   sending 9 rows to task 9
   sending 9 rows to task 10
   sending 9 rows to task 11
Here are the first 30 rows of the result (C) matrix

0.00   1015.00   2030.00   3045.00   4060.00   5075.00   6090.00
0.00   1120.00   2240.00   3360.00   4480.00   5600.00   6720.00
0.00   1225.00   2450.00   3675.00   4900.00   6125.00   7350.00
0.00   1330.00   2660.00   3990.00   5320.00   6650.00   7980.00
0.00   1435.00   2870.00   4305.00   5740.00   7175.00   8610.00
0.00   1540.00   3080.00   4620.00   6160.00   7700.00   9240.00
0.00   1645.00   3290.00   4935.00   6580.00   8225.00   9870.00
0.00   1750.00   3500.00   5250.00   7000.00   8750.00  10500.00
0.00   1855.00   3710.00   5565.00   7420.00   9275.00  11130.00
0.00   1960.00   3920.00   5880.00   7840.00   9800.00  11760.00
0.00   2065.00   4130.00   6195.00   8260.00  10325.00  12390.00
0.00   2170.00   4340.00   6510.00   8680.00  10850.00  13020.00
0.00   2275.00   4550.00   6825.00   9100.00  11375.00  13650.00
0.00   2380.00   4760.00   7140.00   9520.00  11900.00  14280.00
0.00   2485.00   4970.00   7455.00   9940.00  12425.00  14910.00
0.00   2590.00   5180.00   7770.00  10360.00  12950.00  15540.00
0.00   2695.00   5390.00   8085.00  10780.00  13475.00  16170.00
0.00   2800.00   5600.00   8400.00  11200.00  14000.00  16800.00
0.00   2905.00   5810.00   8715.00  11620.00  14525.00  17430.00
0.00   3010.00   6020.00   9030.00  12040.00  15050.00  18060.00
0.00   3115.00   6230.00   9345.00  12460.00  15575.00  18690.00
0.00   3220.00   6440.00   9660.00  12880.00  16100.00  19320.00
0.00   3325.00   6650.00   9975.00  13300.00  16625.00  19950.00
0.00   3430.00   6860.00  10290.00  13720.00  17150.00  20580.00
0.00   3535.00   7070.00  10605.00  14140.00  17675.00  21210.00
0.00   3640.00   7280.00  10920.00  14560.00  18200.00  21840.00
0.00   3745.00   7490.00  11235.00  14980.00  18725.00  22470.00
0.00   3850.00   7700.00  11550.00  15400.00  19250.00  23100.00
0.00   3955.00   7910.00  11865.00  15820.00  19775.00  23730.00
0.00   4060.00   8120.00  12180.00  16240.00  20300.00  24360.00

Back to Contents

2D Laplace Equation

This exercise requires programming

This example solves a two-dimensional Laplace equation using the point Jacobi iteration method over a rectangular domain.

    2-D Laplace Equation
    T is initially 0.0 everywhere except at the boundaries where
    T=100.

            T=100        0,1000_______________1000,1000
           ________           |   |   |   |   |       
          |        |          |   |   |   |   |
          |        |          |   |   |   |   |
    T=100 | T=0.0  | T=100    | 0 | 1 | 2 | 3 |
          |        |          |   |   |   |   |
          |________|          |___|___|___|___|
            T=100           0,0               1000,0  


    Use Central-Difference Method(5-point stecil)
    Each process operates on its subgrid(250x1000) only and then 
    sends its boundaries to the nearest neighbors

The initial guess value of the function is zero. The boundaries are held at 100 throughout the calculation. Domain decomposition is used for parallel implementation of the problem.

The solution at point T(10,10), which belongs to process 0, is printed every 100 iterations

Run this program on 4 processors.

Running this program using 4 processes will have an output similar to the following:

Iter =  100: PE = 0: t[10][10] =          29.17421532
Iter =  200: PE = 0: t[10][10] =          53.50511932
Iter =  300: PE = 0: t[10][10] =          65.75184631
Iter =  400: PE = 0: t[10][10] =          72.95095062
Iter =  500: PE = 0: t[10][10] =          77.66552734
Iter =  600: PE = 0: t[10][10] =          80.98654175
Iter =  700: PE = 0: t[10][10] =          83.45037079
Iter =  800: PE = 0: t[10][10] =          85.35018921
Iter =  900: PE = 0: t[10][10] =          86.85943604
Iter = 1000: PE = 0: t[10][10] =          88.08715057

Back to Contents

MPI Struct Derived Data Type

This exercise DOES NOT require programming

This program demonstrates the proper use of MPI struct derived data type. The code creates a data type that represents a particle and distributes an array of such particles to all processes. Each particle has the following six characteristics:

  1. X coordinate
  2. Y coordinate
  3. Z coordinate
  4. Velocity
  5. ID number
  6. Type specification

As a consequence, the data type consists of four floating point values followed by two integer values.

Running this program using 12 processes will have an output similar to the following:

rank= 1   3.00 -3.00 3.00 0.25 3 1
rank= 4   3.00 -3.00 3.00 0.25 3 1
rank= 2   3.00 -3.00 3.00 0.25 3 1
rank= 5   3.00 -3.00 3.00 0.25 3 1
rank= 6   3.00 -3.00 3.00 0.25 3 1
rank= 8   3.00 -3.00 3.00 0.25 3 1
rank= 7   3.00 -3.00 3.00 0.25 3 1
rank= 9   3.00 -3.00 3.00 0.25 3 1
rank= 10   3.00 -3.00 3.00 0.25 3 1
rank= 0   3.00 -3.00 3.00 0.25 3 1
rank= 11   3.00 -3.00 3.00 0.25 3 1
rank= 3   3.00 -3.00 3.00 0.25 3 1

Back to Contents

MPI Collective Communication Functions

This exercise DOES NOT require programming

This program demonstrates the functionality of the most commonly used collective MPI functions, namely

MPI_Gather
MPI_Gatherv
MPI_Allgather
MPI_Allgatherv
MPI_Scatter
MPI_Alltoall
MPI_Reduce
MPI_Allreduce
MPI_Bcast

Please note that the program is designed to run using up to 4 processes only.

Running this program using 4 processes will have an output similar to the following

 Function  Proc  Sendbuf              Recvbuf
 --------  ----  -------              -------
MPI_Gather    :  0   a                          
MPI_Gather    :  3   d                          
MPI_Gather    :  2   c                          
MPI_Gather    :  1   b                a  b  c  d
MPI_Gatherv   :  3   g  h  i  j                                   
MPI_Gatherv   :  0   a                                            
MPI_Gatherv   :  1   b  c             a  b  c  d  e  f  g  h  i  j
MPI_Gatherv   :  2   d  e  f                                      
MPI_Allgather :  0   a                a  b  c  d
MPI_Allgather :  1   b  c             a  b  c  d
MPI_Allgather :  3   d  h  i  j       a  b  c  d
MPI_Allgather :  2   c  e  f          a  b  c  d
MPI_Allgatherv:  0   a                a  b  c  d  e  f  g  h  i  j
MPI_Allgatherv:  1   b  c             a  b  c  d  e  f  g  h  i  j
MPI_Allgatherv:  2   d  e  f          a  b  c  d  e  f  g  h  i  j
MPI_Allgatherv:  3   g  h  i  j       a  b  c  d  e  f  g  h  i  j
MPI_Scatter   :  1   e  f  g  h       f
MPI_Scatter   :  0   a  b  c  d       e
MPI_Scatter   :  3   m  n  o  p       h
MPI_Scatter   :  2   i  j  k  l       g
MPI_Alltoall  :  0   a  b  c  d       a  e  i  m
MPI_Alltoall  :  3   m  n  o  p       d  h  l  p
MPI_Alltoall  :  1   e  f  g  h       b  f  j  n
MPI_Alltoall  :  2   i  j  k  l       c  g  k  o
MPI_Reduce MAX:  2   i  j  k  l
MPI_Reduce MAX:  0   a  b  c  d
MPI_Reduce MAX:  3   m  n  o  p
MPI_Reduce MAX:  1   e  f  g  h       m  n  o  p
MPI_Allreduce :  1   e  f  g  h       m  n  o  p
MPI_Allreduce :  0   a  b  c  d       m  n  o  p
MPI_Allreduce :  3   m  n  o  p       m  n  o  p
MPI_Allreduce :  2   i  j  k  l       m  n  o  p
MPI_Bcast     :  1   b                b
MPI_Bcast     :  3                    b
MPI_Bcast     :  0                    b
MPI_Bcast     :  2                    b

Back to Contents

MPI Group and Communicator Functions

This exercise DOES NOT require programming

This program demonstrates the typical usage of MPI groups and communicators. The sample code creates two different process groups for separate collective communications exchange. This requires creating new communicators also.

The flow of the code can be summarized as follows:

  1. Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
  2. Form new group as a subset of global group using MPI_Group_incl
  3. Create new communicator for new group using MPI_Comm_create
  4. Determine new rank in new communicator using MPI_Comm_rank
  5. Conduct communications using any MPI message passing routine
  6. When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free

Running this program using 12 processes will have an output similar to the following:

rank= 1 newrank= 1 recvbuf= 15
rank= 3 newrank= 3 recvbuf= 15
rank= 0 newrank= 0 recvbuf= 15
rank= 2 newrank= 2 recvbuf= 15
rank= 8 newrank= 2 recvbuf= 51
rank= 5 newrank= 5 recvbuf= 15
rank= 7 newrank= 1 recvbuf= 51
rank= 10 newrank= 4 recvbuf= 51
rank= 6 newrank= 0 recvbuf= 51
rank= 9 newrank= 3 recvbuf= 51
rank= 11 newrank= 5 recvbuf= 51
rank= 4 newrank= 4 recvbuf= 15

Back to Contents

MPI Virtual Topology Functions

This exercise DOES NOT require programming

This program demonstrates the typical usage of MPI Cartesian Virtual Topology. The sample code creates a 4 x 4 Cartesian topology from 16 processors and have each process exchange its rank with four neighbors

 _______________________________
|   0   |   1   |   2   |   3   |
|       |       |       |       |
| (0,0) | (0,1) | (0,2) | (0,3) |
|_______|_______|_______|_______|
|   4   |   5   |   6   |   7   |
|       |       |       |       |
| (1,0) | (1,1) | (1,2) | (1,3) |
|_______|_______|_______|_______|
|   8   |   9   |  10   |  11   |
|       |       |       |       |
| (2,0) | (2,1) | (2,2) | (2,3) |
|_______|_______|_______|_______|
|  12   |  13   |  14   |  15   |
|       |       |       |       |
| (3,0) | (3,1) | (3,2) | (3,3) |
|_______|_______|_______|_______|

Depending on the periodicity of the cartesian group in the specified coordinate direction, MPI_CART_SHIFT provides the identifiers for a circular or an end-off shift. In the case of an end-off shift, the value MPI_PROC_NULL may be returned in rank_source or rank_dest, indicating that the source or the destination for the shift is out of range

Running this program using 16 processes will have an output similar to the following:

rank =  3 coords =  0 3 neighbors(u,d,l,r) = -1  7  2 -1
rank =  1 coords =  0 1 neighbors(u,d,l,r) = -1  5  0  2
rank =  1                   inbuf(u,d,l,r) = -1  5  0  2
rank =  2 coords =  0 2 neighbors(u,d,l,r) = -1  6  1  3
rank = 15 coords =  3 3 neighbors(u,d,l,r) = 11 -1 14 -1
rank = 14 coords =  3 2 neighbors(u,d,l,r) = 10 -1 13 15
rank = 15                   inbuf(u,d,l,r) = 11 -1 14 -1
rank = 13 coords =  3 1 neighbors(u,d,l,r) =  9 -1 12 14
rank = 14                   inbuf(u,d,l,r) = 10 -1 13 15
rank = 13                   inbuf(u,d,l,r) =  9 -1 12 14
rank =  9 coords =  2 1 neighbors(u,d,l,r) =  5 13  8 10
rank =  4 coords =  1 0 neighbors(u,d,l,r) =  0  8 -1  5
rank = 11 coords =  2 3 neighbors(u,d,l,r) =  7 15 10 -1
rank =  2                   inbuf(u,d,l,r) = -1  6  1  3
rank = 10 coords =  2 2 neighbors(u,d,l,r) =  6 14  9 11
rank = 10                   inbuf(u,d,l,r) =  6 14  9 11
rank =  8 coords =  2 0 neighbors(u,d,l,r) =  4 12 -1  9
rank =  0 coords =  0 0 neighbors(u,d,l,r) = -1  4 -1  1
rank =  5 coords =  1 1 neighbors(u,d,l,r) =  1  9  4  6
rank =  7 coords =  1 3 neighbors(u,d,l,r) =  3 11  6 -1
rank =  9                   inbuf(u,d,l,r) =  5 13  8 10
rank =  6 coords =  1 2 neighbors(u,d,l,r) =  2 10  5  7
rank =  4                   inbuf(u,d,l,r) =  0  8 -1  5
rank = 11                   inbuf(u,d,l,r) =  7 15 10 -1
rank =  0                   inbuf(u,d,l,r) = -1  4 -1  1
rank =  5                   inbuf(u,d,l,r) =  1  9  4  6
rank =  7                   inbuf(u,d,l,r) =  3 11  6 -1
rank =  6                   inbuf(u,d,l,r) =  2 10  5  7
rank =  8                   inbuf(u,d,l,r) =  4 12 -1  9
rank =  3                   inbuf(u,d,l,r) = -1  7  2 -1
rank = 12 coords =  3 0 neighbors(u,d,l,r) =  8 -1 -1 13
rank = 12                   inbuf(u,d,l,r) =  8 -1 -1 13

Back to Contents

Logging Off


Type exit to close the connection with the ACF supercomputer.

Back to Contents

Documentation for MPI and Additional Resources

Back to Contents

Acknowledgments

  • The original MPI training materials for workstations were developed under the Joint Information Systems Committee (JISC) New Technologies Initiative by the Training and Education Centre at Edinburgh Parallel Computing Centre (EPCC-TEC), University of Edinburgh, United Kingdom.
  • NICS staff at UTK/ORNL and their resources
  • All contributors to the art and science of parallel computing

Back to Contents