Purpose
This hands-on session consists of two parts. The first part will guide you through the process of logging in to ACF computers. The second part will then provide you with a set of MPI programming exercises which we believe will help you understand the basic ideas of MPI parallel programming by demonstrating the key features of message passing interface through these sample programs.
Contents
- Introduction
- Selected MPI Codes
- Hello World 1 - MPI Hello world!
- Hello World 2 - Hello From..! MPI_COMM_RANK
- Pi Calculation
- MPI Communication Timing Test
- Passing a message: MPI_Send; MPI_Recv
- Deadlock Situation
- Deadlock Situation - Fixed
- Ping-Pong - Calculating MPI Transfer Rates
- Ring (Non-blocking Communication)
- Ring (Blocking Communication)
- Simple Array Assignment
- Matrix Multiplication
- 2D Laplace Equation
- MPI Struct Derived Data Type
- MPI Collective Communication Functions
- MPI Group and Communicator Functions
- MPI Virtual Topology Functions
- Final Remarks
ACF Environment Setup
To start editing, compiling and running the programs you'll first need to login to a parallel computer such as one provided by XSEDE. To do that you'll need to use your account NAME, DUO login, or PASSCODE that will be given to you during the workshop.
In the newly copied MPI_WORKSHOP folder you will find three sub-folders, namely C/ CPP/ Fortran/
These folders contain all the material necessary for editing, compiling and running the MPI exercises written in C, C++ or Fortran. Select the directory that suits your programming preferences. You are now ready to proceed into the world of MPI.
List of Exercises
During this workshop you will utilize the most fundamental MPI calls necessary for development of any MPI code, as well as learn how to compile and run MPI codes. Each of the three aforementioned folders /C, /CPP, /FORTRAN has a set of sub-folders with exercises, one for each exercise.
Folder name | Description |
---|---|
hello1 | MPI Hello world! |
hello2 | Hello From..! MPI_COMM_RANK |
pical | Pi Calculation |
timing | MPI Communication Timing Test |
sendrecv | Passing a message: MPI_Send; MPI_Recv |
deadlock | Deadlock Situation |
deadlockf | Deadlock Situation - Fixed |
pingpong | Ping-Pong - Calculating MPI Transfer Rates |
ringnb | Ring (Non-blocking Communication) |
ringbl | Ring (Blocking Communication) |
array | Simple Array Assignment |
matmul | Matrix Multiplication |
laplace | 2D Laplace Equation |
struct | MPI Struct Derived Data Type |
collectives | MPI Collective Communication Functions |
group | MPI Group and Communicator Functions |
cartesian | MPI Virtual Topology Functions |
During the workshop you will encounter the following exercises:
Exercise Information
During the workshop exercises you may be asked to write a code; if possible write the codes on your own. To do so use the files with the .start extensions that are provided in the directories where programming is required.
For those of you wishing to concentrate only on the message passing aspects of the code, files with .template extensions have been provided in the exercise directories that require programming. To modify the template files, first copy filename._.template to filename.c, filename.cpp or filename.f depending on whether you are using C, C++ or FORTRAN. For example:
cp filename.c.template filename.c cp filename.cpp.template filename.cpp cp filename.f.template filename.f
Next, invoke your favorite text editor and modify the template by replacing all of the XXXXX's with the appropriate MPI calls.
For your convenience, there are completed solutions to the programming exercises available as files with .soln extensions. Remember, the only way to learn how to program is by actually programming, SO LOOK AT THE SOLUTIONS ONLY AS A LAST RESORT.
Hello World 1 - The Minimal MPI Program
This exercise DOES NOT require programming
The objective of this exercise is not to write an MPI code but to demonstrate the fundamentals of compiling an MPI program and submitting it via qsub.
First, examine the "Hello World!" program hello1.c/hello1.cpp/hello1.f. Notice that every process prints "Hello World!" and that the "Hello World!" program:
- Includes a header,
- Initializes MPI,
- Prints a "Hello World!" message, and
- Finalizes MPI
Next, compile "Hello World!". For the version of MPI that we are using here there are several ways to compile a program. We will use the Intel compilers mpiicc, mpiicpc and mpiifort to compile our C, C++ and FORTRAN programs, respectively. These compilers locate the MPI libraries and header files as needed. To compile, you can either compile at the terminal's prompt or use the provided makefile. If you use the provided makefile, please read it first to understand what it is doing.
In general, to compile a program called filename.c/filename.cpp/filename.f, at the command line you need to enter the following. The default PE environment is Intel:
mpiicc filename.c -o filename or mpiicpc filename.cpp -o filename or mpiifort filename.f -o filename
where filename is the resulting executable.
To compile a program called filename.c/filename.cpp/filename.f using the provided makefile, enter the following:
make
in the appropriate directory. The resulting executable will be filename.
For the "Hello World!" program, enter either
mpiicc hello1.c -o hello1 for C programs mpiicpc hello1.cpp -o hello1 for C++ programs mpiifort hello1.f -o hello1 for FORTRAN programs
or, enter
make
Again, if you use the provided makefile, first make sure you understand what it is doing.
Now we want to run the "Hello World!" program. This elementary problem will use 12 processors and will assign a rank to each of them.
Since a parallel program runs on several processors, there needs to be some method of starting the program on each of the different processors. On ACF this is done by using the Batch Scripts. Batch scripts can be used to run a set of commands on a systems compute partition. The batch script is a shell script containing PBS flags and commands to be interpreted by a shell. Batch scripts are submitted to the batch manager, PBS, where they are parsed. Based on the parsed data, PBS places the script in the queue as a job. Once the job makes its way through the queue, the script will be executed on the head node of the allocated resources. The batch script example below explains how to submit the batch scripts, and common usage tips.
Example Batch Script:
1: #!/bin/bash 2: #PBS -A ACF-XXXXX 3: #PBS -N test 4: #PBS -j oe 5: #PBS -l nodes=:ppn=12,walltime=5:00 6: 7: cd $PBS_O_WORKDIR 8: date 9: mpirun -np 12 ./executable
This batch script can be broken down into the following sections:
- Shell interpreter
- Line 1
- Can be used to specify an interpreting shell.
- PBS commands
- The PBS options will be read and used by PBS upon submission.
- Lines 2–5
- 2: The job will be charged to the ACF-XXXXX project.
- 3: The job will be named “test.”
- 4: The jobs standard output and error will be combined.
- 5: The job will request 12 cores for 5 minutes.
- Please see the PBS Options page for more options.
- Shell commands
- Once the requested resources have been allocated, the shell commands will be executed on the allocated nodes head node.
- Lines 6–9
- 6: This line is left blank, so it will be ignored.
- 7: This command will change directory into the script's submission directory. We assume here that the job was submitted from a directory in /lustre/haven/.
- 8: This command will run the date command.
- 9: This command will run the executable executable on 4 cores.
Submitting Batch Jobs
Batch scripts can be submitted for execution using the qsub command on ACF. For example, the following will submit the batch script named test.pbs:
qsub test.pbs
If successfully submitted, a PBS job ID will be returned. This ID can be used to track the job.
For more information about qsub see the man pages.
Look at the running job page for more (and some redundant) information. In particular, look at the PBS commands for submitting jobs, removing jobs from the queue, etc.
We will use the PBS pbssub script to submit our job for the hello1 exercise. After compiling the source code you will find an executable file in the current hello1 directory. To submit the job to the ACF queue use the provided pbssub file, but first be sure to examine the file.
Submit the job with the command:
qsub pbssub
Did the job enter the batch queue? Check, with the showq command. Where did the job run?
Running this program on 12 processes will output 12 lines depending on the rank of the process.
Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World! (from worker node) Hello, World!!! (from masternode) Hello, World! (from worker node) Hello, World! (from worker node)
Hello World 2 - Hello Again!
This exercise requires programming
The objective of this exercise is to demonstrate the most fundamental MPI calls used in almost any MPI program. You are asked to write an SPMD(Single Process, Multiple Data) program where, again, each process checks its rank, and decides if it is the master (if its rank is 0), or a worker (if its rank is 1 or greater).
The SPMD programs should:
- Include the header,
- Initialize MPI,
- Check its rank, and
- if the process is the master, then send a "Hello World!"message, in characters, to each of the workers
- if the process is a worker, then receive the "Hello World!"message and print it out
- Finalize MPI
Compile your program at the command line or via the makefile. Run this code on 12 processes using qsub pbssub command. You can also run the code on 24, 48 etc processes keeping in mind that the number of requested cores must be the multiple of 12.
When compiled and run correctly the master process will send a copy of the "Hello,World!" message to each of the worker processes, which will then print it to the standard output.
Running this program on 12 processes will output 11 lines similar to the ones below.
Process 1 : Hello,World! Process 4 : Hello,World! Process 2 : Hello,World! Process 3 : Hello,World! Process 5 : Hello,World! Process 6 : Hello,World! Process 8 : Hello,World! Process 7 : Hello,World! Process 9 : Hello,World! Process 10 : Hello,World! Process 11 : Hello,World!
Pi Calculation
This exercise DOES NOT require programming
This program calculates π-number by integrating f(x) = 4 /(1+x^2) . Area under the curve is divided into rectangles and the rectangles are distributed to the processors.
- Each node:
- receives the number of rectangles used in the approximation.
- calculates the areas of it's rectangles.
- Synchronizes for a global summation.
- Node 0 prints the result.
Your tasks are:
make
mpiicc pical.c -o pical mpiicpc pical.cpp -o pical mpiifort pical.f -o pical
- Review the pical.c/pical.cpp/pical.f code to get an idea of what it does.
- Compile the code. A makefile has been provided for you. To compile with the makefile, simply type:
- Or to compile at the command line use the mpiicc compiler for the C code, mpiicpc compiler for the C++ code or the mpiifort compiler for the FORTRAN code:
- Run the code: pical using provided pbssub file.
Running this program using 12 processes will have an output similar to the following
Process 1 of 12 on nid11652 Process 11 of 12 on nid11652 Process 10 of 12 on nid11652 Process 3 of 12 on nid11652 Process 5 of 12 on nid11652 Process 0 of 12 on nid11652 Process 7 of 12 on nid11652 Process 8 of 12 on nid11652 Process 4 of 12 on nid11652 Process 9 of 12 on nid11652 Process 6 of 12 on nid11652 Process 2 of 12 on nid11652 pi is approximately 3.1415926544231252, Error is 0.0000000008333321 wall clock time = 0.000610
MPI Communication Timing Test
This exercise DOES NOT require programming
The objective of this exercise is to investigate the amount of time required for message passing between two processes, i.e. an MPI communication timing test is performed.
In this exercise different size messages are sent back and forth between two processes a number of times. Timings are made for each message before it is sent and after it has been received. The difference is computed to obtain the actual communication time. Finally, the average communication time and the bandwidth are calculated and output to the screen.
For example, one can run this code on two nodes (one process on each node) passing messages of length 1, 100, 10,000, and 1,000,000 and record the results in a table like the one below.
Length | Communication Time (μSec) | Communication Bandwidth (Megabit/Sec) |
---|---|---|
1 | 0.000001 | 73.403187 |
100 | 0.000002 | 3240.799903 |
10,000 | 0.000046 | 13947.887101 |
1,000,000 | 0.003633 | 17616.862072 |
The makefile and pbssub files have been provided.
- Compile the code by typing make
- Submit the job by typing qsub pbssub
Running this program using 2 processes will have an output similar to the following
Doing round trip test for: Message length = 1 integer value(s) Message size = 4 Bytes Number of Reps = 1000 *** Round Trip Avg = 0.000001 uSec *** Bandwidth = 9175398.414001 Byte/Sec = 73.403187 Megabit/Sec Doing round trip test for: Message length = 100 integer value(s) Message size = 400 Bytes Number of Reps = 1000 *** Round Trip Avg = 0.000002 uSec *** Bandwidth = 405099987.927080 Byte/Sec = 3240.799903 Megabit/Sec Doing round trip test for: Message length = 10000 integer value(s) Message size = 40000 Bytes Number of Reps = 1000 *** Round Trip Avg = 0.000046 uSec *** Bandwidth = 1743485887.683419 Byte/Sec = 13947.887101 Megabit/Sec Doing round trip test for: Message length = 1000000 integer value(s) Message size = 4000000 Bytes Number of Reps = 1000 *** Round Trip Avg = 0.003633 uSec *** Bandwidth = 2202107758.953355 Byte/Sec = 17616.862072 Megabit/Sec
Passing a message: MPI_Send; MPI_Recv
This exercise requires programming
Now that you've learned the two basic MPI_Send and MPI_Recv communication calls, you are asked to write a simple program that performs communication between two processors. Processor 0 sends a message Hello, World! usind blocking MPI_Send to processor 1, which receives this message using blocking receive MPI_Recv.
Running this program on any number of processors should output 1 line similar to the one below
Process 1 : Hello, World!
Deadlock Situation
This exercise DOES NOT require programming
This example shows improper use of blocking calls resulting in deadlock run on two nodes. All tasks are simply waiting for events that haven’t been initiated. Try to understand why this is happening. Also, before proceeding to the next exercise try to think of a possible fix to this deadlock situation.
Please do not modify the requested wall time(1 min) in the pbssub file. Otherwise the job will idle and waste compute resources. After 1 minute your job will be killed automatically.
Running this program using 2 processes will have an output similar to the following
Task 1 has sent the message Task 0 has sent the message
Deadlock Situation - Fixed
This exercise DOES NOT require programming
This program is the solution to the deadlock example above showing the use of a non-blocking send to eliminate deadlock.
Running this program using 2 processes will have an output similar to the following
Task 0 has sent the message Task 1 has sent the message Task 0 has received the message Task 1 has received the message
Ping-Pong - Calculating MPI Transfer Rates
This exercise requires programming
Before we ask you to write a code using blocking and non-blocking calls, let's write another simple program (pingpong) in which two processes send packets of information back and forth a number of times and record the amount of time required. The packets of information consist of an array of dummy floating point numbers which vary in length from one floating point number to 10,001floating point numbers. These packets are sent back and forth from one process to another 100 times and the total amount of time required is recorded. From this timing data the average amount of time per send/receive is computed as well as the transfer rate (bandwidth) in bytes/sec.
First of all, you'll need to use the MPI_Wtime() routine for timing the calls. This routine returns the time expressed in seconds (see man MPI_Wtime). In order to time something two MPI_Wtime() calls are needed and the difference between the returned values should be taken to obtain the total elapsed time (in wall clock seconds).
After you've inserted two MPI_Wtime() calls to estimate the average time taken for one message to complete a one way trip, calculate the transfer rate in bytes per second. What did you find?
Now, add a loop around the timing calls changing the length of the message (length varies from 1 to 10001 in increments of 1000) to investigate how the time taken varies with the size of the message.
Running this program using 2 processes should have the output similar to the following
length time/message (sec) transfer rate (byte/sec) 1 0.000002 1955386.480186 1001 0.000004 936642120.245399 2001 0.000007 1146557691.803279 3001 0.000010 1238126772.802164 4001 0.000012 1282859841.681797 5001 0.000014 1419796213.148320 6001 0.000015 1583891657.610320 7001 0.000018 1574600029.707085 8001 0.000021 1554719773.175817 9001 0.000022 1604459426.434339 10001 0.000023 1707427874.386893
Ring (Non-blocking Communication)
This exercise requires programming
This program allows a processor to communicate its rank around a ring. The sum of all ranks is then accumulated and printed out by each processor.
Consider a set of processes arranged in a ring as shown below. Use a token passing method to compute the sum of the ranks of the processes.
1 / \ 0 2 \ / 3
Figure 1: Four processes arranged in a ring. Messages are sent from 0 to 1 to 2 to 3 to 0 again, sum of ranks is 6.
Each processor stores its rank in MPI_COMM_WORLD as an integer and sends this value to the processor on its right. It then receives an integer from its left neighbor. It keeps track of the sum of all the integers received. The processors continue passing on the values they receive until they get their own rank back. Each process should finish by printing out the sum of the values.
For this program use synchronous non-blocking send MPI_Issend(). Make sure that you do not overwrite information.
You are asked to use synchronous message passing because the standard send can be either buffered or synchronous, and you should learn to program for either possibility.
Running this program using 12 processes should have an output similar to the following
Proc 10 sum = 66 Proc 2 sum = 66 Proc 0 sum = 66 Proc 4 sum = 66 Proc 7 sum = 66 Proc 6 sum = 66 Proc 1 sum = 66 Proc 3 sum = 66 Proc 8 sum = 66 Proc 9 sum = 66 Proc 11 sum = 66 Proc 5 sum = 66
Ring (Blocking Communication)
This exercise requires programming
This program allows a processor to communicate its rank around a ring. The sum of all ranks is then accumulated and printed out by each processor.
Consider a set of processes arranged in a ring as shown below. Use a token passing method to compute the sum of the ranks of the processes.
1 / \ 0 2 \ / 3
Figure 1: Four processes arranged in a ring. Messages are sent from 0 to 1 to 2 to 3 to 0 again, sum of ranks is 6.
Each processor stores its rank in MPI_COMM_WORLD as an integer and sends this value to the processor on its right. It then receives an integer from its left neighbor. It keeps track of the sum of all the integers received. The processors continue passing on the values they receive until they get their own rank back. Each process should finish by printing out the sum of the values.
For this program use synchronous blocking send MPI_Ssend(). Watch out for deadlock situations.
You are asked to use synchronous message passing because the standard send can be either buffered or synchronous, and you should learn to program for either possibility.
Running this program using 12 processes should have an output similar to the following
Proc 10 sum = 66 Proc 2 sum = 66 Proc 0 sum = 66 Proc 4 sum = 66 Proc 7 sum = 66 Proc 6 sum = 66 Proc 1 sum = 66 Proc 3 sum = 66 Proc 8 sum = 66 Proc 9 sum = 66 Proc 11 sum = 66 Proc 5 sum = 66
Simple Array Assignment
This exercise requires programming
This is a simple array assignment used to demonstrate the distribution of data among multiple tasks and the communications required to accomplish that distribution.
The master task initiates numtasks-1 number of worker tasks and then distributes an equal portion of the array to each worker. Each worker receives its portion of the array and performs a simple value assignment to each of its elements. The value assigned to each element is simply that element's index in the array plus 1. Each worker then sends its portion of the array back to the master. As the master receives a portion of the array from each worker, selected elements are displayed.
Note: For this example, the number of processes should be set to an odd number (aprun -n 11), to ensure even distribution of the array to numtasks-1 worker tasks.
Running the solution program using 11 processes will have an output similar to the following
MASTER: number of worker tasks will be= 10 Sending to worker task 1 Sending to worker task 2 Sending to worker task 3 Sending to worker task 4 Sending to worker task 5 Sending to worker task 6 Sending to worker task 7 Sending to worker task 8 Sending to worker task 9 Sending to worker task 10 --------------------------------------------------- MASTER: Sample results from worker task = 1 result[0]=1.000000 result[100]=101.000000 result[1000]=1001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 2 result[6000]=6001.000000 result[6100]=6101.000000 result[7000]=7001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 3 result[12000]=12001.000000 result[12100]=12101.000000 result[13000]=13001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 4 result[18000]=18001.000000 result[18100]=18101.000000 result[19000]=19001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 5 result[24000]=24001.000000 result[24100]=24101.000000 result[25000]=25001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 6 result[30000]=30001.000000 result[30100]=30101.000000 result[31000]=31001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 7 result[36000]=36001.000000 result[36100]=36101.000000 result[37000]=37001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 8 result[42000]=42001.000000 result[42100]=42101.000000 result[43000]=43001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 9 result[48000]=48001.000000 result[48100]=48101.000000 result[49000]=49001.000000 --------------------------------------------------- MASTER: Sample results from worker task = 10 result[54000]=54001.000000 result[54100]=54101.000000 result[55000]=55001.000000 MASTER: All Done!
Matrix Multiplication
This exercise requires programming
This example is a simple matrix multiplication program, i.e. AxB=C. Matrix A is copied to every processor. Matrix B is divided into blocks and distributed among processors. The data is distributed among the workers who perform the actual multiplication in smaller blocks and send back their results to the master.
Note: The C and FORTRAN versions of this code differ because of the way arrays are stored/passed. C arrays are stored in row-major order while FORTRAN arrays are stored in column-major order.
Running this program using 11 processes will have an output similar to the following
Number of worker tasks = 11 sending 10 rows to task 1 sending 9 rows to task 2 sending 9 rows to task 3 sending 9 rows to task 4 sending 9 rows to task 5 sending 9 rows to task 6 sending 9 rows to task 7 sending 9 rows to task 8 sending 9 rows to task 9 sending 9 rows to task 10 sending 9 rows to task 11 Here are the first 30 rows of the result (C) matrix 0.00 1015.00 2030.00 3045.00 4060.00 5075.00 6090.00 0.00 1120.00 2240.00 3360.00 4480.00 5600.00 6720.00 0.00 1225.00 2450.00 3675.00 4900.00 6125.00 7350.00 0.00 1330.00 2660.00 3990.00 5320.00 6650.00 7980.00 0.00 1435.00 2870.00 4305.00 5740.00 7175.00 8610.00 0.00 1540.00 3080.00 4620.00 6160.00 7700.00 9240.00 0.00 1645.00 3290.00 4935.00 6580.00 8225.00 9870.00 0.00 1750.00 3500.00 5250.00 7000.00 8750.00 10500.00 0.00 1855.00 3710.00 5565.00 7420.00 9275.00 11130.00 0.00 1960.00 3920.00 5880.00 7840.00 9800.00 11760.00 0.00 2065.00 4130.00 6195.00 8260.00 10325.00 12390.00 0.00 2170.00 4340.00 6510.00 8680.00 10850.00 13020.00 0.00 2275.00 4550.00 6825.00 9100.00 11375.00 13650.00 0.00 2380.00 4760.00 7140.00 9520.00 11900.00 14280.00 0.00 2485.00 4970.00 7455.00 9940.00 12425.00 14910.00 0.00 2590.00 5180.00 7770.00 10360.00 12950.00 15540.00 0.00 2695.00 5390.00 8085.00 10780.00 13475.00 16170.00 0.00 2800.00 5600.00 8400.00 11200.00 14000.00 16800.00 0.00 2905.00 5810.00 8715.00 11620.00 14525.00 17430.00 0.00 3010.00 6020.00 9030.00 12040.00 15050.00 18060.00 0.00 3115.00 6230.00 9345.00 12460.00 15575.00 18690.00 0.00 3220.00 6440.00 9660.00 12880.00 16100.00 19320.00 0.00 3325.00 6650.00 9975.00 13300.00 16625.00 19950.00 0.00 3430.00 6860.00 10290.00 13720.00 17150.00 20580.00 0.00 3535.00 7070.00 10605.00 14140.00 17675.00 21210.00 0.00 3640.00 7280.00 10920.00 14560.00 18200.00 21840.00 0.00 3745.00 7490.00 11235.00 14980.00 18725.00 22470.00 0.00 3850.00 7700.00 11550.00 15400.00 19250.00 23100.00 0.00 3955.00 7910.00 11865.00 15820.00 19775.00 23730.00 0.00 4060.00 8120.00 12180.00 16240.00 20300.00 24360.00
2D Laplace Equation
This exercise requires programming
This example solves a two-dimensional Laplace equation using the point Jacobi iteration method over a rectangular domain.
2-D Laplace Equation T is initially 0.0 everywhere except at the boundaries where T=100. T=100 0,1000_______________1000,1000 ________ | | | | | | | | | | | | | | | | | | | T=100 | T=0.0 | T=100 | 0 | 1 | 2 | 3 | | | | | | | | |________| |___|___|___|___| T=100 0,0 1000,0 Use Central-Difference Method(5-point stecil) Each process operates on its subgrid(250x1000) only and then sends its boundaries to the nearest neighbors
The initial guess value of the function is zero. The boundaries are held at 100 throughout the calculation. Domain decomposition is used for parallel implementation of the problem.
The solution at point T(10,10), which belongs to process 0, is printed every 100 iterations
Run this program on 4 processors.
Running this program using 4 processes will have an output similar to the following:
Iter = 100: PE = 0: t[10][10] = 29.17421532 Iter = 200: PE = 0: t[10][10] = 53.50511932 Iter = 300: PE = 0: t[10][10] = 65.75184631 Iter = 400: PE = 0: t[10][10] = 72.95095062 Iter = 500: PE = 0: t[10][10] = 77.66552734 Iter = 600: PE = 0: t[10][10] = 80.98654175 Iter = 700: PE = 0: t[10][10] = 83.45037079 Iter = 800: PE = 0: t[10][10] = 85.35018921 Iter = 900: PE = 0: t[10][10] = 86.85943604 Iter = 1000: PE = 0: t[10][10] = 88.08715057
MPI Struct Derived Data Type
This exercise DOES NOT require programming
This program demonstrates the proper use of MPI struct derived data type. The code creates a data type that represents a particle and distributes an array of such particles to all processes. Each particle has the following six characteristics:
- X coordinate
- Y coordinate
- Z coordinate
- Velocity
- ID number
- Type specification
As a consequence, the data type consists of four floating point values followed by two integer values.
Running this program using 12 processes will have an output similar to the following:
rank= 1 3.00 -3.00 3.00 0.25 3 1 rank= 4 3.00 -3.00 3.00 0.25 3 1 rank= 2 3.00 -3.00 3.00 0.25 3 1 rank= 5 3.00 -3.00 3.00 0.25 3 1 rank= 6 3.00 -3.00 3.00 0.25 3 1 rank= 8 3.00 -3.00 3.00 0.25 3 1 rank= 7 3.00 -3.00 3.00 0.25 3 1 rank= 9 3.00 -3.00 3.00 0.25 3 1 rank= 10 3.00 -3.00 3.00 0.25 3 1 rank= 0 3.00 -3.00 3.00 0.25 3 1 rank= 11 3.00 -3.00 3.00 0.25 3 1 rank= 3 3.00 -3.00 3.00 0.25 3 1
MPI Collective Communication Functions
This exercise DOES NOT require programming
This program demonstrates the functionality of the most commonly used collective MPI functions, namely
MPI_Gather MPI_Gatherv MPI_Allgather MPI_Allgatherv MPI_Scatter MPI_Alltoall MPI_Reduce MPI_Allreduce MPI_Bcast
Please note that the program is designed to run using up to 4 processes only.
Running this program using 4 processes will have an output similar to the following
Function Proc Sendbuf Recvbuf -------- ---- ------- ------- MPI_Gather : 0 a MPI_Gather : 3 d MPI_Gather : 2 c MPI_Gather : 1 b a b c d MPI_Gatherv : 3 g h i j MPI_Gatherv : 0 a MPI_Gatherv : 1 b c a b c d e f g h i j MPI_Gatherv : 2 d e f MPI_Allgather : 0 a a b c d MPI_Allgather : 1 b c a b c d MPI_Allgather : 3 d h i j a b c d MPI_Allgather : 2 c e f a b c d MPI_Allgatherv: 0 a a b c d e f g h i j MPI_Allgatherv: 1 b c a b c d e f g h i j MPI_Allgatherv: 2 d e f a b c d e f g h i j MPI_Allgatherv: 3 g h i j a b c d e f g h i j MPI_Scatter : 1 e f g h f MPI_Scatter : 0 a b c d e MPI_Scatter : 3 m n o p h MPI_Scatter : 2 i j k l g MPI_Alltoall : 0 a b c d a e i m MPI_Alltoall : 3 m n o p d h l p MPI_Alltoall : 1 e f g h b f j n MPI_Alltoall : 2 i j k l c g k o MPI_Reduce MAX: 2 i j k l MPI_Reduce MAX: 0 a b c d MPI_Reduce MAX: 3 m n o p MPI_Reduce MAX: 1 e f g h m n o p MPI_Allreduce : 1 e f g h m n o p MPI_Allreduce : 0 a b c d m n o p MPI_Allreduce : 3 m n o p m n o p MPI_Allreduce : 2 i j k l m n o p MPI_Bcast : 1 b b MPI_Bcast : 3 b MPI_Bcast : 0 b MPI_Bcast : 2 b
MPI Group and Communicator Functions
This exercise DOES NOT require programming
This program demonstrates the typical usage of MPI groups and communicators. The sample code creates two different process groups for separate collective communications exchange. This requires creating new communicators also.
The flow of the code can be summarized as follows:
- Extract handle of global group from MPI_COMM_WORLD using MPI_Comm_group
- Form new group as a subset of global group using MPI_Group_incl
- Create new communicator for new group using MPI_Comm_create
- Determine new rank in new communicator using MPI_Comm_rank
- Conduct communications using any MPI message passing routine
- When finished, free up new communicator and group (optional) using MPI_Comm_free and MPI_Group_free
Running this program using 12 processes will have an output similar to the following:
rank= 1 newrank= 1 recvbuf= 15 rank= 3 newrank= 3 recvbuf= 15 rank= 0 newrank= 0 recvbuf= 15 rank= 2 newrank= 2 recvbuf= 15 rank= 8 newrank= 2 recvbuf= 51 rank= 5 newrank= 5 recvbuf= 15 rank= 7 newrank= 1 recvbuf= 51 rank= 10 newrank= 4 recvbuf= 51 rank= 6 newrank= 0 recvbuf= 51 rank= 9 newrank= 3 recvbuf= 51 rank= 11 newrank= 5 recvbuf= 51 rank= 4 newrank= 4 recvbuf= 15
MPI Virtual Topology Functions
This exercise DOES NOT require programming
This program demonstrates the typical usage of MPI Cartesian Virtual Topology. The sample code creates a 4 x 4 Cartesian topology from 16 processors and have each process exchange its rank with four neighbors
_______________________________ | 0 | 1 | 2 | 3 | | | | | | | (0,0) | (0,1) | (0,2) | (0,3) | |_______|_______|_______|_______| | 4 | 5 | 6 | 7 | | | | | | | (1,0) | (1,1) | (1,2) | (1,3) | |_______|_______|_______|_______| | 8 | 9 | 10 | 11 | | | | | | | (2,0) | (2,1) | (2,2) | (2,3) | |_______|_______|_______|_______| | 12 | 13 | 14 | 15 | | | | | | | (3,0) | (3,1) | (3,2) | (3,3) | |_______|_______|_______|_______|
Depending on the periodicity of the cartesian group in the specified coordinate direction, MPI_CART_SHIFT provides the identifiers for a circular or an end-off shift. In the case of an end-off shift, the value MPI_PROC_NULL may be returned in rank_source or rank_dest, indicating that the source or the destination for the shift is out of range
Running this program using 16 processes will have an output similar to the following:
rank = 3 coords = 0 3 neighbors(u,d,l,r) = -1 7 2 -1 rank = 1 coords = 0 1 neighbors(u,d,l,r) = -1 5 0 2 rank = 1 inbuf(u,d,l,r) = -1 5 0 2 rank = 2 coords = 0 2 neighbors(u,d,l,r) = -1 6 1 3 rank = 15 coords = 3 3 neighbors(u,d,l,r) = 11 -1 14 -1 rank = 14 coords = 3 2 neighbors(u,d,l,r) = 10 -1 13 15 rank = 15 inbuf(u,d,l,r) = 11 -1 14 -1 rank = 13 coords = 3 1 neighbors(u,d,l,r) = 9 -1 12 14 rank = 14 inbuf(u,d,l,r) = 10 -1 13 15 rank = 13 inbuf(u,d,l,r) = 9 -1 12 14 rank = 9 coords = 2 1 neighbors(u,d,l,r) = 5 13 8 10 rank = 4 coords = 1 0 neighbors(u,d,l,r) = 0 8 -1 5 rank = 11 coords = 2 3 neighbors(u,d,l,r) = 7 15 10 -1 rank = 2 inbuf(u,d,l,r) = -1 6 1 3 rank = 10 coords = 2 2 neighbors(u,d,l,r) = 6 14 9 11 rank = 10 inbuf(u,d,l,r) = 6 14 9 11 rank = 8 coords = 2 0 neighbors(u,d,l,r) = 4 12 -1 9 rank = 0 coords = 0 0 neighbors(u,d,l,r) = -1 4 -1 1 rank = 5 coords = 1 1 neighbors(u,d,l,r) = 1 9 4 6 rank = 7 coords = 1 3 neighbors(u,d,l,r) = 3 11 6 -1 rank = 9 inbuf(u,d,l,r) = 5 13 8 10 rank = 6 coords = 1 2 neighbors(u,d,l,r) = 2 10 5 7 rank = 4 inbuf(u,d,l,r) = 0 8 -1 5 rank = 11 inbuf(u,d,l,r) = 7 15 10 -1 rank = 0 inbuf(u,d,l,r) = -1 4 -1 1 rank = 5 inbuf(u,d,l,r) = 1 9 4 6 rank = 7 inbuf(u,d,l,r) = 3 11 6 -1 rank = 6 inbuf(u,d,l,r) = 2 10 5 7 rank = 8 inbuf(u,d,l,r) = 4 12 -1 9 rank = 3 inbuf(u,d,l,r) = -1 7 2 -1 rank = 12 coords = 3 0 neighbors(u,d,l,r) = 8 -1 -1 13 rank = 12 inbuf(u,d,l,r) = 8 -1 -1 13
Logging Off
Type exit to close the connection with the ACF supercomputer.
Documentation for MPI and Additional Resources
- There are man pages available for MPI which should be installed in your MANPATH. The following man pages have some introductory information about MPI: man MPI man cc man ftn man qsub man MPI_Init man MPI_Finalize
- You can also refer to NICS User Support Page for ACF specific MPI Implementation details.
- MPI man pages are also available online. http://www.mcs.anl.gov/mpi/www/
- Main MPI web page at Argonne National Laboratory http://www-unix.mcs.anl.gov/mpi
- Set of guided exercises http://www-unix.mcs.anl.gov/mpi/tutorial/mpiexmpl
- MPI Forum home page contains the official copies of the MPI standard. http://www.mpi-forum.org/
- Books on and about MPI:
- Using MPI, 2nd Edition by William Gropp, Ewing Lusk, and Anthony Skjellum, published by MIT Press ISBN 0-262-57132-3.
- The example programs from this book are available at ftp://ftp.mcs.anl.gov/pub/mpi/using/UsingMPI.tar.gz.
- The Table of Contents is also available.
- An errata for the book is available.
- Information on the first edition of Using MPI is also available, including the errata.
- Also of interest may be The LAM companion to ``Using MPI...'' by Zdzislaw Meglicki (gustav@arp.anu.edu.au).
- Designing and Building Parallel Programs is Ian Foster's online book that includes a chapter on MPI. It provides a succinct introduction to an MPI subset. (ISBN 0-201-57594-9; Published by Addison-Wesley)
- MPI: The Complete Reference, by Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra, The MIT Press .
- MPI: The Complete Reference - 2nd Edition: Volume 2 - The MPI-2 Extensions, by William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir, The MIT Press.
- Parallel Programming With MPI, by Peter S. Pacheco, published by Morgan Kaufmann.
- RS/6000 SP: Practical MPI Programming, by Yukiya Aoyama and Jun Nakano (IBM Japan), and available as an IBM Redbook.
- Supercomputing Simplified: The Bare Necessities for Parallel C Programming with MPI, by William B. Levy and Andrew G. Howe, ISBN: 978-0-9802-4210-2. See the website for more information.
- Using MPI, 2nd Edition by William Gropp, Ewing Lusk, and Anthony Skjellum, published by MIT Press ISBN 0-262-57132-3.
Acknowledgments
- The original MPI training materials for workstations were developed under the Joint Information Systems Committee (JISC) New Technologies Initiative by the Training and Education Centre at Edinburgh Parallel Computing Centre (EPCC-TEC), University of Edinburgh, United Kingdom.
- NICS staff at UTK/ORNL and their resources
- All contributors to the art and science of parallel computing