I'm working on a project comparing the performance of PVM and MPI in GPU accelerated clusters. Naturally, I turned to the defacto standard for computer cluster benchmarking, LINPACK. The problem I'm running into is finding a version of LINPACK for PVM. It does exist somewhere because there are performance comparisons of systems using LINPACK and PVM. Ive already installed SCALAPACK and all of the other libraries required for it to work, but I can't figure out how to use it as a benchmark. Additionally, I need the benchmark to be GPU accelerated. Ive looked into HPL, which is GPU accelerated through CUDA, but there are no versions for PVM. Does anyone know of a version of LINPACK that can be run on PVM and could be modified to be CUDA accelerated?
int main(int argc, char *argv[]) { int n = ...; if (pvm_parent() == PvmNoParent) { #define N (4) int tid[N]; pvm_spawn("program", argv, PvmTaskDefault, (char*)0, N, &tid[0]); } else { int mytid = pvm_mytid(); int *tids, me = -1; int ntids = pvm_siblings(&tids); for (int i=0; i<ntids; ++i) if ( tids[i] == mytid) { me = i; break; } int p_left = -1, p_right = -1; if (me > 0) p_left = tids[me-1]; if (me < ntids-1) p_right = tids[me+1]; int n_local0 = 1 + (me * (n-1)) / ntids; int n_local1 = 1 + ((me+1) * (n-1)) / ntids; pvm_joingroup("worker"); // allocate only local part + ghost zone of the arrays x,y float *x, *y; x = new float[n_local1 - n_local0 + 2]; y = new float[n_local1 - n_local0 + 2]; x -= (n_local0 - 1); y -= (n_local0 - 1);
... // fill x, y
// fill ghost zone if (p_left != -1) { pvm_initsend(PvmDataDefault); pvm_pkfloat(&y[n_local0], 1, 1); int msgtag = 1; pvm_send(p_left, msgtag); } if (p_right != -1) { int msgtag = 1; pvm_recv(p_right, msgtag); pvm_upkfloat(&y[n_local1], 1, 1); pvm_initsend(PvmDataDefault); pvm_pkfloat(&y[n_local1-1], 1, 1); msgtag = 2; pvm_send(p_right, msgtag); } if (p_left != -1) { int msgtag = 2; pvm_recv(p_left, msgtag); pvm_upkfloat(&y[n_local0-1], 1, 1); }
pthread_t threads[PROC]; pthread_attr_t attr; pthread_attr_init(&attr); n_thread0 = n_local0; n_thread1 = n_local1; float e = 0; // start threads and wait for termination for (int p=0; p<PROC; ++p) pthread_create(&threads[p], &attr, thread1, (void *)p); for (int p=0; p<PROC; ++p) { pthread_join(threads[p], NULL); e += e_vec[p]; }
It is a part of virtual GPU memory access for the bellow definition of CUDA.
CUDA™ is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad-ranging uses for GPU computing with CUDA. Here are a few examples:
Identify hidden plaque in arteries: Heart attacks are the leading cause of death worldwide. Harvard Engineering, Harvard Medical School and Brigham & Women's Hospital have teamed up to use GPUs to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery.
Analyze air traffic flow: The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently. Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds.
Visualize molecules: A molecular simulation called NAMD (nanoscale molecular dynamics) gets a large performance boost with GPUs. The speed-up is a result of the parallel architecture of GPUs, which enables NAMD developers to port compute-intensive portions of the application to the GPU using the CUDA Toolkit.
#include "pvm3.h" // loads a definitions file specific to PVM #include <pthread.h> // loads a definitions file #include <cuda.h> // loads a definitions file specific to CODA
float *x, *y; // defines floating point variables in addition // to the above includes #define PROC (8) // names a process float e_vec[PROC]; // defines floating point vector int n_thread0, n_thread1; // defines integer variables
I don't think I'm communicating very effectively. What I'd like to know is what the purpose of the program is, what someone might use it for, and more specifically, if I were to add code to time it, could it be used for comparison between the mpi and pvm variations given on the site. And by the way, thanks for finding it in the first place.
PDF Downloads As a Pro member, you will gain access to download any Instructable in the PDF format.
You also have the ability to customize your PDF download.
// I don't know if this will help
#include "pvm3.h"
#include <pthread.h>
#include <cuda.h>
float *x, *y;
#define PROC (8)
float e_vec[PROC];
int n_thread0, n_thread1;
// kernel
__global__ void sub1(float* fx, float* fy, float* fe) {
#define BLOCK (512)
int t = threadIdx.x; // builtin
int b = blockIdx.x; // builtin
float e;
__shared__ float se[BLOCK];
__shared__ float sx[BLOCK];
__shared__ float sy[BLOCK+2];
// copy from device to processor memory
sx[t] = fx[BLOCK*b+t];
sy[t] = fy[BLOCK*b+t];
if (t<2)
sy[t+BLOCK] = fy[BLOCK*b+t+BLOCK];
__syncthreads();
// do computation
sx[t] += ( sy[t+2] + sy[t] )*.5;
e = sy[t+1] * sy[t+1];
// copy to device memory
fx[BLOCK*b+t] = sx[t];
// reduction
se[t] = e;
__syncthreads();
if (t<256) {
se[t] += se[t+256];
__syncthreads();
}
if (t<128) {
se[t] += se[t+128];
__syncthreads();
}
if (t<64) {
se[t] += se[t+64];
__syncthreads();
}
if (t<32) { // warp size
se[t] += se[t+32];
se[t] += se[t+16];
se[t] += se[t+8];
se[t] += se[t+4];
se[t] += se[t+2];
se[t] += se[t+1];
}
if (t==0)
fe[b] = se[0];
}
void *thread1(void *arg) {
int p = (int)arg;
int n0 = n_thread0 + (p * (n_thread1-n_thread0)) / PROC;
int n1 = n_thread0 + ((p+1) * (n_thread1-n_thread0)) / PROC;
// pick GPU
cudaSetDevice(p);
// allocate GPU memory
float *fx, *fy, *fe;
cudaMalloc((void**)&fx, (n1-n0+2) * sizeof(float));
cudaMalloc((void**)&fy, (n1-n0+2) * sizeof(float));
cudaMalloc((void**)&fe, (n1-n0+2)/BLOCK * sizeof(float));
float *de = new float[(n1-n0+2)/BLOCK];
// copy to GPU memory
cudaMemcpy(fx+1, &x[n0],
(n1-n0) * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(fy, &y[n0-1],
(n1-n0+2) * sizeof(float), cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCK, 1, 1);
dim3 dimGrid((n1-n0+2)/BLOCK, 1, 1);
float e = 0;
// call GPU
sub1<<<dimGrid, dimBlock>>>(fx, fy, fe);
// copy to host memory
cudaMemcpy(fx+1, &x[n0], (n1-n0) * sizeof(float),
cudaMemcpyDeviceToHost);
cudaMemcpy(fe, &de[n0-1], (n1-n0+2)/BLOCK * sizeof(float),
cudaMemcpyDeviceToHost);
// release GPU memory
cudaFree(fe);
cudaFree(fy);
cudaFree(fx);
float e_local = 0;
for (int i=0; i<(n1-n0+2)/BLOCK; ++i)
e_local += de[i];
e += e_local;
delete[] de;
e_vec[p] = e;
return (void*) 0;
}
int main(int argc, char *argv[]) {
int n = ...;
if (pvm_parent() == PvmNoParent) {
#define N (4)
int tid[N];
pvm_spawn("program", argv, PvmTaskDefault, (char*)0, N, &tid[0]);
} else {
int mytid = pvm_mytid();
int *tids, me = -1;
int ntids = pvm_siblings(&tids);
for (int i=0; i<ntids; ++i)
if ( tids[i] == mytid) {
me = i;
break;
}
int p_left = -1, p_right = -1;
if (me > 0)
p_left = tids[me-1];
if (me < ntids-1)
p_right = tids[me+1];
int n_local0 = 1 + (me * (n-1)) / ntids;
int n_local1 = 1 + ((me+1) * (n-1)) / ntids;
pvm_joingroup("worker"); // allocate only local part + ghost zone of the arrays x,y
float *x, *y;
x = new float[n_local1 - n_local0 + 2];
y = new float[n_local1 - n_local0 + 2];
x -= (n_local0 - 1);
y -= (n_local0 - 1);
... // fill x, y
// fill ghost zone
if (p_left != -1) {
pvm_initsend(PvmDataDefault);
pvm_pkfloat(&y[n_local0], 1, 1);
int msgtag = 1;
pvm_send(p_left, msgtag);
}
if (p_right != -1) {
int msgtag = 1;
pvm_recv(p_right, msgtag);
pvm_upkfloat(&y[n_local1], 1, 1);
pvm_initsend(PvmDataDefault);
pvm_pkfloat(&y[n_local1-1], 1, 1);
msgtag = 2;
pvm_send(p_right, msgtag);
}
if (p_left != -1) {
int msgtag = 2;
pvm_recv(p_left, msgtag);
pvm_upkfloat(&y[n_local0-1], 1, 1);
}
pthread_t threads[PROC];
pthread_attr_t attr;
pthread_attr_init(&attr);
n_thread0 = n_local0;
n_thread1 = n_local1;
float e = 0;
// start threads and wait for termination
for (int p=0; p<PROC; ++p)
pthread_create(&threads[p], &attr, thread1, (void *)p);
for (int p=0; p<PROC; ++p) {
pthread_join(threads[p], NULL);
e += e_vec[p];
}
int msgtag = 3;
pvm_reduce(PvmSum, &e, 1, PVM_FLOAT, msgtag, "worker", tids[0]);
msgtag = 4;
if (me==0) {
pvm_initsend(PvmDataDefault);
pvm_pkfloat(&e, 1, 1);
pvm_bcast("worker", msgtag);
} else {
pvm_recv(tids[0], msgtag);
pvm_upkfloat(&e, 1, 1);
}
... // output x, e
x += (n_local0 - 1);
y += (n_local0 - 1);
delete[] x, y;
}
pvm_exit();
return 0;
}
CUDA™ is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad-ranging uses for GPU computing with CUDA. Here are a few examples:
Identify hidden plaque in arteries:
Heart attacks are the leading cause of death worldwide. Harvard Engineering, Harvard Medical School and Brigham & Women's Hospital have teamed up to use GPUs to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery.
Analyze air traffic flow:
The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently. Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds.
Visualize molecules:
A molecular simulation called NAMD (nanoscale molecular dynamics) gets a large performance boost with GPUs. The speed-up is a result of the parallel architecture of GPUs, which enables NAMD developers to port compute-intensive portions of the application to the GPU using the CUDA Toolkit.
#include "pvm3.h" // loads a definitions file specific to PVM#include <pthread.h>
// loads a definitions file#include <cuda.h>
// loads a definitions file specific to CODAfloat *x, *y; // defines floating point variables in addition
// to the above includes
#define PROC (8) // names a process
float e_vec[PROC]; // defines
// defines integer variablesfloating point vectorint n_thread0, n_thread1;
}if (t<32) { // warp size on binary values
se[t] += se[t+32];
se[t] += se[t+16];
se[t] += se[t+8];
se[t] += se[t+4];
se[t] += se[t+2];
se[t] += se[t+1];
}
Any more and we start a tutorial on the C language :-)
A
running that code segment.
I Quote ;
"It is a part of virtual GPU memory access for the bellow definition
of CUDA."
I will imagine a parallel GPU graphical process will run into a
minimum of 50,000 lines of code maybe 200,000 to work molecular
depiction.
A