Author Options:

Linpack Benchmarking on PVM with CUDA? Answered

I'm working on a project comparing the performance of PVM and MPI in GPU accelerated clusters. Naturally, I turned to the defacto standard for computer cluster benchmarking, LINPACK. The problem I'm running into is finding a version of LINPACK for PVM. It does exist somewhere because there are performance comparisons  of systems using LINPACK and PVM. Ive already installed SCALAPACK and all of the other libraries required for it to work, but I can't figure out how to use it as a benchmark. Additionally, I need the benchmark to be GPU accelerated. Ive looked into HPL, which is GPU accelerated through CUDA, but there are no versions for PVM. Does anyone know of a version of LINPACK that can be run on PVM and could be modified to be CUDA accelerated?



5 years ago

// I don't know if this will help

#include "pvm3.h"
#include <pthread.h>
#include <cuda.h>

float *x, *y;
#define PROC (8)
float e_vec[PROC];
int n_thread0, n_thread1;

// kernel
__global__ void sub1(float* fx, float* fy, float* fe) {
#define BLOCK (512)
  int t = threadIdx.x; // builtin
  int b = blockIdx.x; // builtin
  float e;
  __shared__ float se[BLOCK];
  __shared__ float sx[BLOCK];
  __shared__ float sy[BLOCK+2];
  // copy from device to processor memory
  sx[t] = fx[BLOCK*b+t];
  sy[t] = fy[BLOCK*b+t];
  if (t<2)
     sy[t+BLOCK] = fy[BLOCK*b+t+BLOCK];

  // do computation
  sx[t] += ( sy[t+2] + sy[t] )*.5;
  e = sy[t+1] * sy[t+1];
  // copy to device memory
  fx[BLOCK*b+t] = sx[t];
  // reduction
  se[t] = e;
  if (t<256) {
     se[t] += se[t+256];
  if (t<128) {
     se[t] += se[t+128];
  if (t<64) {
     se[t] += se[t+64];
  if (t<32) { // warp size
     se[t] += se[t+32];
     se[t] += se[t+16];
     se[t] += se[t+8];
     se[t] += se[t+4];
     se[t] += se[t+2];
     se[t] += se[t+1];
  if (t==0)
     fe[b] = se[0];

void *thread1(void *arg) {
  int p = (int)arg;
  int n0 = n_thread0 + (p * (n_thread1-n_thread0)) / PROC;
  int n1 = n_thread0 + ((p+1) * (n_thread1-n_thread0)) / PROC;
  // pick GPU
  // allocate GPU memory
  float *fx, *fy, *fe;
  cudaMalloc((void**)&fx, (n1-n0+2) * sizeof(float));
  cudaMalloc((void**)&fy, (n1-n0+2) * sizeof(float));
  cudaMalloc((void**)&fe, (n1-n0+2)/BLOCK * sizeof(float));
  float *de = new float[(n1-n0+2)/BLOCK];
  // copy to GPU memory
  cudaMemcpy(fx+1, &x[n0],
   (n1-n0) * sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(fy, &y[n0-1],
   (n1-n0+2) * sizeof(float), cudaMemcpyHostToDevice);
  dim3 dimBlock(BLOCK, 1, 1);
  dim3 dimGrid((n1-n0+2)/BLOCK, 1, 1);

  float e = 0;
  // call GPU
  sub1<<<dimGrid, dimBlock>>>(fx, fy, fe);
  // copy to host memory
  cudaMemcpy(fx+1, &x[n0], (n1-n0) * sizeof(float),
  cudaMemcpy(fe, &de[n0-1], (n1-n0+2)/BLOCK * sizeof(float),
  // release GPU memory
  float e_local = 0;
  for (int i=0; i<(n1-n0+2)/BLOCK; ++i)
   e_local += de[i];
  e += e_local;
  delete[] de;
  e_vec[p] = e;
  return (void*) 0;

int main(int argc, char *argv[]) {
  int n = ...;
  if (pvm_parent() == PvmNoParent) {
  #define N (4)
  int tid[N];
  pvm_spawn("program", argv, PvmTaskDefault, (char*)0, N, &tid[0]);
  } else {
  int mytid = pvm_mytid();
  int *tids, me = -1;
  int ntids = pvm_siblings(&tids);
  for (int i=0; i<ntids; ++i)
   if ( tids[i] == mytid) {
    me = i;
  int p_left = -1, p_right = -1;
  if (me > 0)
    p_left = tids[me-1];
  if (me < ntids-1)
    p_right = tids[me+1];
  int n_local0 = 1 + (me * (n-1)) / ntids;
  int n_local1 = 1 + ((me+1) * (n-1)) / ntids;
  pvm_joingroup("worker");   // allocate only local part + ghost zone of the arrays x,y
  float *x, *y;
  x = new float[n_local1 - n_local0 + 2];
  y = new float[n_local1 - n_local0 + 2];
  x -= (n_local0 - 1);
  y -= (n_local0 - 1);

  ... // fill x, y

  // fill ghost zone
  if (p_left != -1) {
   pvm_pkfloat(&y[n_local0], 1, 1);
   int msgtag = 1;
   pvm_send(p_left, msgtag);
  if (p_right != -1) {
   int msgtag = 1;
   pvm_recv(p_right, msgtag);
   pvm_upkfloat(&y[n_local1], 1, 1);
   pvm_pkfloat(&y[n_local1-1], 1, 1);
   msgtag = 2;
   pvm_send(p_right, msgtag);
  if (p_left != -1) {
   int msgtag = 2;
   pvm_recv(p_left, msgtag);
   pvm_upkfloat(&y[n_local0-1], 1, 1);

  pthread_t threads[PROC];
  pthread_attr_t attr;
  n_thread0 = n_local0;
  n_thread1 = n_local1;
  float e = 0;
  // start threads and wait for termination
  for (int p=0; p<PROC; ++p)
   pthread_create(&threads[p], &attr, thread1, (void *)p);
  for (int p=0; p<PROC; ++p) {
   pthread_join(threads[p], NULL);
   e += e_vec[p];

  int msgtag = 3;
  pvm_reduce(PvmSum, &e, 1, PVM_FLOAT, msgtag, "worker", tids[0]);
  msgtag = 4;
  if (me==0) {
   pvm_pkfloat(&e, 1, 1);
   pvm_bcast("worker", msgtag);
  } else {
   pvm_recv(tids[0], msgtag);
   pvm_upkfloat(&e, 1, 1);

  ... // output x, e

  x += (n_local0 - 1);
  y += (n_local0 - 1);
  delete[] x, y;
  return 0;

This looks interesting, where did you find it?

Do you know what the function of the code is?

It is a part of virtual GPU memory access for the bellow definition of  CUDA.

CUDA™ is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

With millions of CUDA-enabled GPUs sold to date, software developers, scientists and researchers are finding broad-ranging uses for GPU computing with CUDA. Here are a few examples:

Identify hidden plaque in arteries:
Heart attacks are the leading cause of death worldwide. Harvard Engineering, Harvard Medical School and Brigham & Women's Hospital have teamed up to use GPUs to simulate blood flow and identify hidden arterial plaque without invasive imaging techniques or exploratory surgery.

Analyze air traffic flow:
The National Airspace System manages the nationwide coordination of air traffic flow. Computer models help identify new ways to alleviate congestion and keep airplane traffic moving efficiently. Using the computational power of GPUs, a team at NASA obtained a large performance gain, reducing analysis time from ten minutes to three seconds.

Visualize molecules:
A molecular simulation called NAMD (nanoscale molecular dynamics) gets a large performance boost with GPUs. The speed-up is a result of the parallel architecture of GPUs, which enables NAMD developers to port compute-intensive portions of the application to the GPU using the CUDA Toolkit.

I meant the specific code from the website. I've looked at it, but I can't tell what it does.

#include "pvm3.h"  // loads a definitions file specific to PVM
#include <pthread.h>
// loads a definitions file 
#include <cuda.h>  
// loads a definitions file specific to CODA

float *x, *y;  // defines floating point variables in addition 
              //  to the above includes
#define PROC (8)  // names a process
float e_vec[PROC];  // defines
floating point vector
int n_thread0, n_thread1;
  // defines integer variables

  if (t<32) { // warp size on binary values
     se[t] += se[t+32];
     se[t] += se[t+16];
     se[t] += se[t+8];
     se[t] += se[t+4];
     se[t] += se[t+2];
     se[t] += se[t+1];

Any more and we start a tutorial on the C language :-)


I don't think I'm communicating very effectively. What I'd like to know is what the purpose of the program is, what someone might use it for, and more specifically, if I were to add code to time it, could it be used for comparison between the mpi and pvm variations given on the site. And by the way, thanks for finding it in the first place.

I gave you the best answer i could short of buying a NVIDIA CUDA and
running that code segment.

I Quote ;

"It is a part of virtual GPU memory access for the bellow definition
of  CUDA."

I will imagine a parallel GPU graphical process will run into a
minimum of  50,000 lines of code maybe 200,000 to work molecular


Thanks alot for your help anyway. I probably never would have even found that on my own. At least now I have a starting point