## Aparapi Java Matrix Multiplication Example

import java.util.Random; import com.amd.aparapi.Kernel; /** * @author Vasanth Raja Chittampally */ public class AparapiMatrixMultiplication { public static void main(String [] args) throws Exception { final int r = 1024; final int c1 = r; final int c2 = r; AparapiMatMul ap = new AparapiMatMul(r, c1, c2); try { long time1 = System.currentTimeMillis(); //ap.setExecutionMode(Kernel.EXECUTION_MODE.JTP); //ap.setExecutionMode(Kernel.EXECUTION_MODE.GPU); //ap.setExecutionMode(Kernel.EXECUTION_MODE.CPU); ap.execute(r,c2); System.out.println("Time taken for kenel execution in "+ ap.getExecutionMode()+ " mode is :"+ (System.currentTimeMillis() - time1)); }catch(NullPointerException ne){ ne.printStackTrace(); } //ap.printResults(); long time1 = System.currentTimeMillis(); ap.normalMatMulCalc(); System.out.println("Time taken for kenel execution in Sequential CPU mode is :"+ (System.currentTimeMillis() - time1)); ap.compareResults(); ap.dispose(); } } class AparapiMatMul extends Kernel { float matA[]; float matB[]; float matC[]; float C[]; int rows ; int cols1; int cols2; @Override public void run() { int i = getGlobalId(); int j = getPassId(); float value = 0; for(int k = 0; k < cols1; k++) { value += matA[k + i * cols1] * matB[k * cols2 + j]; } matC[i * cols1 + j] = value; } public AparapiMatMul(int r, int c1, int c2) { rows = r; cols1 = c1; cols2 = c2; matA = new float [r * c1]; matB = new float [c1 * c2]; matC = new float [r * c2]; C = new float[r * c2]; //matC should be initialized with zeros for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matC[i * c1 + j ] = 0; } } //Here matrix A is initialized with random numbers for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matA[i * c1 +j] = new Random().nextFloat(); } } // Here matrix B is initialized with random numbers for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matB[i * c2 + j] = new Random().nextFloat(); } } } public void printResults() { for(int i = 0; i < rows; i++ ) { for(int j = 0 ; j < cols2; j++ ) { System.out.print(matC[i * cols2 + j]+" "); } } } public void normalMatMulCalc() { System.out.println(); System.out.println("Sequential Execution on CPU"); for(int i = 0;i < rows; i++) { for(int j = 0; j < cols2; j++) { float sum = 0; for(int k = 0; k < cols1; k++) { sum += matA[i*cols1+k] * matB[k*rows+j]; } C[i * cols2 + j] = sum; } } } public void compareResults() { boolean equal = true; for(int i = 0; i < rows * cols2 ; i++) { if(matC[i] != C[i]) { equal = false; break; } } if(!equal) System.out.println("Results are not equal"); else System.out.println("Results are equal.. Tested thoroughly!!!"); } }

Above code simply performs the matrix multiplication operation. The overloaded run method is the Kernel code which runs

on the GPU or JTP or CPU. First the above code is converted into Bytecode, this byte code is again converted to OpenCL

code.

You can compare ease of writing above code with OpenCL C code here but we need to compromise on some optimizations.

Results are as follows:

**Output1:**

Time taken for kenel execution in GPU mode is :8791

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :11580

Results are equal.. Tested thoroughly!!!

Output 2:

Time taken for kenel execution in JTP mode is :7765

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :12491

Results are equal.. Tested thoroughly!!!

Thanks to Gary Frost for your inputs for this program. Here I’m posting the changes I made to the above program. The ap.execute(r,c2) function calls the kernel c2 times which is not the same as clEnqueueNDRange() function. The corrected code as follows.

import java.util.Random; import com.amd.aparapi.Kernel; /** * @author Vasanth Raja Chittampally */ public class AparapiMatrixMultiplication { public static void main(String [] args) throws Exception { final int r = 1024; final int c1 = r; final int c2 = r; AparapiMatMul ap = new AparapiMatMul(r, c1, c2); try { ap.setExecutionMode(Kernel.EXECUTION_MODE.GPU); long time1 = System.currentTimeMillis(); ap.execute(r * c2); System.out.println("Time taken for kenel execution in "+ ap.getExecutionMode()+ " mode is :"+ (System.currentTimeMillis() - time1)); }catch(NullPointerException ne){ ne.printStackTrace(); } //ap.printResults(); long time1 = System.currentTimeMillis(); ap.normalMatMulCalc(); System.out.println("Time taken for kenel execution in Sequential CPU mode is :"+ (System.currentTimeMillis() - time1)); ap.compareResults(); ap.dispose(); } } class AparapiMatMul extends Kernel { float matA[]; float matB[]; float matC[]; float C[]; int rows ; int cols1; int cols2; @Override public void run() { int i = getGlobalId() /rows; int j = getGlobalId() % rows; float value = 0; for(int k = 0; k < cols1; k++) { value += matA[k + i * cols1] * matB[k * cols2 + j]; } matC[i * cols1 + j] = value; } public AparapiMatMul(int r, int c1, int c2) { rows = r; cols1 = c1; cols2 = c2; matA = new float [r * c1]; matB = new float [c1 * c2]; matC = new float [r * c2]; C = new float[r * c2]; //matC should be initialized with zeros for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matC[i * c1 + j ] = 0; } } //Here matrix A is initialized with random numbers for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matA[i * c1 +j] = new Random().nextFloat(); } } // Here matrix B is initialized with random numbers for(int i = 0; i < r; i++ ) { for(int j = 0 ; j < c1; j++ ) { matB[i * c2 + j] = new Random().nextFloat(); } } } public void printResults() { for(int i = 0; i < rows; i++ ) { for(int j = 0 ; j < cols2; j++ ) { System.out.print(matC[i * cols2 + j]+" "); } } } public void normalMatMulCalc() { System.out.println(); System.out.println("Sequential Execution on CPU"); for(int i = 0;i < rows; i++) { for(int j = 0; j < cols2; j++) { float sum = 0; for(int k = 0; k < cols1; k++) { sum += matA[i*cols1+k] * matB[k*rows+j]; } C[i * cols2 + j] = sum; } } } public void compareResults() { boolean equal = true; for(int i = 0; i < rows * cols2 ; i++) { if(matC[i] != C[i]) { equal = false; break; } } if(!equal) System.out.println("Results are not equal"); else System.out.println("Results are equal.. Tested thoroughly!!!"); } }

The results I got are amazing.. I’m posting the results I got in my PC having AMD Radeon 5670 Graphics card.

**Output:** GPU Mode

Time taken for kenel execution in GPU mode is : 838

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is : 13335

Results are equal.. Tested thoroughly!!!

**Output:** JTP Mode

Time taken for kenel execution in JTP mode is :5671

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :13516

Results are equal.. Tested thoroughly!!!

Vasanth

Thanks for posting this and for your evaluation of Aparapi. I am the Aparapi tech lead/architect and it is great to see folks giving Aparapi a try.

The results are slightly lower than I would have expected.

When I looked at the code I discovered that you were using kernel.execute(c,r) (column by row). This is a reasonable choice given that you are familiar with OpenCL 😉 because you probably assumed this mapped to clExecuteNDRangeKernel with a 2 dims. Sadly Aparapi does not support this mode. Instead execute(c,r) is essentially invoking the Kernel r times, and we are accumulating the Kernel execution costs (not buffer txfers costs).

I have included a slightly modified form of the code which calls execute(c*r) and a slightly modified Kernel.run method which is called once.

I like your example and would like to include it on the apapapi.googlecode.com with your permission. Even use it as a sample/example project. Let me know if this would be OK by you. I will obviously credit you as the originator and link to your blog

Here is my modified run() method

public void run() {

int i = getGlobalId()/rows; // was getGlobalId()

int j = getGlobalId()%rows; // was getPassId();

float value = 0;

for(int k = 0; k < cols1; k++)

{

value += matA[k + i * cols1] * matB[k * cols2 + j];

}

matC[i * cols1 + j] = value;

}

And here is my modified execution.

ap.execute(r*c2);

For me (on a laptop) the #'s are

GPU: 2688

JTP: 8376

REFERENCE:19690

Would you mind trying this version to see if it performs better for you?

Gary

Thanks Gary.. Thank you very much for making changes to my code..

You can use my code for the samples. I’ve no problems.

I corrected the code

import java.util.Random;

import com.amd.aparapi.Kernel;

/**

* @author Vasanth Raja Chittampally

*/

public class AparapiMatrixMultiplication {

public static void main(String [] args) throws Exception

{

final int r = 1024;

final int c1 = r;

final int c2 = r;

AparapiMatMul ap = new AparapiMatMul(r, c1, c2);

try {

long time1 = System.currentTimeMillis();

//ap.setExecutionMode(Kernel.EXECUTION_MODE.JTP);

ap.execute(r * c2);

System.out.println(“Time taken for kenel execution in “+ ap.getExecutionMode()+ ” mode is :”+ (System.currentTimeMillis() – time1));

}catch(NullPointerException ne){

ne.printStackTrace();

}

//ap.printResults();

long time1 = System.currentTimeMillis();

ap.normalMatMulCalc();

System.out.println(“Time taken for kenel execution in Sequential CPU mode is :”+ (System.currentTimeMillis() – time1));

ap.compareResults();

ap.dispose();

}

}

class AparapiMatMul extends Kernel {

float matA[];

float matB[];

float matC[];

float C[];

int rows ;

int cols1;

int cols2;

@Override

public void run() {

int i = getGlobalId()/rows;

int j = getGlobalId()%rows;

float value = 0;

for(int k = 0; k < cols1; k++)

{

value += matA[k + i * cols1] * matB[k * cols2 + j];

}

matC[i * cols1 + j] = value;

}

public AparapiMatMul(int r, int c1, int c2)

{

rows = r;

cols1 = c1;

cols2 = c2;

matA = new float [r * c1];

matB = new float [c1 * c2];

matC = new float [r * c2];

C = new float[r * c2];

//matC should be initialized with zeros

for(int i = 0; i < r; i++ )

{

for(int j = 0 ; j < c1; j++ )

{

matC[i * c1 + j ] = 0;

}

}

//Here matrix A is initialized with random numbers

for(int i = 0; i < r; i++ )

{

for(int j = 0 ; j < c1; j++ )

{

matA[i * c1 +j] = new Random().nextFloat();

}

}

// Here matrix B is initialized with random numbers

for(int i = 0; i < r; i++ )

{

for(int j = 0 ; j < c1; j++ )

{

matB[i * c2 + j] = new Random().nextFloat();

}

}

}

public void printResults()

{

for(int i = 0; i < rows; i++ )

{

for(int j = 0 ; j < cols2; j++ )

{

System.out.print(matC[i * cols2 + j]+" ");

}

}

}

public void normalMatMulCalc()

{

System.out.println();

System.out.println("Sequential Execution on CPU");

for(int i = 0;i < rows; i++)

{

for(int j = 0; j < cols2; j++)

{

float sum = 0;

for(int k = 0; k < cols1; k++)

{

sum += matA[i*cols1+k] * matB[k*rows+j];

}

C[i * cols2 + j] = sum;

}

}

}

public void compareResults()

{

boolean equal = true;

for(int i = 0; i < rows * cols2 ; i++)

{

if(matC[i] != C[i])

{

equal = false;

break;

}

}

if(!equal)

System.out.println("Results are not equal");

else

System.out.println("Results are equal.. Tested thoroughly!!!");

}

}

Results I got are amazing

Output – I: GPU Mode

Time taken for kenel execution in GPU mode is :914

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :12389

Results are equal.. Tested thoroughly!!!

Output -II: JTP Mode

Time taken for kenel execution in JTP mode is :6564

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :12285

Results are equal.. Tested thoroughly!!!

Hello,

Thanks a lot for your code for the matrix multiplication.

I don’t understand those 2 lines actually :

int i = getGlobalId()/rows; // was getGlobalId()

int j = getGlobalId()%rows; // was getPassId();

I dont understand why you use division for i, and modulo operator for j.

Could you please explain it to me ?

Thanks in advance

Those two lines are to find the corresponding row and column of a particular element in the array.

So as an example if a one dimensional array represented a grid of 5 rows x 10 columns. That’s a one dimensional array of 5×10 = 50.

For any linear id (say 13) we can convert this to an x,y coord using.

y = 13/5; // 2

x = 13%5; // 3

gary

You example works if each matrix is (n,n). If you test your code with, say, matA (2,5) and matB (5,3), it does not work, the result is not correct because you don’t use the right variable between rows, cols1 and cols2.

Here is the correct code for your kernel :

int i = getGlobalId() /cols2;

int j = getGlobalId() % cols2;

float value = 0;

for(int k = 0; k < cols1; k++)

{

value += matA[k + i * cols1] * matB[k * cols2 + j];

}

matC[i * cols2 + j] = value;

I’m learning Aparapi and find this article very useful as a first step into learning the ways to pass the heavy work (some tasks) to the GPU using Java.

Thanks for the comments, it helps me to know what’s happening in the code, like in :

execute(r,c2)

execute(r *c2)

and the

int i = getGlobalId() /cols2;

int j = getGlobalId() % cols2;

I’m happy to find this article! It’s really useful. By the way, I run your code but I cannot get GPU result, so only CPU result. Is there any problem with the code or should I install something? I’m using processor: Intel(R) Core(TM) i5-3570 CPU@3.40Hz 3.40Hz

Here’s my result:

Time taken for kenel execution in GPU mode is :1919

Sequential Execution on CPU

Time taken for kenel execution in Sequential CPU mode is :15103

Results are equal.. Tested thoroughly!!!

I hope you don’t mind to reply me, thank you!

Hi,

You did not mention which GPU you have. The results might vary with the underlying hardware.