I am trying to implement matrix multiplication with multiple threads. Everything seems to work correctly, however, it work much slower than the usual algorithm. Here is my code

public class Main { private static int nRows = 500; //number of rows and columns in matrices private static int[][] matrix1 = new int[nRows][nRows]; //first matrix for multiplication private static int[][] matrix2 = new int[nRows][nRows]; //second matrix for multiplication private static int[][] result1 = new int[nRows][nRows]; //result from linear matrix multiplication private static int[][] result2 = new int[nRows][nRows]; //result from parallel matrix multiplication private static Thread[][] pool = new Thread[nRows][nRows]; //array of threads //method used for transposing a matrix to get its column easily public static int[][] transpose(int[][] matrix) { int[][] newMatrix = new int[matrix[0].length][matrix.length]; for (int i = 0; i < matrix[0].length; i++) { for (int j = 0; j < matrix.length; j++) { newMatrix[i][j] = matrix[j][i]; } } return newMatrix; } public static void main(String[] args) { //initializing input matrices (setting all elements = 1) for (int i = 0; i < nRows; i++) { for (int j = 0; j < nRows; j++) { matrix1[i][j] = 1; matrix2[i][j] = 1; } } long start; long end; System.out.println("Linear algorithm"); start = System.currentTimeMillis(); //linear multiplication algorithm for (int i = 0; i < nRows; i++) { for (int j = 0; j < nRows; j++) { int temp = 0; for (int k = 0; k < nRows; k++) { temp += matrix1[i][k] * matrix2[k][j]; } result1[i][j] = temp; } } //show result // for(int i=0;i<nRows;i++){ // for(int j=0;j<nRows;j++){ // System.out.print(result1[i][j] + " "); // } // System.out.println(); // } end = System.currentTimeMillis(); System.out.println("Time with linear algorithm: " + (end - start)); //-------------------- System.out.println("Parallel algorithm"); start = System.currentTimeMillis(); int[][] matrix3 = transpose(matrix2); //get a transpose copy of second matrix for (int i = 0; i < nRows; i++) { for (int j = 0; j < nRows; j++) { pool[i][j] = new myThread(matrix1[i], matrix3[j], i, j); //creating a thread for each element pool[i][j].start(); //starting a thread } } for (int i = 0; i < nRows; i++) { for (int j = 0; j < nRows; j++) { try { pool[i][j].join(); //waiting for the thread to finish its job } catch (InterruptedException e) { e.printStackTrace(); } } } //show the result // for(int i=0;i<nRows;i++){ // for(int j=0;j<nRows;j++){ // System.out.print(result2[i][j] + " "); // } // System.out.println(); // } end = System.currentTimeMillis(); System.out.println("Time with parallel algorithm: " + (end - start)); } //class, where parallel multiplication is implemented private static class myThread extends Thread { private int[] row = new int[nRows]; //row for multiplication private int[] col = new int[nRows]; //column for multiplication private int i; //row index of the element in resulting matrix private int j; //column index of the element in resulting matrix //constructor public myThread(int[] r, int[] c, int i, int j) { row = r; col = c; this.i = i; this.j = j; } public void run() { int temp = 0; for (int k = 0; k < nRows; k++) { temp += row[k] * col[k]; //getting the element by multiplying row and column of two matrices } result2[i][j] = temp; //writing the resulting element to the resulting matrix } } }

Here, I create a new thread for each element in the resulting matrix. I than write these threads to an array, start them and, finally, wait for them to finish working. I’ve seen some realizations, where the whole input matrix (both of them) would be given as parameters to the thread. My task is, however, to come up with an algorithm, where only one row and one column (that are necessary for this particular element) are given.

After measuring the time elapsed I get following results

Linear algorithm Time with linear algorithm: 557 Parallel algorithm Time with parallel algorithm: 38262

What am I doing wrong? Thanks in advance!

## Answer

The code you’ve written will work fine on a GPU where the concept of threads is very different and the overhead is basically zero. On CPU-based systems, spawning threads is an exceptionally slow operation and it only makes sense if you can amortise this overhead over **a lot** of computational work.

Here is some general advice that will help you write better parallel algorithms for CPUs:

- With computationally heavy tasks, use as many threads as there are physical execution units (cores). SMT techniques such as HyperThreading do not help much unless there is a lot of memory latency. For small matrices that fit in the L1 and L2 CPU caches, the latency is very low and there is nothing to gain from SMT. When more than one thread share the same core, the OS has to context switch between the two, which adds overhead and may trash the cache.
- Keep parallelisation granularity as coarse as possible so to maximise the work per thread. Instead of having one row x column operation per thread, have each thread operate on contiguous blocks of rows / columns. You can try and only parallelise the outer loop, i.e., only over the rows of the first matrix.
- Keep the number of threads dependent on the hardware properties (number of cores) and independent of the problem size. Spawning a separate thread for each row and column scales the overhead linearly with the problem size, which is really bad from performance point of view.
- Avoid false sharing. This happens when two or more threads running on different cores write to memory locations that fall in the same cache line. When one thread updates the cache of its core, the change propagates and invalidates the caches of the other cores that have the same cache line, forcing them to refetch the data. In your case, 16 consecutive values of
`result2`

fall in the same cache line (cache lines on x86 and ARM are 64 bytes long,`int`

is 4 bytes) and are written by 16 different threads. The use of a temporary summation variable alleviates this problem somehow–it is much more severe when false sharing happens repeatedly in an inner(-most) loop. - Use thread pools for repeated tasks when the number of work items exceeds the number of threads and each thread will get work multiple times. In your case, you give each thread a single work item, so this is not really pooling.

In summary, start as many threads as physical cores and have them work on big contiguous chunks of the input matrices.