Computer science Code Task – Cheap Research Essays

Implement a tiled dense matrix multiplication routine using shared memory in CUDA.
Your code should be able handle arbitrary sizes for square matrices.

Handle thread divergence when dealing with arbitrary matrix sizes

Written Questions-

(1) How many floating operations are being performed in your dense matrix multiply kernel if the matrix size is N times N? Explain.
(2) How many global memory reads are being performed by your kernel? Explain.
(3) How many global memory writes are being performed by your kernel? Explain.
(4) Describe what further optimizations can be implemented to your kernel to achieve a performance speedup.
(5) Suppose you have matrices with dimensions bigger than the max thread dimensions allowed in CUDA. Sketch an algorithm that would perform matrix multiplication algorithm that would perform the multiplication in this case.
(6) Suppose you have matrices that would not fit in global memory. Sketch an algorithm that would perform matrix multiplication algorithm that would perform the multiplication out of place.

Special instruction-
The sample code is given by NVIDIA. I have given the code with this document.
The code runsright now. What I need is code Task completed and following written questions answered.

Code is as follows-
/**
* Copyright 1993-2015 NVIDIA Corporation. All rights reserved.
*
* Please refer to the NVIDIA end user license agreement (EULA) associated
* with this source code for terms and conditions that govern your use of
* this software. Any use, reproduction, disclosure, or distribution of
* this software and related documentation outside the terms of the EULA
* is strictly prohibited.
*
*/

/**
* Matrix multiplication: C = A * B.
* Host code.
*
* This sample implements matrix multiplication as described in Chapter 3
* of the programming guide.
* It has been written for clarity of exposition to illustrate various CUDA
* programming principles, not with the goal of providing the most
* performant generic kernel for matrix multiplication.
*
* See also:
* V. Volkov and J. Demmel, “Benchmarking GPUs to tune dense linear algebra,”
* in Proc. 2008 ACM/IEEE Conf. on Supercomputing (SC ’08),
* Piscataway, NJ: IEEE Press, 2008, pp. Art. 31:1-11.
*/

// System includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// Helper functions and utilities to work with CUDA
#include <helper_functions.h>
#include <helper_cuda.h>

/**
* Matrix multiplication (CUDA Kernel) on the device: C = A * B
* wA is A’s width and wB is B’s width
*/
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;

// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;