Issue
I make following this original post : PyCuda code to invert a high number of 3×3 matrixes.
The code suggested as an answer is :
$ cat t14.py
import numpy as np
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.autoinit
# kernel
kernel = SourceModule("""
__device__ unsigned getoff(unsigned &off){
unsigned ret = off & 0x0F;
off >>= 4;
return ret;
}
// inplace is acceptable i.e. out == in)
// T = float or double only
const int block_size = 288;
typedef double T; // *** can set to float or double
__global__ void inv3x3(const T * __restrict__ in, T * __restrict__ out, const size_t n, const unsigned * __restrict__ pat){
__shared__ T si[block_size];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
T det = 1;
if (idx < n*9)
det = in[idx];
unsigned sibase = (threadIdx.x / 9)*9;
unsigned lane = threadIdx.x  sibase; // cheaper modulo
si[threadIdx.x] = det;
__syncthreads();
unsigned off = pat[lane];
T a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
T b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
a = b;
__syncthreads();
if (lane == 0) si[sibase+3] = a;
if (lane == 3) si[sibase+4] = a;
if (lane == 6) si[sibase+5] = a;
__syncthreads();
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
if (idx < n*9)
out[idx] = a / det;
}
""")
# host code
def gpuinv3x3(inp, n):
# internal constants not to be modified
hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
# Convert parameters into numpy array
# *** change next line between float32 and float64 to match float or double
inpd = np.array(inp, dtype=np.float64)
hpatd = np.array(hpat, dtype=np.uint32)
# *** change next line between float32 and float64 to match float or double
output = np.empty((n*9), dtype= np.float64)
# Get kernel function
matinv3x3 = kernel.get_function("inv3x3")
# Define block, grid and compute
blockDim = (288,1,1) # do not change
gridDim = ((n/32)+1,1,1)
# Kernel function
matinv3x3 (
cuda.In(inpd), cuda.Out(output), np.uint64(n), cuda.In(hpatd),
block=blockDim, grid=gridDim)
return output
inp = (1.0, 1.0, 1.0, 0.0, 0.0, 3.0, 1.0, 2.0, 2.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0)
n = 2
result = gpuinv3x3(inp, n)
print(result.reshape(2,3,3))
The result gives, on an initial 1D array containing 18 values (so 2 matrixes 3×3), the right inverted matrixes, i.e :
[[[ 2. 0. 1. ]
[1. 0.33333333 1. ]
[0. 0.33333333 0. ]]
[[ 1. 0. 0. ]
[ 0. 1. 0. ]
[ 0. 0. 1. ]]]
Main issue : I would like to understand in detail the working of this algorithm, especially how the kernel allows to use shared memory for the initial 1D vector and brings then optimization when I execute this code on a large number of 3×3 matrixes.

I understand the line :
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
which gives the global index of current workitem identified by local threadIdx and blockIdx of the current workinggroup block. 
I understand that
__shared__ T si[block_size];
represents a share array, i.e associated to workgroup blocks : this is what we callLocal Memory
. 
On the other hand, I don’t understand this following part of kernel code :
__shared__ T si[block_size]; size_t idx = threadIdx.x+blockDim.x*blockIdx.x; T det = 1; if (idx < n*9) det = in[idx]; unsigned sibase = (threadIdx.x / 9)*9; unsigned lane = threadIdx.x  sibase; // cheaper modulo si[threadIdx.x] = det; __syncthreads(); unsigned off = pat[lane]; c __syncthreads(); if (lane == 0) si[sibase+3] = a; if (lane == 3) si[sibase+4] = a; if (lane == 6) si[sibase+5] = a; __syncthreads();
Indeed, what’s the role of sibase
index defined by unsigned sibase = (threadIdx.x / 9)*9;
and also, what is the utility of parameter lane
defined by : unsigned lane = threadIdx.x  sibase; // cheaper modulo
Finally, shifting are applied with :
T a = si[sibase + getoff(off)];
a *= si[sibase + getoff(off)];
T b = si[sibase + getoff(off)];
b *= si[sibase + getoff(off)];
a = b;
But I don’t see clearly the functionality.

Same problem for me about this part :
if (lane == 0) si[sibase+3] = a; if (lane == 3) si[sibase+4] = a; if (lane == 6) si[sibase+5] = a;

The determinant is calculated in a weird way that I can’t grasp, i.e :
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
I am not beginner in OpenCL, but I am not enough expert to understand fully this kernel code.
Solution
Preliminaries
First, its important to understand the arithmetic of a 3×3 matrix inversion, see here (and below).
The general methodology used for kernel design is to assign one matrix result element per thread. Therefore I will need 9 threads per matrix. Ultimately each thread will be responsible for computing one of the 9 numerical results, for each matrix. In order to compute two matrices, we then need 18 threads, 3 matrices require 27 threads.
An ancillary task is to decide threadblock/grid sizing. This follows typical methods (overall problem size determines total number of threads needed), but we will make a specific choice of 288 for threadblock size, as this is a convenient multiple of both 9 (number of threads per matrix) and 32 (number of threads per warp in CUDA), which gives us a certain measure of efficiency (no wasted threads, no gaps in data storage).
Since our thread strategy is one thread per matrix element, we must collectively solve the matrix inversion arithmetic using 9 threads. The major tasks are to compute the transposed matrix of cofactors, and then to compute the determinant, then do the final arithmetic (divide by the determinant) to compute each result element.
Computation of the cofactors
The first task is to compute the transposed matrix of cofactors of A
, called M
:
a b c
let A = d e f
g h i
eifh chbi bfce
M = fgdi aicg cdaf
dheg bgah aebd
We have 9 threads for this task, and nine elements of matrix M
to compute, so we will assign one thread to each element of M
. Each element of M
depends on multiple input values (a
, b
, c
, etc.) so we shall first load each input value (there are 9, one per thread), into shared memory:
// allocate enough shared memory for one element per thread in the block:
__shared__ T si[block_size];
// compute a globally unique thread index, so each thread has a unique number 0,1,2,etc.
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
// establish a temporary variable that will use and reuse during thread processing
T det = 1;
// do a thread check to make sure that our next load will be inbounds for the input array in
if (idx < n*9)
// load one element per thread, 9 threads per matrix will load an entire matrix
det = in[idx];
// for a given matrix (9 threads) compute the base offset into shared memory, where this matrix data (9 elements) will be stored. All 9 threads have the same base offset
unsigned sibase = (threadIdx.x / 9)*9;
// for each group of 9 threads handling a matrix, compute for each thread in that group, a group offset or "lane" from 0..8, so each thread in the group has a unique identifier/assignment in the group
unsigned lane = threadIdx.x  sibase; // cheaper modulo
// let each thread place its matrix element a,b,c, etc. into shared memory
si[threadIdx.x] = det;
// shared memory is now loaded, make sure all threads have loaded before any calculations begin
__syncthreads();
now that each A
matrix element (a
, b
, c
, …) is loaded into shared memory, we can start to compute the cofactors in M
. Let’s focus on a particular thread (0) and its cofactor (eifh
). All of the needed matrix elements to compute this cofactor (e
, i
, f
, and h
) are now in shared memory. We need a method to load them in sequence, and perform the needed multiplications and subtractions.
At this point we observe two things:
 each
M
element (cofactor) has a different set of 4 needed elements ofA
 each
M
element (cofactor) follows the same general arithmetic, given four arbitrary elements ofA
, lets refer to them generically asX
,Y
,Z
andW
. The arithmetic is XYZW. I take the first element, multply it by the second, and then take the third and fourth element and multiply them together, then subtract the two products.
Since the general sequence of operations (2, above) is the same for all 9 cofactors, we only need a method to arrange the loading of the 4 needed matrix elements. This methodology is encoded into the load patterns that are hardcoded into the example:
hpat = (0x07584, 0x08172, 0x04251, 0x08365, 0x06280, 0x05032, 0x06473, 0x07061, 0x03140)
There are 9 load patterns, each occupying a hexadecimal quantity, one load pattern per thread, i.e. one load pattern per M
matrix element (cofactor). Within a particular A
matrix, the matrix elements a
, b
, c
etc. are (already) loaded into shared memory at group offsets of 0, 1, 2, etc. The load pattern for a given thread will allow us to generate the sequence of group offsets, needed to retrieve the matrix elements of A
from their locations in shared memory, to be used in sequence to compute the cofactor assigned to that thread. Considering thread 0, and its cofactor eifh
, how does the load pattern 0x7584
encode the needed pattern to select e
, then i
, then f
, then h
?
For this we have a helper function getoff
which takes a load pattern, and successively (each time it is called) strips off an index. The first time I call getoff
with an argument of 0x7584
, it "strips off" the index 4, returns that, and replaces the 0x7584
load pattern with 0x758
for the next usage. 4 corresponds to e
. The next time I call getoff
with 0x758
it "strips off" the index 8, returns that, and replaces 0x758
with 0x75
. 8 corresponds to i
. The next time produces the index 5, corresponding to f
, and the last time produces the index 7, corresponding to h
.
With that description then we will walk through the code, pretending we are thread 0, and describe the process of computing eifh
:
// get the load pattern for my matrix "lane"
unsigned off = pat[lane];
//load my temporary variable `a` with the first item indexed in the load pattern:
T a = si[sibase + getoff(off)];
// multiply my temporary variable `a` with the second item indexed in the load pattern
a *= si[sibase + getoff(off)];
//load my temporary variable `b` with the third item indexed in the load pattern
T b = si[sibase + getoff(off)];
// multiply my temporary variable `b` with the fourth item indexed in the load pattern
b *= si[sibase + getoff(off)];
// compute the cofactor by subtracting the 2 products
a = b;
sibase
, as already indicated in the first commented code section, is the base offset in shared memory where that A
matrix elements are stored. The getoff
function then adds to this base address to select the relevant input element.
Computation of the determinant
The numerical value of the determinant is given by:
det(A) = det = a(eifh)  b(difg) + c(dheg)
If we decompose this, we see that all terms are actually already computed:
a,b,c: these are input matrix elements, in shared locations (group offsets) 0, 1, 2
eifh: cofactor computed by thread 0
difg: cofactor computed by thread 3 (with sign reversed)
dheg: cofactor computed by thread 6
Now, every thread will need the value of the determinant because it will be used by each thread during computation of its final (result) element. Therefore we will have every thread in the matrix redundantly compute the same value (which is more efficient than computing it, say, in one thread, then broadcasting that value to the other threads). In order to facilitate this, we will need 3 of the already computed cofactors made available to all 9 threads. So we will select 3 (no longer needed) locations in shared memory to "publish" these values. We still need the values in locations 0, 1, 2 because we need the input matrix elements a
, b
, and c
for calculation of the determinant. But we no longer need the input elements in locations 3, 4, or 5 for the remainder of our work, so we will reuse those:
// we are about to change shared values, so wait until all previous usage is complete
__syncthreads();
// load cofactor computed by thread 0 into group offset 3 in shared
if (lane == 0) si[sibase+3] = a;
// load cofactor computed by thread 3 into group offset 4 in shared
if (lane == 3) si[sibase+4] = a;
// load cofactor computed by thread 6 into group offset 5 in shared
if (lane == 6) si[sibase+5] = a;
// make sure shared memory loads are complete
__syncthreads();
// let every thread compute the determinant (same for all threads)
// a * (eifh) + b * (fgdi) + c * (dheg)
det = si[sibase]*si[sibase+3]+si[sibase+1]*si[sibase+4]+si[sibase+2]*si[sibase+5];
Calculation of final result
This only involves (for each thread) dividing the previously computed cofactor for that thread, by the justcomputed determinant, and storing that result:
// another thread check: make sure this thread is actually doing useful work
if (idx < n*9)
// take previously computed cofactor, divide by determinant, store result
out[idx] = a / det;
Answered By – Robert Crovella
This Answer collected from stackoverflow, is licensed under cc bysa 2.5 , cc bysa 3.0 and cc bysa 4.0