From: Joachim Selke on
Tim Davis wrote:
> I can't think of a better way ... except to go to a
> mexFunction of course.
>
> The reason I say that is because a mexFunction could do
> Atr(:,rows_j)'*B(:j) without forming the Atr(:,rows_j)
> matrix as an intermediate step.

Wow, that really speeds it up. Thanks! :-)

Again, I did some experiments using random matrices (m = 50000,
n = 10000, k = 15). The total time needed for constructing the product
was around 1.3 seconds on my computer with a MEX function written in C
(compared to 2.3 seconds using my previous implementation).

Since this is my first MEX function (and since I did not even program in
C before) it would help me a lot if you please could have a quick look
at my implementation. It is straight-forward and I am quite sure that
there is still room for improvement (maybe by doing some computations in
parallel?). Here is the code:


-------------------------------------------------------------------------

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray
*prhs[]) {

const mxArray *a, *b, *x;
a = prhs[0];
b = prhs[1];
x = prhs[2];

mwSize m, n, k;
m = mxGetM(x);
n = mxGetN(x);
k = mxGetN(a);

mxArray *ab;
ab = mxDuplicateArray(x);

mwIndex *rows_ab, *cols_ab;
rows_ab = mxGetIr(ab);
cols_ab = mxGetJc(ab);

double *data_a, *data_b, *data_ab;
data_a = mxGetPr(a);
data_b = mxGetPr(b);
data_ab = mxGetPr(ab);

int i, j, r;
int ind_a, ind_b, ind_ab, ind_ab_next;
ind_ab = 0;
for (j = 0; j < n; j++) {
ind_ab_next = ind_ab + cols_ab[j + 1] - cols_ab[j];
while (ind_ab < ind_ab_next) {
i = rows_ab[ind_ab];
data_ab[ind_ab] = 0;
for (r = 0; r < k; r++) {
ind_a = r * m + i;
ind_b = j * k + r;
data_ab[ind_ab] += data_a[ind_a] * data_b[ind_b];
}
ind_ab++;
}
}

plhs[0] = ab;
}

-------------------------------------------------------------------------


Joachim
From: Tim Davis on
Joachim Selke <j.selke(a)tu-bs.de> wrote in message
<688agnF2rcjvjU1(a)mid.dfncis.de>...
> Tim Davis wrote:
> > I can't think of a better way ... except to go to a
> > mexFunction of course.
> >
> > The reason I say that is because a mexFunction could do
> > Atr(:,rows_j)'*B(:j) without forming the Atr(:,rows_j)
> > matrix as an intermediate step.
>
> Wow, that really speeds it up. Thanks! :-)
>
> Again, I did some experiments using random matrices (m =
50000,
> n = 10000, k = 15). The total time needed for constructing
the product
> was around 1.3 seconds on my computer with a MEX function
written in C
> (compared to 2.3 seconds using my previous implementation).
>
> Since this is my first MEX function (and since I did not
even program in
> C before) it would help me a lot if you please could have
a quick look
> at my implementation. It is straight-forward and I am
quite sure that
> there is still room for improvement (maybe by doing some
computations in
> parallel?). Here is the code:
>
>
>
-------------------------------------------------------------------------
>
> #include "mex.h"
>
> void mexFunction(int nlhs, mxArray *plhs[], int nrhs,
const mxArray
> *prhs[]) {
>
> const mxArray *a, *b, *x;
> a = prhs[0];
> b = prhs[1];
> x = prhs[2];
>
> mwSize m, n, k;
> m = mxGetM(x);
> n = mxGetN(x);
> k = mxGetN(a);
>
> mxArray *ab;
> ab = mxDuplicateArray(x);
>
> mwIndex *rows_ab, *cols_ab;
> rows_ab = mxGetIr(ab);
> cols_ab = mxGetJc(ab);
>
> double *data_a, *data_b, *data_ab;
> data_a = mxGetPr(a);
> data_b = mxGetPr(b);
> data_ab = mxGetPr(ab);
>
> int i, j, r;
> int ind_a, ind_b, ind_ab, ind_ab_next;
> ind_ab = 0;
> for (j = 0; j < n; j++) {
> ind_ab_next = ind_ab + cols_ab[j + 1] - cols_ab[j];
> while (ind_ab < ind_ab_next) {
> i = rows_ab[ind_ab];
> data_ab[ind_ab] = 0;
> for (r = 0; r < k; r++) {
> ind_a = r * m + i;
> ind_b = j * k + r;
> data_ab[ind_ab] += data_a[ind_a] *
data_b[ind_b];
> }
> ind_ab++;
> }
> }
>
> plhs[0] = ab;
> }
>
>
-------------------------------------------------------------------------
>
>
> Joachim

Looks fine to me. I could suggest only one thing -
mxDuplicateArray is copying the numerical values of x into
the output ab, then later you overwrite them. It would be
faster to create ab with mxCreateSparse, then fill the
matrix with both the pattern and the values (the Jc, Ir, and
data arrays). But the improvement would be minor.
From: Joachim Selke on
Tim Davis wrote:
> Looks fine to me. I could suggest only one thing -
> mxDuplicateArray is copying the numerical values of x into
> the output ab, then later you overwrite them. It would be
> faster to create ab with mxCreateSparse, then fill the
> matrix with both the pattern and the values (the Jc, Ir, and
> data arrays). But the improvement would be minor.

I tried that and it really seems to be minor. In fact, my implementation
of the copying (mxCreateSparse followed by memcopy) was slightly slower
than using mxDuplicateArray.

But I found two additional ways to speed things up:
(1) Replace the k write accesses to data_ab[ind_ab] in the inner loop
by write accesses to a temporary variable z and do a single write
operation to data_ab[ind_ab] after the loop has finished
(2) Linearize accesses to matrix A by using A's transpose instead of A

The new code only needs 0.7 s in my example scenario, computing of A' in
MATLAB included (old code: 1.3 s, M code: 2.3 s).

Here is the code:

---------------------------------------------------------------------

#include "mex.h"

void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray
*prhs[]) {

const mxArray *atr, *b, *x;
atr = prhs[0];
b = prhs[1];
x = prhs[2];

mwSize m, n, k;
m = mxGetM(x);
n = mxGetN(x);
k = mxGetM(atr);

mxArray *ab;
ab = mxDuplicateArray(x);

mwIndex *rows_ab, *cols_ab;
rows_ab = mxGetIr(ab);
cols_ab = mxGetJc(ab);

double *data_atr, *data_b, *data_ab;
data_atr = mxGetPr(atr);
data_b = mxGetPr(b);
data_ab = mxGetPr(ab);

double z;
int i, j, r;
int ind_atr, ind_b, ind_ab, rows_left;
ind_ab = 0;
for (j = 0; j < n; j++) {
rows_left = cols_ab[j + 1] - cols_ab[j];
while (rows_left != 0) {
i = rows_ab[ind_ab];
z = 0;
ind_atr = i * k;
ind_b = j * k;
for (r = 0; r < k; r++) {
z += data_atr[ind_atr] * data_b[ind_b];
ind_atr++;
ind_b++;
}
data_ab[ind_ab] = z;
ind_ab++;
rows_left--;
}
}

plhs[0] = ab;
}

---------------------------------------------------------------------


Now I think that this is all one can do to speed it up. Am I right? :-)

Joachim
From: Tim Davis on
Instead of "int" you should use mwSignedIndex. It's safer,
if you ever go to a 64-bit platform. I also recommend not
using mwIndex at all, but mwSignedIndex in its place. The
problemis that mwIndex is unsigned, so codes like this:

mwIndex i, n ;
....

for (i = n-1 ; i >= 0 ; i--) { do something }

fail in an infinite loop, since i is always >= 0.