like numbers

The numerical estimate of matrix rank

Friday, January 27th, 2012, 14:25 by Martin Andersson. Posted in Applied math, Chemometrics, Coding |

Share on Facebook
Share on LinkedIn
Post on Twitter

The most com­monly used func­tion to cal­cu­late the matrix rank is prob­a­bly the one in the MATLAB soft­ware pack­age [1]. It is based on the cal­cu­la­tion of sin­gu­lar val­ues of the matrix A, and the num­ber of sin­gu­lar val­ues that are greater than some tol­er­ance defines the rank. The algo­rithm is sim­ple:

r = svd(A);
if nar­gin==1
tol = max(size(A)) * max(s) * eps;
r = sum(s > tol);

Math­e­mat­i­cally, the same rank cal­cu­la­tion is used in the soft­ware pack­age Octave, but it com­pletely equiv­a­lent. The func­tion “rank.m” works sur­pris­ingly well but I won­der what the the­o­ret­i­cal foun­da­tion behind it could be. By using SVD, the orig­i­nal data is mod­eled as A = U*S*V’ and if only a r fac­tors are kept, it is becomes A = U(:,1:r) * S(1:r,1:r) * V(:,1:r)’ + E. The sin­gu­lar val­ues of A are equal to the square root of the vari­ance cap­tured by each fac­tors. (Square root of vari­ance is the same is stan­dard devi­a­tion).

If the sin­gu­lar val­ues are scaled so the first one is equal to 1, the con­tri­bu­tion for an addi­tional fac­tor mea­sured as added stan­dard devi­a­tion that is smaller than the machine pre­ci­sion, eps = 2.2e-16 for dou­ble pre­ci­sion, will be used as the cut­off for when fac­tors with insignif­i­cant sin­gu­lar val­ues are added. How­ever, if addi­tional data is added to a matrix, the sin­gu­lar val­ues will in prac­tice always become larger. This is related to the sum of squares of a matrix which is equal to the sum of all the sin­gu­lar val­ues squared. There­fore the cut­off limit should be scaled by some­thing and in the MATLAB rank func­tion, it is scaled with the fac­tor max(size(A)’).

Again, this works well, but it is dif­fi­cult to under­stand the prin­ci­ple behind it. Ques­tions that come to mind are

  • Why are stan­dard devi­a­tions used for the cut­off?
  • Why not use the vari­ance instead which is related to the vari­ance explained?
  • Why is it scaled with max(size(A)’)?
  • Why not look at the root mean square errors (rms) of the resid­u­als instead?
  • Can it be improved in per­for­mance and can it become more intu­itive?

To address these ques­tions, and to cre­ate some­thing that was easy for myself to under­stand, I made a new algo­rithm for cal­cu­lat­ing the rank, called RANKS where the tail­ing let­ter s stands for search (as in rank-search). I don’t know it is bet­ter, and it is not my inten­tion to make the supe­rior or best algo­rithm for rank esti­ma­tion. The pur­pose is just to make some­thing rea­son­able, under­stand­able for myself,

csq = cum­sum(s.^2);
max­val = csq(end);
if max­val > 0,
csq_nrm = csq/maxval;
df = max(m,n)*(min(m,n):-1:1)’;
msq = csq_nrm./df;
r = sum(msq > rms_tol^2);
r = 0;
%for i=1:length(msq),
% fprintf(‘%5g %20e\n’,i,sqrt(msq(i)));
%semilogy(1:min(m,n),sqrt(msq),‘bo-’,[1 min(m,n)],[rms_tol rms_tol],‘k-’); figure(gcf);

The algo­rithm can be sim­pli­fied but the above makes it easy under­stand the idea. The full algo­rithm, mod­i­fied to avoid divi­sions and unnec­es­sary oper­a­tions is fol­low­ing below.

func­tion r = ranks(X,rms_tol);
%RANKS Matrix rank *search*.
% RANKS(X,rms_tol) is the num­ber of lin­earely inde­pen­dent
% row or columns with the root mean square error of
% resid­u­als larger than rms_tol, slightly dif­fer­ent to
% the stan­dard RANK func­tion in MATLAB.
% RANKS(X) pro­vides an esti­mate of the num­ber of lin­early
% inde­pen­dent rows or columns of a matrix A using a
% default root means square error rms_tol = 2.2e16 (at
% machine pre­ci­sion).
% RANKS is search­ing over r for the SVD model with r fac­tors
% U(:,1:r)*R(1:5,1:r)*V(:,1:r)’
% where the rel­a­tive root mean square of the resid­u­als of
% E = X — U(:,1:r)*R(1:5,1:r)*V(:,1:r)’
% is less than rms_tol.
% See also RANK
% Mar­tin Ander­s­son, 2012
if nar­gin==1
rms_tol = eps;
csq = cum­sum(s.^2);
max­val = csq(end);
df = max(m,n)*(min(m,n):-1:1)’;
r = sum(csq > df.*(maxval*rms_tol^2));

Finally a remark about the dif­fer­ence between the clas­sic RANK algo­rithm imple­mented in MATLAB and the new RANKS algo­rithm pre­sented here and why nobody cared about it. The rea­son is prob­a­bly that this topic is near to being mean­ing­less. Nobody seemed to argue about it because any­way, if the cal­cu­lated the rank is slightly dif­fer­ent between the two algo­rithms, it does not change much about the con­clu­sions about the orig­i­nal data set.

[1] Don­garra, J.J., J.R. Bunch, C.B. Moler, and G.W. Stew­art, LINPACK Users’ Guide, SIAM, Philadel­phia, 1979.

Post a Comment