Saturday, December 3, 2011

Non-PD Matrices in R, Cont.

Let me preface this post by saying I am getting frustrated with R.  The syntax is not intuitive and the performance for matrix operations is slow.  Using Octave, a free Matlab clone, I can get over 6 Gflops on things that R is doing at less than 2.  After this post, I will focus on the statistical functions of R and will do any matrix operations in Octave, or maybe NumPy.


First, let's look at a situation where you find your self with a non-positive semi-definite (PSD) matrix.  Remember that the default chol() requires a positive definite (PD) matrix and we created a function that can handle a PSD matrix.  Imagine if you have 3 time series with different sampling intervals -- in our example we will have one that samples every day, another that samples M-W-F-Sa-Su, and one that samples T-R (Thursday) - Sa.  Further we only have a limited set of data - 3 weeks in our sample.  To create a correlation matrix, we would be forced to either only consider Saturdays (where all 3 series intersect), or to compute the correlations pairwise.  With only 3 intersection points, we would probably choose the pairwise method to attempt to get a better handle on the true underlying process.


We will use SAS IML to construct the series and PROC CORR from SAS to calculate the pairwise correlations.

proc iml;
/*True covariance of the system*/
C1 = {1 .5 0,
     .5 1 .25,
     0 .25 1};

r = root(C1)`;n=21;
out = J(n,3,.);

/*Generate 21 days worth of data*/

do i=1 to n;
   hold = J(1,3);
   do j=1 to 3;
      hold[j] = rannor(2);
   end;

   hold = r * hold`;
   out[i,] = hold`;

end; 

create out from out;
append from out;
close out;
quit;

/*Filter out the unobserved values*/
data out;
format date date9.;
set out;
retain date "04DEC2011"d;
if _n_ ^=1 then   date = date + 1;

if ^(weekday(date) in (2, 4, 6, 7, 1)) then   col2 = .;

if ^(weekday(date) in (3, 5, 7)) then   col3 = .;
run;

title "Observed Data";
proc print data=out noobs;
run;

title "Pairwise Correlation, observed";
proc corr data=out(drop=date)
          out=corr_out(where=(_type_ = "CORR"))
          ;run;

data corr_out;
set corr_out(drop=_type_ _name_);
run;

title "Correlation Eigenvalues";
proc iml;
use corr_out;
read all into corr;
close corr_out; e = eigval(corr); print e;
quit;

Here is the output:


Observed Data

date
COL1
COL2
COL3
04DEC2011
1.31183
0.08083
.
05DEC2011
2.04198
2.26740
.
06DEC2011
0.55341
.
-0.86957
07DEC2011
-0.29515
0.40556
.
08DEC2011
-0.11603
.
0.24319
09DEC2011
1.24763
0.92012
.
10DEC2011
0.65947
-0.30911
-0.28241
11DEC2011
0.69131
0.72022
.
12DEC2011
1.83849
0.31602
.
13DEC2011
-0.35605
.
-0.16218
14DEC2011
-0.96497
-2.40601
.
15DEC2011
-0.42313
.
-0.16670
16DEC2011
0.36461
-0.44278
.
17DEC2011
-1.97964
-0.47143
1.70729
18DEC2011
0.56940
0.68279
.
19DEC2011
-0.72632
-1.30090
.
20DEC2011
0.51960
.
-0.15246
21DEC2011
-0.73427
-0.13240
.
22DEC2011
0.34385
.
-0.79128
23DEC2011
0.00112
-1.70289
.
24DEC2011
-0.38676
-0.74608
-0.16191


Pairwise Correlation, observed



The CORR Procedure
3 Variables:
COL1 COL2 COL3


Simple Statistics
Variable
N
Mean
Std Dev
Sum
Minimum
Maximum
COL1
21
0.19811
0.96281
4.16039
-1.97964
2.04198
COL2
15
-0.14124
1.14684
-2.11865
-2.40601
2.26740
COL3
9
-0.07067
0.74955
-0.63602
-0.86957
1.70729


Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations

COL1
COL2
COL3
COL1
1.00000

21
0.66712
0.0066
15
-0.87999
0.0017
9
COL2
0.66712
0.0066
15
1.00000

15
0.09317
0.9406
3
COL3
-0.87999
0.0017
9
0.09317
0.9406
3
1.00000

9




Correlation Eigenvalues


e
2.0606421
1.0896653
-0.150307



The Matrix package from R has a function called nearPD().  This function has some nice features such as requiring the diagonal to remain constant (useful for a covariance matrix) and a flag to make sure the output is still a correlation matrix.  Not sure what the difference between those two would be if I used a correlation matrix as the input...  The output also include the new eigenvalues, and the Frobenius Norm (a measure of closeness for matrices).


Rebonato and Jackel (1999) define 2 methods in their paper for finding a PSD matrix.  The first is rather involved and the second, the Spectral Decomposition, is very compact.  The Spectral Decomposition is what we will emulate.  Take a second to go read that section (pp 7 - 9).  

Here it is in R:
nearPSD = function(c){
   n = dim(c)[1]
   e = eigen(c,sym=TRUE)
   val = e$values * (e$values > 0)
   vec = e$vectors
   T = sqrt(diag( as.vector(1/(vec^2 %*% val)),n,n))
   B = T %*% vec %*% diag(as.vector(sqrt(val)),n,n)
   out = B %*% t(B)
   return(out)
}
 Using the example from the paper, we see this duplicates their results:

> c
     [,1] [,2] [,3]
[1,]  1.0  0.9  0.7
[2,]  0.9  1.0  0.3
[3,]  0.7  0.3  1.0

> nearPSD(c)
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.8940244 0.6963191
[2,] 0.8940244 1.0000000 0.3009690
[3,] 0.6963191 0.3009690 1.0000000
The nearPD() function gives us something very close to the above:

> nearPD(c,corr=TRUE)
$mat
3 x 3 Matrix of class "dpoMatrix"
          [,1]      [,2]      [,3]
[1,] 1.0000000 0.8945753 0.6966207
[2,] 0.8945753 1.0000000 0.3025436
[3,] 0.6966207 0.3025436 1.0000000
$eigenvalues
[1] 2.291883e+00 7.081174e-01 2.291883e-08
$corr
[1] TRUE
$normF
[1] 0.009727988
$iterations
[1] 13
$rel.tol
[1] 7.667644e-08
$converged
[1] TRUE
attr(,"class")
[1] "nearPD"
So what is more efficient, called nearPD() and using the built in chol() function, or calling our nearPSD() and calling the my_chol() function from before?

n = 1000
c = matrix(.9,n,n)
for( i in 1:n){
  c[i,i] = 1
}
c[1,2] = 0
c[2,1] = 0
c[1:5,1:5]
     [,1] [,2] [,3] [,4] [,5]
[1,]  1.0  0.0  0.9  0.9  0.9
[2,]  0.0  1.0  0.9  0.9  0.9
[3,]  0.9  0.9  1.0  0.9  0.9
[4,]  0.9  0.9  0.9  1.0  0.9
[5,]  0.9  0.9  0.9  0.9  1.0

e = eigen(c,sym=TRUE)
min(e$values)
[1] -0.7982018

s = Sys.time()
pd = nearPD(c,corr=TRUE)
pd = as.matrix(pd$mat)
rpd = chol(pd)
e = Sys.time()
e - s
Time difference of 45.659 secs

pd[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.7934317 0.8984058 0.8984058 0.8984058
[2,] 0.7934317 1.0000000 0.8984058 0.8984058 0.8984058
[3,] 0.8984058 0.8984058 1.0000000 0.9000032 0.9000032
[4,] 0.8984058 0.8984058 0.9000032 1.0000000 0.9000032
[5,] 0.8984058 0.8984058 0.9000032 0.9000032 1.0000000

s = Sys.time()
psd = nearPSD(c)
rpsd = my_chol(psd,50)
e = Sys.time()
e - s
Time difference of 7.031 secs

psd[1:5,1:5]
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1.0000000 0.2848481 0.7604250 0.7604250 0.7604250
[2,] 0.2848481 1.0000000 0.7604250 0.7604250 0.7604250
[3,] 0.7604250 0.7604250 1.0000000 0.9000002 0.9000002
[4,] 0.7604250 0.7604250 0.9000002 1.0000000 0.9000002
[5,] 0.7604250 0.7604250 0.9000002 0.9000002 1.0000000

norm(c - pd ,type="F")
[1] 1.126598

norm(c - psd,type="F")
[1] 8.827865
So the nearPSD method is much much faster. This is due to the fact that the nearPD method is iterative.  However the "difference" between the output matrix is larger. Qualitatively, the nearPD drastically increases the correlation between 1 and 2 -- from 0 to .79. The nearPSD keeps that correlation lower, .285, at the expense of the correlation between 1 and 2 and the rest of the values.

1 comment:

  1. For a way to compute the nearest postitive definite matrix in SAS/IML, see http://blogs.sas.com/content/iml/2012/11/28/computing-the-nearest-correlation-matrix/

    ReplyDelete