Friday, August 24, 2012

Rolling Summary Stats in SAS

In a previous post, Rick Wicklin brought up the inefficiency of a macro loop for collecting summary statistics for different groups.  Rick has a blog post talking all about it here.

The problem with what I was doing was that A) I was doing more than basic statistics (PROC COPULA simulations) and B) operating on a rolling time series window.  For me to generate the distinct BY groups like in Rick's example, I would have to duplicate the data.  That is, create distinct rows for each window, meaning that each overlapping data point would be in many windows; effectively multiplying my data by the window size.

But this got me thinking, what IS the best way to collect summary stats across a rolling window in SAS?  SAS being SAS there are many ways to skin this cat.  I'm going to try 4 and see what works.

  1. A Macro loop with PROC SUMMARY, like Rick says not to do.
  2. A large data duplication with transposition into a group/variable name/ value column table using by groups in PROC SUMMARY.
  3. The same as 2, but using PROC SQL
  4. Something totally different -- transposing the variables into a horizontal table structure and using a DATA STEP and summary functions.
For the test, we will generate 1,000 observations for 100 variables over a 100 sample rolling window.  We will calculate N (non-missing), MIN, MAX, MEAN, and STD.  The goal is to get all methods to output a data set with columns for GROUP, VARIABLE NAME, N, MIN, MAX, MEAN, and STD.

First up, the DATA STEP (and macro) to generate our sample data.  The macro will generate 100 random variable names.

%macro rannames(n,vars);
data _null_;
format outStr $%eval(&n*10).
      chars $52.
      s $8.;

chars = "abcdefghijklmnopqrstuvwxyz";
chars = catt(chars,upcase(chars));
do i=1 to &n;
      s = "";
      do j=1 to 8;
            s = catt(s,substr(chars,ceil(ranuni(0)*52),1));
      end;
      outStr = strip(outStr) || " " || strip(s);
end;
call symputx("&vars",outStr,"g");
run;
%mend;
%rannames(100,vars);
data test(drop=j);
format d date9.;
array arr[100] &vars;
do d=1 to 1000;
      do j=1 to 100;
            arr[j] = rannor(0);
      end;
      output;
end;
run;
Next a macro for calculating the rolling stats.  First we filter the data to the window and transpose the filtered data.  Next we calculate the stats for each variable, outputting the values into appropriately named columns.  We next add back in the group (variable d) id.  Finally we append the results to the main data set.
%macro method1();
proc datasets lib=work nolist;
delete method1;
run;

%do i=100 %to 1000;
proc transpose data=test(where=(%eval(&i-99)<=d<=&i)) out=temp;
by d;
var &vars;
run;

proc summary data=temp;
class _NAME_ ;
var col1;
output out=stats(where=(_NAME_ ^="")) n=n min=min max=max mean=mean std=std ;
run;

data stats;
set stats(drop=_TYPE_ _FREQ_);
d = &i;
run;

proc append base=method1 data=stats force;
run;
%end;
%mend;

%let start=%sysfunc(datetime());
options nonotes;
%method1;
%let elapse=%sysevalf(%sysfunc(datetime()) - &start);
options notes;
%put Took &elapse;
This takes 38.5 seconds on my computer.

Now let's introduce 3 macros which will put the variables into groups.  The output is the format directly used in #4.  However, I have not found a more efficient way to do this for #2 and #3.  These macros transpose the values such that each variable becomes an array of 100 values in a single record.  Each record is a group.  100 variables * 100 observations in a window means we will have 10,000 variables per record.  Hard for an RDBMS, easy for SAS.
%macro rolling_col(var,n);
array &var.arr[&n] &var.1 - &var&n;
retain &var.1 - &var&n;

if _N_ < &n then
      &var.arr[_N_] = &var;
else do;
    if _N_ > 100 then do;
            do i=1 to 99;
                  &var.arr[i] = &var.arr[i+1]; 
            end; 
      end;

      &var.arr[100] = &var;
end;
drop &var;
%mend;

%macro rolling_cols(vars,n);
%local i nv var;
%let nv=%sysfunc(countw(&vars));
%do i=1 %to &nv;
      %let var = %scan(&vars,&i);
      %rolling_col(&var,&n);
%end;
%mend;

%macro rolling_table(data,vars,n,out=);
data &out;
set &data;
retain grp 0;

%rolling_cols(&vars,&n);

drop &vars i;
if _N_ >= &n then do;
   grp = grp + 1;
   output;
end;
run;
%mend;
The second method uses the output of the ROLLING_TABLE macro, transposes it from short and fat into long and skinny.  It sorts the data for use in the BY group and then calls PROC SUMMARY to calculate the stats.  I use the DATA STEP for the transpose because it is faster than PROC TRANSPOSE.  I would have to use a second data step anyway to truncate the _NAME_ variable to the proper values.
%let start=%sysfunc(datetime());
%rolling_table(test,&vars,100,out=temp);

data temp(keep=d _name_ col1);
format d date9. _NAME_ $8.  x 8.;
set temp(drop=grp);
format xx 8.;
array vars[*] x -- xx;
do i=2 to dim(vars)-1;
      _NAME_ = vname(vars[i]);
      COL1 = vars[i];
      output;
end;
run;

proc sort data=temp out=temp;
by d _NAME_;
run;

proc summary data=temp;
by d _NAME_;
var col1;
output out=Method2(drop=_TYPE_ _FREQ_) n=n min=min max=max mean=mean std=std ;
run;
%let elapse=%sysevalf(%sysfunc(datetime()) - &start);
%put Took &elapse;
This takes 4.5 seconds on my computer.  Note, creating the groups expands the table from being 817KB to 210MB.  Note for the feint of heart or if you have lots of columns.

The 3rd method looks exactly like the 2nd, except we skip the sort and use PROC SQL. 
%let start=%sysfunc(datetime());
%rolling_table(test,&vars,100,out=temp);

data temp(keep=d _name_ col1);
format d date9. _NAME_ $8.  x 8.;
set temp(drop=grp);
format xx 8.;
array vars[*] x -- xx;
do i=2 to dim(vars)-1;
      _NAME_ = vname(vars[i]);
      COL1 = vars[i];
      output;
end;
run;

proc sql;
create table Method3 as
select d,
       _NAME_,
         n(col1) as n,
         min(col1) as min,
         max(col1) as max,
         mean(col1) as mean,
         std(col1) as std
      from temp
      group by d , _name_;
quit;

%let elapse=%sysevalf(%sysfunc(datetime()) - &start);
%put Took &elapse;
This takes 3.8 seconds, a significant increase over the SORT and SUMMARY/BY method.  

The last method will require 3 more macros.  These will be used to call the appropriate summary functions for each variable array in the temp table.  We will use 1 DATA STEP to calculate all the results.
%macro summary_stats(vars,n,funcs=n min max mean);
%local i nv var;
%let nv=%sysfunc(countw(&vars));
format _name_ $8.;
%do i=1 %to &nv;
      %let var = %scan(&vars,&i);
      %calc_stats(&var,&n,&funcs);
%end;
%mend;

%macro calc_stats(var,n,funcs);
_NAME_ = "&var";
%local i f;
%do i=1 %to %sysfunc(countw(&funcs));
%let f = %scan(&funcs,&i);
&f = &f(of &var.1-&var&n);
%end;
drop &var.1-&var&n;
output;
%mend;

%macro rolling_summary_stats(data,vars,n,
      out=,funcs=n min max mean std);
data &out;
set &data;
%summary_stats(&vars,&n,funcs=&funcs);
run;
%mend;

%let start=%sysfunc(datetime());
%rolling_table(test,&vars,100,out=temp);
%rolling_summary_stats(temp(drop=grp),&vars,100,out=Method4);
%let elapse=%sysevalf(%sysfunc(datetime()) - &start);
%put Took &elapse;
This method takes .62 seconds.  By FAR the fastest available.

My take away is if you have to use rolling windows, time is a factor, you can afford the disk overhead, and are capable of calculating all the statistics you need from the data step, then do it in the DATA Step.  The procedures from SAS are convenient, especially for creating lots of data and output.  However, that convenience comes with a time cost.

Full SAS code is available here.  I also include a macro for a simple linear regression of many columns dependent on 1.  A good example would be if you wanted to calculate a market Alpha and Beta for a series of stocks over a rolling window.  (This is why I REALLY looked at this topic).

Saturday, July 14, 2012

Expected Shortfall Portfolio Optimization in R using nloptr

I have previously done examples of QP optimization in for financial portfolios.  I am not a huge fan of variance optimization in finance.  Return distributions are not normal, are often skewed, and are usually leptokurtic.  In plain speak, the distributions have fat tails and while the mean may be 0, the median is shifted to one side or the other.

Because of the shape of return distributions, variance optimization fails to capture the true risk of a portfolio. The risk measure I prefer to use is Expected Shortfall.  Expected Shortfall is defined as the mean of the left tail of the distribution, below some given alpha.

Beyond VaR:

We call the value at the apha Value-at-Risk (VaR).  VaR has been much maligned over the last few years -- not always for the right reasons.  Some people confuse all VaR measures as assuming a normal distribution. Not so.  The definition is that is it some left tail alpha value of the expected P/L distribution.  No where does it assume a normal distribution -- people just wrongly use a normal distribution when calculating VaR.

Jorion's definition of VaR fromValue at Risk: The New Benchmark for Managing Financial Risk, 3rd Edition says (I'm paraphrasing) VaR is the maximum expected loss at a given level of certainty during normal market conditions.  That often is misinterpreted by management as THE number of the maximum loss.  Jorion's definition has a positive spin that pushes the misinterpretation.

I prefer to but a negative spin on it.  VaR is the minimum expected loss when you have a bad day.  If I have a day that is, at least, in the alpha percential of crappy-ness, then I will lose MORE than VaR.

VaR is the floor, not the ceiling.  That's where it has been misused.

So why not optimize on VaR?  Short answer is that VaR is highly nonlinear.  Add nonlinear instruments to your portfolio and you can no longer guarantee the concavity of the objective function.

Enter ES:
Expected Shortfall (ES) is the average of the values above the VaR value.  That is, when I'm having a really bad day, what is my expected loss.  If I say a 1 in 20 day is bad (alpha=.05) then I should have 12-13 bad days a year.  If my portfolio was static and the return distributions unchanging (not likely), then the average of those 12-13 days should center on ES.

ES is still not perfect because you don't know how far to the left you can go.  But it does factor in the fat, skewed tail that we see.  It is a good starting point for risk analysis.

Because ES is a mean, it is guaranteed to be a concave function (I'm still looking for the reference -- I've seen it before and will update the post if I find it).  So ES is a better measure of risk than VaR and we can trust optimization results from it.  So let's build an optimization routine for ES.

nloptr:
nloptr is a nonlinear optimization library in R wrapping the GNU NLopt library.  ES, while well behaved, is nonlinear.  nloptr provides a number of nonlinear solvers.  It's what we want.

For our test case, we will simulate a 4 variable normal distribution with 10,000 draws (correlation given below).  We will build the ES function and a gradient function for ES.  We will constrain portfolio weights to be between [0,1].  We will impose an equality constraint that the sum of the weights = 1.
require(MASS)
require(nloptr)

#Covariance structure for the simulation
#give everything a std=.1
n = 4
corr = c(1, .7.5.1,
              .71.4, -.1,
              .5, .4,   1.6,
              .1, -.1, .61)

corr = matrix(corr,n,n)

std = matrix(0,n,n)
for(i in 1:n){      
       std[i,i] = .1
}

cov = std %*% corr %*% std

#Simulate 10,000 draws
sim = mvrnorm(n=10000,rep(0,n),cov)

#feasible starting values of equal weights
w = rep(1/n,n)

#ES function.  Mean of values above alpha
es = function(w,sim=NA,alpha=.05){
       ret = sort(sim %*% w)
      
       n = length(ret)
       i = alpha * n
      
       es = mean(ret[1:i])
      
       return(-es)  
}


#linear equality constraint
#note: nloptr requires all functions to have the same signature
eval_g0 <- function(w,sim=NA,alpha=NA) {
       return( sum(w) - 1 )
}

#numerical approximation of the gradient
des = function(w,sim=NA,alpha=.05){
       n = length(w)
       out = w;
       for (i in 0:n){
              up = w;
              dn = w;
              up[i] = up[i]+.0001
              dn[i] = dn[i]-.0001
              out[i] = (es(up,sim=sim,alpha=alpha) - es(dn,sim=sim,alpha=alpha))/.0002
       }
       return(out)
}

#use nloptr to check out gradient
check.derivatives(w,es,des,sim=sim, alpha=.05)

#function to optimize -- a list of objective and gradient
toOpt = function(w,sim=NA,alpha=.05){
       list(objective=es(w,sim=sim,alpha=alpha),gradient=des(w,sim=sim,alpha=alpha))    
}

#equality constraint function.  The jacobian is 1 for all variables
eqCon = function(w,sim=NA,alpha=.05){
       list(constraints=eval_g0(w,sim=NA,alpha=.05),jacobian=rep(1,4))     
}

#optimization options
opts <- list( "algorithm" = "NLOPT_LD_SLSQP",
              "xtol_rel" = 1.0e-7,
              "maxeval" = 1000)

#run optimization and print results
nl = nloptr(w,toOpt,
              lb = rep(0,4),
              ub = rep(1,4),
              eval_g_eq=eqCon,
              opts=opts,
              sim=sim,alpha=.05)

print(nl)

s = nl$solution
obj = nl$objective
To confirm the results, I ran this 100 times and averaged the resulting weights and objective value.  I did the same in SAS IML.  UPDATE: SAS code located here.

From R we get
> apply(s,2,mean)
[1] 0.101697131193604 0.420428046895474 0.000000000004212 0.477874821906712
> mean(obj)
[1] 0.1371
In SAS we get
wgt
0.1015076 0.4218076 -4.75E-20 0.4766848
obj
0.1375891
I am comfortable with the results.

Final Thoughts:
I was only able to get the SLSQP routine to converge.  This routine uses a local quadratic approximation of the function, runs a QP to update the variables, and then iterates.  I am not sure why the others failed.

The number of routines available for problems with equality constraints are limited.  On top of that, the NLopt documentation gives a number of other routines that should accept linear constraints but the R implementation throws an error.  A number of augmented LM routines are available for equality constraints, but again the only one that worked used SLSQP as the sub-optimizer and produced the same result as SLSQP (in about 5x the time).

The time to run the optimization in R is high.  During the 100 iteration sample it took from 1.5-10 seconds per loop depending on the internal iterations needed by nloptr.  In all it took about 10 minutes.  The SAS code ran in 45 seconds.  I see the following as reasons:

  1. SLSQP is not as efficient as other routines.  It has to approximate a hessian matrix numerically.  Finding another routine that reliably converges could be a big help.
  2. My gradient function is numeric.  Each time the gradient is computed it takes a large number of FLOPs.  This is the same in SAS where the IML optimizer calculates numeric derivatives.
  3. As previously discussed, the linear algebra matrix multiplication in R is slow.  This code relies heavily on it.  SAS IML is optimized for matrix math.
If anyone has ideas on how to speed up the R processing, I would love to hear them.  Maybe there are other, more efficient, nonlinear optimization libraries in R.  I'm sure there are parts of my code that can be sped up as well.