Tuesday, May 29, 2012

SAS: PROC COPULA (Oh and I'm back)

It's been a while.  Work got hectic, then I got a new job, then I had to move across the country.  But now I have access to the most up-to-date SAS so I can post in both SAS and R.  Further, I have more free time to do things like write blog posts!

Today, I'm looking at one of  the new procedures in SAS 9.3.  PROC COPULA is new to the SAS/ETS package and is still "experimental."  The SAS Documentation describes it best:
A copula is a function that combines marginal distributions of variables into a specific multivariate distribution. All of the one-dimensional marginals in the multivariate distribution are the cumulative distribution functions of the factors. Copulas help perform large-scale multivariate simulation from separate models, each of which can be fitted using different, even nonnormal, distributional specifications.
Previously, simulation through copulas was only available in PROC MODEL (also in ETS), and Risk Dimensions (SAS' high end risk analysis framework).  However, there was no way to fit non-normal copulas out of the box.  The guys at SAS give it to us here.

Simulation using copulas has been pushed by the finance industry.  So let's create a way to download a series of stock prices, fit a T copula to the log-returns, simulate from that, and look at the before and after distributional properties.  Along the way, we'll discover how PROC COPULA operates.

First two macros for downloading stock prices from Yahoo! Finance.
%macro download(symbol,from,to);
/*Builde URL for CSV from Yahoo! Finance*/
data _null_;
format s $128.;

if "&from" ^= "" then
      from = "&from"d;
else
      from = intnx('year',today(),-1,'same');

if "&to" ^= "" then
      to = "&to"d;
else
      to = today()-1;

put FROM= date9. TO= date9.;

to_d = day(to);
to_m = month(to)-1;
to_y = year(to);

from_d = day(from);
from_m = month(from)-1;
from_y = year(from);
s = catt("'http://ichart.finance.yahoo.com/table.csv?s=&symbol",
            '&d=',put(to_m,z2.),
            '&e=',to_d,
            '&f=',put(to_y,4.),
            '&g=d&a=',put(from_m,z2.),
            '&b=',from_d,
            '&c=',put(from_y,4.),
            '&ignore=.csv',
            "'");
call symput("s",s);
run;

%put NOTE: &s;
/*SAS Filename to point to the URL*/
filename in url &s;

/*Use PROC IMPORT to download and parse the CSV*/
proc import file=in dbms=csv out=&symbol(rename=(adj_close=&symbol)) replace;
run;

/*Clear the filename to the url*/
filename in clear;

/*Ensure data are sorted*/
proc sort data=&symbol(keep=date &symbol);
by date;
run;
%mend;

%macro get_stocks(stocks,from,to);
%local i n;
%let n= %sysfunc(countw(&stocks));
options nosource nonotes nosource2;
%do i=1 %to &n;
      %download(%scan(&stocks,&i),&from,&to);
%end;
options source notes source2;

data stocks;
merge &stocks;
by date;
run;

proc datasets lib=work nolist;
delete &stocks;
quit;
%mend;
Let's download the top ten holdings from the SPY (S&P500 tracking) ETF and convert them into returns:
 ods html;
%let stocks= aapl xom ibm msft cvx ge t jnj pg wfc;

%get_stocks(&stocks,25MAY2010,25MAY2011);

data returns;
set stocks;
array s[*] &stocks;

do i=1 to dim(s);
      s[i] = log(s[i]/lag(s[i]));
end;
drop i;
run;
 Now, let's fit a T copula to the returns and simulate from it:
ods select FitSummary
      ConvergenceStatus
      ParameterEstimates
      MatrixPlotUnif
      MatrixPlotOrig;
title "T-Copula Fitting";
proc copula data=returns;
var &stocks;
fit t /
      marginals=empirical
      method    = MLE         
      plots      = (data = both matrix);

simulate /
      ndraws=10000
      seed=54321
      out=sim;
run;

ods select default;
First, we use the ods select statement to specify what output we want to see.  There are additional outputs.

Second, the var statement specifies which variables to include.  Obviously, we want all our stock returns.

The fit statement (why can the ETS guys not get the IDE to recognize this and turn it blue?  If STAT can do it on all their procedures, then ETS should be able to.  Rant from a former product manager over) says we want to fit a T copula.  The 'marginals=empirical' says to use the empirical distribution from the data to transform them to uniforms.  The other option is to fit a distribution and transform the values into uniforms before passing to COPULA.  We fit with a Maximum Likelihood Estimation and plot correlations.

The simulate statement simulates from the fitted copula.  Given that we used the empirical values, we will get back values in the return space.  This has a drawback as we will soon see.  We are simulating 10,000 draws and putting the output into a data set named sim.

Output is available here.  Notice the df parameter for the copula is ~8.  Definitely not normal.

So now, let's look at the moments of the observed distribution as well as the simulated distribution:
/*Original Distribution*/
title "Original Distribution";
proc means data=returns mean std skew kurt min max;
var &stocks;
run;

/*Simulated Distribution*/
title "Simulated Distribution";
proc means data=sim mean std skew kurt min max;
var &stocks;
run;

ods html close;
Go back to the output and look at the results.  The simulated distributions are nearly identical to the observed through the first 4 moments.  Notice that only AT&T (T) is apporiximately normal (skew=0 and kurtosis=0).

You will see that the max and min are identical between the observed and simulated data.  This makes sense because we are using empirical distributions in the copula.  There is no model for the tail of the distribution so there is no way for PROC COPULA to simulate beyond the largest and smallest observed values.  So the distributions are truncated at the bounds.  

I assume that fitting a copula using 'marginals=uniform' will allow simulation throughout the full range (0,1).  I attempted to do this in a small test and got a weird error.  There is nothing about the simulation specifics in the documentation.

Besides my inability to simulate with the uniforms, my open question is how are values from the empirical selected when the simulated uniform is between values.  For example, the empirical CDF of -0.01 is 0.4 and the next largest value, -0.0095 is 0.41.  My simulated value is 0.405.  Does COPULA split the difference?  Does it bucket into empirical values?  I will be looking into the simulation specifics as I go forward.  If you know, please let me know below!

2 comments:

  1. this is cool. Is there a good resource (online) that you would recommend to learn about copulas?

    -Nick

    ReplyDelete
    Replies
    1. The online SAS doc has lots of details on what they do as well as references to papers. http://support.sas.com/documentation/cdl/en/etsug/63939/HTML/default/viewer.htm#etsug_copula_sect001.htm

      Delete