In a

previous post, Rick Wicklin brought up the inefficiency of a macro loop for collecting summary statistics for different groups. Rick has a blog post talking all about it

here.

The problem with what I was doing was that A) I was doing more than basic statistics (PROC COPULA simulations) and B) operating on a rolling time series window. For me to generate the distinct BY groups like in Rick's example, I would have to duplicate the data. That is, create distinct rows for each window, meaning that each overlapping data point would be in many windows; effectively multiplying my data by the window size.

But this got me thinking, what IS the best way to collect summary stats across a rolling window in SAS? SAS being SAS there are many ways to skin this cat. I'm going to try 4 and see what works.

- A Macro loop with PROC SUMMARY, like Rick says not to do.
- A large data duplication with transposition into a group/variable name/ value column table using by groups in PROC SUMMARY.
- The same as 2, but using PROC SQL
- Something totally different -- transposing the variables into a horizontal table structure and using a DATA STEP and summary functions.

For the test, we will generate 1,000 observations for 100 variables over a 100 sample rolling window. We will calculate N (non-missing), MIN, MAX, MEAN, and STD. The goal is to get all methods to output a data set with columns for GROUP, VARIABLE NAME, N, MIN, MAX, MEAN, and STD.

First up, the DATA STEP (and macro) to generate our sample data. The macro will generate 100 random variable names.

**%macro**
rannames(n,vars);

data _null_;

format outStr $%eval(&n***10**)**.**

chars $52.

s $8.;

chars = "abcdefghijklmnopqrstuvwxyz";

chars =
catt(chars,upcase(chars));

do i=**1**
to &n;

s = "";

do j=**1**
to **8**;

s = catt(s,substr(chars,ceil(ranuni(**0**)***52**),**1**));

end;

outStr = strip(outStr) || "
" || strip(s);

end;

call symputx("&vars",outStr,"g");

run;

**%mend**;

%*rannames*(100,vars);

**data**
test(drop=j);

format
d date9.;

array
arr[**100**] &vars;

do
d=**1** to **1000**;

do
j=**1** to **100**;

arr[j] = rannor(**0**);

end;

output;

end;

**run**;

Next a macro for calculating the rolling stats. First we filter the data to the window and transpose the filtered data. Next we calculate the stats for each variable, outputting the values into appropriately named columns. We next add back in the group (variable d) id. Finally we append the results to the main data set.

**%macro**
method1();

proc datasets
lib=work nolist;

delete method1;

run;

%do
i=**100** %to **1000**;

proc transpose
data=test(where=(%eval(&i-**99**)<=d<=&i))
out=temp;

by d;

var &vars;

run;

proc summary
data=temp;

class _NAME_ ;

var col1;

output
out=stats(where=(_NAME_ ^=""))
n=n min=min max=max mean=mean std=std ;

run;

data stats;

set
stats(drop=_TYPE_ _FREQ_);

d = &i;

run;

proc append
base=method1 data=stats force;

run;

%end;

**%mend**;

%let
start=%sysfunc(datetime());

options
nonotes;

%*method1*;

%let
elapse=%sysevalf(%sysfunc(datetime())
- &start);

options
notes;

%put
Took &elapse;

This takes 38.5 seconds on my computer.

Now let's introduce 3 macros which will put the variables into groups. The output is the format directly used in #4. However, I have not found a more efficient way to do this for #2 and #3. These macros transpose the values such that each variable becomes an array of 100 values in a single record. Each record is a group. 100 variables * 100 observations in a window means we will have 10,000 variables per record. Hard for an RDBMS, easy for SAS.

**%macro**
rolling_col(var,n);

array
&var.arr[&n] &var.1 -
&var&n;

retain &var.1
- &var&n;

if _N_ <
&n then

&var.arr[_N_] = &var;

else do;

if _N_ > **100**
then do;

do i=**1**
to **99**;

&var.arr[i] =
&var.arr[i+**1**];

end;

end;

&var.arr[**100**]
= &var;

end;

drop &var;

**%mend**;

**%macro**
rolling_cols(vars,n);

%local
i nv var;

%let
nv=%sysfunc(countw(&vars));

%do
i=**1** %to &nv;

%let
var = %scan(&vars,&i);

%*rolling_col*(&var,&n);

%end;

**%mend**;

**%macro**
rolling_table(data,vars,n,out=);

data &out;

set &data;

retain grp **0**;

%*rolling_cols*(&vars,&n);

drop &vars
i;

if _N_ >=
&n then do;

grp = grp + **1**;

output;

end;

run;

**%mend**;

The second method uses the output of the ROLLING_TABLE macro, transposes it from short and fat into long and skinny. It sorts the data for use in the BY group and then calls PROC SUMMARY to calculate the stats. I use the DATA STEP for the transpose because it is faster than PROC TRANSPOSE. I would have to use a second data step anyway to truncate the _NAME_ variable to the proper values.

%let
start=%sysfunc(datetime());

%*rolling_table*(test,&vars,**100**,out=temp);

**data**
temp(keep=d _name_ col1);

format
d date9. _NAME_ $8. x **8.**;

set
temp(drop=grp);

format
xx **8.**;

array
vars[*] x -- xx;

do
i=**2** to dim(vars)-**1**;

_NAME_ = vname(vars[i]);

COL1 = vars[i];

output;

end;

**run**;

**proc**
**sort** data=temp
out=temp;

by
d _NAME_;

**run**;

**proc**
**summary** data=temp;

by
d _NAME_;

var
col1;

output
out=Method2(drop=_TYPE_ _FREQ_) n=n
min=min max=max
mean=mean std=std
;

**run**;

%let
elapse=%sysevalf(%sysfunc(datetime())
- &start);

%put
Took &elapse;

This takes 4.5 seconds on my computer. Note, creating the groups expands the table from being 817KB to 210MB. Note for the feint of heart or if you have lots of columns.

The 3rd method looks exactly like the 2nd, except we skip the sort and use PROC SQL.

%let
start=%sysfunc(datetime());

%*rolling_table*(test,&vars,**100**,out=temp);

**data**
temp(keep=d _name_ col1);

format
d date9. _NAME_ $8. x **8.**;

set
temp(drop=grp);

format
xx **8.**;

array
vars[*] x -- xx;

do
i=**2** to dim(vars)-**1**;

_NAME_ = vname(vars[i]);

COL1 = vars[i];

output;

end;

**run**;

**proc**
**sql**;

create
table Method3 as

select
d,

_NAME_,

n(col1) as n,

min(col1) as min,

max(col1) as max,

mean(col1) as mean,

std(col1) as std

from
temp

group
by d , _name_;

**quit**;

%let
elapse=%sysevalf(%sysfunc(datetime())
- &start);

%put
Took &elapse;

This takes 3.8 seconds, a significant increase over the SORT and SUMMARY/BY method.

The last method will require 3 more macros. These will be used to call the appropriate summary functions for each variable array in the temp table. We will use 1 DATA STEP to calculate all the results.

**%macro**
summary_stats(vars,n,funcs=n min max mean);

%local
i nv var;

%let
nv=%sysfunc(countw(&vars));

format _name_ $8.;

%do
i=**1** %to &nv;

%let
var = %scan(&vars,&i);

%*calc_stats*(&var,&n,&funcs);

%end;

**%mend**;

**%macro**
calc_stats(var,n,funcs);

_NAME_ = "&var";

%local
i f;

%do
i=**1** %to %sysfunc(countw(&funcs));

%let
f = %scan(&funcs,&i);

&f =
&f(of &var.1-&var&n);

%end;

drop &var.1-&var&n;

output;

**%mend**;

**%macro**
rolling_summary_stats(data,vars,n,

out=,funcs=n min max mean std);

data &out;

set &data;

%*summary_stats*(&vars,&n,funcs=&funcs);

run;

**%mend**;

%let
start=%sysfunc(datetime());

%*rolling_table*(test,&vars,**100**,out=temp);

%*rolling_summary_stats*(temp(drop=grp),&vars,**100**,out=Method4);

%let
elapse=%sysevalf(%sysfunc(datetime())
- &start);

%put Took &elapse;

This method takes .62 seconds. By FAR the fastest available.

My take away is if you have to use rolling windows, time is a factor, you can afford the disk overhead, and are capable of calculating all the statistics you need from the data step, then do it in the DATA Step. The procedures from SAS are convenient, especially for creating lots of data and output. However, that convenience comes with a time cost.

Full SAS code is available

here. I also include a macro for a simple linear regression of many columns dependent on 1. A good example would be if you wanted to calculate a market Alpha and Beta for a series of stocks over a rolling window. (This is why I REALLY looked at this topic).