Sunday, June 17, 2012

Calling a 3rd party DLL from Base SAS and SAS IMLPlus

I recently finished Rick Wicklin's Statistical Programming with SAS/IML Software.  Great book and provides a great learning resource for SAS IML.

One of the neat things about SAS IMLPlus is it's ability to call 3rd party libraries.  These libraries can be written in Java or any Windows DLL.  Base SAS, since 9.2, has had the ability to link and call a C/C++ compiled library (DLL in Windows, or .so in *NIX).  So let's compare the two ways to write a function and use it in your SAS programs.

First, go read the great tutorial on using FCMP and Visual Studio to create a function for Base SAS.  Many thanks to the excellent folks at SAS Tech Support for writing that.  We'll use the same function for our example.

First the Base SAS example.  We are going to call the DLL two ways.  The TS example shows how to create an FCMP wrapper around a function linked in PROC PROTO.   FCMP also allows us to write functions in SAS code to be called in the Data Step.  So we will write a SAS factorial function, call it,  and the C function wrapped in a SAS function.

options cmplib=(sasuser.proto_ds sasuser.fcmp_ds);

proc proto stdcall package=sasuser.proto_ds.cfcns;
  link 'c:\users\pazzula\documents\visual studio 2010\Projects\SASExampleLib\Debug\SASExampleLib.dll';
  int myfactorial(int n) ;


proc fcmp inlib=sasuser.proto_ds outlib=sasuser.fcmp_ds.sasfcns;
   function cfactorial(x);
      return (myfactorial(x));

   function sasFactorial(x);
      p = 1;
      do i=2 to x;
           p = p*i;
        return (p);

data test;
format test $24.;

start = datetime();
Test = "cfactorial";
do i=1 to n;
sp = cfactorial(10);
end = datetime();
elapse = end - start;
ave = elapse / n;

start = datetime();
Test = "sasFactorial";
do i=1 to n;
sp = sasFactorial(10);
end = datetime();
elapse = end - start;
ave = elapse / n;

drop i;
format start end datetime.
       n comma16.
         elapse time12.4;
On my laptop, the cFactorial function averages 1.1633E-7 per call and the sasFactorial averages 1.0204E-7.  The times are relatively close.  It has been my experience that a more complex, well written, C function can outperform a FCMP written function.  There is overhead in calling the function, so that is why we see the simple SAS function running faster.

To call the function in IML Studio (using IMLPlus), we must create and declare a DllFuntion object.  In this object we specify the path to the DLL, the function name, and the number of parameters it takes.  We pass the parameters in order using the NextArgIs*() functions where * is the type.  We call the function with the Call_*() method (again * is the return type).
declare String sPathName;
sPathName = "c:\users\pazzula\documents\visual studio 2010\Projects\SASExampleLib\Debug\SASExampleLib.dll";

declare String sFuncName = "myfactorial";

declare DllFunction func = new DllFunction();
func.Init(sPathName, sFuncName, 1);

s = datetime();

n = 10000;
do i=1 to n;
ret = func.Call_Int32();
el = datetime() - s;
print "elapse: " el ;
print "Average call time: " (el/n) ;
The time elapsed on my laptop is 2.017 seconds for an average of 2.017E-4.  MUCH slower than Base SAS. Why?

The reason is that IML Studio is written in Java.  It is use the Java Native Interface to call the function.  Every time it calls it, the return value is taken from the C dll, into Java, and then into SAS.  Modify the do loop like this and you see a much higher throughput.

declare int objRet;
n = 10000;
do i=1 to n;
objRet = func.Call_Int32();
Now the time is .967 seconds or 9.67E-5.  Still not as fast as Base SAS, but pretty quick.  I actually chatted with the guys at SAS TS about this and they tell me the limiting factor is the jump from Java into SAS.  That is why declaring the objRet in IMLPlus (which is held on the client, not the SAS session) is so much faster.  The take away is to limit the number of trips from the client objects into SAS IML variables.  If you have an array, fill it fully in IMLPlus, and then pass it to SAS.  Don't pass each element from IMLPlus into an IML matrix.

I hope this helps as people look to extend SAS and SAS IML.  Feel free to ask me a question in the comments if you have further questions.

Wednesday, June 13, 2012

Performance with foreach, doSNOW, and snowfall

Is it just me, or does the performance of the foreach package with a doSNOW backend operating on a socket grid suck?

Here at work, I am helping to setup a cluster of Windows machines for distributed R processing.  We have lots of researchers running code that takes hours to complete and are essentially large for loops with lots of analysis in between.  These guys and gals are not hard core programmers, so there is lots of interest in foreach (as opposed to something like RMPI).

I have successfully setup a POC grid between mutliple machines using sockets and public key authentication.  Assuming we use this, I'll post a how-to, as there is not much on the web on how to get it working on Windows.

In the meantime, I am testing performance.  There is something going on with foreach that I do not understand.  Performance numbers are really bad.

Can anyone explain what is going on here?
> require(doSNOW)
Loading required package: doSNOW
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
Loading required package: iterators
Loading required package: snow
> require(snowfall)
Loading required package: snowfall
> sfInit(parallel=TRUE,socketHosts=rep("localhost",3))
R Version:  R version 2.15.0 (2012-03-30)
snowfall 1.84 initialized (using snow 0.3-9): parallel execution on 3 CPUs.
> cl = sfGetCluster()
> f = function(x) {
+    sum = 0
+    for (i in seq(1,x)) sum = sum + i
+    return(sum)
+ }
> registerDoSNOW(cl)
> out = vector("logical",length=10000)
> system.time( (for (i in seq(1,10000)) out[i]=f(i) ))
   user  system elapsed
  25.99    0.00   25.99
> system.time( (out = lapply(seq(1,10000),f) ))
   user  system elapsed
  26.55    0.00   26.55
> system.time( (out = parLapply(cl,seq(1,10000),f) ))
   user  system elapsed
   0.02    0.00   15.85
> system.time( (out = foreach(i=seq(1,10000)) %dopar% f(i) ))
   user  system elapsed
   6.64    0.42   98.31
> getDoParWorkers()
[1] 3
EDIT: HA!  Figured it out.  foreach is not very efficient in communicating tasks as compared to par*apply().  The time to communicate the process overwhelmed the actual processing time.

When I change the code to this, it runs fast (about the same as parLapply()):

> system.time( (out = foreach(i=seq(0,9),.combine='c') %dopar% {
+    apply(as.array(seq(i*1000+1,(i+1)*1000)),1,f)
+ }))
   user  system elapsed
   0.00    0.00   14.03

Friday, June 1, 2012

Calling R from SAS IML Studio

I am playing around with SAS IML Studio 3.4.  For those that do not know, IML (Interactive Matrix Language) is the Matlab-esk language from SAS.  It opperates from normal SAS code through the PROC IML procedure.  A new (to me at least) UI has been developed for analysts called IML Studio.  IML Studio uses a superset of the IML language called IMLPlus.  I'll be digging into it (and the goodies like linked graphs, Java integration, and the ability to call 3rd party dll's) later.

One of the more recent additions to the IML and IMLPlus languages is the ability to run SAS routines from within IML.  At the same time this functionality was added, SAS also added the ability to call R from within IML.  You can now pass IML matrices back and forth betwen R matrices and SAS Datasets back and forth to R Data Frames (and other types).

Having never done this, I fired up IML Studio and set out to learning.

First, save the macros created in my last post into the an Autocall library.  You can modify the autocall libraries by modifying the sasv9.cfg file and adding the path to the SASAUTOS list.  Mine looks like this:
        "C:\Users\pazzula\Documents\My SAS Files(32)\9.3\macros"

The SAS file for the macro can be downloaded here.

To submit SAS code, surround the code with "submit;" and "endsubmit;".  This piece will download data for the SPY ETF:

Next, let's create 2 vectors, X and Y.  Make Y and linear function of X.

x = (1:10)`;
y = 1 + 3*x;

e = j(10,1,0);
do i=1 to 10;
               e[i] = .5*rannor(12345);

y = y + e;

 Nothing hard about that. Those new to IML will want to know that ` is the transpose operator and "j(n,m,value)" creates a matrix (n x m) filled with "value."

Exporting IML matrices and SAS Data Sets to R is straight forward.  Use the modules ExportMatrixToR() and ExportDataSetToR().
run ExportMatrixToR(x,"x");
run ExportMatrixToR(y,"y");
run ExportDataSetToR("returns","returns");

The second parameter to each module is the name to give the object in R.  To call R, we again use "submit" and "endsubmit," only this time we add "/ R" to the submit line.  So let's run a linear model on y~x, create an XTS object from the returns Data Frame, chart the cumulative returns of SPY and create an AnnualizedReturn table.
submit /R;

m = lm(y~x);

returns = xts(returns$spy,returns$Date);

colnames(returns) = {"SPY"};
chart.CumReturns(returns[,"SPY"],main="Total Return");


lm(formula = y ~ x)
Min 1Q Median 3Q Max
-0.79782 -0.04944 0.04503 0.17198 0.33329
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.85948 0.22469 3.825 0.00505 **
x 3.02539 0.03621 83.548 4.7e-13 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3289 on 8 degrees of freedom
Multiple R-squared: 0.9989, Adjusted R-squared: 0.9987
F-statistic: 6980 on 1 and 8 DF, p-value: 4.698e-13
Annualized Return                0.1041
Annualized Std Dev              0.1956
Annualized Sharpe (Rf=0%) 0.5323

The SAS Data Set contained a column called Date that had a SAS Date format applied.  During the conversion to R Data Frame, SAS was nice enough to convert that column into an R date.  

That's pretty much it.  It's pretty straight forward.  Personally, I'm excited about this.  There are some things, like data manipulation, that SAS is way better than R at.  But then there are things that R gives me that I have to work to code in SAS (like easy functions for portfolio analytics).  Now I get the best of both worlds.