In the question, the user has about 80,000 records that match between 2 tables and need to be merged by a character key.
To start, let's create 2 tables in SAS. Each will have 150,000 records and there will be 80,000 overlapping records. We will randomize the tables (obviously the merge is trivial when the tables are sorted with the matching records on top) to be more "real world."
data A;format address $32. a $8.;a = "1";do i=1 to 150000;rand = ranuni(123);if i <=80000 then do;address = put(i,z32.);end;else do;address = "AAA" || put(i,z29.);end;output;end;run;data B;format address $32. b $8.;b = "1";do i=1 to 150000;rand = ranuni(234);if i <=80000 then do;address = put(i,z32.);end;else do;address = "BBB" || put(i,z29.);end;output;end;run;proc sort data=a;by rand;run;proc sort data=b;by rand;run;
Now SAS pages tables to the hard drive. Good and bad. Python and R will be starting with the tables in memory, so we use SASFILE to load the tables into the main memory. Also note that SAS is writing the result table to the HD. We will do the same in Python and R.
594 sasfile a load;NOTE: The file WORK.A.DATA has been loaded into memory by the SASFILE statement.595 sasfile b load;NOTE: The file WORK.B.DATA has been loaded into memory by the SASFILE statement.596597 proc sql noprint;598 create table temp.tableA_B as599 select a.address,600 a.a,601 b.b602 from a inner join603 b604 on a.address = b.address;NOTE: Table TEMP.TABLEA_B created, with 80000 rows and 3 columns.605 quit;NOTE: PROCEDURE SQL used (Total process time):real time 0.09 secondscpu time 0.09 seconds
So SAS took 0.09 seconds. Much faster than the many minutes in Excel.
In R this is trivial. I wrote the tables to csv files (not shown). So we will read them in, do the merge, and then save the result as an .rda file.
> system.time({+ a = read.delim('c://temp//a.csv',header=T,sep=',')+ b = read.delim('c://temp//b.csv',header=T,sep=',')+ })user system elapsed9.03 0.06 9.09> system.time({+ m = merge(a[c("address","a")],b[c("address","b")])+ })user system elapsed2.15 0.02 2.17> system.time({+ save(m, file="c://temp//tableA_B.rda")+ })user system elapsed0.21 0.00 0.20
R took 2.17 seconds for the merge and 0.20 seconds to write. A total of 2.37 seconds.
I am least familiar with the optimal way to do this in Python. I have a question to my Python guru about the optimal way to do the merge. For the time being, my attempt is here. The basics is to read the files into a Dictionary with the address string as the key and a basic object as the value. Then iterate over the keys in 1 table and see if they are in the second table. If so, add the merged data to a new dictionary.
Outputting the merged data might be faster in a list instead of a dictionary. The index hash is not being built in SAS or R.
C:\Temp>c:\Python27\python.exe merge.py2.4200000763 Starting MergeTook 0.2660000324 seconds to merge, 1.1210000515 total with pickle
Python took 0.27 seconds for the merge and a total of 1.12 seconds for the merge and write.
A final note, here is a screen grab of the resulting file sizes. R wins hands down -- I assume there is some compression going on there.
This example is fairly trivial. Hopefully, someone will find it useful while trying to learn one of these languages.
Hi
ReplyDeleteThanks for this interesting work . But I think there is some kind of bias . In fact you are not using the same technics : Sql for SAS native for R and homemade for python . But you can do Sql also in those two ( sqldf package ).
Hello mystery poster. I agree. As I said above, I have the least amount of experience in Python, so I am sure there are more efficient ways of doing this.
DeleteFor Python, you can do the following with the pandas library (untested):
ReplyDeleteimport pandas as pd
df_a = pd.read_csv("path_to_file_a")
df_b = pd.read_csv("path_to_file_b")
df_a_b = pd.merge(df_a, df_b, how='inner', on=address)
('how' and 'on' are the merge defaults, provided for reference)
Pandas will likely be faster than R, at least that's been my experience). Doesn't invalidate the point that anything is better than using Excel lookups.
Thanks!
DeleteC:\Temp>c:\Python27\python.exe merge2.py
0.8199999332 to Read. Starting Merge
Took 0.1800000668 seconds to merge, 0.9200000763 total with pickle
As the "mystery poster" implies, "merge" is a sql operation, whether a union or join (of some sort). If you're a real business, that'll be SQL Server or DB2. If not, go with Postgres. And, if you're not too picky about ANSI coverage and want to work in memory, then SQlite is your 7% solution.
ReplyDeleteSAS, many decades ago, attempted to sell itself as a "one stop data management shop", with some stats thrown in for good measure. With PL/R and Postgres, one can use a (nearly) industrial strength RDBMS and R. Give it a look.
Try using the data.table package in R (which is much faster than the base merge function). See the timings below.
ReplyDelete> system.time({
+ a = fread('C:\\Users\\k.roy.chowdhury\\Desktop\\Cm\\A.csv',header=T,sep=',')
+ b = fread('C:\\Users\\k.roy.chowdhury\\Desktop\\Cm\\B.csv',header=T,sep=',')
+
+
+
+ setkey(a,address)
+ setkey(b,address)
+
+ a[b,nomatch=0]
+ })
user system elapsed
0.98 0.00 0.98
it is very use full blog and very important information about SAS.
ReplyDeleteOnline SAS Training
Cod Benzeri Oyunlar
ReplyDeleteTerraria Benzeri Oyunlar
Escape Benzeri Oyunlar
Ark Benzeri Oyunlar
Max Payne Benzeri Oyunlar
LHÄ°TED