An open API service indexing awesome lists of open source software.

https://github.com/sourdoughcat/stan

Statistical Analysis System Transcompiler to SciPy
https://github.com/sourdoughcat/stan

Last synced: 6 months ago
JSON representation

Statistical Analysis System Transcompiler to SciPy

Awesome Lists containing this project

README

        

Statistical Analysis System (SAS) Transcompiler to SciPy
==================================================

.. image:: https://travis-ci.org/chappers/Stan.svg?branch=dev :target: https://travis-ci.org/chappers/Stan

The goal of this is to transcompile a subset of SAS/Base to SciPy.

Testing
-------

The tests can be run directly inside your git clone (without having to install stan) by typing:

nosetests stan

Differences
-----------

* ``data merge`` will not require the data to be sorted before hand. Data will be implicitly sorted
(similar to the SPDE engine).
* ``dates`` will be suppported in a different manner (coming soon).
* ``format``, ``length``, ``informats`` will not be necessary (we shall use ``dtype`` in ``numpy``).
* Pandas supports column names with spaces in it. This may cause issues since SAS automatically changes spaces to ``_``.
* Pandas is case sensitive, SAS is not.

Known Issues
------------

Will not Suport
---------------

* ``macro`` facility. It can be replicated (to a degree) using `iPython `_.

Example
-------

.. code:: python

from stan.transcompile import transcompile
import stan.stan_magic
from pandas import DataFrame
import numpy as np
import pkgutil
from numpy import nan
.. code:: python

import stan.proc_functions as proc_func

mod_name = ["from stan.proc_functions import %s" % name for _, name, _ in pkgutil.iter_modules(proc_func.__path__)]
exec("\n".join(mod_name))
.. code:: python

# create an example data frame
df = DataFrame(np.random.randn(10, 5), columns = ['a','b','c','d','e'])
df

.. raw:: html






a
b
c
d
e




0
-1.245481
-1.609963
0.442550
-0.056406
-0.213349


1
-1.118754
0.116146
-0.032579
-0.556940
0.270678


2
0.864960
-0.479118
2.370390
2.090656
-0.475426


3
0.434934
-2.510176
0.122871
0.077915
0.597477


4
0.689308
0.042817
0.217040
-1.424120
-0.214721


5
-0.432170
-1.344882
-0.055934
1.921247
1.519922


6
-0.837277
0.944802
-0.650114
-0.297314
1.432118


7
1.488292
-1.236296
0.128023
2.886408
-0.560200


8
-0.510566
-1.736577
0.066769
-0.735257
0.178167


9
2.540022
0.034493
-0.521496
-2.189938
0.111702



.. code:: python

%%stan
data test;
set df (drop = a);
run;

.. parsed-literal::

u"test=df.drop(['a'],1)\n"

.. code:: python

exec(_)
test

.. raw:: html






b
c
d
e




0
-1.609963
0.442550
-0.056406
-0.213349


1
0.116146
-0.032579
-0.556940
0.270678


2
-0.479118
2.370390
2.090656
-0.475426


3
-2.510176
0.122871
0.077915
0.597477


4
0.042817
0.217040
-1.424120
-0.214721


5
-1.344882
-0.055934
1.921247
1.519922


6
0.944802
-0.650114
-0.297314
1.432118


7
-1.236296
0.128023
2.886408
-0.560200


8
-1.736577
0.066769
-0.735257
0.178167


9
0.034493
-0.521496
-2.189938
0.111702



.. code:: python

%%stan
data df_if;
set df;
if b < 0.3 then x = 0;
else if b < 0.6 then x = 1;
else x = 2;
run;

.. parsed-literal::

u"df_if=df\nfor el in ['x']:\n if el not in df_if.columns:\n df_if[el] = np.nan\ndf_if.ix[((df_if[u'b']<0.3)), 'x'] = (0)\nfor el in ['x']:\n if el not in df_if.columns:\n df_if[el] = np.nan\ndf_if.ix[((~((df_if[u'b']<0.3))) & (df_if[u'b']<0.6)), 'x'] = (1)\ndf_if.ix[((~((df_if[u'b']<0.6))) & (~((df_if[u'b']<0.3)))), 'x'] = (2)\n"

.. code:: python

exec(_)
df_if

.. raw:: html






a
b
c
d
e
x




0
-1.245481
-1.609963
0.442550
-0.056406
-0.213349
0


1
-1.118754
0.116146
-0.032579
-0.556940
0.270678
0


2
0.864960
-0.479118
2.370390
2.090656
-0.475426
0


3
0.434934
-2.510176
0.122871
0.077915
0.597477
0


4
0.689308
0.042817
0.217040
-1.424120
-0.214721
0


5
-0.432170
-1.344882
-0.055934
1.921247
1.519922
0


6
-0.837277
0.944802
-0.650114
-0.297314
1.432118
2


7
1.488292
-1.236296
0.128023
2.886408
-0.560200
0


8
-0.510566
-1.736577
0.066769
-0.735257
0.178167
0


9
2.540022
0.034493
-0.521496
-2.189938
0.111702
0



--------------

.. code:: python

# procs can be added manually they can be thought of as python functions
# you can define your own, though I need to work on the parser
# to get it "smooth"

df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df1

.. raw:: html






a
b




0
True
False


1
False
True


2
True
True



.. code:: python

%%stan
proc describe data = df1 out = df2;
by a;
run;

.. parsed-literal::

u"df2=describe.describe(data=df1,by='a')"

.. code:: python

exec(_)
df2

.. raw:: html







a
b


a







False
count
1
1


mean
0
1


std
NaN
NaN


min
False
True


25%
False
True


50%
0
1


75%
False
True


max
False
True


True
count
2
2


mean
1
0.5


std
0
0.7071068


min
True
False


25%
1
0.25


50%
1
0.5


75%
1
0.75


max
True
True



The proc actually isn't difficult to write. So for the above code it is

actually just this:

::

def describe(data, by):

return data.groupby(by).describe()

This functionality allow you to handle most of the ``by`` and ``retain``

cases. For languages like Python and R, the normal way to handle data is

through the split-apply-combine methodology.

Merges can be achieved in a similar way, by creating a ``proc``:

.. code:: python

%%stan
proc merge out = df2;
dt_left left;
dt_right right;
on = 'key';
run;

.. parsed-literal::

u"df2=merge.merge(dt_left=left,dt_right=right,on='key')"

.. code:: python

left = DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

exec(_)
df2

.. raw:: html






key
lval
rval




0
foo
1
4


1
foo
1
5


2
foo
2
4


3
foo
2
5



heres an example showing how you can define your own function and run it

(not a function that came with the package)

.. code:: python

def sum_mean_by(data, by):
return data.groupby(by).agg([np.sum, np.mean])
.. code:: python

%%stan
proc sum_mean_by data = df_if out = df_sum;
by x;
run;

.. parsed-literal::

u"df_sum=sum_mean_by(data=df_if,by='x')"

.. code:: python

exec(_)
df_sum

.. raw:: html






a
b
c
d
e



sum
mean
sum
mean
sum
mean
sum
mean
sum
mean


x














0
2.710545
0.301172
-8.723557
-0.969284
2.737635
0.304182
2.013566
0.223730
1.214251
0.134917


2
-0.837277
-0.837277
0.944802
0.944802
-0.650114
-0.650114
-0.297314
-0.297314
1.432118
1.432118



``proc sql`` is supported through the ``pandasql`` library. So the above

table could have been produced via SQL as well.

.. code:: python

import pandasql

q = """
select
sum(a) as sum_a,
sum(b) as sum_b,
sum(c) as sum_c,
sum(d) as sum_d,
sum(e) as sum_e,
avg(a) as avg_a,
avg(b) as avg_b,
avg(c) as avg_c,
avg(d) as avg_d,
avg(e) as avg_e
from
df_if
group by x
"""

df_sum_sql = pandasql.sqldf(q, locals())
df_sum_sql

.. raw:: html






sum_a
sum_b
sum_c
sum_d
sum_e
avg_a
avg_b
avg_c
avg_d
avg_e




0
2.710545
-8.723557
2.737635
2.013566
1.214251
0.301172
-0.969284
0.304182
0.223730
0.134917


1
-0.837277
0.944802
-0.650114
-0.297314
1.432118
-0.837277
0.944802
-0.650114
-0.297314
1.432118