https://github.com/sourdoughcat/stan
Statistical Analysis System Transcompiler to SciPy
https://github.com/sourdoughcat/stan
Last synced: 6 months ago
JSON representation
Statistical Analysis System Transcompiler to SciPy
- Host: GitHub
- URL: https://github.com/sourdoughcat/stan
- Owner: SourdoughCat
- License: mit
- Created: 2014-01-17T11:37:28.000Z (over 11 years ago)
- Default Branch: dev
- Last Pushed: 2016-05-22T23:39:40.000Z (about 9 years ago)
- Last Synced: 2024-12-09T00:10:21.825Z (6 months ago)
- Language: Python
- Size: 476 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
Awesome Lists containing this project
README
Statistical Analysis System (SAS) Transcompiler to SciPy
==================================================.. image:: https://travis-ci.org/chappers/Stan.svg?branch=dev :target: https://travis-ci.org/chappers/Stan
The goal of this is to transcompile a subset of SAS/Base to SciPy.
Testing
-------The tests can be run directly inside your git clone (without having to install stan) by typing:
nosetests stan
Differences
-----------* ``data merge`` will not require the data to be sorted before hand. Data will be implicitly sorted
(similar to the SPDE engine).
* ``dates`` will be suppported in a different manner (coming soon).
* ``format``, ``length``, ``informats`` will not be necessary (we shall use ``dtype`` in ``numpy``).
* Pandas supports column names with spaces in it. This may cause issues since SAS automatically changes spaces to ``_``.
* Pandas is case sensitive, SAS is not.Known Issues
------------Will not Suport
---------------* ``macro`` facility. It can be replicated (to a degree) using `iPython `_.
Example
-------.. code:: python
from stan.transcompile import transcompile
import stan.stan_magic
from pandas import DataFrame
import numpy as np
import pkgutil
from numpy import nan
.. code:: pythonimport stan.proc_functions as proc_func
mod_name = ["from stan.proc_functions import %s" % name for _, name, _ in pkgutil.iter_modules(proc_func.__path__)]
exec("\n".join(mod_name))
.. code:: python# create an example data frame
df = DataFrame(np.random.randn(10, 5), columns = ['a','b','c','d','e'])
df.. raw:: html
a
b
c
d
e
0
-1.245481
-1.609963
0.442550
-0.056406
-0.213349
1
-1.118754
0.116146
-0.032579
-0.556940
0.270678
2
0.864960
-0.479118
2.370390
2.090656
-0.475426
3
0.434934
-2.510176
0.122871
0.077915
0.597477
4
0.689308
0.042817
0.217040
-1.424120
-0.214721
5
-0.432170
-1.344882
-0.055934
1.921247
1.519922
6
-0.837277
0.944802
-0.650114
-0.297314
1.432118
7
1.488292
-1.236296
0.128023
2.886408
-0.560200
8
-0.510566
-1.736577
0.066769
-0.735257
0.178167
9
2.540022
0.034493
-0.521496
-2.189938
0.111702
.. code:: python
%%stan
data test;
set df (drop = a);
run;.. parsed-literal::
u"test=df.drop(['a'],1)\n"
.. code:: python
exec(_)
test.. raw:: html
b
c
d
e
0
-1.609963
0.442550
-0.056406
-0.213349
1
0.116146
-0.032579
-0.556940
0.270678
2
-0.479118
2.370390
2.090656
-0.475426
3
-2.510176
0.122871
0.077915
0.597477
4
0.042817
0.217040
-1.424120
-0.214721
5
-1.344882
-0.055934
1.921247
1.519922
6
0.944802
-0.650114
-0.297314
1.432118
7
-1.236296
0.128023
2.886408
-0.560200
8
-1.736577
0.066769
-0.735257
0.178167
9
0.034493
-0.521496
-2.189938
0.111702
.. code:: python
%%stan
data df_if;
set df;
if b < 0.3 then x = 0;
else if b < 0.6 then x = 1;
else x = 2;
run;.. parsed-literal::
u"df_if=df\nfor el in ['x']:\n if el not in df_if.columns:\n df_if[el] = np.nan\ndf_if.ix[((df_if[u'b']<0.3)), 'x'] = (0)\nfor el in ['x']:\n if el not in df_if.columns:\n df_if[el] = np.nan\ndf_if.ix[((~((df_if[u'b']<0.3))) & (df_if[u'b']<0.6)), 'x'] = (1)\ndf_if.ix[((~((df_if[u'b']<0.6))) & (~((df_if[u'b']<0.3)))), 'x'] = (2)\n"
.. code:: python
exec(_)
df_if.. raw:: html
a
b
c
d
e
x
0
-1.245481
-1.609963
0.442550
-0.056406
-0.213349
0
1
-1.118754
0.116146
-0.032579
-0.556940
0.270678
0
2
0.864960
-0.479118
2.370390
2.090656
-0.475426
0
3
0.434934
-2.510176
0.122871
0.077915
0.597477
0
4
0.689308
0.042817
0.217040
-1.424120
-0.214721
0
5
-0.432170
-1.344882
-0.055934
1.921247
1.519922
0
6
-0.837277
0.944802
-0.650114
-0.297314
1.432118
2
7
1.488292
-1.236296
0.128023
2.886408
-0.560200
0
8
-0.510566
-1.736577
0.066769
-0.735257
0.178167
0
9
2.540022
0.034493
-0.521496
-2.189938
0.111702
0
--------------
.. code:: python
# procs can be added manually they can be thought of as python functions
# you can define your own, though I need to work on the parser
# to get it "smooth"
df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df1.. raw:: html
a
b
0
True
False
1
False
True
2
True
True
.. code:: python
%%stan
proc describe data = df1 out = df2;
by a;
run;.. parsed-literal::
u"df2=describe.describe(data=df1,by='a')"
.. code:: python
exec(_)
df2.. raw:: html
a
b
a
False
count
1
1
mean
0
1
std
NaN
NaN
min
False
True
25%
False
True
50%
0
1
75%
False
True
max
False
True
True
count
2
2
mean
1
0.5
std
0
0.7071068
min
True
False
25%
1
0.25
50%
1
0.5
75%
1
0.75
max
True
True
The proc actually isn't difficult to write. So for the above code it is
actually just this:
::
def describe(data, by):
return data.groupby(by).describe()
This functionality allow you to handle most of the ``by`` and ``retain``
cases. For languages like Python and R, the normal way to handle data is
through the split-apply-combine methodology.
Merges can be achieved in a similar way, by creating a ``proc``:
.. code:: python
%%stan
proc merge out = df2;
dt_left left;
dt_right right;
on = 'key';
run;.. parsed-literal::
u"df2=merge.merge(dt_left=left,dt_right=right,on='key')"
.. code:: python
left = DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
exec(_)
df2.. raw:: html
key
lval
rval
0
foo
1
4
1
foo
1
5
2
foo
2
4
3
foo
2
5
heres an example showing how you can define your own function and run it
(not a function that came with the package)
.. code:: python
def sum_mean_by(data, by):
return data.groupby(by).agg([np.sum, np.mean])
.. code:: python%%stan
proc sum_mean_by data = df_if out = df_sum;
by x;
run;.. parsed-literal::
u"df_sum=sum_mean_by(data=df_if,by='x')"
.. code:: python
exec(_)
df_sum.. raw:: html
a
b
c
d
e
sum
mean
sum
mean
sum
mean
sum
mean
sum
mean
x
0
2.710545
0.301172
-8.723557
-0.969284
2.737635
0.304182
2.013566
0.223730
1.214251
0.134917
2
-0.837277
-0.837277
0.944802
0.944802
-0.650114
-0.650114
-0.297314
-0.297314
1.432118
1.432118
``proc sql`` is supported through the ``pandasql`` library. So the above
table could have been produced via SQL as well.
.. code:: python
import pandasql
q = """
select
sum(a) as sum_a,
sum(b) as sum_b,
sum(c) as sum_c,
sum(d) as sum_d,
sum(e) as sum_e,
avg(a) as avg_a,
avg(b) as avg_b,
avg(c) as avg_c,
avg(d) as avg_d,
avg(e) as avg_e
from
df_if
group by x
"""
df_sum_sql = pandasql.sqldf(q, locals())
df_sum_sql
.. raw:: html
sum_a
sum_b
sum_c
sum_d
sum_e
avg_a
avg_b
avg_c
avg_d
avg_e
0
2.710545
-8.723557
2.737635
2.013566
1.214251
0.301172
-0.969284
0.304182
0.223730
0.134917
1
-0.837277
0.944802
-0.650114
-0.297314
1.432118
-0.837277
0.944802
-0.650114
-0.297314
1.432118