https://github.com/sourdoughcat/stan

Statistical Analysis System Transcompiler to SciPy
https://github.com/sourdoughcat/stan
Last synced: 6 months ago
JSON representation
Statistical Analysis System Transcompiler to SciPy
Host: GitHub
URL: https://github.com/sourdoughcat/stan
Owner: SourdoughCat
License: mit
Created: 2014-01-17T11:37:28.000Z (over 11 years ago)
Default Branch: dev
Last Pushed: 2016-05-22T23:39:40.000Z (about 9 years ago)
Last Synced: 2024-12-09T00:10:21.825Z (6 months ago)
Language: Python
Size: 476 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 4
Metadata Files:
- Readme: README.rst
- Contributing: CONTRIBUTING.rst
- License: LICENSE
Awesome Lists containing this project

README

        Statistical Analysis System (SAS) Transcompiler to SciPy

==================================================

.. image:: https://travis-ci.org/chappers/Stan.svg?branch=dev   :target: https://travis-ci.org/chappers/Stan

The goal of this is to transcompile a subset of SAS/Base to SciPy.

Testing

-------

The tests can be run directly inside your git clone (without having to install stan) by typing:

    nosetests stan

Differences

-----------

* ``data merge`` will not require the data to be sorted before hand. Data will be implicitly sorted

  (similar to the SPDE engine).

* ``dates`` will be suppported in a different manner (coming soon).

* ``format``, ``length``, ``informats`` will not be necessary (we shall use ``dtype`` in ``numpy``).

* Pandas supports column names with spaces in it. This may cause issues since SAS automatically changes spaces to ``_``. 

* Pandas is case sensitive, SAS is not.

Known Issues

------------

Will not Suport

---------------

* ``macro`` facility. It can be replicated (to a degree) using `iPython `_.

Example

-------

.. code:: python

    from stan.transcompile import transcompile

    import stan.stan_magic

    from pandas import DataFrame

    import numpy as np

    import pkgutil

    from numpy import nan

.. code:: python

    import stan.proc_functions as proc_func

    

    mod_name = ["from stan.proc_functions import %s" % name for _, name, _ in pkgutil.iter_modules(proc_func.__path__)] 

    exec("\n".join(mod_name))

.. code:: python

    # create an example data frame 

    df = DataFrame(np.random.randn(10, 5), columns = ['a','b','c','d','e'])

    df

.. raw:: html

    


    

      

        

          

          a

          b

          c

          d

          e

        

      

      

        

          0

          -1.245481

          -1.609963

           0.442550

          -0.056406

          -0.213349

        

        

          1

          -1.118754

           0.116146

          -0.032579

          -0.556940

           0.270678

        

        

          2

           0.864960

          -0.479118

           2.370390

           2.090656

          -0.475426

        

        

          3

           0.434934

          -2.510176

           0.122871

           0.077915

           0.597477

        

        

          4

           0.689308

           0.042817

           0.217040

          -1.424120

          -0.214721

        

        

          5

          -0.432170

          -1.344882

          -0.055934

           1.921247

           1.519922

        

        

          6

          -0.837277

           0.944802

          -0.650114

          -0.297314

           1.432118

        

        

          7

           1.488292

          -1.236296

           0.128023

           2.886408

          -0.560200

        

        

          8

          -0.510566

          -1.736577

           0.066769

          -0.735257

           0.178167

        

        

          9

           2.540022

           0.034493

          -0.521496

          -2.189938

           0.111702

        

      

    

    


.. code:: python

    %%stan

    data test;

    set df (drop = a);

    run;

.. parsed-literal::

    u"test=df.drop(['a'],1)\n"

.. code:: python

    exec(_)

    test

.. raw:: html

    


    

      

        

          

          b

          c

          d

          e

        

      

      

        

          0

          -1.609963

           0.442550

          -0.056406

          -0.213349

        

        

          1

           0.116146

          -0.032579

          -0.556940

           0.270678

        

        

          2

          -0.479118

           2.370390

           2.090656

          -0.475426

        

        

          3

          -2.510176

           0.122871

           0.077915

           0.597477

        

        

          4

           0.042817

           0.217040

          -1.424120

          -0.214721

        

        

          5

          -1.344882

          -0.055934

           1.921247

           1.519922

        

        

          6

           0.944802

          -0.650114

          -0.297314

           1.432118

        

        

          7

          -1.236296

           0.128023

           2.886408

          -0.560200

        

        

          8

          -1.736577

           0.066769

          -0.735257

           0.178167

        

        

          9

           0.034493

          -0.521496

          -2.189938

           0.111702

        

      

    

    


.. code:: python

    %%stan

    data df_if;

        set df;    

        if b < 0.3 then x = 0;

        else if b < 0.6 then x = 1;

        else x = 2;    

    run;

.. parsed-literal::

    u"df_if=df\nfor el in ['x']:\n    if el not in df_if.columns:\n        df_if[el] = np.nan\ndf_if.ix[((df_if[u'b']<0.3)), 'x'] = (0)\nfor el in ['x']:\n    if el not in df_if.columns:\n        df_if[el] = np.nan\ndf_if.ix[((~((df_if[u'b']<0.3))) & (df_if[u'b']<0.6)), 'x'] = (1)\ndf_if.ix[((~((df_if[u'b']<0.6))) & (~((df_if[u'b']<0.3)))), 'x'] = (2)\n"

.. code:: python

    exec(_)

    df_if

.. raw:: html

    


    

      

        

          

          a

          b

          c

          d

          e

          x

        

      

      

        

          0

          -1.245481

          -1.609963

           0.442550

          -0.056406

          -0.213349

           0

        

        

          1

          -1.118754

           0.116146

          -0.032579

          -0.556940

           0.270678

           0

        

        

          2

           0.864960

          -0.479118

           2.370390

           2.090656

          -0.475426

           0

        

        

          3

           0.434934

          -2.510176

           0.122871

           0.077915

           0.597477

           0

        

        

          4

           0.689308

           0.042817

           0.217040

          -1.424120

          -0.214721

           0

        

        

          5

          -0.432170

          -1.344882

          -0.055934

           1.921247

           1.519922

           0

        

        

          6

          -0.837277

           0.944802

          -0.650114

          -0.297314

           1.432118

           2

        

        

          7

           1.488292

          -1.236296

           0.128023

           2.886408

          -0.560200

           0

        

        

          8

          -0.510566

          -1.736577

           0.066769

          -0.735257

           0.178167

           0

        

        

          9

           2.540022

           0.034493

          -0.521496

          -2.189938

           0.111702

           0

        

      

    

    


--------------

.. code:: python

    # procs can be added manually they can be thought of as python functions

    # you can define your own, though I need to work on the parser

    # to get it "smooth"

    

    df1 = DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)

    df1

.. raw:: html

    


    

      

        

          

          a

          b

        

      

      

        

          0

            True

           False

        

        

          1

           False

            True

        

        

          2

            True

            True

        

      

    

    


.. code:: python

    %%stan

    proc describe data = df1 out = df2;

    by a;

    run;

.. parsed-literal::

    u"df2=describe.describe(data=df1,by='a')"

.. code:: python

    exec(_)

    df2

.. raw:: html

    


    

      

        

          

          

          a

          b

        

        

          a

          

          

          

        

      

      

        

          False

          count

               1

                   1

        

        

          mean

               0

                   1

        

        

          std

             NaN

                 NaN

        

        

          min

           False

                True

        

        

          25%

           False

                True

        

        

          50%

               0

                   1

        

        

          75%

           False

                True

        

        

          max

           False

                True

        

        

          True 

          count

               2

                   2

        

        

          mean

               1

                 0.5

        

        

          std

               0

           0.7071068

        

        

          min

            True

               False

        

        

          25%

               1

                0.25

        

        

          50%

               1

                 0.5

        

        

          75%

               1

                0.75

        

        

          max

            True

                True

        

      

    

    


The proc actually isn't difficult to write. So for the above code it is

actually just this:

::

    def describe(data, by):

        return data.groupby(by).describe()  

This functionality allow you to handle most of the ``by`` and ``retain``

cases. For languages like Python and R, the normal way to handle data is

through the split-apply-combine methodology.

Merges can be achieved in a similar way, by creating a ``proc``:

.. code:: python

    %%stan

    proc merge out = df2;

    dt_left left;

    dt_right right;

    on = 'key';

    run;

.. parsed-literal::

    u"df2=merge.merge(dt_left=left,dt_right=right,on='key')"

.. code:: python

    left = DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

    right = DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

    

    exec(_)

    df2

.. raw:: html

    


    

      

        

          

          key

          lval

          rval

        

      

      

        

          0

           foo

           1

           4

        

        

          1

           foo

           1

           5

        

        

          2

           foo

           2

           4

        

        

          3

           foo

           2

           5

        

      

    

    


heres an example showing how you can define your own function and run it

(not a function that came with the package)

.. code:: python

    def sum_mean_by(data, by):

        return data.groupby(by).agg([np.sum, np.mean]) 

.. code:: python

    %%stan

    proc sum_mean_by data = df_if out = df_sum;

    by x;

    run;

.. parsed-literal::

    u"df_sum=sum_mean_by(data=df_if,by='x')"

.. code:: python

    exec(_)

    df_sum

.. raw:: html

    


    

      

        

          

          a

          b

          c

          d

          e

        

        

          

          sum

          mean

          sum

          mean

          sum

          mean

          sum

          mean

          sum

          mean

        

        

          x

          

          

          

          

          

          

          

          

          

          

        

      

      

        

          0

           2.710545

           0.301172

          -8.723557

          -0.969284

           2.737635

           0.304182

           2.013566

           0.223730

           1.214251

           0.134917

        

        

          2

          -0.837277

          -0.837277

           0.944802

           0.944802

          -0.650114

          -0.650114

          -0.297314

          -0.297314

           1.432118

           1.432118

        

      

    

    


``proc sql`` is supported through the ``pandasql`` library. So the above

table could have been produced via SQL as well.

.. code:: python

    import pandasql

    

    q = """

    select 

        sum(a) as sum_a,

        sum(b) as sum_b,

        sum(c) as sum_c,

        sum(d) as sum_d,

        sum(e) as sum_e,

        avg(a) as avg_a,

        avg(b) as avg_b,

        avg(c) as avg_c,

        avg(d) as avg_d,

        avg(e) as avg_e

    from

        df_if

    group by x

    """

    

    df_sum_sql = pandasql.sqldf(q, locals())

    df_sum_sql

    

    

.. raw:: html

    


    

      

        

          

          sum_a

          sum_b

          sum_c

          sum_d

          sum_e

          avg_a

          avg_b

          avg_c

          avg_d

          avg_e

        

      

      

        

          0

           2.710545

          -8.723557

           2.737635

           2.013566

           1.214251

           0.301172

          -0.969284

           0.304182

           0.223730

           0.134917

        

        

          1

          -0.837277

           0.944802

          -0.650114

          -0.297314

           1.432118

          -0.837277

           0.944802

          -0.650114

          -0.297314

           1.432118
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sourdoughcat/stan

Awesome Lists containing this project

README