Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/kwmsmith/runcython

compile and run cython in one line
https://github.com/kwmsmith/runcython
Last synced: 24 days ago
JSON representation
compile and run cython in one line
Host: GitHub
URL: https://github.com/kwmsmith/runcython
Owner: kwmsmith
License: mit
Created: 2014-11-20T19:32:30.000Z (almost 10 years ago)
Default Branch: master
Last Pushed: 2014-11-20T07:27:03.000Z (almost 10 years ago)
Last Synced: 2023-03-11T03:03:47.941Z (over 1 year ago)
Language: Shell
Homepage:
Size: 72.3 KB
Stars: 0
Watchers: 1
Forks: 25
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        Cython is a superset of python that adds support for blazing fast c/c++ functions. Runcython lets you compile and run cython files in one line. OSX and Linux are supported.

Installation


    pip install runcython

For OSX, make sure that the file python.pc is on your PKG_CONFIG_PATH. I achieved this by adding `export PKG_CONFIG_PATH=/System/Library/Frameworks/Python.framework/Versions/2.7/lib/pkgconfig` to my ~/.bashrc.

    

Usage


    # hello.pyx

    print 'hello, world'



    $ runcython hello.pyx

    hello, world

    

  You can use `runcython file.pyx` just like you would use `python file.py`. The difference is that runcython will run a file with arbitrary cython code.

  

    # accum.pyx

    cdef int i, n, accum

    accum = 0

    n = 10**4

    for i in range(n):

        accum += i

    print i



    $ runcython accum.pyx

    49995000

    

  There's no need to muck around with distutils or intermediate files. Using cython the typical way would require creating 5 distinct files, `accum.pyx`, `accum.c`, `accum.so`, `setup.py`, and `use.py`. That's a lot of moving parts to keep track of. `runcython` keeps things simple so that you can just focus on writing fast code. If you want to output a module for use in your other python files, you can always use `makecython` instead:

  

    # primes.pyx

    def primes(int kmax):

        cdef int n, k, i

        cdef int p[1000000]

        result = []

        if kmax > 1000000:

            kmax = 1000000

        k = 0

        n = 2

        while k < kmax:

            i = 0

            while i < k and n % p[i] != 0:

                i = i + 1

            if i == k:

                p[k] = n

                k = k + 1

                result.append(n)

            n = n + 1

        return result

    

    def main():

        print primes(5)



    $ makecython primes.pyx

    $ ls

    primes.pyx primes.so

    $ python -c 'import primes; print primes.primes(10)'

    [2, 3, 5, 7, 11, 13, 17, 23, 29]

    

Note that the `main()` function above in primes.pyx is not called when we import the primes.so module. Unlike makecython, runcython will call the `main()` function in your *.pyx file if it exists. This is the equivalent of the `if __name__ == '__main__': main()` convention used in python programs:

    $ runcython primes.pyx

    [2, 3, 5, 7, 11]

 Advanced Usage 


  Of course, none of this would be much better than the pyximport tool if it didn't work for complex cython builds with lots of dependencies. But unlike pyximport, runcython doesn't force you to adopt an entirely new strategy for complex builds. You get 2 extra arguments to runcython, one for passing additional flags to `cython file.pyx ...`, and one for passing additional flags to `gcc a.c ...`. Lets see how this works for calling a c file:

  

    # square.c

    int square(int x) {

        return x * x;

    }

  

    # use_square.pyx

    cdef extern int square(int)

    print square(5)

  

  Now if we don't add any extra parameters, `runcython use_square.pyx` will first run `cython use_square.pyx` to produce use_square.c, and call `gcc -shared -fPIC use_square.c -o use_square.so` to produce use_square.so. But we need to tell gcc that it should also compile the square.c file. Doing this just requres tagging on square.c to the gcc command, giving `gcc -shared -fPIC use_square.c -o use_square.so square.c`. To tell runcython to do this, we just tell it to add the string "square.c" to the end of the gcc command:

  

    $ runcython square.pyx "" "square.c"

    25

  

  We can also tell runcython to pass special flags to the cython command. For example the `-a` flag tells cython to produce a nicely formatted html file with a summary of which lines in the input file were successfully optimized by cython. We can do that for the above `primes.pyx`:

  

    $ runcython primes.pyx "-a"

    $ ls

    primes.pyx primes.html

    $ firefox primes.html

    # firefox will show us the areas that have been sped up

 Binding cuda kernels with runcython 


  To convince you that `runcython` really does scale to rather complex build processes, here's a pipeline I built recently to call cuda kernels directly using `runcython++`. Note that runcython++ is just like runcython, but using g++ for compilation rather than gcc:

  

    // kernel.h

    #define N 16

    void cscan(float* hostArray);

    

  kernel.h declares a simple interface to a cuda kernel that will take an array of floats and return an array of the partial sums. This is a great parallel primitive that would be nice to do on a GPU if it were an important part of your computation. The kernel itself is defined with `kernel.cu`.

  

    // kernel.cu

    #include "kernel.h"

    __global__ void scan(float *g_odata, float *g_idata, int n)  {

        extern __shared__ float temp[]; // allocated on invocation  

        int thid = threadIdx.x;

        int pout = 0, pin = 1;

        // Load input into shared memory.  

        // This is exclusive scan, so shift right by one  

        // and set first element to 0  

        temp[pout*n + thid] = (thid > 0) ? g_idata[thid-1] : 0;

        __syncthreads();

        for (int offset = 1; offset < n; offset *= 2)

        {

            pout = 1 - pout; // swap double buffer indices

            pin = 1 - pout;

            if (thid >= offset) {

              temp[pout*n+thid] = temp[pin*n+thid - offset] + temp[pin*n+thid];

            } else {

              temp[pout*n+thid] = temp[pin*n+thid];

            }

            __syncthreads();

        }

        g_odata[thid] = temp[pout*n+thid]; // write output

    }

    

    void cscan(float* hostArray)

    {

        float* deviceArray;

        float* deviceArrayOut;

        const float zero = 0.0f;

        const int arrayLength = N;

        const unsigned int memSize = sizeof(float) * arrayLength;

    

        cudaMalloc((void**) &deviceArray, memSize);

        cudaMalloc((void**) &deviceArrayOut, memSize);

        cudaMemset( deviceArray, zero, memSize);

        cudaMemset( deviceArrayOut, zero, memSize);

    

        cudaMemcpy(deviceArray, hostArray, memSize, cudaMemcpyHostToDevice);

        scan <<< 1, N, 32 >>> (deviceArrayOut, deviceArray, N);

        cudaMemcpy(hostArray, deviceArrayOut, memSize, cudaMemcpyDeviceToHost);

    

    

        cudaFree(deviceArrayOut);

        cudaFree(deviceArray);

    

        return;

    }

  Finally, we're going to wrap this `cscan` function using cython with the following code:

  

    # use_kernel.pyx

    from libc.stdlib cimport malloc, free

    cdef extern from "kernel.h":

        void cscan(float*)

    

    def main():

        cdef int N = 16

        cdef float *my_array = malloc(N * sizeof(float))

        print "before"

        for x in range(N):

            my_array[x] = x**2

        print [my_array[x] for x in range(N)]

        cscan(my_array)

        print "after"

        print [my_array[x] for x in range(N)]

    

  To compile, first we need to turn the cuda code into c++ object code:

  

    $ nvcc -c kernel.cu  --shared --compiler-options '-fPIC' -o kernel.o

    

  The results are stored in `kernel.o`, which we would normally call from c++ with

  

    $ g++ use_kernel.cpp kernel.o -L/usr/local/cuda-5.5/lib64 -lcudart

    

  Hence, we just need to pass the equivalent flags to runcython++, and we'll get what we want

  

    $ runcython++ user_kernel.pyx "" "kernel.o -L/usr/local/cuda-5.5/lib64 -lcudart"

    before

    [0.0, 1.0, 4.0, 9.0, 16.0, 25.0, 36.0, 49.0, 64.0, 81.0, 100.0, 121.0, 144.0, 169.0, 196.0, 225.0]

    after

    [0.0, 0.0, 1.0, 5.0, 14.0, 30.0, 55.0, 91.0, 140.0, 204.0, 285.0, 385.0, 506.0, 650.0, 819.0, 1015.0]