[cython-users] Cython syntax for CUDA programming?

Discussion:

Hai Nguyen

2014-12-10 16:50:56 UTC

Hi there,

As far as I know, supporting GPU for Python is still limited and the
most popular library is pycuda/pyopencl. In order to use it we still
need to write extension in C/C++ syntax
(http://documen.tician.de/pycuda/)

I am interested in writing CUDA code in clean Cython/Python syntax. Is
there any quick way to get this code below workable?
( I know that we can write very clean syntax with numbapro but
numbapro is not really free).

Thanks.

(This is very simple hello world example for running in GPU from this website
https://www.pdc.kth.se/resources/computers/zorn/how-to/how-to-compile-and-run-a-simple-cuda-hello-world)

hello.pyx
======
# pseudocode
from libcuda cimport *

DEF N = 7
DEF blocksize = 7

# dummy cuda decorator to let Cython know that it will be used with CUDA
cdef cuda(func, message):
pass

@cuda('global')
cdef void hello (char* a, int *b, cudablockcode=args):
# C language: void __global__ hello(...)
# cudablockcode will be translated to something like <<<dimGrid, dimBlock>>>
# hello<<dimGrid, dimBlock>>>(ad, bd)
a[threadIdx.x] += b[threadIdx.x]

cdef main():
cdef char[N] a = "Hello "
cdef int[N] b = [15, 10, 6, 0, -11, 1, 0]
cdef:
char* ad
int* db
int csize = N * sizeof(char)
int isize = N * sizeof(int)
int i=1
dim3 dimBlock(blocksize, i)
dim3 dimGrid(i, i)

print a
# alloc and copy to device
cudaMalloc(<void**>&ad, csize)
cudaMalloc(<void**>&bd, isize)
cudaMemcpy(ad, a, csize, cudaMemcpyHostToDevice)
cudaMemcpy(bd, b, isize, cudaMemcpyHostToDevice)

# hello<<<dimGrid, dimBlock>>>(ad, bd)
# call function in device
hello(ad, bd, cudablockcode=(dimGrid, dimBlock))

# copy to host
cudaMemcpy(a, ad, csize, cudaMemcpyDeviceToHost)

# free mem
cudaFree(ad)

libcuda.pxd
========
# delaration for CUDA
cdef extern from "driver_types.h":
ctypedef enum cudaError:
pass
ctypedef enum cudaMemcpyKind:
cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice
cudaMemcpyDefault

cdef extern from "vector_types.h":
ctypedef struct uint3:
unsigned int x, y, z
ctypedef struct dim3:
pass

cdef extern from "cuda.h":
uint3 threadIdx
uint3 threadIdy
uint3 threadIdz

cdef extern from "device_functions.h":
void __syncthreads(void)

ctypedef cudaError cudaError_t
cdef extern from "cuda_device_runtime_api.h":
cudaError_t cudaMalloc(void **p, size_t s)
cudaError_t cudaMemcpy()

cdef extern from "cuda_runtime_api.h":
cudaError_t CUDARTAPI cudaMemcpy(void *dst, const void *src, size_t count,
enum cudaMemcpyKind kind)

cheer
Hai

--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Sturla Molden

2014-12-10 19:07:31 UTC

Permalink

Post by Hai Nguyen
I am interested in writing CUDA code in clean Cython/Python syntax. Is
there any quick way to get this code below workable?

I believe there is support for CUDA in Numba (or at least in Anaconda
Accelerate).

Sturla

Hai Nguyen

2014-12-10 19:11:00 UTC

Permalink

yes, I just wrote "(I know that we can write very clean syntax with numbapro but
numbapro is not really free)."

thanks.

Hai

Post by Sturla Molden

Post by Hai Nguyen
I am interested in writing CUDA code in clean Cython/Python syntax. Is
there any quick way to get this code below workable?

I believe there is support for CUDA in Numba (or at least in Anaconda
Accelerate).
Sturla
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
For more options, visit https://groups.google.com/d/optout.

Jérôme Kieffer

2014-12-10 19:29:22 UTC

Permalink

Hello Hai,

I am regular PyOpenCL programmer (and also Cython) and I don't see the point in what you are suggesting:

Cuda/OpenCL imply many threads running simultaneously per chunk of block/workgroup.
The redesign of the algorithm is much more work than just a language issue.
By the way, this pays as OpenCL is slighly but regularly faster than Cython+gcc (thanks to JIT and advanced SIMD optimization)

Also, having OpenCL code in C and in a separate file is much cleaner to my opinion than mixing ala cuda.

Cheers,

--
Jérôme Kieffer <***@terre-adelie.org>
--
---
You received this message because you are subscribed to the Google Groups "cython-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cython-users+***@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.