(def-gpu-kernel mulmatrix (m1 m2 m3) ...) ;m3 = m2*m1
(defun my-solver (equation-matrix)
...
(with-gpu-allocated-arrays (a1 a2 a3)
...
(call-gpu-kernel mulmatrix a1 a2 a3 N) ;N only, square stuff
...)
) ;end of the solver
Let us explain the code a bit. mulmatrix is a GPU kernel, defined using some subset of CL (mostly math functions and array operations). It should be compiled to one of the two GPU-specific assembly-like languages (see below). After that it can be launched transparently by call-gpu-kernel (transparency means that call-gpu-kernel is the ordinary CL macro).
After the kernel is defined, we should be able to use it from the plain CL (CPU-targeted) code. To achieve this goal, the next steps are required:
1. Allocate CL data arrays on SBCL side (usually arrays of numbers).
2. Allocate the arrays of same size in GPU memory.
3. Copy the data from CL arrays to GPU arrays.
4. Run the kernels using GPU kernel launcher API (part of ATI Stream / CUDA SDKs) on GPU data arrays.
5. Copy the modified data from GPU arrays to CL data arrays.
6. Use the processed data in your CL application.
This is only one of the possible approaches, and the inputs are not restricted to be arrays (but GPU parallelism is reasonable to use only for quite large data sets, which are represented by arrays).
In fact, the very simple variant of the described above things works fine with one proprietary CL application which we have written here in Ukraine. And it is proven to gain over 500x (yes, 500 times faster) execution speed improvement on comparatively simple O(N*N) algorithms benchmarks. With SBCL there will be no 500x (because our proprietary application is deadly slow for some reasons which I will not list here), but I believe that 10-20x for O(N*N) algorithms class is possible.
The whole architecture for SBCL is here:

CL subset is translated to the intermediate representation GPU-IR, which has no dependency on the underlying device architecture. Then, we produce the GPU assembly for CUDA devices or ATI Stream devices, depending on the platform where our CL code is loaded and executed. In such a way we have our home-made variant of the portable Lisp-like language for the modern parallel computing. SBCL-side part of the system should live in SB-GPU contrib.
The bad news are connected with the estimate: at least 6 man-months for GPU-independent part and at least 1-2 man-years for two GPU backends (6-12 months per the backend). I can do this as a PhD thesis (because there will be some scientific part, connected with parallelism and optimisations), but I need to find the appropriate university and a professor who could be interested in this (not necessarily in Ukraine). The thing is that I want to continue my education, so I would like to join the SBCL development activities with that. Is this possible on Earth?
Well, the PhD goal seems to be too ambitious here, but the idea itself is not so bad. Just imagine a ton of different algorithms which benefit from their parallel modifications execution on GPU...
ReplyDeleteYou know about cl-gpu, I assume? https://github.com/angavrilov/cl-gpu
ReplyDeleteSure. I find that library to be great, but SB-GPU is slightly different - it is designed to incorporate the compiler for GPU targets as well. Blue-sky idealism, you see :)
ReplyDeleteStatus update:
As for now, I do not develop any SSA/GPU things for SBCL, I just send some trivial patches from time to time.
However, I plan to develop SBCL in the future. As for now, I still have the excellent CL job, and I do not feel that I should prefer open-source development on the full time basis. SSA frontend cannot be written during weekends (well, I am not so cool engineer to write it during weekends, let us say the truth), it needs a lot of time. But in case I am fired, I will definitely try to do something useful for SBCL, just to master my compiler engineering skills. Currently, I master it on the daily job, and I am happy with this situation.
Interesting approach. But for me, GPU programming alone is difficult enough. I mean, I don't want to see a speedup of 2 or 3. And to provide the compiler with all the knowledge to generate efficient code, this seems to be really tough. I did once an easier approach: just called the GPU from the Lisp side via FFI and did the GPU stuff in C. The real bad thing was that the GPU wants to see C-data, so there is a continuous converting business from Lisp to C and the other way around ...
ReplyDelete