April 5, 2017

gr-clenabled: OpenCL GPU Blocks for GNU Radio

Yesterday Mike (ghostop14) submitted to us by email a document that gives an overview of his experiments on rewriting several GNU Radio blocks to take advantage of OpenCL GPU acceleration. High end discrete gaming GPU’s (Graphics Processing Unit) on PC’s are a very powerful parallel processors which can be significantly faster at performing calculations than the general purpose CPU. But only algorithms that can be parallelized are worth running on the GPU, and there is an additional overhead to pass the data between the CPU and GPU. This means that only some algorithms will actually work faster on the GPU. GPU acceleration could be part of the key to allowing very high bandwidth SDRs to run on PC’s.

In Mike’s experiments he accordingly found that only some GNU Radio blocks could be accelerated by the GPU. Many blocks ran more slowly on the GPU due to the additional overheads. In the end the blocks he tested that showed actual or at least mixed acceleration were:

Log10
Complex To Arg
Complex To Mag/Phase
A custom Signal To Noise Ratio Helper that executes a divide->Log10->Abs sequence
Mag/Phase To Complex (OpenCL performed better only for blocks above 8K for the 1070, and 18K for the 970 and 1000M)
Signal Source (OpenCL outperformed CPU only for the 1070 for 8K blocks and above)
Quadrature Demodulation (OpenCL performed better only for blocks above 10K)

The project is called gr-clenabled, and the open source code for gr-clenabled is available over on GitHub. A document documenting a full study of the implementation and performance of GPU GNU Radio blocks can be found here. Below is an excerpt from Mike’s overview document (if you want more information we suggest reading the overview first, and then the full study document):

About 4 months ago I decided to take on a project that I had wished existed for some time. With all of the code available for using graphics cards for signal processing why were there not a wealth of GPU-accelerated blocks for GNURadio? Really leveraging my new graphics card (an NVIDIA GTX 1070), couldn’t I drive 80 MSPS or higher through if I had hardware that could supply it? (I know USB 2.0 bus speeds, some decoders require hardware for speed, etc. but an SDR enthusiast can still dream)

My idea seemed simple enough. Why not develop OpenCL versions of the most common blocks used in digital data processing? I may not hit my throughput goal but I bet I can really accelerate my flowgraphs. And since I can dream up whatever I want before I have to actually make it, why not make it even more scalable? Why not be able to take full advantage of multiple graphics cards in a system by being able to assign different blocks to run on different cards?

I know, that’s a lot of questions, but sounds great if it existed right? What I didn’t realize was the scope of the box I was about to open. My first task at hand was to learn OpenCL and REALLY dig into the depths of the GNURadio code. Turns out not all signal processing algorithms lend themselves nicely to the way massively parallel processing works. And there’s a time price to pay to move data to a PCI card for processing then retrieve the results that has to be considered. Some native blocks take longer than this transfer time to run and can benefit from offloading, while others are so fast they’re done before a GPU even gets the data. But I’m getting ahead of myself here.

Throughput of the log10 GNU Radio block on various different GPU's at different block sizes. — Throughput of the log10 GNU Radio block on various different GPU’s at different block sizes.