CUDA/OpenCL/C++ development or conversion from MATLAB, writing highly efficient, readable, unit tested code
Every deliverable code gets unit tested automatically with the latest testing infrastructure
Fast efficient implementations, on CPU and/or GPU
CPU Optimizations using SIMD SSE/AVX, parallel programming. Usage of atomic commands, lock free data structures, etc. Finding hotspots, detecting and analyzing of various types of bottlenecks.
Design, architect and implement new systems. GPU optimizations of OpenCL code, supporting high memory bandwidth requirements and high compute efficiency.