CUDA Support#
Vortex provides GPU-accelerated decompression and compute through the vortex-cuda crate. CUDA
kernels integrate with Vortex’s deferred execution model – expression trees can be shipped to
the GPU in bulk and fused into efficient kernel launches.
CudaSession#
The CudaSession maintains CUDA context, a registry of compiled kernels, and a pool of streams
for concurrent execution. It follows the same vtable registry pattern as other Vortex components:
encodings register their GPU kernels at session creation, and the session resolves them by
encoding ID at execution time.
let session = CudaSession::try_new()?;
let ctx = session.create_execution_ctx();
The session lazily loads PTX modules and caches compiled kernels. Multiple execution contexts can share a single session.
Execution Context#
CudaExecutionCtx manages kernel launches and tracks CUDA events for profiling. It acquires
streams from the session’s pool and records events around kernel launches to measure execution
time.
The execution context integrates with Vortex’s incremental execution model. When an array is
executed on the GPU, the same child-first optimization strategy applies – each encoding can
provide an optimized GPU kernel via execute_parent, or fall back to incremental execution.
Device Buffers#
CudaDeviceBuffer wraps GPU memory allocations with alignment guarantees matching Vortex’s
buffer conventions. Device buffers can be created by:
Allocating directly on the GPU.
Transferring from a host buffer (the allocator handles alignment and pinned memory).
Memory-mapping a file segment directly to the GPU (with appropriate hardware support).
The buffer handle tracks the CUDA context and ensures proper cleanup on drop.
Kernel Architecture#
CUDA kernels are written in C++ and compiled to PTX (NVIDIA’s intermediate representation)
at build time. The build script invokes nvcc to produce PTX files that are embedded in the
binary and loaded at runtime.
Kernels typically use a fixed launch configuration optimized for Vortex’s common array sizes:
64 threads per block (2 warps).
32 elements per thread.
Grid dimensions computed from array length:
(len / 2048, 1, 1).
The launch_cuda_kernel! macro handles grid/block configuration, argument marshaling, and
event recording:
launch_cuda_kernel!(ctx, "dict_lookup_u32", len, |builder| {
builder
.arg(&codes_buffer)
.arg(&values_buffer)
.arg(&output_buffer)
});
Stream Pool#
The session maintains a pool of CUDA streams (default 4) for concurrent kernel execution. Streams are allocated round-robin to execution contexts, allowing multiple kernels to execute in parallel when they have no data dependencies.
Stream synchronization is handled automatically by the execution context – callers can treat execution as synchronous while the runtime pipelines independent operations.
Supported Encodings#
The following encodings have GPU-accelerated kernels:
Encoding |
Kernel |
|---|---|
ALP |
Floating-point decompression |
BitPacked |
Bit unpacking (8/16/32/64-bit variants) |
Dictionary |
Dictionary lookup |
DecimalByteParts |
Decimal reconstruction |
Frame of Reference |
FoR decompression |
Sequence |
Sequence expansion |
ZigZag |
ZigZag decoding |
ZSTD |
GPU-accelerated decompression via nvCOMP |
Additional kernels exist for filter, slice, and patch operations.
External Libraries#
Vortex integrates with NVIDIA libraries for operations that benefit from highly optimized implementations:
nvCOMP – NVIDIA’s compression library provides GPU-accelerated ZSTD decompression. The
vortex-cuda/nvcomp crate provides Rust bindings that dynamically load libnvcomp.so at
runtime. Currently Linux-only (x86_64 and ARM64).
CUB – NVIDIA’s CUDA Unbound library provides GPU primitives. Vortex uses DeviceSelect
for GPU-side filtering operations. The vortex-cuda/cub crate compiles a thin wrapper that
is loaded at runtime.
Build Requirements#
Building with CUDA support requires:
NVIDIA CUDA Toolkit with
nvcccompiler.CUDA 12.0 or later (builds target CUDA 12.0.80 for compatibility).
Linux (nvCOMP bindings are Linux-only).
If nvcc is not available at build time, the crate compiles without PTX generation and GPU
operations will fail at runtime with an appropriate error.
Integration with Deferred Execution#
GPU execution integrates with Vortex’s deferred execution model described in
Execution. When a ScalarFnArray tree is executed on the GPU:
The tree is traversed to identify operations with GPU kernels.
Compatible subtrees are batched into a single GPU execution plan.
The plan is executed with kernel fusion where possible, reducing memory traffic.
Results are returned as device buffers that can remain on GPU for further computation.
This batching is why deferral matters for GPU performance – eager execution would launch many small kernels with host-device synchronization between each, while deferred execution can fuse the entire expression tree into fewer, larger kernel launches.
Interoperability#
Vortex arrays with on-device buffer handles can be converted to Apache Arrow DeviceArray
format for interoperability with other GPU libraries. An ArrayStream of Vortex arrays
that have been executed on the GPU can be exported as a stream of Arrow DeviceArrays without
copying data back to the host.
Future work includes direct conversion to cuDF DataFrames, enabling zero-copy handoff to RAPIDS libraries for GPU-accelerated analytics.