cuda-oxide: Bridging Rust and CUDA for GPU Kernel Development

NVIDIA AI’s latest experimental project, cuda-oxide, opens a new path for GPU programming by allowing developers to write CUDA SIMT (Single Instruction, Multiple Threads) kernels in standard Rust and compile them directly to PTX—the low-level intermediate representation used by CUDA. This eliminates the need for C/C++ code, foreign function interfaces, or domain-specific languages. It integrates seamlessly with the existing Rust toolchain while preserving the familiar CUDA programming model. Below, we explore how cuda-oxide works, what makes it unique, and how it compares to other projects in the Rust GPU ecosystem.

What is cuda-oxide and what problem does it solve?

cuda-oxide is a custom codegen backend for the Rust compiler that targets NVIDIA GPUs. It compiles Rust code—specifically SIMT kernels—into PTX, the assembly-like IR that CUDA uses to execute on NVIDIA hardware. Traditionally, writing CUDA kernels meant using C++ and the CUDA API, or higher-level Python frameworks like Triton that generate CUDA under the hood. Rust developers who wanted GPU compute had to rely on projects like rust-cuda or CubeCL, which often introduced DSLs or foreign function interfaces. cuda-oxide solves this by letting you write a __global__-like function in safe Rust that compiles directly to PTX, preserving the SIMT execution model and CUDA intrinsics without leaving the Rust ecosystem. This brings the familiarity of CUDA programming into Rust, enabling low-level control over GPU threads while benefiting from Rust’s safety guarantees and modern tooling.

cuda-oxide: Bridging Rust and CUDA for GPU Kernel Development — Source: www.marktechpost.com

How does cuda-oxide differ from other Rust GPU projects?

The Rust GPU ecosystem includes several approaches: Rust-GPU compiles to SPIR-V for Vulkan/graphics compute, rust-cuda uses an NVVM IR backend to run Rust on GPUs, CubeCL provides an embedded DSL with JIT cross-compilation to CUDA/ROCm/WGPU, and std::offload leverages LLVM’s implicit offload path. cuda-oxide occupies a distinct niche: it aims to “bring CUDA into Rust” rather than “bring Rust to GPUs.” This means kernel authoring in cuda-oxide feels like writing a native CUDA __global__ function, with explicit thread indexing and device intrinsics, instead of a generic Rust function that happens to run on a GPU. In contrast, rust-cuda abstracts CUDA concepts with Rust ergonomics like async/.await and aims to run parts of the standard library on-device. The two are complementary—cuda-oxide targets developers who want direct CUDA semantics, while rust-cuda appeals to those who prefer a more Rust-idiomatic GPU programming model.

What is the compilation pipeline of cuda-oxide?

cuda-oxide integrates as a custom rustc codegen backend. Instead of producing CPU machine code, it intercepts the compiler at the CodegenBackend::codegen_crate() entry point and runs a separate pipeline for device code. The flow is: Rust source → rustc frontend → rustc_public (Stable MIR) → dialect-mir (a custom Pliron dialect modeling MIR semantics) → memory-to-register promotion → dialect-llvm → LLVM IR (.ll) → PTX (.ptx). The pipeline uses Pliron, a Rust-native MLIR-like IR framework, for the middle stages. This design avoids dependence on C++ tooling (no CMake, tablegen, or LLVM’s MLIR) and allows the entire compiler to build with just cargo. The use of Stable MIR ensures that the backend works across different Rust nightly versions without breaking.

Why does cuda-oxide use Stable MIR (rustc_public)?

The raw internal MIR (Mid-level IR) in rustc is unstable—it changes between nightly releases with no backward compatibility guarantees. This would make any backend relying on it fragile and hard to maintain. cuda-oxide solves this by using rustc_public, also known as Stable MIR. This is a versioned, stable API that rustc exposes over its internal representations. By basing the device code pipeline on Stable MIR, cuda-oxide can read MIR constructs without breaking every time the compiler updates. This stability is crucial for production use and for keeping the project maintainable. It also means developers can upgrade their Rust toolchain without worrying about cuda-oxide compatibility. The choice reflects the project’s emphasis on practical, long-term viability within the Rust ecosystem.

What is Pliron and why was it chosen over MLIR?

Pliron is a Rust-native IR framework inspired by MLIR (Multi-Level Intermediate Representation). The key advantage is that Pliron is written entirely in Rust and compiles with cargo, eliminating any dependency on C++ build systems like CMake or tools like tablegen. By contrast, using upstream MLIR would require integrating a C++ codebase into the Rust toolchain, adding complexity and build overhead. cuda-oxide defines three custom Pliron dialects: dialect-mir, which maps Rust MIR concepts (places, projections, etc.); dialect-llvm, which bridges to LLVM IR; and an intermediate dialect for memory operations. Pliron’s extensibility allows these dialects to be fine-tuned for GPU-specific optimizations. This choice keeps the entire cuda-oxide backend self-contained and easy to build, which aligns with the Rust philosophy of minimal dependencies and cross-platform consistency.

How does cuda-oxide compare to rust-cuda specifically?

While both projects target NVIDIA GPUs from Rust, their philosophies differ. cuda-oxide is designed to “bring CUDA into Rust”—it stays close to the CUDA programming model, with explicit kernel launches, device intrinsics, and thread-block management that feels like writing C++ CUDA code. In contrast, rust-cuda focuses on “bringing Rust to NVIDIA GPUs”—it abstracts CUDA concepts behind Rust idioms like async/.await and aims to run standard library features on device. For example, rust-cuda might let you use Rust’s Iterator on the GPU, while cuda-oxide expects you to write explicit grid-stride loops. The NVlabs team has been coordinating with rust-cuda maintainers and views the projects as complementary rather than competing. Developers could even use cuda-oxide for low-level kernel development and rust-cuda for higher-level asynchronous orchestration. Together, they cover a broad spectrum of GPU programming needs within the Rust ecosystem.

What are the practical implications for developers?

For Rust developers working on GPU compute, cuda-oxide opens the door to writing high-performance kernels without leaving the Rust language or sacrificing safety. It eliminates the need for C++ wrappers or FFI calls, reducing boilerplate and potential bugs. The direct PTX output allows fine-grained optimization, and the use of Stable MIR ensures long-term compiler compatibility. However, cuda-oxide is still experimental—it targets nightly Rust and may lack full coverage of CUDA intrinsics or edge cases. Developers interested in prototyping or exploring SIMT in Rust can start experimenting now, but production use will require further testing. The project also encourages contributions to extend device library support. For teams already using CUDA C++, cuda-oxide offers a path to incrementally adopt Rust for kernel development, leveraging Rust’s memory safety without retraining on a new abstraction layer. It’s a promising step toward making Rust a first-class language for GPU programming.

Tags: