this might be an extremely stupid question, but is this just a demo project of https://github.com/ffmpegwasm/ffmpeg.wasm? or is this bringing forth some other utility that im not seeing?
Hey, I'm Arpad. I have a background in signal processing, deep learning (mainly computer vision), and embedded systems. I have 5+ years of research experience and end-to-end edge AI development. My strong suite comes in GPU + embedded development, optimization, and architecture design.
Best fit: fast-moving teams, end-to-end ownership, collaborative.
looks like a cool project, but id say keep working on it since there seems to be some confusion on why someone would want to use this: no benchmarks and overall pretty vibe-codey (which id personally be very hesitant to use in production)
another comment already mentioned comparison to vortex, which is the same compression ratio and same speeds as youre claiming - but your compression is half of parquet. and if speed is the main goal youre going for, python is an interesting choice. no hate, but def keep working on it, and would love to see more concrete benchmarks with various columnar store types
That phrase and "Who designed this command line interface" are probably written for attention and clicks. The feedback content is useful and I agree with most of it but using such phrases diminishes the value of that feedback and invites defensiveness. I find uv's command line interface cumbersome for me too but I understand why it was written this way.
1. use-after-free, drop semantics vs manual cudaFree
2. kernel args enforced using `cuda_launch!` whereas CPP void* args is just an array of pointers, validating count only
3. alias mutable writes. e.g. CPP can have more than one thread writing out[i] with same i and this will compile. but DisjointSlice<T> with ThreadIndex doesnt have any public constructor (see: https://github.com/NVlabs/cuda-oxide/blob/2a03dfd9d5f3ecba52...) and only using API of `index_1d` `index_2d` and `index_2d_runtime`
This is amazing.. ive been working with custom CUDA kernels and https://crates.io/crates/cudarc for a long time, and this honestly looks like it could be a near drop-in replacement.
im especially curious how build times would compare? Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow. coincidentally, just last week i was profiling build times and found that tools like sccache can dramatically reduce rebuild times by caching artifacts - but you still end up paying for expensive custom nvcc invocations (e.g. candle by hugging face calls custom nvcc command in their kernel compilation): https://arpadvoros.com/posts/2026/05/05/speeding-up-rust-whi...
> Most Rust CUDA crates obv rely on calling CMake or nvcc, which can make compilation painfully slow.
I anecdotally haven't hit this; see the `cuda_setup` crate I made to handle the build scripts; it is a simple `build.rs` which only recompiles if the file changes, and it's a tiny compile time (compared to the rust CPU-side code)
perhaps not drop-in, but all my workflows with cudarc have always been "i make cuda kernel, i use cudarc for ffi to said kernels, i call via rust" - which for this case is pretty analogous
briefly looking at the repo, looks like the main workflow is using rustc-codegen-cuda to convert rust -> MIR -> pliron IR -> LLVM IR -> PTX, which is embedded in the host binary, where then cuda-core loads embedded PTX at runtime onto the GPU
but, if you arent directly making cuda kernels and just want cudarc for either calling existing kernels or other cuda driver api access then cudarc is lighter-weight option? or just use one of the sub-crates in this repo like cuda-core for those apis
Hi, author of cuda-oxide here. Yes, I think that’s basically the right framing: cudarc and cuda-oxide sit at different points in the stack.
cudarc is a host-side CUDA API for Rust: loading modules, managing contexts/streams/events/memory, launching kernels, and accessing CUDA libraries/driver APIs. If your workflow is “I already have CUDA C++/PTX/CUBIN kernels and want to call them from Rust”, cudarc is a very natural fit.
cuda-oxide is focused on the other side of the problem: writing the GPU kernel itself in Rust and compiling it through rustc/MIR into GPU code. The generated PTX is then embedded in the host binary and loaded at runtime by our host-side pieces.
We include cuda-core/cuda-host because we need an end-to-end path for “write Rust kernel, build it, launch it”, but that doesn’t mean the generated PTX is tied forever to our launcher. We’d like the PTX from cuda-oxide to be usable from other host-side CUDA APIs too, including cudarc, and we’re exploring ways to make that interop smoother.
So the short version is: cudarc is about driving CUDA from Rust; cuda-oxide is about generating CUDA device code from Rust. They’re complementary rather than replacements for each other.
I am observing the same from the article... is it heavily inspired by Cudarc, i.e. is this intentional, or are we reading too much into this, given Cudarc is a light abstraction over the CUDA api?
i've had my shot at sycamore a number of times. IMO leptos (leptos.dev) has far more fine-grained capabilities, and dioxus (dioxuslabs.com) is overall more hand-holdy but also powerful. comes with tradeoff for speed. wasm still isnt there yet (yet..) but a lot more web frameworks (including smaller rust ones) can be tracked here: https://krausest.github.io/js-framework-benchmark/current.ht...
reply