cuRobo

CUDA + PyTorch as the front-end + Warp

Motion Generation as Optimization

world representation

N.B. To efficiently leverage GPU compute for motion generation, we found the following to be important:

Reducing reads and writes to global memory to avoid hitting memory access latency.
Skip writing zeros for gradients by keeping track of sparsity. As the optimization converges, most gradients will go to zero and skipping rewrites of zeros greatly reduces the memory access bottlenecks in our cost terms.
Reducing number of kernels called in an optimization iteration by combining small kernels into a single larger kernel that does all the work from the small kernels.
Using shared memory to share data across a thread block, enabling for-loops to be run in parallel.
Leveraging warp-wide operations for computing values across small thread groups such as reductions and finding the maximum.