RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
70% Positive
Analyzed from 621 words in the discussion.
Trending Topics
#checkpoint#gvisor#restore#more#gpu#cuda#open#read#starts#both

Discussion (14 Comments)Read Original on HackerNews
I can't read good ;)
Both approaches still need NVIDIA’s cuda-checkpoint for the GPU side, because CUDA/GPU memory and driver state are not something a normal process checkpointing tool can handle on its own.
So are there any more resources that perhaps the team could point out or other resources or if there are any idea of open-sourcing it ever for more internal deeper dives as I would love to know more about it!
https://docs.cloud.google.com/kubernetes-engine/docs/concept...
gVisor's `runsc checkpoint` subcommand supports a `--save-restore-exec-argv` which lets you specify a program to execute before gVisor starts taking the process snapshot.
You can fill in the blanks from there.
They run their snapshot agent as a Kubernetes DaemonSet, whereas our implementation runs as part of the Cerebrium container runtime path. Under the hood, both approaches rely on cuda-checkpoint, since cuda-checkpoint is currently the main primitive NVIDIA exposes for interacting with GPU memory during checkpoint/restore.
One difference is how KV cache handling is exposed. NVIDIA’s approach appears to automatically handle KV cache allocation/deallocation, whereas today we expose that choice to users (vLLM and SGLang expose primitives to to his). In some cases, users may want to discard the KV cache to reduce checkpoint size and restore time; in others, preserving it may be useful.
Their DaemonSet approach is also nice because it can be more portable across Kubernetes environments and clouds. Our approach is more deeply integrated into the node/runtime layer, which gives us tighter control over the serverless startup path, but also means it depends on custom node VM images, which not every provider supports equally.
The optimizations they mention around parallel memfd restore and Linux native AIO for anonymous memory could also be applied to our architecture if we find them stable and beneficial. That said, our current results are already pretty close. For example, they report restoring Qwen3-8B in 4.7s with those changes, while we currently restore it in 6.49s.
The biggest thing we are excited for is multi-GPU restore, which is not supported yet. That would unlock a much broader set of workloads.
Also in our benchmarks we seem to perform better than Modal by ~20% in 4/6 workloads we tested and have a lower spread of results meaning you get more consistent results. However the same fundamentals still apply -> how can you move storage into memory as quickly as possible