Sign In
Access to Author tools and Claude Code Assistant requires authentication.
by Adam 7 min read

Making MAX Engine Work on NVIDIA DGX Spark: An NVML Journey

How I built an NVML library replacement to unlock GPU monitoring on Grace Blackwell unified memory architecture

nvidia dgx-spark nvml cuda max-engine gpu unified-memory grace-blackwell

Picture this: You just unboxed your shiny new NVIDIA DGX Spark with its groundbreaking Grace Blackwell GB10 superchip. 128GB of unified memory, ARM64 Grace CPU cores, and Blackwell GPU with FP4 tensor cores all in one integrated package. You’re ready to run local AI inference with MAX Engine.

You type the command:

max generate google/gemma-3-1b-it "Hello world"

And you get:

Error: No supported "gpu" device available.

Wait, what?

You check:

nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

Your GPU didn’t disappear. But to every tool that matters - MAX Engine, PyTorch, TensorFlow, your monitoring dashboard - it might as well have.

Welcome to the unified memory NVML problem.

Understanding the Issue

The Discrete GPU World

Traditional NVIDIA GPUs (think RTX 4090, A100, H100) have a clear separation:

  • CPU with its own RAM (system memory)
  • GPU with its own VRAM (framebuffer memory)
  • PCIe bus connecting them
  • NVML library that queries GPU framebuffer

Applications use NVML (NVIDIA Management Library) to ask questions like:

  • “How many GPUs are there?”
  • “How much VRAM is available?”
  • “What’s the GPU utilization?”

NVML looks at the GPU’s dedicated framebuffer and reports back. Simple, straightforward, works great.

The Unified Memory Revolution

Grace Blackwell GB10 changes everything:

  • One integrated superchip (CPU + GPU on same die)
  • 128GB unified LPDDR5x memory (shared between CPU and GPU)
  • No separate framebuffer (there’s no “GPU VRAM” to query)
  • No PCIe bottleneck (direct memory access)

This is the future of AI hardware. No copying data between CPU and GPU memory spaces. Everything just… works together.

Except when it doesn’t.

Why NVML Breaks

Standard NVML expects to find:

  1. A discrete GPU with PCI device ID
  2. Dedicated framebuffer memory to query
  3. Separate GPU memory controllers to monitor

When NVML tries to query the GB10’s “framebuffer memory,” it finds… nothing. Because there isn’t one. The memory is unified.

Result: NVML_ERROR_NOT_SUPPORTED

This breaks:

  • MAX Engine (can’t detect GPU)
  • PyTorch GPU monitoring (no device info)
  • TensorFlow device detection (falls back to CPU)
  • pynvml library (all queries fail)
  • nvidia-smi (version mismatch errors)
  • DGX Spark dashboard (no telemetry)

Your $5,000 desktop AI supercomputer just became a very expensive ARM64 workstation.

The Journey: Building a Solution

Initial Research

I started by diving into how NVML actually works:

// What applications expect to do:
nvmlInit();
nvmlDeviceGetCount(&count);        // Should return 1
nvmlDeviceGetHandleByIndex(0, &handle);
nvmlDeviceGetMemoryInfo(handle, &memory);  // Fails on GB10!

The last line fails because nvmlDeviceGetMemoryInfo is trying to read GPU framebuffer stats that don’t exist.

Key insight: Most of what NVML provides is actually available through the CUDA Runtime API.

The Architecture

Instead of trying to patch NVML or use LD_PRELOAD tricks (which don’t work with Python’s ctypes), I built a complete replacement:

+-------------------------------------+
|   Applications (MAX, PyTorch, etc)  |
+-------------------------------------+
|   libnvidia-ml.so.1 (our shim)      |  <- Drop-in replacement
+-------------------------------------+
|   CUDA Runtime API                  |  <- For device queries
|   /proc/meminfo                     |  <- For unified memory
+-------------------------------------+
|   NVIDIA Driver                     |
+-------------------------------------+
|   GB10 Hardware                     |
+-------------------------------------+

Implementation Highlights

1. Device Enumeration (Easy)

nvmlReturn_t nvmlDeviceGetCount_v2(unsigned int *deviceCount) {
    cudaError_t err = cudaGetDeviceCount(&device_count);
    if (err != cudaSuccess) {
        return NVML_ERROR_UNINITIALIZED;
    }
    *deviceCount = device_count;
    return NVML_SUCCESS;
}

CUDA Runtime already knows about the GB10. Just pass it through.

2. Device Properties (Straightforward)

nvmlReturn_t nvmlDeviceGetName(nvmlDevice_t device, char *name, unsigned int length) {
    struct cudaDeviceProp prop;
    cudaGetDeviceProperties(&prop, device_index);
    strncpy(name, prop.name, length);  // "NVIDIA GB10"
    return NVML_SUCCESS;
}

3. Memory Info (The Tricky Part)

Here’s where unified memory gets interesting:

nvmlReturn_t nvmlDeviceGetMemoryInfo_v2(nvmlDevice_t device, nvmlMemory_v2_t *memory) {
    // For unified memory, "total" is system memory from /proc/meminfo
    unsigned long long total = get_system_memory_total();

    // "Used" comes from CUDA runtime
    size_t free_mem, total_mem;
    cudaMemGetInfo(&free_mem, &total_mem);
    unsigned long long used = total_mem - free_mem;

    // Fill structure
    memory->total = total;      // 128GB unified pool
    memory->used = used;        // What CUDA is using
    memory->free = total - used;
    memory->reserved = 0;

    return NVML_SUCCESS;
}

The semantics are different from discrete GPUs, but they’re correct for unified memory.

4. The Version Field Bug

MAX Engine does something clever:

mem_info.version = sizeof(nvmlMemory_v2_t) | (2 << 24);  // Encode struct version

My first implementation overwrote this:

memory->version = 0x2000028;  // WRONG! Destroys MAX's encoding

After debugging with NVML_SHIM_DEBUG=1, I found the issue:

// CORRECT: Preserve the input version
unsigned int input_version = memory->version;
// ... do the query ...
memory->version = input_version;  // Restore it

This took 3 hours to figure out. Worth it.

The Result: It Works!

MAX Engine

$ python3 -c "from max.driver import Accelerator; print(Accelerator())"
Device(type=gpu,id=0)

GPU detected!

$ max generate google/gemma-3-1b-it "The future of AI is"
Pipeline Configuration
============================
model              : google/gemma-3-1b-it
architecture       : Gemma3ForCausalLM_Legacy
pipeline           : TextGenerationPipeline
devices            : gpu[0]
max_batch_size     : 512
max_seq_len        : 32768
cache_memory       : 87.56 GiB     <- Using unified memory!

It works! MAX Engine is generating text on GB10.

Python ML Frameworks

import pynvml

pynvml.nvmlInit()
count = pynvml.nvmlDeviceGetCount()  # 1
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
name = pynvml.nvmlDeviceGetName(handle)  # "NVIDIA GB10"
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)

print(f"GPU: {name}")
print(f"Memory: {mem.used / 1024**3:.1f}GB / {mem.total / 1024**3:.1f}GB")

# Output:
# GPU: NVIDIA GB10
# Memory: 18.5GB / 119.7GB

pynvml works!

nvidia-smi

I built a Python wrapper that mimics nvidia-smi output:

$ nvidia-smi
Tue Jan 27 08:15:03 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.126.09       Driver Version: 580.126.09   CUDA Version: 13.0     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage   | GPU-Util  Compute M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                 Off | 000f:01:00.0    Off |                  N/A |
| N/A   N/A    P0              N/A /  N/A |  18432MiB / 122570MiB |      0%      Default |
+-----------------------------------------+------------------------+----------------------+

Dashboard telemetry restored!

Technical Deep Dive

What’s Implemented (16 Functions)

FunctionImplementation
nvmlInit_v2cudaGetDeviceCount()
nvmlDeviceGetCount_v2cudaGetDeviceCount()
nvmlDeviceGetHandleByIndex_v2Fake handle from index
nvmlDeviceGetNamecudaGetDeviceProperties()
nvmlDeviceGetMemoryInfo_v2/proc/meminfo + cudaMemGetInfo()
nvmlDeviceGetUUIDcudaGetDeviceProperties()
nvmlDeviceGetPciInfo_v3cudaGetDeviceProperties()
nvmlDeviceGetCudaComputeCapabilitycudaGetDeviceProperties()
nvmlSystemGetDriverVersioncudaDriverGetVersion()
nvmlSystemGetNVMLVersionVersion string

Plus 6 more utility functions.

What’s Different from Discrete GPUs

GPU Utilization: Returns 0%

Why? Traditional “GPU utilization” measures separate GPU cores and memory controllers. Unified memory doesn’t have this separation. The GPU and CPU share the same memory controllers.

Temperature/Power: Returns N/A

The GB10 superchip manages thermal and power internally. There’s no separate “GPU temperature” to query.

Process Tracking: Not yet implemented

nvmlDeviceGetComputeRunningProcesses is tricky with unified memory. Working on it.

Performance

  • Library size: 72KB (vs 15MB for official NVML)
  • Load time: <5ms
  • Function overhead: <0.1ms
  • Memory footprint: ~2MB runtime

Lean and fast.

Installation

Quick Start

# Clone repository
git clone https://github.com/CINOAdam/nvml-unified-shim.git
cd nvml-unified-shim

# Build
make -f Makefile.python

# Install system-wide
sudo make -f Makefile.python install

# Test
python3 tests/test_python.py

For MAX Engine

# Find MAX installation
MAX_PATH=$(python3 -c "import max; print(max.__file__)" | xargs dirname)

# Install to Modular's lib directory
sudo cp libnvidia-ml.so.1 $MAX_PATH/../modular/lib/

Verify

# Test MAX Engine
python3 -c "from max.driver import Accelerator; print(Accelerator())"
# Output: Device(type=gpu,id=0)

# Test pynvml
python3 -c "import pynvml; pynvml.nvmlInit(); print('OK')"
# Output: OK

The Bigger Picture

Why This Matters

Unified memory is the future of AI hardware:

  • No PCIe bottleneck: Direct CPU-GPU memory access
  • Simplified programming: No manual data copying
  • Better efficiency: Shared memory pool reduces waste
  • Lower latency: No transfer overhead

Grace Blackwell GB10 is just the beginning:

  • Grace Hopper GH200 (unified memory)
  • Grace Blackwell GB200 (larger systems)
  • Future NVIDIA architectures
  • Potentially AMD and Intel unified memory GPUs

This library makes unified memory accessible today.

What’s Next

Short term (Community feedback):

  • Additional NVML functions as needed
  • Process tracking implementation
  • Thread safety improvements
  • Testing on GH200 and other systems

Long term (NVIDIA collaboration):

  • Official NVML support for unified memory?
  • Upstream contribution opportunities
  • Best practices and documentation
  • Integration with NVIDIA tools

Lessons Learned

Technical

1. Iterative discovery works: Let the application tell you what it needs. MAX Engine’s error messages guided the implementation.

2. Version fields matter: Preserve application-specific encodings. Don’t assume you know better.

3. Debug logging is essential: NVML_SHIM_DEBUG=1 saved hours of debugging.

4. Test incrementally: Verify each function immediately. Don’t write 100 functions and then test.

5. CUDA Runtime provides most of what you need: NVML is mostly a wrapper anyway.

Process

1. PoC first, production later: Prove the concept before over-engineering.

2. Document as you go: Capture discoveries in real-time. Future you will thank you.

3. Community focus: Build for everyone, not just yourself.

4. Persistence pays off: 5 hours of iteration = full success.

Try It Yourself

The project is open source (MIT License) and ready to use:

Repository: https://github.com/CINOAdam/nvml-unified-shim

Requirements:

  • NVIDIA DGX Spark (or other Grace Blackwell system)
  • Ubuntu 24.04 (or compatible Linux)
  • CUDA 12.8+
  • GCC compiler

Status: Production ready. Tested with:

  • MAX Engine 26.2.0
  • PyTorch 2.x
  • TensorFlow 2.x
  • pynvml
  • DGX Dashboard

What started as “Why doesn’t MAX Engine see my GPU?” became a journey into unified memory architecture and a solution that makes AI more accessible on next-generation hardware.

The problem: Standard NVML breaks on unified memory architectures.

The solution: A drop-in NVML replacement using CUDA Runtime and /proc/meminfo.

The result: MAX Engine, PyTorch, TensorFlow, and all GPU monitoring tools work on DGX Spark.

Grace Blackwell represents the future of AI hardware. This library makes that future work today.