Sign In
Access to Author tools and Claude Code Assistant requires authentication.
by Adam 9 min read

12 Hours, 7 Layers, 1 Hard Wall: Debugging Memory on DGX Spark

A deep technical investigation into why MAX Engine allocates 117GB regardless of limits on NVIDIA DGX Spark's unified memory architecture - and why the solution doesn't exist in userspace

nvidia dgx-spark max-engine cuda unified-memory grace-blackwell memory-management debugging cgroups kernel

Yesterday I spent 12 consecutive hours debugging why MAX Engine allocates 117GB of RAM on my DGX Spark, regardless of what limits I set. I traced the problem through seven system layers, wrote six patches, and ultimately hit a hard wall in the kernel.

The conclusion: On unified memory architectures, there is no userspace solution. The NVIDIA kernel module allocates directly from physical memory, bypassing every control mechanism Linux provides.

This is that story.

The Problem

After building the NVML unified shim to make MAX Engine detect my GPU, I encountered a new issue:

max generate --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --prompt "Hello" --max-new-tokens 10

Result: System memory spikes to 117-119GB. Always. Every time.

It didn’t matter if I:

  • Used a 1.1B or 8B parameter model
  • Set --device-memory-utilization 0.2 (should be 20%!)
  • Had other applications running
  • Explicitly capped memory through environment variables

The behavior was deterministic and frustrating: MAX claims nearly all 128GB of unified memory, leaving under 10GB for the OS and everything else.

Why This Matters

On discrete GPU systems, this isn’t a problem. Your RTX 4090 has 24GB of VRAM. When an AI framework “allocates all GPU memory,” it’s just the 24GB on the card. Your system RAM stays separate.

On unified memory (DGX Spark, Grace Hopper), CPU and GPU share the same physical RAM. When MAX “allocates 90% of GPU memory,” it’s allocating 90% of all memory. Your desktop environment, your browser, your other services - they’re all competing for the remaining 10GB.

The Working Hypothesis

My initial theory was simple:

MAX Engine queries available GPU memory, sees 112GB free, and allocates ~90% for workspace and KV cache.

If I could intercept that query and return a smaller number (say, 40GB), MAX should allocate less.

What followed was a systematic investigation through every layer where memory queries could occur.

Layer 1: NVML (Working, But Irrelevant)

I already had the NVML unified shim working. Adding memory capping was straightforward:

nvmlReturn_t nvmlDeviceGetMemoryInfo(nvmlDevice_t device, nvmlMemory_t *memory) {
    unsigned long long total = get_system_memory_total();

    // Apply memory cap if set
    const char *cap_str = getenv("NVML_UNIFIED_MEMORY_CAP_GB");
    if (cap_str) {
        unsigned long long cap_bytes = strtoull(cap_str, NULL, 10) * 1024ULL * 1024ULL * 1024ULL;
        if (cap_bytes < total) {
            total = cap_bytes;
        }
    }

    memory->total = total;
    memory->free = get_system_memory_available();
    return NVML_SUCCESS;
}

Test results:

NVML_UNIFIED_MEMORY_CAP_GB=40 max generate --model TinyLlama/...
[NVML-UNIFIED] Capping memory: 40 GB (real: 112 GB)
Device stats: free_memory = 42,949,672,960 bytes (40.00 GB)

The cap worked perfectly. NVML reported 40GB.

System memory: Still spiked to 117GB.

Discovery: MAX Engine doesn’t use NVML for memory allocation decisions.

Layer 2: Python Memory Estimator

MAX has a Python layer that calculates KV cache sizes. I found it in max/pipelines/lib/memory_estimation.py:

@classmethod
def free_memory(cls, devices: list[Device]) -> int:
    """Return total free memory with optional cap."""
    detected_free = int(sum(d.stats["free_memory"] for d in devices))

    # Check for memory limit override
    memory_limit_gb = os.environ.get("MAX_MEMORY_LIMIT_GB")
    if memory_limit_gb:
        limit_bytes = int(float(memory_limit_gb) * 1024**3)
        if detected_free > limit_bytes:
            logger.info(f"Capping detected free memory: "
                       f"{detected_free / (1024**3):.2f} GB -> "
                       f"{float(memory_limit_gb):.2f} GB")
            return limit_bytes
    return detected_free

Test results:

INFO: Capping detected free memory from 111.86 GB to 40.00 GB
INFO: cache_memory: 22.00 GiB

KV cache planning respected the limit. The Python layer was working.

System memory: Still spiked to 117GB.

Discovery: Python plans the allocation, but C++/Mojo performs it. The C++ layer queries memory independently.

Layer 3: Python Device.stats Monkey Patch

Maybe the Python Device class was the query point. I intercepted its stats property:

from max._core.driver import Device

_original_stats = Device.stats.fget

def _patched_stats(self):
    """Patched stats property that caps memory values."""
    stats = _original_stats(self)

    memory_limit_gb = os.environ.get("MAX_MEMORY_LIMIT_GB")
    if memory_limit_gb:
        limit_bytes = int(float(memory_limit_gb) * 1024**3)
        if stats.get("free_memory", 0) > limit_bytes:
            print(f"[PATCH] Capping device memory: "
                  f"{stats['free_memory'] / (1024**3):.2f}GB -> "
                  f"{limit_bytes / (1024**3):.2f}GB")
            stats["free_memory"] = limit_bytes
            stats["total_memory"] = limit_bytes
    return stats

Device.stats = property(_patched_stats)

Test results:

[PATCH] Capping device free_memory: 107.29GB -> 40.00GB

Python saw 40GB. Memory estimator saw 40GB.

System memory: Still 117GB.

Discovery: The Device Python class is a wrapper around C++ _core.driver.Device. The C++ layer makes its own CUDA calls, completely bypassing Python.

Layer 4: CUDA Runtime API

Maybe MAX’s C++ uses cudaMemGetInfo() from the CUDA Runtime API. I built an LD_PRELOAD shim:

static cudaError_t (*real_cudaMemGetInfo)(size_t *free, size_t *total) = NULL;

cudaError_t cudaMemGetInfo(size_t *free, size_t *total) {
    if (!real_cudaMemGetInfo) {
        real_cudaMemGetInfo = dlsym(RTLD_NEXT, "cudaMemGetInfo");
    }

    cudaError_t ret = real_cudaMemGetInfo(free, total);

    if (ret == cudaSuccess && memory_cap_bytes > 0 && *total > memory_cap_bytes) {
        LOG_DEBUG("cudaMemGetInfo() capped: %llu MB -> %llu MB",
                 *total / (1024*1024), memory_cap_bytes / (1024*1024));
        *total = memory_cap_bytes;
        *free = (*free * memory_cap_bytes) / (*total);  // Scale proportionally
    }

    return ret;
}

Test results: No cudaMemGetInfo() calls logged.

The shim loaded, initialized, and… nothing. MAX never called cudaMemGetInfo.

Discovery: MAX doesn’t use CUDA Runtime API. It uses the lower-level CUDA Driver API.

Layer 5: CUDA Driver API

The Driver API is below the Runtime API. I extended the shim for cuMemGetInfo_v2:

static CUresult (*real_cuMemGetInfo_v2)(size_t *free, size_t *total) = NULL;

CUresult cuMemGetInfo_v2(size_t *free, size_t *total) {
    if (!real_cuMemGetInfo_v2) {
        real_cuMemGetInfo_v2 = dlsym(RTLD_NEXT, "cuMemGetInfo_v2");
    }

    if (!real_cuMemGetInfo_v2) {
        fprintf(stderr, "[CUDA-SHIM] WARNING: Could not find cuMemGetInfo_v2\n");
        return CUDA_ERROR_UNKNOWN;
    }

    // ... capping logic ...
}

Test results:

[CUDA-SHIM] WARNING: Could not find cuMemGetInfo_v2 (Driver API)

dlsym couldn’t find the symbol. I checked MAX’s binary:

nm -D max/_core.cpython-313-aarch64-linux-gnu.so | grep -i "mem.*info\|malloc"

Output:

U PyMem_Malloc
U PyObject_Malloc
U malloc@GLIBC_2.17

Discovery: MAX uses standard malloc() for its control structures, not CUDA memory query APIs.

This was getting interesting.

Layer 6: cgroups v2

Linux cgroups should be the ultimate authority. If I put MAX in a cgroup with a 50GB memory limit, the kernel should enforce it regardless of what MAX does internally.

# Create cgroup
sudo mkdir -p /sys/fs/cgroup/max-engine

# Enable memory controller
echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control

# Set 50GB limit
echo "53687091200" | sudo tee /sys/fs/cgroup/max-engine/memory.max

I wrote a wrapper to run MAX inside the cgroup:

#!/bin/bash
max "$@" &
MAX_PID=$!
echo $MAX_PID > /sys/fs/cgroup/max-engine/cgroup.procs

# Monitor
while kill -0 $MAX_PID 2>/dev/null; do
    CURRENT=$(cat /sys/fs/cgroup/max-engine/memory.current)
    PEAK=$(cat /sys/fs/cgroup/max-engine/memory.peak)
    printf "\r[%s] Current: %2dGB | Peak: %2dGB | Limit: 50GB" \
           "$(date +%H:%M:%S)" "$((CURRENT/1024/1024/1024))" "$((PEAK/1024/1024/1024))"
    sleep 1
done

Test results:

MetricValue
cgroup memory.peak1.4 GB
System memory used118 GB
MAX exit codeCUDA_ERROR_OUT_OF_MEMORY

The cgroup tracked only 1.4GB of allocations. But the system showed 118GB used.

Checking kernel logs:

sudo dmesg | tail -10
[2754.617886] NVRM: nvCheckOkFailedNoLog: Check failed: Out of memory
               [NV_ERR_NO_MEMORY] (0x00000051) returned from
               _memdescAllocInternal(pMemDesc) @ mem_desc.c:1359

Discovery: cgroups only track userspace allocations. The NVIDIA kernel module allocates memory directly, outside cgroup accounting.

Layer 7: The Kernel Module (Hard Wall)

The allocation path became clear:

MAX C++/Mojo layer
    ↓
malloc() → allocates small control structures (~1GB, tracked by cgroup)
    ↓
CUDA Driver API (cuMemAlloc, cuCtxCreate)
    ↓
ioctl() to /dev/nvidia0
    ↓
nvidia.ko kernel module
    ↓
_memdescAllocInternal() → allocates 117GB directly from physical memory
    ↓
⚠️ NOT tracked by cgroups

On unified memory systems, the NVIDIA kernel module:

  1. Queries total physical memory (128GB)
  2. Allocates workspace based on that total
  3. Allocates from kernel space, invisible to cgroup memory controller
  4. Returns tiny handles to userspace (a few bytes)

There is no userspace interception point.

The 1.4GB that cgroups saw was MAX’s Python/C++ control structures. The 117GB was allocated by the kernel module, completely bypassing Linux’s memory accounting.

Why This Happens

On discrete GPU systems:

CPU RAM (128GB) ←→ PCIe Bus ←→ GPU VRAM (24GB)

The GPU VRAM is separate. CUDA allocations come from VRAM. System memory stays untouched.

On unified memory:

Unified Physical Memory (128GB)
├── CPU Region (OS, apps) - cgroups can limit
└── GPU Region (CUDA) - cgroups CANNOT limit

When CUDA context creation happens on unified memory, the kernel module sees 128GB of shared RAM and allocates a large workspace. This allocation happens in kernel space. cgroups v2 memory controller only tracks:

  • Anonymous pages (userspace malloc, mmap)
  • Page cache
  • Kernel stack

It does not track kernel module allocations.

Implications

For MAX Engine Users

ApproachWorks?Limitation
Python memory limitsNoC++ bypasses Python
NVML shim capNoMAX doesn’t use NVML
CUDA API interceptionNoMAX uses kernel ioctl directly
cgroups v2NoKernel allocations invisible
Dedicated machineYesCan’t share hardware

For All Unified Memory Systems

This isn’t MAX-specific. Any framework that uses CUDA Driver API on unified memory will exhibit similar behavior:

  • PyTorch
  • TensorFlow
  • JAX
  • Custom CUDA applications

For Cluster Operators

Traditional Kubernetes resource limits don’t work. Memory requests/limits are advisory only. OOM conditions happen in kernel space, not containerd.

The Path Forward

Short Term (Modular Can Implement)

  1. Respect environment variables in C++/Mojo layer:
fn init_cuda(memory_limit_gb: Int):
    let limit = get_env("MAX_MEMORY_LIMIT_GB")
    if limit:
        cuda_set_internal_limit(limit * 1024 * 1024 * 1024)
  1. Use Python config values in C++ allocation decisions

  2. Document unified memory limitations in DGX Spark/Grace Hopper docs

Long Term (Requires NVIDIA)

  1. Kernel module memory accounting: Expose allocations to cgroup v2
  2. New CUDA API: cuCtxCreateWithMemoryLimit()
  3. Configurable default allocation: Module parameter for max workspace

Reproduction

For anyone wanting to verify:

# 1. Clone and build NVML shim
git clone https://github.com/CINOAdam/nvml-unified-shim
cd nvml-unified-shim && make

# 2. Monitor memory in another terminal
watch -n1 'free -h'

# 3. Run MAX (observe 117GB spike)
max generate --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
    --prompt "Test" --max-new-tokens 10

# 4. Try cgroup (observe it doesn't help)
sudo mkdir -p /sys/fs/cgroup/max-engine
echo "53687091200" | sudo tee /sys/fs/cgroup/max-engine/memory.max
# ... still spikes to 117GB, then OOMs

Lessons Learned

Technical:

  1. Debug logging at each layer saved hours
  2. nm -D on binaries reveals actual symbol usage
  3. cgroups have fundamental limits with kernel module allocations
  4. Unified memory requires fundamentally different resource management

Process:

  1. Systematic layer-by-layer investigation beats guessing
  2. Sometimes the answer is “this can’t be done from userspace”
  3. Document failures - they help others avoid the same rabbit holes

Current Status

Investigation: Complete Hard wall: Reached (kernel module allocations) Userspace solution: None exists Vendor response: Pending

For now, running MAX Engine on DGX Spark means dedicating the entire machine to it, or accepting memory pressure on other workloads. Not ideal for homelab multi-tenancy.

The unified memory architecture is genuinely revolutionary for AI performance - no PCIe bottleneck, no memory copying, massive effective VRAM. But the software ecosystem hasn’t caught up with the memory management implications.


Repository: nvml-unified-shim

Hardware: NVIDIA DGX Spark (Grace Blackwell GB10)

Investigation time: 12 hours continuous

System layers traced: 7

Patches written: 6

Hard walls hit: 1

Sometimes the most valuable debugging sessions are the ones that prove something can’t be done.