GRID Middleware GPU Discovery
- GRID middleware GPU discovery is a set of mechanisms that identify, enumerate, and advertise GPU resources across distributed computing sites.
- It uses methods like scheduler probes, vendor API integration, and microbenchmark-driven tools to acquire detailed hardware topology and performance metrics.
- The approach enables efficient job scheduling and workload allocation in multi-vendor, high-performance computing environments, crucial for applications in HPC, high-energy physics, and AI.
GRID middleware GPU discovery denotes the set of mechanisms, interfaces, and schemas by which distributed resource management systems in scientific GRID computing environments identify, enumerate, and advertise the presence and properties of GPU resources at each participating site. Efficient and accurate discovery of GPU nodes is foundational for enabling GPGPU-based job scheduling, optimizing performance through hardware-aware job placement, and supporting advanced analytical workflows in domains such as high-energy physics, HPC, and AI. Solutions span command-line probing in schedulers, vendor-specific API integration, schema extension, abstraction layers, and—more recently—portable, benchmarking-driven topology tools.
1. Architectural Patterns for GPU Discovery in GRID Middleware
Classic GRID middleware relies on site-local resource probes, typically interfacing with the cluster’s Local Resource Management System (LRMS). The ARC framework utilizes CEinfo.pl, a Perl script, which delegates LRMS-specific discovery to modules such as SLURMmod.pm under SLURM. The GPU discovery extension inserts the slurm_read_gresinfo() routine, which invokes:
1 |
sinfo –a –h –o “gresinfo=%G” |
This command outputs SLURM’s notion of Generic Resources (e.g., gpu:k80ce:4,mps:no_consume:1), which is pre-filtered and relayed as an array to the ARC1ClusterInfo.pm, then serialized into an additional <GeneralResources> XML element for consumption by GLUE2-aware clients. No GPU-spefic API (CUDA, NVML) is required, making this approach backend-agnostic within the constraints of LRMS support (Isacson et al., 2019).
Advanced frameworks, such as Grid for Lattice QCD, have architected vendor-neutral, cross-platform abstraction interfaces. These subsystems instantiate a backend-agnostic device enumeration at program initialization (via acceleratorInit()), acquiring detailed properties from the appropriate backend: NVIDIA CUDA, AMD HIP, or SYCL standard. Discovered devices are stored into a global vector of DeviceInfo records and expose compute major/minor, total memory, warp size, PCI IDs, and device name among other parameters, providing subsequent accelerator and memory management logic with a unified interface (Boyle et al., 2022).
2. Information Model, Schema Extensions, and Data Structures
ARC’s approach augments, but does not modify, the GLUE2 schema, embedding a <GeneralResources> block under each <ComputingManager>. Each <Resource> entry is a free-form string exactly as emitted by SLURM (gresinfo). For example:
gpu:k80ce:4,mps:no_consume:1(4 K80 GPUs, modifiers: mps)hbm:16G(high-bandwidth memory, 16GB)
This encoding captures GPU card type, count, optional modifiers, and memory, but omits fine-grained metrics, e.g. streaming multiprocessor (SM) count, memory bandwidth, occupancy, or utilization. Clients ingest these strings, and display within the “General resources:” block, enabling job attribute and broker parsing (Isacson et al., 2019).
The Grid library normalizes all enumeration into a single DeviceInfo record:
1 2 3 4 5 6 7 8 9 10 |
struct DeviceInfo { int backend; // CUDA, HIP, SYCL int ordinal; // device index int computeMajor, computeMinor; uint64_t globalMemBytes; int warpSize; int maxThreadsPerBlock; int pciBusID, pciDeviceID; std::string name; }; |
Reporting and scheduling uniformly reference these objects, enabling seamless multi-vendor, multi-GPU operation across diverse backend APIs (Boyle et al., 2022).
MT4G introduces a vendor-agnostic JSON topology structure, reporting not only direct API attributes, but also microbenchmark-inferred cache sizes, memory hierarchy, bandwidth, segment counts, and topology (interconnects, sharing). A sample fragment includes compute, memory hierarchy (L1, L2, DRAM), and interconnect metrics (Vanecek et al., 8 Nov 2025):
1 2 3 4 5 6 7 8 9 10 11 12 |
{
"device_id": 0,
"vendor": "NVIDIA",
"model": "H100-80GB",
"compute": { "sm_count": 144, ... },
"memory_hierarchy": [
{ "name": "L1", "size": 238000, "line_size": 128, ... }
],
"interconnects": [
{ "peer_device_id": 1, "latency_us": 2.3, ... }
]
} |
3. Probing Methods, API Integration, and Benchmark Layers
SLURM-based sites rely exclusively on scheduler commands (sinfo), avoiding direct GPU queries, which simplifies compatibility but restricts information to what the scheduler and its GRES subsystem expose. No direct use of CUDA, NVML, HIP, or ROCm is made in the original ARC extension. Other LRMS backends (PBS, LSF) require analogous probes, representing a notable limitation in non-SLURM environments (Isacson et al., 2019).
The Grid library directly calls vendor APIs:
- CUDA:
cudaGetDeviceCount,cudaGetDeviceProperties,cudaDeviceGetAttribute - HIP:
hipGetDeviceCount,hipGetDeviceProperties,hipDeviceGetAttribute - SYCL:
platform::get_devices,device.get_info<...>
This interface provides broader hardware introspection, including PCI topology, compute capability, and other device attributes, but does not dynamically update device inventory after initialization (Boyle et al., 2022).
MT4G advances the discovery model through the combination of API-level queries and a suite of over 50 custom microbenchmarks, implementing pointer-chase, bandwidth streaming, fetch-granularity, and cache-line detection kernels. Statistical analysis, primarily via offline Kolmogorov–Smirnov change-point detection, enables reliable identification of topological attributes unavailable programmatically. The microbenchmark layer detects cache sizes, bandwidths, line sizes, latency profiles, and segment counts across both NVIDIA and AMD devices (Vanecek et al., 8 Nov 2025).
4. Reporting, Advertisement, and Job Submission Workflows
Discovered GPU resources are presented to site-level consumers via extended XML, JSON, or directly accessible structures. For ARC, once <GeneralResources> is populated, the arcinfo client displays the new resource block, and XRSL job submission scripts reference a site-provided runtime environment (RTE, e.g. 00gpu), which is mapped internally to the corresponding SLURM --gres directive (Isacson et al., 2019):
1 2 3 |
description = "Request 1 K80 GPU"
environment =
joboption_nodeproperty_# = "%{joboption_nodeproperty_#} --gres=gpu:k80:1" |
Example XRSL for consuming a GPU resource:
1 2 3 |
&(executable=$(which nvidia-smi)) (stdout=job.out)(stderr=job.err) (runtimeenvironment=00gpu) |
Grid sets GPU affinity for each MPI rank, binding to a logical device by mapping local ranks to indexes of gDeviceList. Device selection is finalized upon initialization; subsequent accelerator logic references the chosen DeviceInfo (Boyle et al., 2022).
MT4G runs as a daemon or service, emitting topology in JSON for integration with HPC schedulers, data-placement engines, and performance modeling systems. Full discovery is performed at node boot or GPU reset (10–15 min) with incremental API-only updates possible (subsecond latency) (Vanecek et al., 8 Nov 2025).
5. Integration with Resource Selection, Scheduling, and Performance Modeling
ARC’s integration enables brokers or job dispatchers to parse the “General resources” block, identifying which Computing Elements (CEs) advertise GPUs and directing GPU-dependent workloads accordingly. The user simply sets an RTE name in XRSL, triggering the required resource mapping within the middleware (Isacson et al., 2019).
Grid’s normalized DeviceInfo structure abstracts device selection, affinity, and multi-GPU operation. Post-enumeration, resource usage proceeds uniformly regardless of backend, supporting applications with MPI, OpenMP threading, and accelerator dispatch (Boyle et al., 2022).
MT4G’s topology-centric output enables middleware schedulers to incorporate cache sizes, bandwidths, and inter-GPU fabrics into resource selection heuristics. Data-intensive jobs may be placed onto GPUs with high L2 cache, co-scheduled kernels may be arranged to benefit from physically shared caches, and performance estimators incorporate empirical bandwidth and latency measures. This suggests the evolution of scheduling towards topological and performance-aware paradigms (Vanecek et al., 8 Nov 2025).
6. Limitations, Challenges, and Future Directions
ARC’s original extension is limited to SLURM support, exposes only scheduler-level resource strings, and lacks dynamic utilization or fine-grained performance metrics. The current schema is amenable to extension but is string-typed and would benefit from discrete typed attributes, such as NumberOfGPUs:Int or GPUType:String. No benchmarking, throughput, or capacity formulas are defined, although conceptual expressions such as deviceMemoryBandwidth (GB/s) or peakFLOPS could be incorporated () (Isacson et al., 2019).
The Grid library’s device enumeration is static; no dynamic device discovery is performed after initialization. PCI bus fields may be missing in SYCL if Level-Zero interop is unavailable. All vendor queries are required to pass error checking, with fallback to CPU-only operation if no devices are discovered (Boyle et al., 2022).
MT4G delivers the most detailed, portable topological discovery currently documented, but incurs significant time overhead for full microbenchmark sweeps (10–15 minutes per high-end GPU) and requires direct node access for service deployment. Busy or exclusive GPUs may skip benchmarking and report API-only attributes with reduced confidence. A plausible implication is that future GRID middleware may integrate dynamic utilization metrics (NVML/CUDA APIs) and finer-grained, typed schema extensions to enable optimized, performance-aware resource brokering (Vanecek et al., 8 Nov 2025).
7. Comparative Summary of Approaches
| Method | Data Acquisition | Granularity |
|---|---|---|
| ARC/SLURMmod | Scheduler output | Card, count, modifiers |
| Grid (OneCode) | Vendor APIs | Compute, memory, PCI |
| MT4G | APIs + microbenchmarks | Full topology, latency, cache, bandwidth |
ARC offers rapid, config-free resource advertising but lacks performance metrics and dynamic updates. Grid’s platform abstraction provides robust, cross-vendor enumeration for single-source codebases, suitable for scientific workflows requiring multi-GPU, multi-architecture deployment. MT4G’s benchmark-driven approach yields comprehensive hardware topologies, supporting sophisticated scheduling and modeling at the cost of increased discovery latency and operational complexity.
GRID middleware GPU discovery continues to evolve, with the convergence of scheduler-level abstraction, vendor API interrogation, and benchmarking-driven topology tools paving the way for performance-aware resource management, dynamic hardware introspection, and advanced scheduling in distributed scientific computing environments (Isacson et al., 2019, Boyle et al., 2022, Vanecek et al., 8 Nov 2025).