IJRTI
International Journal for Research Trends and Innovation
International Peer Reviewed & Refereed Journals, Open Access Journal
ISSN Approved Journal No: 2456-3315 | Impact factor: 8.14 | ESTD Year: 2016
Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 8.14 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)

Call For Paper

For Authors

Forms / Download

Published Issue Details

Editorial Board

Other IMP Links

Facts & Figure

Impact Factor : 8.14

Issue per Year : 12

Volume Published : 11

Issue Published : 121

Article Submitted : 24719

Article Published : 9358

Total Authors : 24865

Total Reviewer : 861

Total Countries : 169

Indexing Partner

Licence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Published Paper Details
Paper Title: Enhancing GPU Cluster Efficiency in HPC with Kubernetes and Custom Monitoring Agents
Authors Name: Ratan Raj Anandeshi
Download E-Certificate: Download
Author Reg. ID:
IJRTI_212240
Published Paper Id: IJRTI2605166
Published In: Volume 11 Issue 5, May-2026
DOI: https://doi.org/10.56975/ijrti.v11i5.212240
Abstract: GPU clusters are increasingly important in high-performance computing environments for large-scale simulation, data analytics, and deep learning workloads. Meanwhile, Kubernetes has moved beyond cloud-native service orchestration and is increasingly discussed in research and practice as a platform for scientific computing, largely due to its declarative control paradigm, interoperability and complete ecosystem. The key issue here is to balance Kubernetes flexibility and the strict efficiency requirements of GPU-based HPC systems, where the latency sensitivity, topology information, accelerator usage, and fine-grained observability have a strong impact on scientific throughput. This review examines peer-reviewed literature on GPU resource management, container orchestration, telemetry, and monitoring-driven optimization, with particular attention to Kubernetes-based implementations and custom monitoring agents. Themes such as scheduling under heterogeneous accelerator constraints, scientific workload container overhead, node-level and pod-level observability of GPUs, fairness and isolation, and feedback control and performance and energy optimization are major themes. It has been reported that orchestration per se can rarely provide maximum efficiency; quantifiable benefits are more often associated with scheduler extensions, topology-aware placement, and monitoring pipelines which are able to reveal the pressure on memory, streaming multiprocessor occupancy, I/O contention, and thermal or power behaviours. Persistent gaps include the lack of cross-layer metrics, limited support for multi-tenant GPU fragmentation, and insufficient validation at production-scale HPC environments. The discipline is of great importance owing to the fact that the exascale and AI-driven next-generation systems will need to be operating models that combine portability, policy control, and accelerator-aware observability.
Keywords: GPU clusters are increasingly important in high-performance computing environments for large-scale simulation, data analytics, and deep learning workloads. Meanwhile, Kubernetes has moved beyond cloud-native service orchestration and is increasingly discussed in research and practice as a platform for scientific computing, largely due to its declarative control paradigm, interoperability and complete ecosystem. The key issue here is to balance Kubernetes flexibility and the strict efficiency requirements of GPU-based HPC systems, where the latency sensitivity, topology information, accelerator usage, and fine-grained observability have a strong impact on scientific throughput. This review examines peer-reviewed literature on GPU resource management, container orchestration, telemetry, and monitoring-driven optimization, with particular attention to Kubernetes-based implementations and custom monitoring agents. Themes such as scheduling under heterogeneous accelerator constraints, scientific workload container overhead, node-level and pod-level observability of GPUs, fairness and isolation, and feedback control and performance and energy optimization are major themes. It has been reported that orchestration per se can rarely provide maximum efficiency; quantifiable benefits are more often associated with scheduler extensions, topology-aware placement, and monitoring pipelines which are able to reveal the pressure on memory, streaming multiprocessor occupancy, I/O contention, and thermal or power behaviours. Persistent gaps include the lack of cross-layer metrics, limited support for multi-tenant GPU fragmentation, and insufficient validation at production-scale HPC environments. The discipline is of great importance owing to the fact that the exascale and AI-driven next-generation systems will need to be operating models that combine portability, policy control, and accelerator-aware observability.
Cite Article: "Enhancing GPU Cluster Efficiency in HPC with Kubernetes and Custom Monitoring Agents", International Journal for Research Trends and Innovation (www.ijrti.org), ISSN:2456-3315, Vol.11, Issue 5, page no.b585-b603, May-2026, Available :http://www.ijrti.org/papers/IJRTI2605166.pdf
Downloads: 00099
ISSN: 2456-3315 | IMPACT FACTOR: 8.14 Calculated By Google Scholar| ESTD YEAR: 2016
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 8.14 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator
Publication Details: Published Paper ID: IJRTI2605166
Registration ID:212240
Published In: Volume 11 Issue 5, May-2026
DOI (Digital Object Identifier): https://doi.org/10.56975/ijrti.v11i5.212240
Page No: b585-b603
Country: -, -, India
Research Area: Arts
Publisher : IJ Publication
Published Paper URL : https://www.ijrti.org/viewpaperforall?paper=IJRTI2605166
Published Paper PDF: https://www.ijrti.org/papers/IJRTI2605166
Share Article:

Click Here to Download This Article

Article Preview
Click Here to Download This Article

Major Indexing from www.ijrti.org
Google Scholar ResearcherID Thomson Reuters Mendeley : reference manager Academia.edu
arXiv.org : cornell university library Research Gate CiteSeerX DOAJ : Directory of Open Access Journals
DRJI Index Copernicus International Scribd DocStoc

ISSN Details

ISSN: 2456-3315
Impact Factor: 8.14 and ISSN APPROVED, Journal Starting Year (ESTD) : 2016

DOI (A digital object identifier)


Providing A digital object identifier by DOI.ONE
How to Get DOI?

Conference

Open Access License Policy

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Creative Commons License This material is Open Knowledge This material is Open Data This material is Open Content

Important Details

Join RMS/Earn 300

IJRTI

WhatsApp
Click Here

Indexing Partner