Enhancing GPU Cluster Efficiency in HPC with Kubernetes and Custom Monitoring Agents

doi:https://doi.org/10.56975/ijrti.v11i5.212240; ?>

IJRTI

International Journal for Research Trends and Innovation

International Peer Reviewed & Refereed Journals, Open Access Journal

ISSN Approved Journal No: 2456-3315 | Impact factor: 8.14 | ESTD Year: 2016

Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 8.14 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)

Submit Paper Online Track Paper

Call For Paper

Issue: June 2026

Volume 11 | Issue 6

Submit Paper Online

Review Result and Publication of Paper within : 2-3 days

Click Here For more Details

For Authors

Submit Paper Online Publication Guidelines Publication Charges HardCopy and DOI Charges Pay Publication Charges Track Paper Research Area All Policy

Forms / Download

Undertaking Form Paper Format Sample Certificate Sample Publication Letter Sample Hard Copy of Journal

Published Issue Details

Current Issue Archive Conference Proposal Recent Conference Details

Editorial Board

Editorial Board Join As A Referral/Reviewer Benefits of Referral/Reviewer

Other IMP Links

START A NEW JOURNAL &
JOURNAL SUPPORTING SOFTWARE Publish BOOK, DISSERTATION AND THESIS Best Research Paper Award

Facts & Figure

Impact Factor : 8.14

Issue per Year : 12

Volume Published : 11

Issue Published : 121

Article Submitted : 24719

Article Published : 9358

Total Authors : 24865

Total Reviewer : 861

Total Countries : 169

Indexing Partner

Licence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Published Paper Details

Paper Title:	Enhancing GPU Cluster Efficiency in HPC with Kubernetes and Custom Monitoring Agents
Authors Name:	Ratan Raj Anandeshi
Download E-Certificate:	Download
Author Reg. ID:	IJRTI_212240
Published Paper Id:	IJRTI2605166
Published In:	Volume 11 Issue 5, May-2026
DOI:	https://doi.org/10.56975/ijrti.v11i5.212240
Abstract:	GPU clusters are increasingly important in high-performance computing environments for large-scale simulation, data analytics, and deep learning workloads. Meanwhile, Kubernetes has moved beyond cloud-native service orchestration and is increasingly discussed in research and practice as a platform for scientific computing, largely due to its declarative control paradigm, interoperability and complete ecosystem. The key issue here is to balance Kubernetes flexibility and the strict efficiency requirements of GPU-based HPC systems, where the latency sensitivity, topology information, accelerator usage, and fine-grained observability have a strong impact on scientific throughput. This review examines peer-reviewed literature on GPU resource management, container orchestration, telemetry, and monitoring-driven optimization, with particular attention to Kubernetes-based implementations and custom monitoring agents. Themes such as scheduling under heterogeneous accelerator constraints, scientific workload container overhead, node-level and pod-level observability of GPUs, fairness and isolation, and feedback control and performance and energy optimization are major themes. It has been reported that orchestration per se can rarely provide maximum efficiency; quantifiable benefits are more often associated with scheduler extensions, topology-aware placement, and monitoring pipelines which are able to reveal the pressure on memory, streaming multiprocessor occupancy, I/O contention, and thermal or power behaviours. Persistent gaps include the lack of cross-layer metrics, limited support for multi-tenant GPU fragmentation, and insufficient validation at production-scale HPC environments. The discipline is of great importance owing to the fact that the exascale and AI-driven next-generation systems will need to be operating models that combine portability, policy control, and accelerator-aware observability.
Keywords:	GPU clusters are increasingly important in high-performance computing environments for large-scale simulation, data analytics, and deep learning workloads. Meanwhile, Kubernetes has moved beyond cloud-native service orchestration and is increasingly discussed in research and practice as a platform for scientific computing, largely due to its declarative control paradigm, interoperability and complete ecosystem. The key issue here is to balance Kubernetes flexibility and the strict efficiency requirements of GPU-based HPC systems, where the latency sensitivity, topology information, accelerator usage, and fine-grained observability have a strong impact on scientific throughput. This review examines peer-reviewed literature on GPU resource management, container orchestration, telemetry, and monitoring-driven optimization, with particular attention to Kubernetes-based implementations and custom monitoring agents. Themes such as scheduling under heterogeneous accelerator constraints, scientific workload container overhead, node-level and pod-level observability of GPUs, fairness and isolation, and feedback control and performance and energy optimization are major themes. It has been reported that orchestration per se can rarely provide maximum efficiency; quantifiable benefits are more often associated with scheduler extensions, topology-aware placement, and monitoring pipelines which are able to reveal the pressure on memory, streaming multiprocessor occupancy, I/O contention, and thermal or power behaviours. Persistent gaps include the lack of cross-layer metrics, limited support for multi-tenant GPU fragmentation, and insufficient validation at production-scale HPC environments. The discipline is of great importance owing to the fact that the exascale and AI-driven next-generation systems will need to be operating models that combine portability, policy control, and accelerator-aware observability.
Cite Article:	"Enhancing GPU Cluster Efficiency in HPC with Kubernetes and Custom Monitoring Agents", International Journal for Research Trends and Innovation (www.ijrti.org), ISSN:2456-3315, Vol.11, Issue 5, page no.b585-b603, May-2026, Available :http://www.ijrti.org/papers/IJRTI2605166.pdf
Downloads:	00098
ISSN:	2456-3315 \| IMPACT FACTOR: 8.14 Calculated By Google Scholar\| ESTD YEAR: 2016 An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 8.14 Calculate by Google Scholar and Semantic Scholar \| AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator
Publication Details:	Published Paper ID: IJRTI2605166 Registration ID:212240 Published In: Volume 11 Issue 5, May-2026 DOI (Digital Object Identifier): https://doi.org/10.56975/ijrti.v11i5.212240 Page No: b585-b603 Country: -, -, India Research Area: Arts Publisher : IJ Publication Published Paper URL : https://www.ijrti.org/viewpaperforall?paper=IJRTI2605166 Published Paper PDF: https://www.ijrti.org/papers/IJRTI2605166
Share Article:	Share Facebook Twitter Google+ Pinterest LinkedIn Email Tumblr WhatsApp Google Gmail

Click Here to Download This Article

Article Preview

Click Here to Download This Article

Major Indexing from www.ijrti.org

Google Scholar	ResearcherID Thomson Reuters	Mendeley : reference manager	Academia.edu
arXiv.org : cornell university library	Research Gate	CiteSeerX	DOAJ : Directory of Open Access Journals
DRJI	Index Copernicus International	Scribd	DocStoc

ISSN Details

ISSN: 2456-3315
Impact Factor: 8.14 and ISSN APPROVED, Journal Starting Year (ESTD) : 2016

DOI (A digital object identifier)

Providing A digital object identifier by DOI.ONE
How to Get DOI?

Conference

Conference Proposal Recent Conference Details

Open Access License Policy

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Important Details

Terms & Condition FAQ Privacy Policy Copyright Infringement

Join RMS/Earn 300

IJRTI

WhatsApp
Click Here

Indexing Partner

For Authors Sample Paper Format Submit Paper Online Call For Paper Undetaking Form Publication Charges FAQ Contact US	Publications Current Issue Past Issue	Proposals Join As a Reviewer Editiorial Board Join in RMS Program Conference Proposal	Policies Privacy Policy Payment Terms and Condition Copyright Infringement Claims Payment Refund Policy
Copyright © 2026 - All Rights Reserved - IJRTI

WhatsApp Click Here

Indexing Partner

WhatsApp
Click Here