Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 8.14 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)
The complexity and heterogeneity of cloud-native systems have surpassed the effectiveness of conventional incident response methods which depend on static rules and manual tasks and predefined automation systems to handle operational continuity and service-level agreements (SLAs). The paper introduces a new framework which uses reinforcement learning (RL) to develop systems that autonomously detect and resolve incidents within Amazon Web Services (AWS) infrastructures. The research trains RL agents to optimize their policy decisions through real-time telemetry and cloud event data and simulated fault scenarios after modeling the cloud infrastructure as a partially observable dynamic decision-making environment. The agents assess the current system status and develop corrective responses through trial-and-error interactions based on reward functions that focus on availability and stability and latency improvement.
The method employs Amazon CloudWatch and AWS Config monitoring tools and fault injection mechanisms to train RL agents on realistic system failure conditions. We assess multiple deep reinforcement learning techniques that include Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO) to test their performance in handling different types of system failures such as resource exhaustion and instance termination and misconfiguration. The outcome of the experiments indicates that RL-based approaches offer superior performance in both recovery duration and intervention effectiveness compared to rule-based systems.
The study results demonstrate the ability of RL to function as a core element for developing autonomous cloud systems that can heal themselves. The study advances the development of intelligent cloud computing systems that use operational data for self-learning and require minimal human supervision during disruptive events
"Intelligent Incident Management in AWS Cloud Architectures: An AI-Centric Approach", International Journal of Science & Engineering Development Research (www.ijrti.org), ISSN:2455-2631, Vol.7, Issue 2, page no.135-147, February-2022, Available :http://www.ijrti.org/papers/IJRTI2201022.pdf
Downloads:
0003361
ISSN:
2456-3315 | IMPACT FACTOR: 8.14 Calculated By Google Scholar| ESTD YEAR: 2016
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 8.14 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator