Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 8.14 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)
Large language models (LLMs) require high computational power and memory, which has limited their execution to cloud-based environments. This relies on remote infrastructure; however, it introduces challenges such as latency, privacy risks, and dependence on stable connectivity. This work examines the feasibility of performing LLM inference directly on consumer-grade edge devices by combining model compression through quantization with parallel inference on available hardware accelerators. Using the TinyLlama-1.1B model in Q4_K_M format, performance can be evaluated on two representative platforms: a MacBook Air M2 with GPU acceleration via Metal, and an Android smartphone utilizing CPU NEON vectorization. Results show that quantization substantially reduces memory requirements, while parallel execution enables interactive throughput of approximately 96 tokens per second on laptop and 46 tokens per second on smartphone. The analysis further indicates that the autoregressive decoding stage continues to be the important feature that influences the performance. In contrast to earlier studies that explored quantization and parallelism as separate strategies, this study offers empirical evidence of their combined impact on edge hardware. The findings demonstrate a practical approach for enabling efficient, privacy-preserving, and scalable LLM applications at the edge.
Keywords:
Large Language Models, Edge Computing, Model Quantization, Parallel Inference, Inference Optimization
Cite Article:
"Quantization and Parallel Inference for Large Language Models on Edge Devices", International Journal for Research Trends and Innovation (www.ijrti.org), ISSN:2455-2631, Vol.10, Issue 11, page no.a1-a7, November-2025, Available :http://www.ijrti.org/papers/IJRTI2511001.pdf
Downloads:
000285
ISSN:
2456-3315 | IMPACT FACTOR: 8.14 Calculated By Google Scholar| ESTD YEAR: 2016
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 8.14 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator