IJRTI
International Journal for Research Trends and Innovation
International Peer Reviewed & Refereed Journals, Open Access Journal
ISSN Approved Journal No: 2456-3315 | Impact factor: 8.14 | ESTD Year: 2016
Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 8.14 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)

Call For Paper

For Authors

Forms / Download

Published Issue Details

Editorial Board

Other IMP Links

Facts & Figure

Impact Factor : 8.14

Issue per Year : 12

Volume Published : 11

Issue Published : 121

Article Submitted : 24719

Article Published : 9358

Total Authors : 24865

Total Reviewer : 861

Total Countries : 169

Indexing Partner

Licence

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Published Paper Details
Paper Title: OCR-Integrated Retrieval-Augmented Generation for Intelligent PDF Processing
Authors Name: Eshani Patel , Yash Deep Singh Bais , Liza Patel
Download E-Certificate: Download
Author Reg. ID:
IJRTI_212689
Published Paper Id: IJRTI2605099
Published In: Volume 11 Issue 5, May-2026
DOI:
Abstract: The pervasive digitization of institutional records has generated an extensive corpus of scanned, image-encoded PDF documents whose informational content remains inaccessible to conventional text-processing infrastructure. This paper presents and rigorously evaluates an integrated pipeline that unifies multi-engine Optical Character Recognition with semantic retrieval and large language model generation to enable natural language querying over arbitrary PDF documents. The proposed architecture deploys three parallel OCR engines—Tesseract 5.3, EasyOCR 1.7, and PaddleOCR 2.7—with quantitative output selection, followed by LangChain-orchestrated semantic chunking, 384-dimensional dense embedding via the all-MiniLM-L6-v2 sentence transformer, FAISS-indexed vector storage, and Retrieval-Augmented Generation inference over both locally deployed and API-backed large language models. Evaluation across a 42-document, 318-page heterogeneous corpus stratified into three document quality categories yields a FAISS Precision@5 of 0.78 and a Mean Reciprocal Rank of 0.86—representing a 17-percentage-point improvement over TF-IDF baselines. Human-assessed response correctness reached 74% under GPT-3.5-Turbo and 58% under the locally deployed Falcon-RW-1B model, at a mean query latency of 9.5 seconds. PaddleOCR achieved the lowest Word Error Rate of 9.7% on degraded documents, while Tesseract retains a marginal advantage on clean typeset material. The work provides a rigorous framework for OCR engine selection within retrieval-augmented pipelines—a design dimension absent from the prior integrated document intelligence literature.
Keywords: Optical Character Recognition; Retrieval-Augmented Generation; Document Intelligence; FAISS; Semantic Embeddings; LangChain; PDF Processing; Large Language Models
Cite Article: "OCR-Integrated Retrieval-Augmented Generation for Intelligent PDF Processing", International Journal for Research Trends and Innovation (www.ijrti.org), ISSN:2456-3315, Vol.11, Issue 5, page no.a796-a806, May-2026, Available :http://www.ijrti.org/papers/IJRTI2605099.pdf
Downloads: 000106
ISSN: 2456-3315 | IMPACT FACTOR: 8.14 Calculated By Google Scholar| ESTD YEAR: 2016
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 8.14 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator
Publication Details: Published Paper ID: IJRTI2605099
Registration ID:212689
Published In: Volume 11 Issue 5, May-2026
DOI (Digital Object Identifier):
Page No: a796-a806
Country: Raipur, Chhattisgarh, India
Research Area: Computer Science & Technology 
Publisher : IJ Publication
Published Paper URL : https://www.ijrti.org/viewpaperforall?paper=IJRTI2605099
Published Paper PDF: https://www.ijrti.org/papers/IJRTI2605099
Share Article:

Click Here to Download This Article

Article Preview
Click Here to Download This Article

Major Indexing from www.ijrti.org
Google Scholar ResearcherID Thomson Reuters Mendeley : reference manager Academia.edu
arXiv.org : cornell university library Research Gate CiteSeerX DOAJ : Directory of Open Access Journals
DRJI Index Copernicus International Scribd DocStoc

ISSN Details

ISSN: 2456-3315
Impact Factor: 8.14 and ISSN APPROVED, Journal Starting Year (ESTD) : 2016

DOI (A digital object identifier)


Providing A digital object identifier by DOI.ONE
How to Get DOI?

Conference

Open Access License Policy

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Creative Commons License This material is Open Knowledge This material is Open Data This material is Open Content

Important Details

Join RMS/Earn 300

IJRTI

WhatsApp
Click Here

Indexing Partner