Indic OCR Database
A comprehensive repository tracking the state of Optical Character Recognition (OCR) for Indic languages.
This project was developed for the class 'Computation, Culture, and Society' at the University of Chicago.
About This Project
This site serves as a central repository tracking the state of Optical Character Recognition (OCR) for Indic languages across the world. It aims to catalog research efforts, available datasets, tools, and resources to facilitate advancements in this domain.
Indic languages present unique challenges for OCR systems due to their complex scripts, diverse writing systems, and rich typographical traditions. This project aims to bridge the gap between research and implementation by providing a comprehensive overview of the field.
Why Indic Scripts Are Special
Indic scripts (including Devanagari, Bengali, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Odia, Gujarati, and Urdu) derive from the ancient Brahmi family and are abugidas: consonants carry an inherent vowel and combine with diacritics to form compound glyphs (akṣaras).
Challenges for OCR
Challenge | Impact on OCR |
---|---|
Ligatures & consonant clusters (e.g., क्ष, त्र्य, ज्ञ) | Prevent simple character segmentation |
The śirorekhā (headline) in Devanagari/Gurmukhi | Connects letters, causing touching-character errors |
Vowel diacritics above/below/beside the base | Small marks are lost at low resolution |
Font-and-encoding diversity pre-Unicode | Training data must span dozens of fonts & legacy encodings |
Very long Sanskrit compounds | Dictionary-based post-correction alone is ineffective |
Technique Evolution
- Rule-based / structural (1970s) → handwriting of primitives
- Statistical & feature-engineering (1990s): zoning, profiles, junction counts; decision trees, k-NN, SVM
- Hybrid systems + lexicon post-processing (2000s): Chitrankan (Hindi) transferred to C-DAC
- Deep learning revolution (2015 →):
- CNN-LSTM + CTC delivers segmentation-free line recognition
- Adopted in Tesseract 4/5, OCRopy, EasyOCR, IIIT-H CVIT models
- Attention & Transformer variants tackle very long Sanskrit compounds (>50 chars)
OCR Comparison Results
This section compares OCR results from Gemma 3 (using Tesseract/Llama) with the ground truth transliteration of a Sanskrit text. The image was processed using Gemma3-OCR and the extracted text is shown below.

Sample Text Comparison (Bhagavad Gita Chapter 4, Verses 1-7)
Ground Truth (from image) | Gemma 3 OCR Output | |
---|---|---|
स कालेनेह महता योगो नष्टः परंतप ॥ २ स एवायं मया तेऽद्य योगः प्रोक्तः पुरातनः । भक्तोऽसि मे सखा चेति रहस्यं ह्येतदुत्तमम् ॥ ३॥ अर्जुन उवाच | अपरं भवतो जन्म परं जन्म विवस्वतः | कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ ४॥ श्रीभगवानुवाच | बहूनि मे व्यतीतानि जन्मानि तव चार्जुन | तान्यहं वेद सर्वाणि न त्वं वेत्थ परंतप ॥ ५॥ अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन् | प्रकृतिं स्वामधिष्ठाय संभवाम्यात्ममायया ॥ ६॥ यदा यदा हि धर्मस्य ग्लानिर्भवति भारत | अभ्युत्थानमधर्मस्य तदात्मानं सृजाम्यहम् ॥ ७॥ परित्राणाय साधूनां विनाशाय च दुष्कृताम् | धर्मसंस्थापनार्थाय संभवामि युगे युगे ॥ ८॥ जन्म कर्म च मे दिव्यमेवं यो वेत्ति तत्त्वतः | त्यक्त्वा देहं पुनर्जन्म नैति मामेति सोऽर्जुन ॥ ९॥ वीतरागभयक्रोधा मन्मया मामुपाश्रिताः | बहवो ज्ञानतपसा पूता मद्भावमागताः ॥ १०॥ ये यथा मां प्रपद्यन्ते तांस्तथैव भजाम्यहम् | मम वर्त्मानुवर्तन्ते मनुष्याः पार्थ सर्वशः ॥ ११॥ काङ्क्षन्तः कर्मणां सिद्धिं यजन्त इह देवताः | क्षिप्रं हि मानुषे लोके सिद्धिर्भवति कर्मजा ॥ १२॥ चातुर्वर्ण्यं मया सृष्टं गुणकर्मविभागशः | तस्य कर्तारमपि मां विद्ध्यकर्तारमव्ययम् ॥ १३॥ न मां कर्माणि लिम्पन्ति न मे कर्मफले स्पृहा | |
स कालेनेह महता योगो नष्टः परंतप ॥ २ स॒ एवायं मया तेऽ योगः प्रोक्तः पुरातनः | भक्तोऽपि मे सखा चेति रहस्यं हयेतदुत्तमम् | ३ अजन उवाच | अपरं भवतो जन्म परं जन्म बिवखतः | कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ 2 आओभगवायुवाच | बहूनि मे व्यतीतानि जन्मानि तव चान | तान्यहं वेद सबोणि न त्वं वेत्थ परंतप ॥ ५ अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन् | प्रकृतिं खामधिष्ठाय संभवाम्यात्ममायया ॥ & यदा यदा हि धर्मस्य ग्लानिर्भवति भारत | अभ्युत्थानमधर्मसख तदात्मानं सृजाम्यहम् ॥ ७ परित्राणाय साधूनां विनाशाय च दुष्कृताम् । धर्मसंखापनाथोय संभवामि युगे युगे ॥ ८ जन्म कर्मं च मे दिव्यमेवं यो वेत्ति तचत; | लक्ता देहं पुनजैन्म नैति मामेति सोऽर्जुन ॥ ९ वरीतरागभयक्रोधा मन्मया माघ्रुपाभिताः | बहवो ज्ञानतपसा पूता मद्भावमागताः ॥ १० ये यथा मां प्रप्न्ते TAIT भजाम्यहम् | मम THT मलुष्याः पाथं सर्वशः ॥ ११ काष्न्तः कर्मणां सिद्धि यजन्त इह देवताः | fart हि मालुषे रोके सिद्विभवति कर्मजा ॥ १२ चातुर्वण्यं मया सृष्टं गुणकर्मविभागशः | तस्य कतारमपि मां विद्कर्तारमव्ययम् ॥ १३ न मां warty लिम्पन्ति न मे कर्मफले स्एृहा | |
स कालेनेह महता योगो नष्टः परंतप ॥ २ स॒ एवायं मया तेऽ योगः प्रोक्तः पुरातनः | भक्तोऽपि मे सखा चेति रहस्यं हयेतदुत्तमम् | ३ अजन उवाच | अपरं भवतो जन्म परं जन्म बिवखतः | कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ 2 आओभगवायुवाच | बहूनि मे व्यतीतानि जन्मानि तव चान | तान्यहं वेद सबोणि न त्वं वेत्थ परंतप ॥ ५ अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन् | प्रकृतिं खामधिष्ठाय संभवाम्यात्ममायया ॥ & यदा यदा हि धर्मस्य ग्लानिर्भवति भारत | अभ्युत्थानमधर्मसख तदात्मानं सृजाम्यहम् ॥ ७ |
In-Depth Error Analysis
- Proper Noun Recognition: Names like "अर्जुन" (Arjuna) misrecognized as "अजन", "श्रीभगवानुवाच" (Shri Bhagavan said) becomes "आओभगवायुवाच".
- Numeral Representation: Devanagari numerals (२, ३) sometimes converted to Arabic numerals (2, 3) or special characters (&).
- Diacritic Errors: Critical diacritics like "्र" (ra) in words like "सर्वाणि" misread as "सबोणि", changing meaning substantially.
- Punctuation Inconsistency: Double danda (॥) sometimes recognized as single danda (|) or omitted entirely.
- Conjunct Character Errors: Complex ligatures in words like "स्वामधिष्ठाय" misread as "खामधिष्ठाय" due to similar visual appearance.
- Extra/Missing Spaces: Incorrect spacing, especially after diacritics, causes word boundary issues ("तेऽद्य" becomes "तेऽ").
- Word Compounding Issues: Sanskrit compound words like "अधर्मस्य" incorrectly split as "अधर्मसख".
- General Observations: The model performs well on standard, clear Devanagari text but struggles with complex ligatures, rare words, and certain font styles. The error rate is acceptable for many research purposes but would need post-processing for high-accuracy applications.
Accuracy Metrics
Metric | Value | Notes |
---|---|---|
Character Error Rate (CER) | 28.5% | Most errors occur in complex conjunct consonants and diacritics typical in Sanskrit. |
Word Error Rate (WER) | 32.7% | Key proper nouns like "अर्जुन" (Arjuna) misrecognized as "अजन" impact overall semantic understanding. |
Accuracy | 71.5% | Gemma3-OCR demonstrates reasonable performance for classical Sanskrit with complex verse structure. |
Conclusion: Gemma 3 OCR (with Tesseract/Llama) provides a strong baseline for Sanskrit OCR, but further improvements are needed for perfect accuracy, especially on complex or ornate texts. Manual correction or post-processing is recommended for critical applications.
Languages Tracked
Research Papers
Historical Development (1970s → 2000s)
Era | Milestone | Notes & Citation |
---|---|---|
1973 – 1980 | First Devanagari OCR prototypes by R. M. K. Sinha (IIT-Kanpur) and H. N. Mahabala | Rule-based "syntactic pattern analysis". [1] |
1980s–1990s | ISI Kolkata team (B. B. Chaudhuri & U. Pal): full pipelines for Bangla & Hindi | ≥95% on clean prints; introduced decision-tree + template classifiers. [2] |
1995–2000 | Punjabi OCR (Lehal et al.), Telugu & Malayalam prototypes (C-DAC/TDIL) | Multi-stage classifiers + language-model post-correction. [3][4] |
2003 | Jawahar et al. bilingual Hindi-Telugu OCR | Early "multi-script" prototype. [5] |
2005–2006 | Tesseract open-sourced by Google & community Indic training begins | Marks shift to freely-available engines. [6] |
2010 | Guide to OCR for Indic Scripts (Springer) published | First book-length survey summarising state of the art. [7] |
Recent State-of-the-Art (2021–2025)
- 99% character accuracy reported on clean Hindi, Tamil, Bangla prints with CVIT CTC-LSTM models—outperforming Google Vision in 8/13 languages. [10]
- Attention-LSTM OCR for Sanskrit (Dwivedi 2020) cuts WER by 35% vs. CNN-LSTM baselines on classical texts. [12]
- Vision-Transformer OCR pilots at IIT Madras for palm-leaf manuscripts (2024): early results promising for low-contrast ink.
- Handwritten Indic OCR (ICDAR 2023 competition): best teams reach 80% word accuracy on Devanagari cursive using CNN-Transformer decoders.
Latest Research Papers
Languages Tracked
Research Papers
Historical Development (1970s → 2000s)
Era | Milestone | Notes & Citation |
---|---|---|
1973 – 1980 | First Devanagari OCR prototypes by R. M. K. Sinha (IIT-Kanpur) and H. N. Mahabala | Rule-based "syntactic pattern analysis". [1] |
1980s–1990s | ISI Kolkata team (B. B. Chaudhuri & U. Pal): full pipelines for Bangla & Hindi | ≥95% on clean prints; introduced decision-tree + template classifiers. [2] |
1995–2000 | Punjabi OCR (Lehal et al.), Telugu & Malayalam prototypes (C-DAC/TDIL) | Multi-stage classifiers + language-model post-correction. [3][4] |
2003 | Jawahar et al. bilingual Hindi-Telugu OCR | Early "multi-script" prototype. [5] |
2005–2006 | Tesseract open-sourced by Google & community Indic training begins | Marks shift to freely-available engines. [6] |
2010 | Guide to OCR for Indic Scripts (Springer) published | First book-length survey summarising state of the art. [7] |
Recent State-of-the-Art (2021–2025)
- 99% character accuracy reported on clean Hindi, Tamil, Bangla prints with CVIT CTC-LSTM models—outperforming Google Vision in 8/13 languages. [10]
- Attention-LSTM OCR for Sanskrit (Dwivedi 2020) cuts WER by 35% vs. CNN-LSTM baselines on classical texts. [12]
- Vision-Transformer OCR pilots at IIT Madras for palm-leaf manuscripts (2024): early results promising for low-contrast ink.
- Handwritten Indic OCR (ICDAR 2023 competition): best teams reach 80% word accuracy on Devanagari cursive using CNN-Transformer decoders.
Latest Research Papers
Datasets & Corpora
Tools & Software
Major Projects & Institutions
Project / Institution | Contribution | Status |
---|---|---|
C-DAC / TDIL | Govt-funded OCRs (e-Aksharayan for 7 scripts) | Free desktop tool (2018) [9] |
IIIT-Hyderabad (CVIT) | Mozhi dataset (1.2M words, 13 scripts); SOTA CNN-LSTM models | Open-source, 2023 [10] |
ISI Kolkata | Early Bangla & Devanagari engines; algorithmic surveys | Academic |
Tesseract Indic-OCR community | Enhanced tessdata models, training scripts | GitHub active [11] |
SanskritOCR / Indsenz (O. Hellwig) | Sanskrit-specific OCR with linguistic post-checks | Used for >3k texts |
Dharmamitra | Neural post-correction for Sanskrit & Tibetan OCR | Non-profit (2024) |
Available Tools
University of Chicago Initiatives
The University of Chicago has evolved into a digital clearing-house for South Asian text and metadata. The Library's vast Southern Asia Collection feeds high-resolution page images into open repositories, while SALRC and SALC supply pedagogy, fonts, and linguistic expertise. These resources jointly underpin much of today's Indic OCR benchmarking and corpus work.
Digital South Asia Library (DSAL)
- Scope & Launch: Online since 2000, DSAL aggregates reference books, gazetteers, journals, maps, photographs and colonial statistical tables under a single search interface.
- Holdings: Includes 19,000+ historic photographs, 500+ topographic & thematic maps, and full scans of the Imperial Gazetteer of India—all served as TIFF + OCR-friendly DjVu/PDF derivatives.
- Visit DSAL
Digital Dictionaries of South Asia (DDSA)
- Content: 52 dictionaries spanning 25 languages (Assamese→Urdu) with lemmatized, full-text & cross-dictionary search.
- Scale & Impact: Draws approximately 7 million queries annually; project began in 1999 and received a $198K U.S. Department of Education grant in 2020 to add Kashmiri, Panjabi, Sindhi, Sinhala, Telugu and Urdu.
- Platform: Runs on the ARTFL PhiloLogic engine, enabling multi-script querying and morphological expansion.
- Explore the Dictionaries
South Asia Union Catalogue (SAUC)
- Goal: An historical bibliography & union catalogue describing all books and periodicals published in South Asia from 1556 → present.
- Collaborators: British Library, Library of Congress, Roja Muthiah Library, Sundarayya Vignana Kendram and others—UChicago serves as lead institution.
- Funding: Early phases supported by the Ford Foundation's New Delhi office.
- Access SAUC
South Asia Language Resource Center (SALRC)
- Mandate: A Title VI Language Resource Center charged with creating & disseminating web-based resources for 30+ South Asian languages.
- Key Databases & Tools:
- Unicode Font Repository: Curated fonts & keyboard layouts for every major Indic script—critical for rendering OCR output correctly.
- Grant-Award Database: Searchable record of SALRC micro-grants funding pedagogical and digital-text projects.
- SALPAT E-journal: South Asia Language Pedagogy & Technology—peer-reviewed articles on digital language teaching and corpus methods.
- Hosted Learning Sites: e.g., Intermediate Urdu and Nepali: A Beginner's Primer—open, multimedia courses built with SALRC support.
- Visit SALRC
Southern Asia Collection, UChicago Library
- Holdings: Over 1 million volumes plus manuscripts, audio and cartographic materials in every South Asian language—largest such collection in North America.
- Digitisation Pipeline: Rare texts are continuously fed into DSAL & SAUC, providing high-fidelity page images for OCR model training.
- Browse the Collection
ARTFL – South Asia Reference Tools Program
- ARTFL programmers supply the backend for DDSA and have prototyped Tamil- and Hindi-English dictionary interfaces, demonstrating cross-script fuzzy search.
- Learn More
Department of South Asian Languages & Civilizations (SALC)
- Academic Context: Founded 1966; offers Bangla, Hindi, Marathi, Sanskrit, Tamil, Tibetan & Urdu, with faculty (e.g., Prof. Gary Tubb) directly involved in dictionary and OCR corpus expansion.
- Visit SALC
Additional Resources
Key Secondary Sources
- Pal, U., & Chaudhuri, B. B. "Indian Script Character Recognition: A Survey." Pattern Recognition 37 (2004): 1887–99.
- Chaudhuri, B. B., & Pal, U. "An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)." Proceedings of ICASSP (1998).
- Lehal, G. S. "A Gurmukhi OCR System." Journal of Research 37 (2000): 159–71.
- Pal, U., Wakabayashi, T., & Kimura, F. "Multilingual OCR System for South Asian Scripts." IEICE Trans. E83-D (2000).
- Jawahar, C. V., et al. "Bilingual OCR for Hindi–Telugu Documents." ICDAR 2003.
- Smith, R. "An Overview of the Tesseract OCR Engine." ICDAR 2007.
- Govindaraju, V., & Chaudhuri, B. B., eds. Guide to OCR for Indic Scripts. Springer, 2010.
- Sarkhel, S., et al. "Indic-Script OCR Using Sequence-to-Sequence Models." Pattern Recognition Letters 148 (2021).
- Ministry of Electronics & IT, GoI. e-Aksharayan User Manual, 2018.
- Krishna, P. R., et al. "Mozhi: A Large-Scale Dataset for Indic OCR." CVPR Workshops 2023.
- Indic-OCR GitHub Repository. Accessed May 2025.
- Dwivedi, S., et al. "Sanskrit OCR with Attention-Based Encoder-Decoder." ICFHR 2020.
Organizations & Initiatives
© 2025 aadarwal