Indic OCR Database

A comprehensive repository tracking the state of Optical Character Recognition (OCR) for Indic languages.
This project was developed for the class 'Computation, Culture, and Society' at the University of Chicago.


About This Project

This site serves as a central repository tracking the state of Optical Character Recognition (OCR) for Indic languages across the world. It aims to catalog research efforts, available datasets, tools, and resources to facilitate advancements in this domain.

Indic languages present unique challenges for OCR systems due to their complex scripts, diverse writing systems, and rich typographical traditions. This project aims to bridge the gap between research and implementation by providing a comprehensive overview of the field.

Why Indic Scripts Are Special

Indic scripts (including Devanagari, Bengali, Tamil, Telugu, Malayalam, Kannada, Gurmukhi, Odia, Gujarati, and Urdu) derive from the ancient Brahmi family and are abugidas: consonants carry an inherent vowel and combine with diacritics to form compound glyphs (akṣaras).

Challenges for OCR

Challenge Impact on OCR
Ligatures & consonant clusters (e.g., क्ष, त्र्य, ज्ञ) Prevent simple character segmentation
The śirorekhā (headline) in Devanagari/Gurmukhi Connects letters, causing touching-character errors
Vowel diacritics above/below/beside the base Small marks are lost at low resolution
Font-and-encoding diversity pre-Unicode Training data must span dozens of fonts & legacy encodings
Very long Sanskrit compounds Dictionary-based post-correction alone is ineffective

Technique Evolution

  1. Rule-based / structural (1970s) → handwriting of primitives
  2. Statistical & feature-engineering (1990s): zoning, profiles, junction counts; decision trees, k-NN, SVM
  3. Hybrid systems + lexicon post-processing (2000s): Chitrankan (Hindi) transferred to C-DAC
  4. Deep learning revolution (2015 →):
    • CNN-LSTM + CTC delivers segmentation-free line recognition
    • Adopted in Tesseract 4/5, OCRopy, EasyOCR, IIIT-H CVIT models
    • Attention & Transformer variants tackle very long Sanskrit compounds (>50 chars)

OCR Comparison Results

This section compares OCR results from Gemma 3 (using Tesseract/Llama) with the ground truth transliteration of a Sanskrit text. The image was processed using Gemma3-OCR and the extracted text is shown below.

Screenshot of Sanskrit OCR with Tesseract app
Screenshot: Running the OCR pipeline on the uploaded image using Gemma3-OCR and Tesseract. This demonstrates the actual workflow and results as seen on the local system.

Sample Text Comparison (Bhagavad Gita Chapter 4, Verses 1-7)

Ground Truth (from image) Gemma 3 OCR Output
स कालेनेह महता योगो नष्टः परंतप ॥ २
स एवायं मया तेऽद्य योगः प्रोक्तः पुरातनः ।
भक्तोऽसि मे सखा चेति रहस्यं ह्येतदुत्तमम् ॥ ३॥
अर्जुन उवाच |
अपरं भवतो जन्म परं जन्म विवस्वतः |
कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ ४॥
श्रीभगवानुवाच |
बहूनि मे व्यतीतानि जन्मानि तव चार्जुन |
तान्यहं वेद सर्वाणि न त्वं वेत्थ परंतप ॥ ५॥
अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन् |
प्रकृतिं स्वामधिष्ठाय संभवाम्यात्ममायया ॥ ६॥
यदा यदा हि धर्मस्य ग्लानिर्भवति भारत |
अभ्युत्थानमधर्मस्य तदात्मानं सृजाम्यहम् ॥ ७॥
परित्राणाय साधूनां विनाशाय च दुष्कृताम् |
धर्मसंस्थापनार्थाय संभवामि युगे युगे ॥ ८॥
जन्म कर्म च मे दिव्यमेवं यो वेत्ति तत्त्वतः |
त्यक्त्वा देहं पुनर्जन्म नैति मामेति सोऽर्जुन ॥ ९॥
वीतरागभयक्रोधा मन्मया मामुपाश्रिताः |
बहवो ज्ञानतपसा पूता मद्भावमागताः ॥ १०॥
ये यथा मां प्रपद्यन्ते तांस्तथैव भजाम्यहम् |
मम वर्त्मानुवर्तन्ते मनुष्याः पार्थ सर्वशः ॥ ११॥
काङ्क्षन्तः कर्मणां सिद्धिं यजन्त इह देवताः |
क्षिप्रं हि मानुषे लोके सिद्धिर्भवति कर्मजा ॥ १२॥
चातुर्वर्ण्यं मया सृष्टं गुणकर्मविभागशः |
तस्य कर्तारमपि मां विद्ध्यकर्तारमव्ययम् ॥ १३॥
न मां कर्माणि लिम्पन्ति न मे कर्मफले स्पृहा |
स कालेनेह महता योगो नष्टः परंतप ॥ २
स॒ एवायं मया तेऽ योगः प्रोक्तः पुरातनः |

भक्तोऽपि मे सखा चेति रहस्यं हयेतदुत्तमम्‌ | ३

अजन उवाच |

अपरं भवतो जन्म परं जन्म बिवखतः |
कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ 2

आओभगवायुवाच |
बहूनि मे व्यतीतानि जन्मानि तव चान |
तान्यहं वेद सबोणि न त्वं वेत्थ परंतप ॥ ५
अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन्‌ |
प्रकृतिं खामधिष्ठाय संभवाम्यात्ममायया ॥ &
यदा यदा हि धर्मस्य ग्लानिर्भवति भारत |
अभ्युत्थानमधर्मसख तदात्मानं सृजाम्यहम्‌ ॥ ७

परित्राणाय साधूनां विनाशाय च दुष्कृताम्‌ ।
धर्मसंखापनाथोय संभवामि युगे युगे ॥ ८

जन्म कर्मं च मे दिव्यमेवं यो वेत्ति तचत; |
लक्ता देहं पुनजैन्म नैति मामेति सोऽर्जुन ॥ ९
वरीतरागभयक्रोधा मन्मया माघ्रुपाभिताः |

बहवो ज्ञानतपसा पूता मद्भावमागताः ॥ १०

ये यथा मां प्रप्न्ते TAIT भजाम्यहम्‌ |

मम THT मलुष्याः पाथं सर्वशः ॥ ११
काष्न्तः कर्मणां सिद्धि यजन्त इह देवताः |
fart हि मालुषे रोके सिद्विभवति कर्मजा ॥ १२
चातुर्वण्यं मया सृष्टं गुणकर्मविभागशः |

तस्य कतारमपि मां विद्कर्तारमव्ययम्‌ ॥ १३
न मां warty लिम्पन्ति न मे कर्मफले स्एृहा |
स कालेनेह महता योगो नष्टः परंतप ॥ २
स॒ एवायं मया तेऽ योगः प्रोक्तः पुरातनः |

भक्तोऽपि मे सखा चेति रहस्यं हयेतदुत्तमम्‌ | ३

अजन उवाच |

अपरं भवतो जन्म परं जन्म बिवखतः |
कथमेतद्विजानीयां त्वमादौ प्रोक्तवानिति ॥ 2

आओभगवायुवाच |
बहूनि मे व्यतीतानि जन्मानि तव चान |
तान्यहं वेद सबोणि न त्वं वेत्थ परंतप ॥ ५
अजोऽपि सन्नव्ययात्मा भूतानामीश्वरोऽपि सन्‌ |
प्रकृतिं खामधिष्ठाय संभवाम्यात्ममायया ॥ &
यदा यदा हि धर्मस्य ग्लानिर्भवति भारत |
अभ्युत्थानमधर्मसख तदात्मानं सृजाम्यहम्‌ ॥ ७

In-Depth Error Analysis

  • Proper Noun Recognition: Names like "अर्जुन" (Arjuna) misrecognized as "अजन", "श्रीभगवानुवाच" (Shri Bhagavan said) becomes "आओभगवायुवाच".
  • Numeral Representation: Devanagari numerals (२, ३) sometimes converted to Arabic numerals (2, 3) or special characters (&).
  • Diacritic Errors: Critical diacritics like "्र" (ra) in words like "सर्वाणि" misread as "सबोणि", changing meaning substantially.
  • Punctuation Inconsistency: Double danda (॥) sometimes recognized as single danda (|) or omitted entirely.
  • Conjunct Character Errors: Complex ligatures in words like "स्वामधिष्ठाय" misread as "खामधिष्ठाय" due to similar visual appearance.
  • Extra/Missing Spaces: Incorrect spacing, especially after diacritics, causes word boundary issues ("तेऽद्य" becomes "तेऽ").
  • Word Compounding Issues: Sanskrit compound words like "अधर्मस्य" incorrectly split as "अधर्मसख".
  • General Observations: The model performs well on standard, clear Devanagari text but struggles with complex ligatures, rare words, and certain font styles. The error rate is acceptable for many research purposes but would need post-processing for high-accuracy applications.

Accuracy Metrics

Metric Value Notes
Character Error Rate (CER) 28.5% Most errors occur in complex conjunct consonants and diacritics typical in Sanskrit.
Word Error Rate (WER) 32.7% Key proper nouns like "अर्जुन" (Arjuna) misrecognized as "अजन" impact overall semantic understanding.
Accuracy 71.5% Gemma3-OCR demonstrates reasonable performance for classical Sanskrit with complex verse structure.

Conclusion: Gemma 3 OCR (with Tesseract/Llama) provides a strong baseline for Sanskrit OCR, but further improvements are needed for perfect accuracy, especially on complex or ornate texts. Manual correction or post-processing is recommended for critical applications.


Languages Tracked


Research Papers

Historical Development (1970s → 2000s)

Era Milestone Notes & Citation
1973 – 1980 First Devanagari OCR prototypes by R. M. K. Sinha (IIT-Kanpur) and H. N. Mahabala Rule-based "syntactic pattern analysis". [1]
1980s–1990s ISI Kolkata team (B. B. Chaudhuri & U. Pal): full pipelines for Bangla & Hindi ≥95% on clean prints; introduced decision-tree + template classifiers. [2]
1995–2000 Punjabi OCR (Lehal et al.), Telugu & Malayalam prototypes (C-DAC/TDIL) Multi-stage classifiers + language-model post-correction. [3][4]
2003 Jawahar et al. bilingual Hindi-Telugu OCR Early "multi-script" prototype. [5]
2005–2006 Tesseract open-sourced by Google & community Indic training begins Marks shift to freely-available engines. [6]
2010 Guide to OCR for Indic Scripts (Springer) published First book-length survey summarising state of the art. [7]

Recent State-of-the-Art (2021–2025)

  • 99% character accuracy reported on clean Hindi, Tamil, Bangla prints with CVIT CTC-LSTM models—outperforming Google Vision in 8/13 languages. [10]
  • Attention-LSTM OCR for Sanskrit (Dwivedi 2020) cuts WER by 35% vs. CNN-LSTM baselines on classical texts. [12]
  • Vision-Transformer OCR pilots at IIT Madras for palm-leaf manuscripts (2024): early results promising for low-contrast ink.
  • Handwritten Indic OCR (ICDAR 2023 competition): best teams reach 80% word accuracy on Devanagari cursive using CNN-Transformer decoders.

Latest Research Papers


Languages Tracked


Research Papers

Historical Development (1970s → 2000s)

Era Milestone Notes & Citation
1973 – 1980 First Devanagari OCR prototypes by R. M. K. Sinha (IIT-Kanpur) and H. N. Mahabala Rule-based "syntactic pattern analysis". [1]
1980s–1990s ISI Kolkata team (B. B. Chaudhuri & U. Pal): full pipelines for Bangla & Hindi ≥95% on clean prints; introduced decision-tree + template classifiers. [2]
1995–2000 Punjabi OCR (Lehal et al.), Telugu & Malayalam prototypes (C-DAC/TDIL) Multi-stage classifiers + language-model post-correction. [3][4]
2003 Jawahar et al. bilingual Hindi-Telugu OCR Early "multi-script" prototype. [5]
2005–2006 Tesseract open-sourced by Google & community Indic training begins Marks shift to freely-available engines. [6]
2010 Guide to OCR for Indic Scripts (Springer) published First book-length survey summarising state of the art. [7]

Recent State-of-the-Art (2021–2025)

  • 99% character accuracy reported on clean Hindi, Tamil, Bangla prints with CVIT CTC-LSTM models—outperforming Google Vision in 8/13 languages. [10]
  • Attention-LSTM OCR for Sanskrit (Dwivedi 2020) cuts WER by 35% vs. CNN-LSTM baselines on classical texts. [12]
  • Vision-Transformer OCR pilots at IIT Madras for palm-leaf manuscripts (2024): early results promising for low-contrast ink.
  • Handwritten Indic OCR (ICDAR 2023 competition): best teams reach 80% word accuracy on Devanagari cursive using CNN-Transformer decoders.

Latest Research Papers


Datasets & Corpora


Tools & Software

Major Projects & Institutions

Project / Institution Contribution Status
C-DAC / TDIL Govt-funded OCRs (e-Aksharayan for 7 scripts) Free desktop tool (2018) [9]
IIIT-Hyderabad (CVIT) Mozhi dataset (1.2M words, 13 scripts); SOTA CNN-LSTM models Open-source, 2023 [10]
ISI Kolkata Early Bangla & Devanagari engines; algorithmic surveys Academic
Tesseract Indic-OCR community Enhanced tessdata models, training scripts GitHub active [11]
SanskritOCR / Indsenz (O. Hellwig) Sanskrit-specific OCR with linguistic post-checks Used for >3k texts
Dharmamitra Neural post-correction for Sanskrit & Tibetan OCR Non-profit (2024)

Available Tools


University of Chicago Initiatives

The University of Chicago has evolved into a digital clearing-house for South Asian text and metadata. The Library's vast Southern Asia Collection feeds high-resolution page images into open repositories, while SALRC and SALC supply pedagogy, fonts, and linguistic expertise. These resources jointly underpin much of today's Indic OCR benchmarking and corpus work.

Digital South Asia Library (DSAL)

  • Scope & Launch: Online since 2000, DSAL aggregates reference books, gazetteers, journals, maps, photographs and colonial statistical tables under a single search interface.
  • Holdings: Includes 19,000+ historic photographs, 500+ topographic & thematic maps, and full scans of the Imperial Gazetteer of India—all served as TIFF + OCR-friendly DjVu/PDF derivatives.
  • Visit DSAL

Digital Dictionaries of South Asia (DDSA)

  • Content: 52 dictionaries spanning 25 languages (Assamese→Urdu) with lemmatized, full-text & cross-dictionary search.
  • Scale & Impact: Draws approximately 7 million queries annually; project began in 1999 and received a $198K U.S. Department of Education grant in 2020 to add Kashmiri, Panjabi, Sindhi, Sinhala, Telugu and Urdu.
  • Platform: Runs on the ARTFL PhiloLogic engine, enabling multi-script querying and morphological expansion.
  • Explore the Dictionaries

South Asia Union Catalogue (SAUC)

  • Goal: An historical bibliography & union catalogue describing all books and periodicals published in South Asia from 1556 → present.
  • Collaborators: British Library, Library of Congress, Roja Muthiah Library, Sundarayya Vignana Kendram and others—UChicago serves as lead institution.
  • Funding: Early phases supported by the Ford Foundation's New Delhi office.
  • Access SAUC

South Asia Language Resource Center (SALRC)

  • Mandate: A Title VI Language Resource Center charged with creating & disseminating web-based resources for 30+ South Asian languages.
  • Key Databases & Tools:
    • Unicode Font Repository: Curated fonts & keyboard layouts for every major Indic script—critical for rendering OCR output correctly.
    • Grant-Award Database: Searchable record of SALRC micro-grants funding pedagogical and digital-text projects.
    • SALPAT E-journal: South Asia Language Pedagogy & Technology—peer-reviewed articles on digital language teaching and corpus methods.
    • Hosted Learning Sites: e.g., Intermediate Urdu and Nepali: A Beginner's Primer—open, multimedia courses built with SALRC support.
  • Visit SALRC

Southern Asia Collection, UChicago Library

  • Holdings: Over 1 million volumes plus manuscripts, audio and cartographic materials in every South Asian language—largest such collection in North America.
  • Digitisation Pipeline: Rare texts are continuously fed into DSAL & SAUC, providing high-fidelity page images for OCR model training.
  • Browse the Collection

ARTFL – South Asia Reference Tools Program

  • ARTFL programmers supply the backend for DDSA and have prototyped Tamil- and Hindi-English dictionary interfaces, demonstrating cross-script fuzzy search.
  • Learn More

Department of South Asian Languages & Civilizations (SALC)

  • Academic Context: Founded 1966; offers Bangla, Hindi, Marathi, Sanskrit, Tamil, Tibetan & Urdu, with faculty (e.g., Prof. Gary Tubb) directly involved in dictionary and OCR corpus expansion.
  • Visit SALC

Additional Resources

Key Secondary Sources

  1. Pal, U., & Chaudhuri, B. B. "Indian Script Character Recognition: A Survey." Pattern Recognition 37 (2004): 1887–99.
  2. Chaudhuri, B. B., & Pal, U. "An OCR System to Read Two Indian Language Scripts: Bangla and Devnagari (Hindi)." Proceedings of ICASSP (1998).
  3. Lehal, G. S. "A Gurmukhi OCR System." Journal of Research 37 (2000): 159–71.
  4. Pal, U., Wakabayashi, T., & Kimura, F. "Multilingual OCR System for South Asian Scripts." IEICE Trans. E83-D (2000).
  5. Jawahar, C. V., et al. "Bilingual OCR for Hindi–Telugu Documents." ICDAR 2003.
  6. Smith, R. "An Overview of the Tesseract OCR Engine." ICDAR 2007.
  7. Govindaraju, V., & Chaudhuri, B. B., eds. Guide to OCR for Indic Scripts. Springer, 2010.
  8. Sarkhel, S., et al. "Indic-Script OCR Using Sequence-to-Sequence Models." Pattern Recognition Letters 148 (2021).
  9. Ministry of Electronics & IT, GoI. e-Aksharayan User Manual, 2018.
  10. Krishna, P. R., et al. "Mozhi: A Large-Scale Dataset for Indic OCR." CVPR Workshops 2023.
  11. Indic-OCR GitHub Repository. Accessed May 2025.
  12. Dwivedi, S., et al. "Sanskrit OCR with Attention-Based Encoder-Decoder." ICFHR 2020.

Organizations & Initiatives


© 2025 aadarwal