Models & Researchlanguage dataindian languagesiit patnabhashini

IIT Patna launches COIL-D to build Indian language corpora

|May 11, 2026|By LDS Team

6.8

Relevance Score

IIT Patna launches COIL-D to build Indian language corpora — Photo: static.toiimg.com · rights & takedowns

Per a Technology Innovation Hub (TIH) IIT Patna call for proposals, the Centre of Indian Language Data (COIL-D) initiative is soliciting partners to build multilingual language resources for AI and NLP applications. The TIH document, tied to an MoU with the COIL-D chief principal investigator, lists tasks including text and voice corpora development, parallel corpora, and technology innovation to improve translations from Hindi to 17 Indian languages and Tamil to 3 Dravidian languages. The initiative is sponsored by the Ministry of Electronics and Information Technology (MeitY), according to the TIH call, and the Bhashini project pages identify Dr. Asif Ekbal (IIT Patna) as COIL-D chief investigator. TIH IIT Patna invited startups and corporates to submit expressions of interest, with an initial deadline listed as 15 March 2025, per the TIH announcement.

What happened

Per the TIH IIT Patna call for technology development, the Centre of Indian Language Data (COIL-D) project has an MoU with TIH IIT Patna and is soliciting expressions of interest to build language datasets and related services. The TIH notice lists deliverables that include corpora (text and voice) development, parallel corpora, and technology innovation for text processing and speech recognition. The TIH document states the project aims to improve translations from Hindi to 17 Indian languages and Tamil to 3 Dravidian languages, and identifies the initiative as sponsored by MeitY. The Bhashini project pages identify Dr. Asif Ekbal (IIT Patna) as the COIL-D chief investigator. The TIH announcement invited startups and corporates to participate and cited a first-cycle expression-of-interest deadline of 15 March 2025.

Editorial analysis - technical context

COIL-D is presented in public materials as a centralized repository and resource-creation effort for Indian languages, with stated deliverables focused on both monolingual and parallel corpora. Industry-pattern observations: large, government-backed corpora projects typically emphasize annotation standards, licensing clarity, and benchmarked evaluation sets to maximize downstream utility for multilingual models and machine translation. For practitioners, harmonized corpora and aligned parallel datasets reduce duplicated effort and can materially accelerate training of translation and ASR systems for low-resource Indian languages.

Industry context

Public reporting and government platforms place COIL-D within the broader Bhashini ecosystem, which the Government of India frames as a national language technology initiative. Industry context: national-language repositories backed by ministries often open access to vetted datasets and create procurement pathways for startups and research groups, increasing data availability for academic and commercial model development.

What to watch

Observers should track formal data licensing terms, annotation schemas, and the scope of languages and dialects included. Also watch for published benchmarks or leaderboards that would enable apples-to-apples comparisons across models using COIL-D resources. If TIH or Bhashini publish data access APIs or tooling, that will affect how practitioners integrate the resources into training and evaluation pipelines.

Key Points

1A government-backed COIL-D repository centralizes Indian language data, reducing fragmentation and accelerating multilingual model training.
2The TIH call lists parallel corpora and speech/text corpora, which directly support machine translation and ASR development for low-resource languages.
3Public-sector datasets often hinge on licensing and annotation standards, so access terms will determine practitioner adoption and model reproducibility.

Scoring Rationale

This is a notable, government-backed dataset initiative that can materially affect multilingual NLP research and engineering for Indian languages. The impact depends on data scope, licensing, and tooling; those details will determine practitioner uptake.

Sources

Public references used for this report.

5 sources

iitp.ac.inINDIAN INSTITUTE OF TECHNOLOGY PATNA

tih-iitp.comCall for Technology Development Phase -4 | FY 2025-2026

bhashini.gov.inAnusandhan Mitra Translation Service Providers - Bhashini

View 2 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

Models & Researchlanguage dataindian languagesiit patnabhashini