IIT Patna launches COIL-D to build Indian language corpora

Per a Technology Innovation Hub (TIH) IIT Patna call for proposals, the Centre of Indian Language Data (COIL-D) initiative is soliciting partners to build multilingual language resources for AI and NLP applications. The TIH document, tied to an MoU with the COIL-D chief principal investigator, lists tasks including text and voice corpora development, parallel corpora, and technology innovation to improve translations from Hindi to 17 Indian languages and Tamil to 3 Dravidian languages. The initiative is sponsored by the Ministry of Electronics and Information Technology (MeitY), according to the TIH call, and the Bhashini project pages identify Dr. Asif Ekbal (IIT Patna) as COIL-D chief investigator. TIH IIT Patna invited startups and corporates to submit expressions of interest, with an initial deadline listed as 15 March 2025, per the TIH announcement.
What happened
Per the TIH IIT Patna call for technology development, the Centre of Indian Language Data (COIL-D) project has an MoU with TIH IIT Patna and is soliciting expressions of interest to build language datasets and related services. The TIH notice lists deliverables that include corpora (text and voice) development, parallel corpora, and technology innovation for text processing and speech recognition. The TIH document states the project aims to improve translations from Hindi to 17 Indian languages and Tamil to 3 Dravidian languages, and identifies the initiative as sponsored by MeitY. The Bhashini project pages identify Dr. Asif Ekbal (IIT Patna) as the COIL-D chief investigator. The TIH announcement invited startups and corporates to participate and cited a first-cycle expression-of-interest deadline of 15 March 2025.
Editorial analysis - technical context
COIL-D is presented in public materials as a centralized repository and resource-creation effort for Indian languages, with stated deliverables focused on both monolingual and parallel corpora. Industry-pattern observations: large, government-backed corpora projects typically emphasize annotation standards, licensing clarity, and benchmarked evaluation sets to maximize downstream utility for multilingual models and machine translation. For practitioners, harmonized corpora and aligned parallel datasets reduce duplicated effort and can materially accelerate training of translation and ASR systems for low-resource Indian languages.
Industry context
Public reporting and government platforms place COIL-D within the broader Bhashini ecosystem, which the Government of India frames as a national language technology initiative. Industry context: national-language repositories backed by ministries often open access to vetted datasets and create procurement pathways for startups and research groups, increasing data availability for academic and commercial model development.
What to watch
Observers should track formal data licensing terms, annotation schemas, and the scope of languages and dialects included. Also watch for published benchmarks or leaderboards that would enable apples-to-apples comparisons across models using COIL-D resources. If TIH or Bhashini publish data access APIs or tooling, that will affect how practitioners integrate the resources into training and evaluation pipelines.
Scoring Rationale
This is a notable, government-backed dataset initiative that can materially affect multilingual NLP research and engineering for Indian languages. The impact depends on data scope, licensing, and tooling; those details will determine practitioner uptake.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems