Microsoft has announced support for three Indian language Speech Corpus. The company wants to help researchers by offering speech training and test data for Telugu, Tamil, and Gujarati. The dataset includes audio and corresponding transcripts.
The language Speech Corpus content is provided by Microsoft Research’s Open Data initiative. The initiative will continue to offer a collection of free datasets to advance the research in areas like natural language processing, AI, domain specified sciences, speech recognition, and computer vision.
According to the software giant, there is a scarcity of digital data of text, speech and language resources. Microsoft’s Research department wants to support the development of large machine learning models for as many vernacular languages as possible. The tech firm is working across the world to support a number of local languages.
Microsoft is addressing the lack of data, and catalyzing the development of machine learning models. Development of low resource languages is supposed to enable the ecosystem of researchers, top companies and academia working on Indian language models. Microsoft’s effort is going to accelerate the end of Indian users.
The Indian language Speech Corpus was analyzed at ‘Interspeech 2018’, the largest conference on the technology of spoken language processing. Microsoft’s Automated Speech Recognition (ASR) system was among the top in a low resource speed recognition challenge. Microsoft wants to support India’s increasing digital literacy needs by supporting a multi-lingual digital world.
The company is among pioneers to support Indian languages. Microsoft had launched Project Bhasha in 1998 and allowed users to input localized text using the Indian language input tool. The reason behind the Indian language Speech Corpus is the push to reduce language barriers. The company wants to enable Indians to utilize the full potential of the internet.