TBD
The task of automatic speech recognition (ASR) and spoken language understanding embodies almost all the elements of artificial intelligence (AI). When ubiquitously available, reliable ASR will be a key enabler of robust intelligence research in: spoken dialog systems for human-computer interactions; information integration research in content-based multimedia; and search and access to oral history archives. ASR will also a fundamental component of speech science and technology to enable research in children's cognitive development, linguistics, smart health, elderly care, education, and (broadly) the machine-aided study of behavioral and social dynamics.
This project, developed after extensive consultations with the speech and language research community, is extensively revising the Kaldi open-source toolkit to (a) make speech recognition more accessible both for beginners in speech recognition as well as researchers in other fields, (b) leverage existing deep learning framework (primarily PyTorch) to increase its flexibility, (c) create new user training materials, and (d) continue to enhance the toolkit, so as to support the growth of and cooperation within the community.
The project implements all core Kaldi functions (e.g., the lattice-free maximum mutual information training objective) natively in generic AI/deep learning frameworks, primarily PyTorch, so that associated advances in deep learning (e.g., novel optimization algorithms) can be seamlessly leveraged. Furthermore, the project incorporates automatic differentiation through finite state transducers, a core Kaldi feature responsible for its state-of-the-art performance, permitting true end-to-end training of ASR systems. These and other enhancements will make it possible to achieve two currently incompatible goals: incorporating structure external knowledge (e.g., dialog flow models, finite state grammars, pronunciation lexicons) into fully neural ASR systems, and end-to-end training of a hybrid ASR system via backpropagation. Other goals of this proposal include the provision of efficient yet user-friendly data preparation and model management tools for large scale training of ASR systems, in addition to developing capabilities for robust conversation analysis and speaker diarization needed by researchers who use ASR as a tool for other scientific inquiries.