Click here to download the project base paper of speech recognition.
Abstract:
Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-streaming uses a local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on an unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
Introduction Whisper (Radford et al.,2022) is a cent state-of-the-art system for automatic speech recognition (ASR) for 97 languages and translation from 96 languages into English—whisper models are publicly available under the MIT license.However, the current public implementations of Whisper inference usually allow only offline processing of audio documents that are completely available at the time of processing, without any processing time constraints. Real-time streaming mode is useful in certain situations,e.g. for live captioning. It means that source speech audio has to be processed at the time when it is being recorded. The transcripts or translations have to be delivered with a short additive latency, in seconds.TherearesomeimplementationsofWhisperforstreaming, but their approach is rather naive, first, record a 30-second audio segment, and then process it. The latency of these methods is large, and the quality of the segment boundaries is low because a simple content unaware segmentation can split a word in the middle.