to download project base paper transcription project.

At datapro, we provide projects for BTech final year students with source code in python for computer science students in Hyderabad, Visakhapatnam


This report presents the technical details of our submission on transcription project of Advance Deep Learning– EGO4D Audio-Visual (AV) Automatic Speech Recognition Challenge 2023 from the OxfordVGG team. We present WhisperX, a system for efficient speech transcription of long-form audio with word-level time alignment, along with two text normalisers which are publicly available. Our final submission obtained 56.0% of the Word Error Rate (WER) on the challenge test set, ranked 1st on the leaderboard.

Speech recognition has been a fundamental challenge in the field of audio processing, aiming to convert speech waveforms into textual representations. In recent years, Deep Neural Networks (DNNs)
have significantly advanced the field by improving the performance of speech recognition systems. The availability of web-scale datasets and advancements in semi- or unsupervised learning techniques have further propelled the performance of speech recognisers to new heights. Notably, Whisper [14] shows that even a simple encoder-decoder architecture could generalise well training with 680,000 hours of data. However,
it is worth noting that Whisper’s input window is limited to audio segments of only 30 seconds. As a result, it faces challenges when transcribing longer audio files. Additionally, due to its sequential
decoding approach, Whisper is susceptible to issues such as hallucinations or repetitive outputs. These challenges are similar to those encountered in auto-regressive language generation tasks.
WhisperX [4] proposes a method to improve both the accuracy and efficiency of Whisper when transcribing long audio. It uses a voice activity detection model to pre-segment the input audio to
run Whisper with a cut&merge scheme, allowing long-form audio to be transcribed in parallel with batched transcription of the presegmented audio chunks. It also conducts forced phoneme alignment using an off-the-shelf model such as Wav2Vec2 [3] to generate wordlevel timestamps required by the EGO4D transcription challenge All baseline codes and models are available Here Click Here

OXFORDVGG SUBMISSION TO THE EGO4D AV TRANSCRIPTION CHALLENGE-transcription project Visakhapatnam, Chennai, Hyderabad
Leave a Comment


No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *