Transcribe Streams of Audio Data in Real-Time with Python

I developed a web app using Django as a backend and a Frontend library.

I have used django-channels, for WebSocket and I am able to record the audio stream from the front end and send it to Django over WebSocket and then Django sends it to the group.

So I'm able to do live audio calls (let's say) but I have to transcribe the audio at the backend. (the main goal of the project)

I'm looking forward to using SpeechRecognition 3.8.1 package to achieve transcription.

I send base64 encoded opus codecs string from the front end to Django every second. It sends microphone recorded audio every 1 second.

My doubts -

If we play Audio streams independently, we are only able to play the first string. We are not able to play 2nd, 3rd .... independently (padding issues or maybe more I'm not aware of), so I have used MediaSource at front end to buffer the streams and play. The question is can we convert this 2nd 3rd audio stream into text using the above-mentioned package? or I will have to do something else. (I'm looking for ideas here on how will this work)

Also, the above-mentioned package uses a wav format to transcribe audio, so how can I convert my base64 encoded string into a wav format audio file on the go? I have seen many examples of people using files but I'm looking to change the audio format on the go and then save the transcribed data to a file/database.

Will provide any code example required to better understand the question.

Also, open to new ideas to change my working code workflow to achieve transcription.

Thanks !!

Back to Top