[Where do we stand] Transcribe audio using AI cloud services.

6 min readAug 17, 2022

AI ML DL stack of Major Cloud services such as AWS Azure IBM & Google

One of the well-known services in AI/ML stack (artificial intelligence & machine learning) of cloud computing is transcribing (speech to text) the audio and analyzing it.

While cleaning up my phone, I realised, that it has piled up a lot of voice recordings that I did on various conferences & seminars I attended. Recording the speeches/discussions directly on phone, is a quick way of taking notes, which you can refer later & if you label it & make a transcript, it’s even better for indexing. Thus, my experiments with Transcription services on the cloud started! (I wonder whether Evernote or OneNote has transcribe feature which converts speech-to-text on-the-go on mobile).

I had to trim a 45 minutes long audio file to 1 min. sample & convert it to .mp3 (using VLC Player). To keep things simple there’s only 1 speaker talking in the audio file, although the job is not simple enough for machines, you can download and listen to sample audio below (and try to text it in comment section!)

Google Cloud pleased me as it supports long range of input languages as audio source, so I could choose Indian English, but on the downside, it only supported 1 min audio file for free-demo. (for longer files, you may have to call APIs & write a code). Here’s the result from Google:

150
the world's most interesting thing and I think you are the advantage of building for 
India and building the world have a domestic economy is big enough for you to grow 
as a starter and global market also I don't need to both the option that I want to 
talk to someone is very difficult to the reason is these words between is true that 
was open find out the world after appoint the world can give the field of India 
can eat Universal pain

But it gives different results if you select US English as source language, much better if selected model is ‘video’ (other options are phone call, search/command, default) Video model is not available for IN-EN.

We are around 250 in India.  to the point that
the world could have come in and they want to be locked up all over the top again, 
but more importantly I think is huge opportunity for startups the third and most 
interesting thing and I think you can speak about the relevant is after the DND 
Festival we spend some time at Israel is India has this unique Vantage of building
 for India and building for the world. We have a domestic economy, which is big 
enough for you to grow as a start-up and it has a and a global market. 
Also, I think we need a bill for both the opportunity that I only focus on one is 
very defensive because the reason is these words between these two that what 
have different kind navigation for the world to start learning after a point 
because India will be the world's and usually right so the because the scene of 
India can be Universal game.

If you check out AWS AI/ML/DL stack (Artificial Intelligence Machine Learning Deep Learning), it gives you ready to use tools such as Rekognition, Translate, Polly which can process the input without requiring any coding. What we need in our case is: AWS Transcribe which reads audio files stored in S3 and export the text in json format. The best part is you do not need to write any code for this nor there’s 1 min. restriction for audio. But it understands only US-EN and Spanish, which did not give great results.

Get it. So with that the ones they want. 
So  we have all but one party is huge over to you first. The most impressive thing. 
And i think you could spend some time. Is there? He has a unique case beginning for
the world. You have a domestic big enough to stop it and the opportunity only 
focused on one very defensive because of easiness. These words that was a different
world will start learning after a point because and you two guys with big feet,
you know what?

IBM Watson is a dark horse here, showed relatively fewer errors. But if you choose British English, the result was full of ‘yeah’, which is funny (and stupid!)

We get 150. This. For the one that. The one. The what we love to help all those 
online games at 148 is huge opportunity for stocks the most interesting thing 
and I think you would be more than enough for the effects of the sense of time 
and was there. Is media has unique blockades probability for India and before 
the war. Yeah domestic economy big enough for you to go in the stock and it 
has a. And over the hallways like the need to clear the board the all the Jews 
that only focus on one is very different too because the reason is these lawyers
 or differently from the way the stock offer appointed with nearly the word off 
and you could be like the because of being in the op he or university.

Lastly, Microsoft’s Azure cognitive Services gave me hard time. Unlike other cloud service providers mentioned above, it doesn’t have a quick demo page where I could simply upload a minute long file to test. So I had to sign-up for Azure services using credit card, email, mobile, & OTP verification, only to find that I am not eligible for their 30 day free trial. I somehow got the free trial after contacting the customer support. After much fooling around Azure Cognitive Services API documentation, I realised that API only allows 10 seconds of audio file (certainly not a tool we are looking for, this one is more suited for command/search). Apparently cris.ai, customspeech.ai does a batch transcription, which is again a lot of efforts. After few RestAPI experiments, I ditched! I may be missing something @Azure for such a simple task, (Please add a comment below if you can guide me to a right path, or if you know easier way which I missed). But I think they should have made it simple ready-to-use service at least to get the trial demo.

Finally, I decided to transcribe it myself as none of the above cloud players are up to the mark.
Human interpretation *(who knows the context.)
I am trying verbatim here. I must confess, this audio was very tough to `crack even for someone who was present in that discussion and recorded it. I can imagine more errors if this assigned to a layman.

...Twenty hundred accelerators. And we are around hundred and fifteen. so almost… 
So to the point that, the world’s gonna come in. and we gonna...  
..upper game but more importantly, I think it’s huge opportunity for startups.
The third and most interesting thing and I think you’ll speak about it, 
Amey, after the DLD festival where we spent some time in Israel, is 
India has this unique advantage of building for India and building for world. 
We have a domestic economy big enough for you to grow as a startup and it has a 
and a global market also I think we need to build for both.
The opportunity that I wanna focus is very different because the reason is these
worlds between the what is different for India different for world is start
blurring after a point because India will be the world. And mutually so bigger the 
theme for India is Universal theme.

Surely, above results of machine transcribed service have quite some errors! I know the audio clip was not clear enough but that’s your practical scenario; that’s what you get in real life, actual footage recorded from a distance in a conference room with loud speakers.
Conclusion: Are we there yet? What do you think?

Clearly, you still need human intervention and cannot rely fully (in this case not at all) on these automate services. In Translation job these machines at least aid 50–70% but for transcription we still have a lot to cover (& miles to go 🙂 ).

Notes: I did this exercise in November 2018, it’s been 3+ years but this result of 2018 should be documented somewhere, and thus we are publishing it. I’d conduct similar test this month in 2022 to check out and compare the current status.[Part2 with updates shall come next month]

[Where do we stand] Transcribe audio using AI cloud services.

Written by Staying Relevant

Responses (1)