How to add speech-to-text to an HTML page

Adding voice input into your web applications can significantly enhance accessibility and convenience.
One of the tools that make this possible is called the SpeechRecognition API, which allows developers to capture spoken input directly within HTML forms. This article will explore how to use the SpeechRecognition API, discuss some of its limitations, and how compatible it is across leading browsers.

What is the SpeechRecognition API anyway?

Simply put the SpeechRecognition API, provides an interface that allows us to recognize voice input and converting it into text without using a costly third party API. This allows users, especially those with disabilities or differentially abled, to interact with forms, search bars, and other input fields. Enabling voice input can also be a great way to speed up form completion when using mobile devices where typing may be slower.

The main issue with this solution that the SpeechRecognition API does not work across all browsers. As of now (August, 2024) it works primarily in Google Chrome and some versions of Microsoft Edge. Speech-to-text also introduces some questions about data security and privacy issues as your voice must be transferred to the cloud for processing to return written text.

Supported Browsers

Google ChromeFull Support on desktop and mobile
Microsoft EdgeSupport available in Chromium-based versions
Apple SafariPartial support
Mozilla FirefoxNo support but ongoing discussions about future implementation

The accuracy of the speech recognition can also vary depending on your accent, issues such as background noise and generally how clearly you speak. This could result in a frustrating user experience for some. Since processing is done in the cloud, it also means that it requires an active internet connection to work. Remote location or areas with poor connectivity could present challenges working with the SpeechRecognition API.

To interact with the API we need to use JavaScript.

First we create an new SpeechRecognition object. The code will check if the browser supports either the SpeechRecognition API or the webkit prefix supported version of it. If one of these is available, then it will proceed with listening. Note that the recognition.lang can be explicitly set to other languages such as Spanish (es-ES), French (fr-FR) but if omitted will use the browser default language.

Setting interimResults to false to tell the object not to return partial results while still speaking. This means it will wait until you are done talking before delivering the final result.

Finally setting maxAlternatives to one (1) will limit the number of alternative results that the SpeechRecognition object will consider when trying to understand what is being said. One means that it will only return the most confident result. Here’s what the full javascript would look like.

<script>
        function startRecognition() {
            const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
            recognition.lang = 'en-US';
            recognition.interimResults = false;
            recognition.maxAlternatives = 1;

            recognition.onresult = (event) => {
                const result = event.results[0][0].transcript;
                document.getElementById('voiceInput').value = result;
            };

            recognition.start();
        }
    </script>

Now we need a form with a button and an input field so that we can update it with the returned text. Let’s create a form and set the onclick action to startRecognition. Here’s what the full code should look like.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Speech Recognition Test</title>
</head>
<body>
    <h1>Voice Input Test</h1>
    <form>
        <label for="voiceInput">Speak something:</label>
        <input type="text" id="voiceInput" placeholder="Your speech will appear here">
        <button type="button" onclick="startRecognition()">Start Recognition</button>
    </form>

    <script>
        function startRecognition() {
            const recognition = new (window.SpeechRecognition || window.webkitSpeechRecognition)();
            recognition.lang = 'en-US';
            recognition.interimResults = false;
            recognition.maxAlternatives = 1;

            recognition.onresult = (event) => {
                const result = event.results[0][0].transcript;
                document.getElementById('voiceInput').value = result;
            };

            recognition.start();
        }
    </script>
</body>
</html>

As you can see the SpeechRecognitionAPI gives an easy and rapid way to enhance user experience in your web application but has drawbacks namely unsupported browsers. If you’re targeting users that primarily use Google Chrome or Microsoft Edge that this solution can be a great addition to your website.

Given the pace of development it is expected that full browser support will be coming within the next few years for Safari and Firefox once certain issues with permissions and privacy are worked out. If you need wider support in the interim, you can provide fallback options for unsupported browsers. However don’t forget to get your end user approval to address privacy and data security implications when rolling out this technology.



Leave a Reply