Explore AI speech
In this exercise, you’ll use a chat playground to interact with a generative AI model using speech. You’ll explore speech-to-text (STT) and text-to-speech (TTS) functionality.
This exercise should take approximately 15 minutes to complete.
Chat with a model
Let’s start by chatting with a generative AI model. In this exercise, we’ll use the Microsoft Phi 3 Mini model; a small language model that is useful for general chat solutions in low bandwidth scenarios. The solution also uses Web Speech APIs for speech recognition and synthesis.
Note: The model will run in your browser, on your local computer. Performance may vary depending on the available memory in your computer and your network bandwidth to download the model.
- In a web browser, open the Chat Playground at https://aka.ms/chat-playground.
- 
    Wait for the model to download and initialize. Tip: The first time you open the chat playground, it may take a few minutes for the model to download. Subsequent downloads will be faster. 
- When the model is ready, enter a prompt, such as What is the capital of England?and review the response.
Use speech-to-text for voice recognition
Speech-to-text (STT) is an AI technique that transforms audible speech into text. A common use for this is to implement voice recognition solutions in which a user can interact with an AI application by talking to it.
- At the top-right of the Chat history pane, use the Settings (⚙) button to view the chat capabilities options.
- 
    In the Speech section, enable Speech to text. Then save the changes.  Under the chat interface, a Voice input (🎙) button is enabled. 
- 
    Click the Voice input button, and if prompted, allow access to your computer’s microphone. Then after the tone, say something like “What should I consider when choosing a hotel in London?”. Your speech should be converted to text and entered as a prompt, to which the model responds. 
Use text-to-speech for voice synthesis
For a fully functional speech-capable AI agent, the conversation should flow in both directions. You can use text-to-speech (TTS) to have the agent vocalize its responses.
- At the top-right of the Chat history pane, use the Settings (⚙) button to re-open the chat capabilities options.
- 
    In the Speech section, enable Text to speech. Then explore the available voices, trying them out with the voice sample. When you have chosen a voice for your agent, save the changes.  The chat restarts after enabling text to speech. 
- 
    Use the Voice input button to speak to the agent (try something like “Suggest three tourist activities in London”) Note: The agent may take longer to respond when text to speech is enabled. 
Summary
in this exercise, you explored the use of speech-to-text and text-to-speech with a generative AI model in a chat playground.
Some models are multimodal, and natively support audio input; while others rely on speech-to-text and text-to-speech for spoken conversations. Azure AI Foundry supports multimodal models, and also provides Azure AI Speech; which includes a wide range of services you can use to build speech-based AI apps and agents.