Recognizers

Utterly Voice depends on third party software for speech recognition. You can currently configure the application to use one of:

Vosk (the default)
Microsoft Azure
Deepgram
Google Cloud Speech-to-Text V1
Whisper

Recognizer General Settings

General application settings are found in the config\settings.yaml file found in the application directory. This file has the following recognizer settings:

Setting	Type	Description
`recognizer`	String	The recognizer to use for speech recognition. This value can be "vosk", "azure", "deepgram", "google_v1", or "whispercpp".
`voskRecognizerConfig`	Map Collection	Settings specific for the Vosk recognizer. See Vosk below.
`azureRecognizerConfig`	Map Collection	Settings specific for the Microsoft Azure recognizer. See Microsoft Azure below.
`deepgramRecognizerConfig`	Map Collection	Settings specific for the Deepgram recognizer. See Deepgram below.
`googleV1RecognizerConfig`	Map Collection	Settings specific for the Google Speech-to-Text V1 recognizer. See Google Cloud Speech-to-Text V1 below.
`whisperCppRecognizerConfig`	Map Collection	Settings specific for the Whisper recognizer. See Whisper below.

Vosk

Vosk is a free offline speech recognizer. The default settings are configured for this recognizer. It is runs on your local machine, so it does not require an internet connection.

If you use this speech recognizer, Utterly Voice will use about 5 GB of memory. Your machine should have at least 8 GB of memory, ideally 16 GB or more.

Here are the configuration settings:

Setting	Type	Description
`model`	String	The path to the model directory relative to the application directory. Utterly Voice provides a recommended model, but you can try other models. Download a model that you want to try, unzip the file, and update this setting to be the path to the model directory that contains a `README` file in it.

Microsoft Azure

Microsoft Azure Speech to Text is a paid online speech recognition service. This recognizer runs on Microsoft servers, so you need to have an active internet connection to use this option.

If you have an accent, this is currently the best recognizer we provide. In the language settings below, you can set the English locale.

You can optionally create a custom speech model, using your voice and jargon. Instructions are below.

On rare occasions, requests sent to this service receive error responses. If this happens, turn the microphone off for one minute, which gives the service time to recover.

To set this up, you need to create an account, create a speech resource, select a region, and copy a secret key for authentication:

Sign up for an Azure account.
Go to the Azure Portal.
Click Create a resource.
Enter "speech" in the search field and press enter.
Click the Speech resource for Microsoft Azure Service.
Click Create.
Use the default subscription or create a new one.
If you haven't already, create a resource group with a name something like "PrimaryResourceGroup".
Select the region that is closest to your location.
Choose a resource name that will be unique across all Azure services. You can use random letters.
Select the Standard S0 pricing tier. For more information, see Azure Speech Pricing.
Click Review + create, then confirm creation. The resource is now created.
Click Go to resource.
Expand the Resource Management menu on the left.
Click Keys and Endpoint.
Click the copy button next to KEY 1. You will need this secret key below. Keep this key private and secure. If somebody else has your key, they can send requests to the service that you will be billed for.
Note the identifier for the region you selected in the endpoint. For example, eastus is the region identifier for the https://eastus.api.cognitive.microsoft.com/ endpoint. You will need this region identifier below.

If you would like to create a custom speech model, follow the instructions below. Note that hosting a custom model on Azure will result in monthly charges, even if you are not actively using the model. See the Azure pricing for details.

Overview: Familiarize yourself with Azure custom speech models by reading the overview.
Prepare data: You need to create a directory with many WAV audio files and a single text transcription file that contains the text spoken in the audio files. Once you have created these files, zip the directory.

Each audio file should be a recording of your voice speaking an utterance. You can use Utterly Voice to create these files by updating the audioFiles setting in the general settings file to be the number of utterances you want to capture. The text transcription file should contain one line for each utterance. Each line contains the audio file name, followed by a tab character, followed by the utterance text. For example: utterance-25.wav eurasian wigeon.

For each utterance, you can use jargon phrases alone, and complete sentences containing your jargon. The more data you provide, the better the recognition will be.

For more details, see the human labeled transcriptions guide.
Create a project: You need to create an Azure project. The documentation states that you can either use AI Foundry or Speech Studio, but we found that AI Foundry was not producing usable models in February 2025, so we recommend Speech Studio.
Upload dataset: Follow steps to upload your dataset that you prepared above (zip file).
Train your model: Follow steps to train your model.
Deploy your model: Follow steps to deploy your model.
Copy the Endpoint ID: Once your model is deployed, open the deployment details and copy the value of Endpoint ID. You will use this value for the customModelIdentifier setting described below.

Here are the Utterly Voice configuration settings:

Setting	Type	Description
`secretKey`	String	The resource key you acquired above.
`region`	String	The region identifier you acquired above.
`language`	String	The language used for recognition. This must be an English-based language. There are many locale-specific English options for improved recognition with accents ("en-AU", "en-GB", "en-HK", etc.). See the Azure Speech language options.
`customModelIdentifier`	String	The custom speech model identifier. Azure calls this the Endpoint ID. If you are not using a custom speech model, this should be set to an empty string.

Whisper

Whisper is a free offline speech recognizer. It runs on your local machine, so it does not require an internet connection.

Whisper does not provide any effective means to control speech recognition bias. This means that the bias settings you apply in your settings files do not have any effect with this recognizer.

Whisper is not designed for realtime streaming audio processing. It is primarily designed for processing audio files. Due to this, the recognition latency is larger than it is for other recognizers.

If you use this speech recognizer, Utterly Voice will use less than 1 GB of memory.

Here are the configuration settings:

Setting	Type	Description
`model`	String	The path to the model file relative to the application directory. Utterly Voice provides a recommended model, but you can try other models. Download a model that you want to try (`*.bin` files) and update this setting to be the path to the model file.

Google Cloud Speech-to-Text V1

Google Cloud Speech-to-Text V1 is a paid online speech recognition service. This recognizer runs on Google servers, so you need to have an active internet connection to use this option. To set this up, see Google's setup instructions and complete the following steps:

Create a Google account if you do not already have one.
Create a Google Cloud project if you do not already have one.
Provide billing information.
Enable the Speech-to-Text API.
Choose your preferred data logging setting.
Create a service account. We recommend providing the service account with the cloud speech editor role.
Download a private service account key. Utterly Voice uses this file to authenticate calls to the Google Cloud service. Move this file to the Utterly Voice application directory and rename it to secret_google_credentials.json. Be sure to keep this file in a secure location. If anyone copies this file, they can make calls to Google Cloud services that you will be billed for.
You can skip other steps from Google's documentation.

If you work in an organization that prevents downloading of service account keys, you can use the gcloud command line tool instead. Follow the steps to install this tool. Once installed, you can use the gcloud auth application-default login command with no command line flags to authenticate calls from Utterly Voice. This command will have you follow some steps to create a credentials file on your computer in a location that Utterly Voice can find. Unless you delete this file, you only need to call this command once.

Here are the Utterly Voice configuration settings:

Setting	Type	Description
`model`	String	The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models.

Deepgram

Deepgram is a paid online speech recognition service. This recognizer runs on Deepgram servers, so you need to have an active internet connection to use this option. To set this up:

Create a Deepgram account.
Create an API key. Keep this key private and secure. If somebody else has your key, they can send requests to the service that you will be billed for.

Here are the Utterly Voice configuration settings:

Setting	Type	Description
`secretKey`	String	The API key you acquired above.
`model`	String	The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models. You must choose a model that supports streaming (Whisper does not support streaming).
`alternatives`	Number	The number of transcript alternatives sent by the service. These alternatives are used to apply bias to select the best transcript. Some of the Deepgram models only work if this is set to 1. This is not documented by Deepgram. The "enhanced-general" model can be set for 3 alternatives. The "nova-2" model must be set for 1 alternatives. For all other models, you should experiment with either 1 (Utterly Voice bias settings have no effect) or 3 (Utterly Voice bias settings work as expected).
`minConfidence`	Number	This recognizer frequently interprets noise as text. Fortunately, the service usually returns a low confidence score in these cases. This setting allows you to set a minimum confidence score. If the score is below this value, the recognized text will be ignored. You can see the confidence scores for recognized text in the `log.txt` file in the application directory.

❮ Bias