Utterly Voice icon image Utterly Voice

Recognizers

Utterly Voice depends on third party software for speech recognition. You can currently configure the application to use one of:

Recognizer General Settings

General application settings are found in the config\settings.yaml file found in the application directory. This file has the following recognizer settings:

Setting Type Description
recognizer String The recognizer to use for speech recognition. This value can be "vosk", "whispercpp", "google_v1" or "deepgram".
voskRecognizerConfig Map Collection Settings specific for the Vosk recognizer. See Vosk below.
whisperCppRecognizerConfig Map Collection Settings specific for the Whisper recognizer. See Whisper below.
googleV1RecognizerConfig Map Collection Settings specific for the Google Speech-to-Text V1 recognizer. See Google Cloud Speech-to-Text V1 below.
deepgramRecognizerConfig Map Collection Settings specific for the Deepgram recognizer. See Deepgram below.

Vosk

Vosk is a free offline speech recognizer. The default settings are configured for this recognizer. It is runs on your local machine, so it does not require an internet connection.

If you use this speech recognizer, Utterly Voice will use about 5 GB of memory. Your machine should have at least 8 GB of memory, ideally 16 GB or more.

Here are the configuration settings:

Setting Type Description
model String The path to the model directory relative to the application directory. Utterly Voice provides a recommended model, but you can try other models. Download a model that you want to try, unzip the file, and update this setting to be the path to the model directory that contains a README file in it.

Whisper

Whisper is a free offline speech recognizer. It runs on your local machine, so it does not require an internet connection.

Whisper does not provide any effective means to control speech recognition bias. This means that the bias settings you apply in your settings files do not have any effect with this recognizer.

Whisper is not designed for realtime streaming audio processing. It is primarily designed for processing audio files. Due to this, the recognition latency is larger than it is for other recognizers.

If you use this speech recognizer, Utterly Voice will use less than 1 GB of memory.

Here are the configuration settings:

Setting Type Description
model String The path to the model file relative to the application directory. Utterly Voice provides a recommended model, but you can try other models. Download a model that you want to try (*.bin files) and update this setting to be the path to the model file.

Google Cloud Speech-to-Text V1

Google Cloud Speech-to-Text is a paid online speech recognition service. This recognizer runs on Google servers, so you need to have an active internet connection to use this option. To set this up, see Google's setup instructions and complete the following steps:

  1. Create a Google account if you do not already have one.
  2. Create a Google Cloud project if you do not already have one.
  3. Provide billing information.
  4. Enable the Speech-to-Text API.
  5. Choose your preferred data logging setting.
  6. Create a service account. We recommend providing the service account with the cloud speech editor role.
  7. Download a private service account key. Utterly Voice uses this file to authenticate calls to the Google Cloud service. Move this file to the Utterly Voice application directory and rename it to secret_google_credentials.json. Be sure to keep this file in a secure location. If anyone copies this file, they can make calls to Google Cloud services that you will be billed for.
  8. You can skip other steps from Google's documentation.

If you work in an organization that prevents downloading of service account keys, you can use the gcloud command line tool instead. Follow the steps to install this tool. Once installed, you can use the gcloud auth application-default login command with no command line flags to authenticate calls from Utterly Voice. This command will have you follow some steps to create a credentials file on your computer in a location that Utterly Voice can find. Unless you delete this file, you only need to call this command once.

Here are the Utterly Voice configuration settings:

Setting Type Description
model String The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models.

Deepgram

Deepgram is a paid online speech recognition service. This recognizer runs on Deepgram servers, so you need to have an active internet connection to use this option. To set this up:

  1. Create a Deepgram account.
  2. Create an API key. Keep this key private and secure. If somebody else has your key, they can send requests to the service that you will be billed for.

Here are the Utterly Voice configuration settings:

Setting Type Description
secretKey String The API key you acquired above.
model String The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models. You must choose a model that supports streaming (Whisper does not support streaming).
alternatives Number The number of transcript alternatives sent by the service. These alternatives are used to apply bias to select the best transcript. Some of the Deepgram models only work if this is set to 1. This is not documented by Deepgram. The "enhanced-general" model can be set for 3 alternatives. The "nova-2" model must be set for 1 alternatives. For all other models, you should experiment with either 1 (Utterly Voice bias settings have no effect) or 3 (Utterly Voice bias settings work as expected).
minConfidence Number This recognizer frequently interprets noise as text. Fortunately, the service usually returns a low confidence score in these cases. This setting allows you to set a minimum confidence score. If the score is below this value, the recognized text will be ignored. You can see the confidence scores for recognized text in the log.txt file in the application directory.