Utterly Voice icon image Utterly Voice

Recognizers

Utterly Voice depends on third party software for speech recognition. You can currently configure the application to use one of:

Recognizer General Settings

General application settings are found in the config\settings.yaml file found in the application directory. This file has the following recognizer settings:

Setting Type Description
recognizer String The recognizer to use for speech recognition. This value can be "vosk", "google_v1" or "deepgram".
voskRecognizerConfig Map Collection Settings specific for the Vosk recognizer. See Vosk below.
googleV1RecognizerConfig Map Collection Settings specific for the Google Speech-to-Text V1 recognizer. See Google Cloud Speech-to-Text V1 below.
deepgramRecognizerConfig Map Collection Settings specific for the Deepgram recognizer. See Deepgram below.

Vosk

Vosk is a free offline speech recognizer. The default settings are configured for this recognizer. It is runs on your local machine, so it does not require an internet connection.

If you use this speech recognizer, Utterly Voice will use about 5 GB of memory. Your machine should have at least 8 GB of memory, ideally 16 GB or more.

Here are the configuration settings:

Setting Type Description
model String The path to the model directory relative to the application directory. Utterly Voice provides a recommended model, but you can try other models. Download a model that you want to try, unzip the file, and update this setting to be the path to the model directory that contains a README file in it.

Google Cloud Speech-to-Text V1

Google Cloud Speech-to-Text is a paid online speech recognition service. This recognizer runs on Google servers, so you need to have an active internet connection to use this option. To set this up, see Google's setup instructions and complete the following steps:

  1. Create a Google account if you do not already have one.
  2. Create a Google Cloud project if you do not already have one.
  3. Provide billing information.
  4. Enable the Speech-to-Text API.
  5. Choose your preferred data logging setting.
  6. Create a service account. We recommend providing the service account with the cloud speech editor role.
  7. Download a private service account key. Utterly Voice uses this file to authenticate calls to the Google Cloud service. Move this file to the Utterly Voice application directory and rename it to secret_google_credentials.json. Be sure to keep this file in a secure location. If anyone copies this file, they can make calls to Google Cloud services that you will be billed for.
  8. You can skip other steps from Google's documentation.

If you work in an organization that prevents downloading of service account keys, you can use the gcloud command line tool instead. Follow the steps to install this tool. Once installed, you can use the gcloud auth application-default login command with no command line flags to authenticate calls from Utterly Voice. This command will have you follow some steps to create a credentials file on your computer in a location that Utterly Voice can find. Unless you delete this file, you only need to call this command once.

Here are the Utterly Voice configuration settings:

Setting Type Description
model String The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models.

Deepgram

Deepgram is a paid online speech recognition service. This recognizer runs on Deepgram servers, so you need to have an active internet connection to use this option. To set this up:

  1. Create a Deepgram account.
  2. Create an API key. Keep this key private and secure. If somebody else has your key, they can send requests to the service that you will be billed for.

Here are the Utterly Voice configuration settings:

Setting Type Description
secretKey String The API key you acquired above.
model String The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models. You must choose a model and tier that supports streaming.
tier String The recognizer tier. The recommended tier is provided in the default settings, but you are free to experiment with other tiers. You must choose a model and tier that supports streaming.
minConfidence Number This recognizer frequently interprets noise as text. Fortunately, the service usually returns a low confidence score in these cases. This setting allows you to set a minimum confidence score. If the score is below this value, the recognized text will be ignored. You can see the confidence scores for recognized text in the log.txt file in the application directory.