Utterly Voice depends on third party software for speech recognition. You can currently configure the application to use one of:
General application settings are found in the
config\settings.yaml
file found in the application directory.
This file has the following recognizer settings:
Setting | Type | Description |
---|---|---|
recognizer
|
String | The recognizer to use for speech recognition. This value can be "vosk", "whispercpp", "google_v1" or "deepgram". |
voskRecognizerConfig
|
Map Collection | Settings specific for the Vosk recognizer. See Vosk below. |
whisperCppRecognizerConfig
|
Map Collection | Settings specific for the Whisper recognizer. See Whisper below. |
googleV1RecognizerConfig
|
Map Collection | Settings specific for the Google Speech-to-Text V1 recognizer. See Google Cloud Speech-to-Text V1 below. |
deepgramRecognizerConfig
|
Map Collection | Settings specific for the Deepgram recognizer. See Deepgram below. |
Vosk is a free offline speech recognizer. The default settings are configured for this recognizer. It is runs on your local machine, so it does not require an internet connection.
If you use this speech recognizer, Utterly Voice will use about 5 GB of memory. Your machine should have at least 8 GB of memory, ideally 16 GB or more.
Here are the configuration settings:
Setting | Type | Description |
---|---|---|
model
|
String |
The path to the model directory
relative to the application directory.
Utterly Voice provides a recommended model,
but you can try other models.
Download a model
that you want to try,
unzip the file,
and update this setting to be the path
to the model directory that contains a
README file in it.
|
Whisper is a free offline speech recognizer. It runs on your local machine, so it does not require an internet connection.
Whisper does not provide any effective means to control speech recognition bias. This means that the bias settings you apply in your settings files do not have any effect with this recognizer.
Whisper is not designed for realtime streaming audio processing. It is primarily designed for processing audio files. Due to this, the recognition latency is larger than it is for other recognizers.
If you use this speech recognizer, Utterly Voice will use less than 1 GB of memory.
Here are the configuration settings:
Setting | Type | Description |
---|---|---|
model
|
String |
The path to the model file
relative to the application directory.
Utterly Voice provides a recommended model,
but you can try other models.
Download a model
that you want to try (*.bin files)
and update this setting to be the path to the model file.
|
Google Cloud Speech-to-Text is a paid online speech recognition service. This recognizer runs on Google servers, so you need to have an active internet connection to use this option. To set this up, see Google's setup instructions and complete the following steps:
secret_google_credentials.json
.
Be sure to keep this file in a secure location.
If anyone copies this file,
they can make calls to Google Cloud services
that you will be billed for.
If you work in an organization that prevents downloading of service account keys, you can use the gcloud command line tool instead. Follow the steps to install this tool. Once installed, you can use the gcloud auth application-default login command with no command line flags to authenticate calls from Utterly Voice. This command will have you follow some steps to create a credentials file on your computer in a location that Utterly Voice can find. Unless you delete this file, you only need to call this command once.
Here are the Utterly Voice configuration settings:
Setting | Type | Description |
---|---|---|
model
|
String | The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models. |
Deepgram is a paid online speech recognition service. This recognizer runs on Deepgram servers, so you need to have an active internet connection to use this option. To set this up:
Here are the Utterly Voice configuration settings:
Setting | Type | Description |
---|---|---|
secretKey
|
String | The API key you acquired above. |
model
|
String | The recognizer model. The recommended model is provided in the default settings, but you are free to experiment with other models. You must choose a model that supports streaming (Whisper does not support streaming). |
alternatives
|
Number | The number of transcript alternatives sent by the service. These alternatives are used to apply bias to select the best transcript. Some of the Deepgram models only work if this is set to 1. This is not documented by Deepgram. The "enhanced-general" model can be set for 3 alternatives. The "nova-2" model must be set for 1 alternatives. For all other models, you should experiment with either 1 (Utterly Voice bias settings have no effect) or 3 (Utterly Voice bias settings work as expected). |
minConfidence
|
Number |
This recognizer frequently interprets noise as text.
Fortunately, the service usually returns a low confidence score in these cases.
This setting allows you to set a minimum confidence score.
If the score is below this value,
the recognized text will be ignored.
You can see the confidence scores for recognized text in the
log.txt file in the application directory.
|