Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton

Last update: Jan 7, 2023

Related tags

Audio nerd-dictation

Overview

Nerd Dictation

Offline Speech to Text for Desktop Linux.

This is a utility that provides simple access speech to text for using in Linux without being tied to a desktop environment.

Simple: This is a single file Python script with minimal dependencies.
Hackable: User configuration lets you manipulate text using Python string operations.
Zero Overhead: As this relies on manual activation there are no background processes.

Dictation is accessed manually with begin/end commands.

This uses the excellent vosk-api.

Usage

It is suggested to bind begin/end/cancel to shortcut keys.

nerd-dictation begin

nerd-dictation end

For details on how this can be used, see: nerd-dictation --help and nerd-dictation begin --help.

Features

Specific features include:

Numbers as Digits

Optional conversion from numbers to digits.

So Three million five hundred and sixty second becomes 3,000,562nd.

A series of numbers (such as reciting a phone number) is also supported.

So Two four six eight becomes 2,468.

Time Out

Optionally end speech to text early when no speech is detected for a given number of seconds. (without an explicit call to end which is otherwise required).

Output Type

Output can simulate keystroke events (default) or simply print to the standard output.

User Configuration Script

User configuration is just a Python script which can be used to manipulate text using Python's full feature set.

See nerd-dictation begin --help for details on how to access these options.

Dependencies

Python 3.
The VOSK-API.
parec command (for recording from pulse-audio).
xdotool command to simulate keyboard input.

Install

pip3 install vosk
git clone https://github.com/ideasman42/nerd-dictation.git
cd nerd-dictation
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model

To test dictation:

./nerd-dictation begin --vosk-model-dir=./model &
# Start speaking.
./nerd-dictation end

Reminder that it's up to you to bind begin/end/cancel to actions you can easily access (typically key shortcuts).
To avoid having to pass the --vosk-model-dir argument, copy the model to the default path:
```
mkdir -p ~/.config/nerd-dictation
mv ./model ~/.config/nerd-dictation
```

Hint

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

Configuration

This is an example of a trivial configuration file which simply makes the input text uppercase.

# ~/.config/nerd-dictation/nerd-dictation.py
def nerd_dictation_process(text):
    return text.upper()

A more comprehensive configuration is included in the examples/ directory.

Hints

The processing function can be used to implement your own actions using keywords of your choice. Simply return a blank string if you have implemented your own text handling.
Context sensitive actions can be implemented using command line utilities to access the active window.

Paths

Local Configuration

~/.config/nerd-dictation/nerd-dictation.py

Language Model

~/.config/nerd-dictation/model

Note that --vosk-model-dir=PATH can be used to override the default.

Details

Typing in results will never press enter/return.
Pulse audio is used for recording.
Recording and speech to text a performed in parallel.

Examples

Store the result of speech to text as a variable in the shell:

SPEECH="$(nerd-dictation begin --timeout=1.0 --output=STDOUT)"

Limitations

Text from VOSK is all lower-case, while the user configuration can be used to set the case of common words like I this isn't very convenient (see the example configuration for details).
For some users the delay in start up may be noticeable on systems with slower hard disks especially when running for the 1st time (a cold start).

This is a limitation with the choice not to use a service that runs in the background. Recording begins before any the speech-to-text components are loaded to mitigate this problem.

Further Work

And a general solution to capitalize words (proper nouns for example).
Preview output while dictating.
Wayland support (this should be quite simple to support and mainly relies on a replacement for xdotool).
Add a setup.py for easy installation on uses systems.
Possibly other speech to text engines (only if they provide some significant benefits).
Possibly support Windows & macOS.

Comments

Packaging

Hello, I have the idea to package nerd-dictation for Pypi.org. I tested adding a setup.py and setup.cfg file. Thus I tried to consider nerd-dictation file as a module, adding a console script entry. At this step, I'm facing a problem that the name nerd-dictation is not allowed because of the dash, the name generates syntax error with `import nerd-dictation". Can it be considered to change the name in nerd_dictation instead of nerd-dictation? I didn't yet explored another way to not use module/console script, but to install directly the nerd-dictation script. What do you think about that? The background idea is to distribute it with easy installation with pip install, and also that elograf can require it as dependency.

opened by papoteur-mga 8
xdotool: freezes the OS

When I run the program by assigning the command "nerd-dictation begin --timeout 1 --numbers-as-digits --numbers-use-separator" to a custom Keyboard Shortcut on Ubuntu 20.04 it seems to be freezing every single time. Any fixes for this? It seems to behave like a memory leak, It completely crashes the OS.

opened by 52617365 8
Add shell.nix and package vosk

Greetings, started playing around with this the other day.. I run NixOS so I had to lay some groundwork first.. Thought others might appreciate it too.

It just drops you into a nix-shell with the required packages so you can run nerd-dictation. It packages a couple of the English models. Easy enough to copy for other language models though. :)

opened by mankyKitty 6
Is using nerd-dictation to control software a solved problem?
I want to use nerd-dictation for processing my photos, basically:

show photo

wait for command (next previous delete promote)

if command is detected: show what was detected (or produce sound feedback?), execute action

I am not entirely sure what would be the best way to implement this - has anyone did something like that already? Seems a relatively obvious use of actually working voice-to-text.

(maybe using nerd-dictation is a mistake and I should be using vosk API directly?)
question
opened by matkoniecz 6
No keystrokes appear in LibreOffice Writer

With some sort of recent upgrade either Ubuntu or LibreOffice the I have noticed that I cannot use nerd-dictation in LibreOffice writer. No text appears. nerd-dictation works fine with chrome or thunderbird windows. It did not used to be this way. I have upgraded from ubuntu 18 to 21.10 recently, so perhaps there was some sort of change with regard to that period maybe there's some sort of security policy that prevents simulated keystrokes? Just a guess. Libreoffice is 7.2.3.2.

opened by xenotropic 5
Russian input lags entire interface

Russian input lags entire interface. But some programs (Blender for example) don't lag at all (also Blender usually launched in fullscreen). English input works fine. Model: "vosk-model-small-ru-0.22"

opened by scaledteam 5

What is the correct format for --pulse-device-name?

First off - thank you. This is precisely what I have been looking for. Great work here!

I want to ensure that the program is using the right microphone - I want to make sure it uses the external one, not the one on my laptop. Running pactl list gives me a WHOLE slew of stuff, but I think this is the chunk I'm most interested in, since it lists my external microphone:

Card #2
	Name: alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00
	Driver: module-alsa-card.c
	Owner Module: 28
	Properties:
		alsa.card = "1"
		alsa.card_name = "Blue Snowball"
		alsa.long_card_name = "BLUE MICROPHONE Blue Snowball at usb-0000:00:14.0-3, full speed"
		alsa.driver_name = "snd_usb_audio"
		device.bus_path = "pci-0000:00:14.0-usb-0:3:1.0"
		sysfs.path = "/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3:1.0/sound/card1"
		udev.id = "usb-BLUE_MICROPHONE_Blue_Snowball_201603-00"
		device.bus = "usb"
		device.vendor.id = "0d8c"
		device.vendor.name = "C-Media Electronics, Inc."
		device.product.id = "0005"
		device.product.name = "Blue Snowball"
		device.serial = "BLUE_MICROPHONE_Blue_Snowball_201603"
		device.string = "1"
		device.description = "Blue Snowball"
		module-udev-detect.discovered = "1"
		device.icon_name = "audio-card-usb"
	Profiles:
		input:mono-fallback: Mono Input (sinks: 0, sources: 1, priority: 1, available: yes)
		input:multichannel-input: Multichannel Input (sinks: 0, sources: 1, priority: 1, available: yes)
		off: Off (sinks: 0, sources: 0, priority: 0, available: yes)
	Active Profile: input:mono-fallback
	Ports:
		analog-input-mic: Microphone (priority: 8700, latency offset: 0 usec)
			Properties:
				device.icon_name = "audio-input-microphone"
			Part of profile(s): input:mono-fallback
		multichannel-input: Multichannel Input (priority: 0, latency offset: 0 usec)
			Part of profile(s): input:multichannel-input

I have tried feeding the "Name" value (alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00), the udev.id, and the device.icon_name (longshot) into the CLI, each time getting the error Stream error: No such entity. If I don't include the --pulse-device-name, dictation works fine, but I want to ensure it's getting the best input possible.

Which of the values from the pactl list output should we use for that flag? Or is there another value further up in the stream - i.e. not "Card #2' - that I should be looking at?

Thanks!

opened by vrrobz 5

pa_context_connect() failed: Connection refused
Hi, I'm trying to run nerd-dictator on Kubutu 20.04. I created a virtualenv, activated it and installed vosk by pip3.

I'm running newrd-dictation as root user and I get

./nerd-dictation begin --vosk-model-dir=./model & pa_context_connect() failed: Connection refused

(the process still runs on background). What is causing this error? Am I missing something?

If I try ti run it as normal user, I get some permission error:

File "./nerd-dictation", line 1188, in <module> main() File "./nerd-dictation", line 1184, in main args.func(args) File "./nerd-dictation", line 1107, in <lambda> func=lambda args: main_begin( File "./nerd-dictation", line 747, in main_begin touch(path_to_cookie) File "./nerd-dictation", line 65, in touch os.utime(filepath, None) PermissionError: [Errno 13] Permission denied

I tried to change the ownership of the main folder and model/ folder so they belong to my current user, but I still get the error. I notice the error mention a "path_to_cookie" but I have no idea of what path it could be.
opened by sirio81 5
Lots of numbers being spit out

Thank you for writing this interesting project. It's running, but it's spitting out a lot of garbage along with the text.

❯ ./nerd-dictation begin 0.09997663497924805 0.09870014190673829 0.09955344200134278 0.09974346160888672 0.09971175193786622 0.0929502010345459 0.09946784973144532 0.09947595596313477 0.0925527572631836 0.09944138526916504 0.09245476722717286 0.09949836730957032 0.09236202239990235 0.09945592880249024 0.09939346313476563 0.0923090934753418 0.09901008605957032 THIS0.09907612800598145 IS0.039521551132202154 0.09932670593261719 0.09929046630859376 ANOTHER0.07741460800170899 0.09929213523864747 0.09936389923095704 TERRORIST0.015120840072631841 0.09926352500915528 ST0.09925565719604493 0.0896986484527588 0.09934697151184083 0.09947404861450196 0.09257588386535645 0.09938035011291504 0.09136066436767579 0.09934458732604981 0.06850967407226563 0.09943637847900391 0.09936747550964356 0.09154710769653321 0.09944114685058594 0.09195122718811036 0.09947142601013184

How do I suppress all these logits?

opened by MikeyBeez 5
'huh' outputted after exiting

Hello, thanks for creating this project! Very cool.

I've noticed that huh is outputted after I stop nerd-dictation from running. Maybe outputted twice? I'm using the small English model from the install instructions.

opened by makeworld-the-better-one 4

English text is out of order and includes extra characters

The characters are strangely out of order. Using the vosk-model-en-us-0.22-lgraph.zip model. Saying "This is a test of the emergency broadcast system" multiple times:

$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
this i tstfesa  o the mycnegeer broactdas ysstem
$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
tihs is  atesoft  theem ergencbortsy adca systme
$ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
this is a se oftt the mereg aorbnecscdyta system

The Vosk API test_microphone.py works correctly:

$ python3 test_microphone.py
LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:11:12:13:14:15
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.089 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:281) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:302) Loading winfo model/graph/phones/word_boundary.int
################################################################################
Press Ctrl+C to stop the recording
################################################################################
{
  "partial" : ""
}
<SNIP DUPLICATES>
{
  "partial" : "this"
}
{
  "partial" : "this"
}
{
  "partial" : "this is"
}
{
  "partial" : "this is a"
}
{
  "partial" : "this is a test of"
}
{
  "partial" : "this is a test of"
}
{
  "partial" : "this is a test of the"
}
{
  "partial" : "this is a test of the emergency"
}
{
  "partial" : "this is a test of the emergency broadcast"
}
<SNIP DUPLICATES>
{
  "partial" : "this is a test of the emergency broadcast system"
}
<SNIP DUPLICATES>
{
  "text" : "this is a test of the emergency broadcast system"
}
{
  "partial" : ""
}
<SNIP DUPLICATES>
^C
Done

opened by 13rac1 4

What configuration script and nerd-dictation options did you use for your youtube video?
I saw your video and wondered how you invoked nerd-dictation for that example? Did you have any special .py configuration or command line options?

https://www.youtube.com/watch?v=T7sR-4DFhpQ
opened by KJ7LNW 0
PYNPUT support
hello i am so glad that you wrote the software (i wrote this message using nerd dictation come a however i do not have punctuation fixed yet exclamation mark)

While i did not write the patch for PYNPUT below, i thought you might be interested emerging it since it is an additional input method that does not depend on external tooling:

https://github.com/ideasman42/nerd-dictation/compare/master...mklcp:nerd-dictation:master
opened by KJ7LNW 0
Sentence capitalization and punctuation not working as in demo
First of all, thanks for creating this! I am very excited to see such an accurate, accessible and extendable voice typing solution on Linux!

The youtube demo is very exciting and I would love to voice type with that accuracy. However, when I try to set it up myself, there seems to be no punctuation added. This is the command I am running:

./nerd-dictation begin --full-sentence --punctuate-from-previous-timeout 2 &

The first sentence is capitalized but no matter how long I wait there is no punctuation added. I have tried with different models, and all produce the same result. I initially thought that there was some background noise keeping the mic alive, but the --timeout option works as expected so that can't be it. Adding --continuous makes each new sentence capitalize, but punctuation is still missing.

After looking around a bit it seems like there is a bug that make the variable is_run_on never evaluate to true since age_in_seconds is always a really high number. However, even after manually setting is_run_on to True, I get only commas inserted unless I use the --full-sentence option in which case only periods are inserted. How can I include a mix of the two as in the demo video? Instructions of how to replicate the behavior in the video would be very helpful.

Related issues:

https://github.com/ideasman42/nerd-dictation/issues/63

https://github.com/ideasman42/nerd-dictation/issues/59

https://github.com/ideasman42/nerd-dictation/pull/50 (but in the video there is no need to say "comma" etc)
opened by joelostblom 0
Support for OpenAI Whisper

I'm wondering if its on the roadmap to add support for OpenAI's whisper. It could possibly be done using a packaged docker container something like this: https://github.com/ahmetoner/whisper-asr-webservice/

I assume at this point the results would be superior. Maybe failing that there could be some notes about how to use another backend for the voice to text?

Cheers and great work - thank you

opened by nkeilar 2

On/off script + tray icon

The included bash script (see below) can be linked to a desktop shortcut.

it starts nerd-dictation in background
and runs a py script which places a tray-icon on xfce panel. This allows to change the langage and stop nerd-dictation daemon

ADAPT THE 2 PROGRAMS :

nerd-command.sh :
1.1) line 3 : The "cd " command in the bash script should be adapted according to user's folder organisation 1.2) line 12: The variable LANG set the langage loaded by the daemon
nerd-tray.py : 2.1) Place model folders in user's .config directory. Model folders should be renamed "model_xx" (eg model_US or model_FR). 2.2) If you dont want to modify your organisation, adapt the line where the call to your model is made : "os.system("nerd-dictation begin --vosk-model-dir=$HOME/.config/nerd-dictation/model_" + current_label + " &")"

NOTES :

Rought but works for me. I use the bash script as an on/off button. XFCE allows to link a keyboard shortcut to this script making its use very confortable. Hope this helps. Best regards.

nerd-command.sh

#!/bin/bash

LANG=US

if [[ ! "$(ps -o ppid= -C parec)" == "" ]] 
then 
  nerd-dictation end
  kill -9 $(ps aux|grep 'python nerd-tray.py'|grep -v grep |awk '{print $2}')
  notify-send "nerd-ended" 

else 
  cd $HOME/Documents/dotfiles/backup/Scripts
  python nerd-tray.py $LANG &
  notify-send "nerd-started" 
fi

nerd-tray.py

#!/usr/bin/env python3

import os
import gi
import sys

gi.require_version("Gtk", "3.0")
gi.require_version("AppIndicator3", "0.1")
gi.require_version('Notify', '0.7')
from gi.repository import Gtk as gtk
from gi.repository import AppIndicator3 as appindicator
# from gi.repository import Notify as notify

LAUNCHERS = [
    {
        "label": "US",
        "icon": "/usr/share/xfce4/xkb/flags/us.svg",
        "command": "setLang",
    },
    {
        "label": "FR",
        "icon": "/usr/share/xfce4/xkb/flags/fr.svg",
        "command": "setLang",
    },
    {
        "sep": True,
    },
    {
        "label": "Stop",
        "icon": None,
        "command": "stopLang",
    },
    {
        "label": "Exit",
        "icon": None,
        "command": "quit",
    },
]

APPINDICATOR_ID = 'nerd-tray'

class IconoTray:
    def __init__(self, appid, iconname):
        self.menu = gtk.Menu()
        self.ind = appindicator.Indicator.new(appid, iconname, appindicator.IndicatorCategory.APPLICATION_STATUS)
        self.ind.set_status (appindicator.IndicatorStatus.ACTIVE)
        self.ind.set_menu(self.menu)
        # notify.init(APPINDICATOR_ID)
        # notify.Notification.new(appid, "started", None).show()

    def add_menu_item(self, label=None, icon=None, command=None, sep=False):
        if sep :
            item = gtk.SeparatorMenuItem()
        elif icon == None :
            item = gtk.MenuItem()
            item.set_label(label)
        else :
            img = gtk.Image()
            img.set_from_file(icon)
            item = gtk.ImageMenuItem(label=label)
            item.set_image(img)
        if command != None :
            item.connect("activate", getattr(self, command))
        self.menu.append(item)
        self.menu.show_all()

    def setLang(self, source):
        current_label = source.get_label()
        os.system("nerd-dictation end")
        os.system("nerd-dictation begin --vosk-model-dir=$HOME/.config/nerd-dictation/model_" + current_label + " &")
        for item in self.menu:
            item.set_sensitive(item.get_label() != current_label)
        self.ind.set_icon('audio-recorder-on')
        return

    def stopLang(self, source):
        os.system("nerd-dictation end")
        for item in self.menu:
            item.set_sensitive(True)
        self.ind.set_icon('notification-microphone-sensitivity-high')
        return

    def selectItem(self, label):
        for item in self.menu:
            if item.get_label() == label:
                self.setLang(item)
        return

    def quit(self, source):
        os.system("nerd-dictation end")
        # notify.Notification.new("nerd-dictation", "stopped.", None).show()
        # notify.uninit()
        gtk.main_quit()

def main():
    app = IconoTray(APPINDICATOR_ID, "notification-microphone-sensitivity-high")
    for launcher in LAUNCHERS:
        app.add_menu_item(**launcher)
    if len(sys.argv) >= 2 :
        app.selectItem(sys.argv[1])
    gtk.main()

if __name__ == "__main__":
    main()

opened by mdjames094 0

Can't make it run with Ydotool on fedora

Hey everyone, first of all thanks for this amzing tool, I used on my previous distros (Parrot Os) and it was working smoothly, but now I'm on Fedora 37 and I can't make it run with Ydotool since I'm on Wayland. Can you give a more detailled workaround on how to setup. Especially here : You should then place them in a place that's available on your $PATH environment variable.

opened by ElSamiru 3

Owner

Campbell Barton

GitHub

Speech recognition module for Python, supporting several engines and APIs, online and offline.

SpeechRecognition Library for performing speech recognition, with support for several engines and APIs, online and offline. Speech recognition engine/

6.7k Jan 8, 2023

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au