Simple, hackable offline speech to text - using the VOSK-API.

Related tags

Audio nerd-dictation
Overview

Nerd Dictation

Offline Speech to Text for Desktop Linux.

This is a utility that provides simple access speech to text for using in Linux without being tied to a desktop environment.

Simple
This is a single file Python script with minimal dependencies.
Hackable
User configuration lets you manipulate text using Python string operations.
Zero Overhead
As this relies on manual activation there are no background processes.

Dictation is accessed manually with begin/end commands.

This uses the excellent vosk-api.

Usage

It is suggested to bind begin/end/cancel to shortcut keys.

nerd-dictation begin
nerd-dictation end

For details on how this can be used, see: nerd-dictation --help and nerd-dictation begin --help.

Features

Specific features include:

Numbers as Digits

Optional conversion from numbers to digits.

So Three million five hundred and sixty second becomes 3,000,562nd.

A series of numbers (such as reciting a phone number) is also supported.

So Two four six eight becomes 2,468.

Time Out
Optionally end speech to text early when no speech is detected for a given number of seconds. (without an explicit call to end which is otherwise required).
Output Type
Output can simulate keystroke events (default) or simply print to the standard output.
User Configuration Script
User configuration is just a Python script which can be used to manipulate text using Python's full feature set.

See nerd-dictation begin --help for details on how to access these options.

Dependencies

  • Python 3.
  • The VOSK-API.
  • parec command (for recording from pulse-audio).
  • xdotool command to simulate keyboard input.

Install

pip3 install vosk
git clone https://github.com/ideasman42/nerd-dictation.git
cd nerd-dictation
wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.15.zip
unzip vosk-model-small-en-us-0.15.zip
mv vosk-model-small-en-us-0.15 model

To test dictation:

./nerd-dictation begin --vosk-model-dir=./model &
# Start speaking.
./nerd-dictation end
  • Reminder that it's up to you to bind begin/end/cancel to actions you can easily access (typically key shortcuts).

  • To avoid having to pass the --vosk-model-dir argument, copy the model to the default path:

    mkdir -p ~/.config/nerd-dictation
    mv ./model ~/.config/nerd-dictation

Hint

Once this is working properly you may wish to download one of the larger language models for more accurate dictation. They are available here.

Configuration

This is an example of a trivial configuration file which simply makes the input text uppercase.

# ~/.config/nerd-dictation/nerd-dictation.py
def nerd_dictation_process(text):
    return text.upper()

A more comprehensive configuration is included in the examples/ directory.

Hints

  • The processing function can be used to implement your own actions using keywords of your choice. Simply return a blank string if you have implemented your own text handling.
  • Context sensitive actions can be implemented using command line utilities to access the active window.

Paths

Local Configuration
~/.config/nerd-dictation/nerd-dictation.py
Language Model

~/.config/nerd-dictation/model

Note that --vosk-model-dir=PATH can be used to override the default.

Details

  • Typing in results will never press enter/return.
  • Pulse audio is used for recording.
  • Recording and speech to text a performed in parallel.

Examples

Store the result of speech to text as a variable in the shell:

SPEECH="$(nerd-dictation begin --timeout=1.0 --output=STDOUT)"

Limitations

  • Text from VOSK is all lower-case, while the user configuration can be used to set the case of common words like I this isn't very convenient (see the example configuration for details).

  • For some users the delay in start up may be noticeable on systems with slower hard disks especially when running for the 1st time (a cold start).

    This is a limitation with the choice not to use a service that runs in the background. Recording begins before any the speech-to-text components are loaded to mitigate this problem.

Further Work

  • And a general solution to capitalize words (proper nouns for example).
  • Preview output while dictating.
  • Wayland support (this should be quite simple to support and mainly relies on a replacement for xdotool).
  • Add a setup.py for easy installation on uses systems.
  • Possibly other speech to text engines (only if they provide some significant benefits).
  • Possibly support Windows & macOS.
Comments
  • Packaging

    Packaging

    Hello, I have the idea to package nerd-dictation for Pypi.org. I tested adding a setup.py and setup.cfg file. Thus I tried to consider nerd-dictation file as a module, adding a console script entry. At this step, I'm facing a problem that the name nerd-dictation is not allowed because of the dash, the name generates syntax error with `import nerd-dictation". Can it be considered to change the name in nerd_dictation instead of nerd-dictation? I didn't yet explored another way to not use module/console script, but to install directly the nerd-dictation script. What do you think about that? The background idea is to distribute it with easy installation with pip install, and also that elograf can require it as dependency.

    opened by papoteur-mga 8
  • xdotool: freezes the OS

    xdotool: freezes the OS

    When I run the program by assigning the command "nerd-dictation begin --timeout 1 --numbers-as-digits --numbers-use-separator" to a custom Keyboard Shortcut on Ubuntu 20.04 it seems to be freezing every single time. Any fixes for this? It seems to behave like a memory leak, It completely crashes the OS.

    opened by 52617365 8
  • Add shell.nix and package vosk

    Add shell.nix and package vosk

    Greetings, started playing around with this the other day.. I run NixOS so I had to lay some groundwork first.. Thought others might appreciate it too.

    It just drops you into a nix-shell with the required packages so you can run nerd-dictation. It packages a couple of the English models. Easy enough to copy for other language models though. :)

    opened by mankyKitty 6
  • Is using nerd-dictation to control software a solved problem?

    Is using nerd-dictation to control software a solved problem?

    I want to use nerd-dictation for processing my photos, basically:

    • show photo
    • wait for command (next previous delete promote)
    • if command is detected: show what was detected (or produce sound feedback?), execute action

    I am not entirely sure what would be the best way to implement this - has anyone did something like that already? Seems a relatively obvious use of actually working voice-to-text.

    (maybe using nerd-dictation is a mistake and I should be using vosk API directly?)

    question 
    opened by matkoniecz 6
  • No keystrokes appear in LibreOffice Writer

    No keystrokes appear in LibreOffice Writer

    With some sort of recent upgrade either Ubuntu or LibreOffice the I have noticed that I cannot use nerd-dictation in LibreOffice writer. No text appears. nerd-dictation works fine with chrome or thunderbird windows. It did not used to be this way. I have upgraded from ubuntu 18 to 21.10 recently, so perhaps there was some sort of change with regard to that period maybe there's some sort of security policy that prevents simulated keystrokes? Just a guess. Libreoffice is 7.2.3.2.

    opened by xenotropic 5
  • Russian input lags entire interface

    Russian input lags entire interface

    Russian input lags entire interface. But some programs (Blender for example) don't lag at all (also Blender usually launched in fullscreen). English input works fine. Model: "vosk-model-small-ru-0.22"

    opened by scaledteam 5
  • What is the correct format for --pulse-device-name?

    What is the correct format for --pulse-device-name?

    First off - thank you. This is precisely what I have been looking for. Great work here!

    I want to ensure that the program is using the right microphone - I want to make sure it uses the external one, not the one on my laptop. Running pactl list gives me a WHOLE slew of stuff, but I think this is the chunk I'm most interested in, since it lists my external microphone:

    Card #2
    	Name: alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00
    	Driver: module-alsa-card.c
    	Owner Module: 28
    	Properties:
    		alsa.card = "1"
    		alsa.card_name = "Blue Snowball"
    		alsa.long_card_name = "BLUE MICROPHONE Blue Snowball at usb-0000:00:14.0-3, full speed"
    		alsa.driver_name = "snd_usb_audio"
    		device.bus_path = "pci-0000:00:14.0-usb-0:3:1.0"
    		sysfs.path = "/devices/pci0000:00/0000:00:14.0/usb1/1-3/1-3:1.0/sound/card1"
    		udev.id = "usb-BLUE_MICROPHONE_Blue_Snowball_201603-00"
    		device.bus = "usb"
    		device.vendor.id = "0d8c"
    		device.vendor.name = "C-Media Electronics, Inc."
    		device.product.id = "0005"
    		device.product.name = "Blue Snowball"
    		device.serial = "BLUE_MICROPHONE_Blue_Snowball_201603"
    		device.string = "1"
    		device.description = "Blue Snowball"
    		module-udev-detect.discovered = "1"
    		device.icon_name = "audio-card-usb"
    	Profiles:
    		input:mono-fallback: Mono Input (sinks: 0, sources: 1, priority: 1, available: yes)
    		input:multichannel-input: Multichannel Input (sinks: 0, sources: 1, priority: 1, available: yes)
    		off: Off (sinks: 0, sources: 0, priority: 0, available: yes)
    	Active Profile: input:mono-fallback
    	Ports:
    		analog-input-mic: Microphone (priority: 8700, latency offset: 0 usec)
    			Properties:
    				device.icon_name = "audio-input-microphone"
    			Part of profile(s): input:mono-fallback
    		multichannel-input: Multichannel Input (priority: 0, latency offset: 0 usec)
    			Part of profile(s): input:multichannel-input
    

    I have tried feeding the "Name" value (alsa_card.usb-BLUE_MICROPHONE_Blue_Snowball_201603-00), the udev.id, and the device.icon_name (longshot) into the CLI, each time getting the error Stream error: No such entity. If I don't include the --pulse-device-name, dictation works fine, but I want to ensure it's getting the best input possible.

    Which of the values from the pactl list output should we use for that flag? Or is there another value further up in the stream - i.e. not "Card #2' - that I should be looking at?

    Thanks!

    opened by vrrobz 5
  • pa_context_connect() failed: Connection refused

    pa_context_connect() failed: Connection refused

    Hi, I'm trying to run nerd-dictator on Kubutu 20.04. I created a virtualenv, activated it and installed vosk by pip3.

    I'm running newrd-dictation as root user and I get

    ./nerd-dictation begin --vosk-model-dir=./model &
    pa_context_connect() failed: Connection refused
    

    (the process still runs on background). What is causing this error? Am I missing something?

    If I try ti run it as normal user, I get some permission error:

      File "./nerd-dictation", line 1188, in <module>
        main()
      File "./nerd-dictation", line 1184, in main
        args.func(args)
      File "./nerd-dictation", line 1107, in <lambda>
        func=lambda args: main_begin(
      File "./nerd-dictation", line 747, in main_begin
        touch(path_to_cookie)
      File "./nerd-dictation", line 65, in touch
        os.utime(filepath, None)
    PermissionError: [Errno 13] Permission denied
    

    I tried to change the ownership of the main folder and model/ folder so they belong to my current user, but I still get the error. I notice the error mention a "path_to_cookie" but I have no idea of what path it could be.

    opened by sirio81 5
  • Lots of numbers being spit out

    Lots of numbers being spit out

    Thank you for writing this interesting project. It's running, but it's spitting out a lot of garbage along with the text.

    ❯ ./nerd-dictation begin 0.09997663497924805 0.09870014190673829 0.09955344200134278 0.09974346160888672 0.09971175193786622 0.0929502010345459 0.09946784973144532 0.09947595596313477 0.0925527572631836 0.09944138526916504 0.09245476722717286 0.09949836730957032 0.09236202239990235 0.09945592880249024 0.09939346313476563 0.0923090934753418 0.09901008605957032 THIS0.09907612800598145 IS0.039521551132202154 0.09932670593261719 0.09929046630859376 ANOTHER0.07741460800170899 0.09929213523864747 0.09936389923095704 TERRORIST0.015120840072631841 0.09926352500915528 ST0.09925565719604493 0.0896986484527588 0.09934697151184083 0.09947404861450196 0.09257588386535645 0.09938035011291504 0.09136066436767579 0.09934458732604981 0.06850967407226563 0.09943637847900391 0.09936747550964356 0.09154710769653321 0.09944114685058594 0.09195122718811036 0.09947142601013184

    How do I suppress all these logits?

    opened by MikeyBeez 5
  • 'huh' outputted after exiting

    'huh' outputted after exiting

    Hello, thanks for creating this project! Very cool.

    I've noticed that huh is outputted after I stop nerd-dictation from running. Maybe outputted twice? I'm using the small English model from the install instructions.

    image

    opened by makeworld-the-better-one 4
  • English text is out of order and includes extra characters

    English text is out of order and includes extra characters

    The characters are strangely out of order. Using the vosk-model-en-us-0.22-lgraph.zip model. Saying "This is a test of the emergency broadcast system" multiple times:

    $ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
    this i tstfesa  o the mycnegeer broactdas ysstem
    $ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
    tihs is  atesoft  theem ergencbortsy adca systme
    $ ./nerd-dictation begin --vosk-model-dir=./model --timeout=1.0
    this is a se oftt the mereg aorbnecscdyta system
    

    The Vosk API test_microphone.py works correctly:

    $ python3 test_microphone.py
    LOG (VoskAPI:ReadDataFiles():model.cc:213) Decoding params beam=13 max-active=7000 lattice-beam=6
    LOG (VoskAPI:ReadDataFiles():model.cc:216) Silence phones 1:2:3:4:5:11:12:13:14:15
    LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
    LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
    LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.089 seconds in looped compilation.
    LOG (VoskAPI:ReadDataFiles():model.cc:248) Loading i-vector extractor from model/ivector/final.ie
    LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
    LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
    LOG (VoskAPI:ReadDataFiles():model.cc:281) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
    LOG (VoskAPI:ReadDataFiles():model.cc:302) Loading winfo model/graph/phones/word_boundary.int
    ################################################################################
    Press Ctrl+C to stop the recording
    ################################################################################
    {
      "partial" : ""
    }
    <SNIP DUPLICATES>
    {
      "partial" : "this"
    }
    {
      "partial" : "this"
    }
    {
      "partial" : "this is"
    }
    {
      "partial" : "this is a"
    }
    {
      "partial" : "this is a test of"
    }
    {
      "partial" : "this is a test of"
    }
    {
      "partial" : "this is a test of the"
    }
    {
      "partial" : "this is a test of the emergency"
    }
    {
      "partial" : "this is a test of the emergency broadcast"
    }
    <SNIP DUPLICATES>
    {
      "partial" : "this is a test of the emergency broadcast system"
    }
    <SNIP DUPLICATES>
    {
      "text" : "this is a test of the emergency broadcast system"
    }
    {
      "partial" : ""
    }
    <SNIP DUPLICATES>
    ^C
    Done
    
    opened by 13rac1 4
  • How to capitalize the proper names ?

    How to capitalize the proper names ?

    Hi,

    First, I'd like to congratulate Campbell Barton. Thank you very much for this wonderful script !

    Melbourne, Berlin, John, etc. are recognized with lower case first letter. If possible, who could write a script to add to nerd-dictation.py ? Unfortunately, I can't do it ! Thanks to you.

    opened by Lume6 3
  • punctuate-from-previous-timeout not punctuating

    punctuate-from-previous-timeout not punctuating

    The current documentation makes it seem like using this command should result in fully punctuated sentences:

    nerd-dictation begin --full-sentence  --continuous --punctuate-from-previous-timeout=2 --timeout=4 
    

    But instead I'm getting something like this: "Sentence oneSentence two"

    opened by jonulrich 3
  • Is it possible to output ctrl, shift... key strokes?

    Is it possible to output ctrl, shift... key strokes?

    Hi there, First of all, congrats on this tool, it's light-weight, simple, customizable, can be executed from emacs, just perfect. I was just wondering whether it was possible to convert a spoken command into a command with modifier keys (C-c C-c typically...!)? Cheers, Vian

    opened by myravian 4
  • Fix writing text immediately with --output STDOUT when --continuous is also enabled

    Fix writing text immediately with --output STDOUT when --continuous is also enabled

    Seems to me that --output STDOUT and --continuous should not defer writing text, this patch fixes that case.

    This allows stdout text to go to a pipe or named pipe immediately. Backspaces are also passed, when the text is corrected by Vosk.

    Here's a simple example of using a named pipe:

    mkfifo /tmp/nerdpipe
    nerd-dictation begin --output STDOUT --continuous >/tmp/nerdpipe
    

    In another terminal:

    while true; do 
        read -n 1000 -t 0.5 input </tmp/nerdpipe
        [[ -n "$input" ]] && echo "nerdpipe says: $input" 
    done
    

    Demo results:

    nerdpipe says: hello
    nerdpipe says: world
    nerdpipe says: this
    nerdpipe says: is a
    nerdpipe says: longer sentence
    nerdpipe says: goodbye
    

    I included a flush() on the existing handler, but I also had another version that checked for this condition and only then use flush(). Flushing stdout on every write shouldn't cause much harm, since we can only speak so fast :-)

    diff --git a/nerd-dictation b/nerd-dictation
    index 1d6b626..77e51d4 100755
    --- a/nerd-dictation
    +++ b/nerd-dictation
    @@ -1055,6 +1055,15 @@ def main_begin(
                     run_xdotool("key", ["BackSpace"] * delete_prev_chars)
                 run_xdotool("type", ["--", text])
    
    +    elif output == "STDOUT" and progressive:
    +
    +        def handle_fn(text: str, delete_prev_chars: int) -> None:
    +            if delete_prev_chars:
    +                sys.stdout.write("\x08" * delete_prev_chars)
    +            sys.stdout.write(text)
    +            sys.stdout.flush()
    +
    +  
         elif output == "STDOUT":
    
             def handle_fn(text: str, delete_prev_chars: int) -> None:
    
    opened by tpoindex 2
  • New possible output method and tts doubts

    New possible output method and tts doubts

    So I want to respeak my live recorded speech. That means: mic -> text -> sound. Or in another words: Speech to Text and then Text to Speech.

    The part for converting sounds from the microphone to text I achieve it thanks to nerd-dictation. The part for converting text to sound again I want to implement it thanks to festival.

    1 - I have sort of added a new output method to nerd-dictation. I call it file because it's meant to go into a file. My current work can be found at https://github.com/ruckard/nerd-dictation/tree/speech_to_file_v2 . As you can see I have not added a new option for this mode because I'm not sure if it's worth it.

    The current way that I run nerd-dictation is like this: ./nerd-dictation begin --vosk-model-dir=/home/playg/vosk-models/vosk-model-small-es-0.22 --full-sentence --punctuate-from-previous-timeout 1 --idle-time 0.5 --continuous --timeout 0.5 --output=STDOUT > /tmp/output_test_file.txt.

    Then I just tail -f /tmp/output_test_file.txt.

    2 - The current changes ( https://github.com/ruckard/nerd-dictation/commit/5acbd5468a294657b14ee5d832cc266afeb03c63 ) abuse the timeout option so that instead of exiting the program it process the audio again and gives me another sentence. It also makes sure not to output new text if there is nothing else said.

    The idea is to read every line (after \n is issued) and reproduce it thanks to festival.

    3 - Anyways in the end I have three questions for you:

    • Do you want me to send you a new file output mode which works as described pull request so that it gets added in the upstream project?
    • Would you accept a pull request about a new functionality that converts the text back to sound thanks to festival (or espeak-ng or similar tool)?
    • Do you know any other project that already does what I'm trying to do?

    Thank you very much for your feedback.

    opened by ruckard 2
Owner
Campbell Barton
Campbell Barton
Speech recognition module for Python, supporting several engines and APIs, online and offline.

SpeechRecognition Library for performing speech recognition, with support for several engines and APIs, online and offline. Speech recognition engine/

Anthony Zhang 6.5k Sep 26, 2022
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

pyannote 1.8k Oct 1, 2022
This library provides common speech features for ASR including MFCCs and filterbank energies.

python_speech_features This library provides common speech features for ASR including MFCCs and filterbank energies. If you are not sure what MFCCs ar

James Lyons 2.1k Sep 24, 2022
:speech_balloon: SpeechPy - A Library for Speech Processing and Recognition: http://speechpy.readthedocs.io/en/latest/

SpeechPy Official Project Documentation Table of Contents Documentation Which Python versions are supported Citation How to Install? Local Installatio

Amirsina Torfi 863 Sep 21, 2022
Conferencing Speech Challenge

ConferencingSpeech 2021 challenge This repository contains the datasets list and scripts required for the ConferencingSpeech challenge. For more detai

null 71 Sep 30, 2022
Speech Algorithms Collections

Speech Algorithms Collections

Ryuk 449 Sep 30, 2022
Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Voicefixer aims at the restoration of human speech regardless how serious its degraded.

Leo 273 Sep 28, 2022
Some utils for auto speech recognition

About Some utils for auto speech recognition. Utils Util Description Script Reset audio Reset sample rate, sample width, etc of audios.

null 1 Jan 24, 2022
Voice to Text using Raspberry Pi

This module will help to convert your voice (speech) into text using Speech Recognition Library. You can control the devices or you can perform the desired tasks by the word recognition

Raspberry_Pi Pakistan 2 Dec 15, 2021
A voice based calculator by using termux api in Android

termux_voice_calculator This is. A voice based calculator by using termux api in Android Instagram account ?? ?? Requirements and installation Downloa

ʕ´•ᴥ•`ʔ╠ŞĦỮβĦa̷m̷╣ʕ´•ᴥ•`ʔ 2 Apr 29, 2022
Terminal-based audio-to-text converter

att Terminal-based audio-to-text converter Project description A terminal-based audio-to-text converter written in python, enabling you to convert .wa

Sven Eschlbeck 3 Dec 19, 2021
Delta TTA(Text To Audio) SoftWare

Text-To-Audio-Windows Delta TTA(Text To Audio) SoftWare Info You Can Use It For Convert Your Text To Audio File You Just Write Your Text And Your End

Delta Inc. 2 Dec 14, 2021
Scalable audio processing framework written in Python with a RESTful API

TimeSide : scalable audio processing framework and server written in Python TimeSide is a python framework enabling low and high level audio analysis,

Parisson 332 Sep 15, 2022
A Python wrapper around the Soundcloud API

soundcloud-python A friendly wrapper around the Soundcloud API. Installation To install soundcloud-python, simply: pip install soundcloud Or if you'r

SoundCloud 81 Sep 10, 2022
Supysonic is a Python implementation of the Subsonic server API.

Supysonic Supysonic is a Python implementation of the Subsonic server API. Current supported features are: browsing (by folders or tags) streaming of

Alban 226 Sep 22, 2022
Just-Music - Spotify API Driven Music Web app, that allows to listen and control and share songs

Just Music... Just Music Is A Web APP That Allows Users To Play Song Using Spoti

Ayush Mishra 3 May 1, 2022
Manipulate audio with a simple and easy high level interface

Pydub Pydub lets you do stuff to audio in a way that isn't stupid. Stuff you might be looking for: Installing Pydub API Documentation Dependencies Pla

James Robert 6.4k Sep 26, 2022
Powerful, simple, audio tag editor for GNU/Linux

puddletag puddletag is an audio tag editor (primarily created) for GNU/Linux similar to the Windows program, Mp3tag. Unlike most taggers for GNU/Linux

null 332 Sep 22, 2022
LibXtract is a simple, portable, lightweight library of audio feature extraction functions.

LibXtract LibXtract is a simple, portable, lightweight library of audio feature extraction functions. The purpose of the library is to provide a relat

Jamie Bullock 214 Sep 15, 2022