Home Assistant speech to text with Asterisk and DeepSpeech

Submitted by tensai on Fri, 02/17/2023 - 7:55pm

I live outside city limits, so we don't have a municipal garbage service. Instead we get to choose from a handful of private trash collection companies. The one we chose is fine as far as garbage collection goes, but their website is stuck in the dark ages. Their contact email is an AOL.com address, if that tells you something. When I want to find out if garbage collection has been delayed due to a holiday, I have to call their phone number and listen to a voice message. It's terribly inconvenient and a lot more difficult to integrate with my Home Assistant installation. But I can't let that stop me, now can I?

The general concept is

Asterisk starts a phone call and records it as a WAV file.
This file is processed through DeepSpeech to create a text version.
A script tasks the text and posts it to an MQTT server

For convenience, I have uploaded all the below files to github, https://github.com/heytensai/homeassistant-asterisk-deepspeech.

The Asterisk Bit

We'll be making use of call files to initiate the call. Substitute your SIP dial string on the first line. Save this file somewhere on the Asterisk server.

Channel: SIP/trunk-name/number-to-dial
Context: garbageman-answer
Extension: s
Priority: 1

Next we add an Asterisk dialplan which controls the flow of the call after it is answered. In this case it enables call recording, waits 20 seconds, and then hangs up. Also of note I'm using the "o" option to Monitor so that each leg is recorded separately. No point in worrying about our outgoing audio.

[garbageman-answer]
exten => s,1,Verbose(1,Garbageman Answer)
 same => n,Wait(1)
 same => n,Set(FILENAME=garbageman)
 same => n,Monitor(wav,${FILENAME},o)
 same => n,Wait(20)
 same => n,Hangup()

And finally a script to copy the call file to the Asterisk spool and initiate the process. I'm calling this from a cron job once a week. After the call completes, the WAV file is copied to the web root of my Asterisk server. That's because I couldn't get DeepSpeech to install on an Raspberry Pi4 with Python 3.10, so I'm running it on an x86 server elsewhere. If not for this, it the process would be simpler.

#!/bin/bash

CALL="/root/garbageman.call"
WAV="/var/spool/asterisk/monitor/garbageman-in.wav"

cp -p "${CALL}" /var/spool/asterisk/outgoing

sleep 60

if [ -f "${WAV}" ]
then
	mv "${WAV}" /var/www/html/audio
fi

The DeepSpeech BIt

First, install DeepSpeech in a Python virtual environment. I put it in my home directory for convenience. As mentioned above, at the time of install there weren't wheels for Python 3.10 on ARM. I used Python 3.7 on x86 instead.

$ python3 -m venv deepspeech
$ cd deepspeech
$ source bin/activate
$ pip install deepspeech

Now you can call deepspeech from a script to convert audio to text and post it to an MQTT server. In my case, I'm specifically looking for the word "delay" in the text, which will indicate that there's a delay in the pickup schedule.

#!/bin/bash

PREFIX="garbageday"
MQTT_HOST="192.0.1.0"
URL="http://192.0.1.2/audio/garbageman-in.wav"
WAV="/tmp/garbageman-in.wav"
DSDIR="${HOME}/deepspeech"

wget -q -O "${WAV}" "${URL}"

if [ -s "${WAV}" ]
then
	TEXT=$(${DSDIR}/bin/deepspeech_text "${WAV}" | cut -c -200)
	mosquitto_pub -h "${MQTT_HOST}" -t "${PREFIX}/text" -r -m "${TEXT}"
	DELAY=0
	echo "${TEXT}" |grep -q "delaying" && DELAY=1
	mosquitto_pub -h "${MQTT_HOST}" -t "${PREFIX}/delay" -r -m "${DELAY}"
else
	mosquitto_pub -h "${MQTT_HOST}" -t "${PREFIX}/delay" -r -m "0"
fi

rm -f "${WAV}"

Conclusion

I've been running this setup for about a month now and it's working adequately. I would rate the accuracy of DeepSpeech at about 70%. It's quite a bit less accurate than Google or others, but sufficient for my needs and worth the price. Speed is not quite realtime but not bad either.

The hardest part was have to use 2 separate systems due to the different CPU architectures. If not for that, it would have been almost simple.