Speech Controlled Home Automation – Part 1 (Raspberry Pi/Windows IoT/C#)

“You know, however, with the Xbox One, I can control my entire entertainment system using voice commands. Up until now, I’ve had to use Leonard.” – Sheldon Cooper.

That quote is actually of particular relevance to me – given a lot of the impetus to create my home automation system was so that when my wife asked me to turn on the air conditioner, I could have the home automation system do it instead.

I was quite pleased when I found out that I could use Windows 10 on the Raspberry Pi. In the past, I’d been using ArchLinux on my Raspberry Pi devices, with a Jabra Speak 510 UC Speakerphone. I was then, through the use of a mono C# daemon, listening to any audio coming in the microphone, converting it to a stream and then sending it to a C# windows service. That C# windows service would then take that data, and feed it into the Microsoft Speech Platform. Assuming the audio matched the speech patterns I was looking for, my application would then perform the appropriate task, synthesize a voice response, and stream the audio data back to the Raspberry Pi who would then promptly play it back out the speakerphone. Thus I had a speech controlled home automation system.

Speech Overview

This worked – and it worked reasonably well. Unfortunately though, due to the age of the Microsoft speech runtime (2011), I was having some voice quality issues that were difficult to fix quickly. Quickly in the sense of the time you would reasonably wait for a response after speaking.

What did I try? Well – once the Microsoft Speech Runtime generated a speech match result, I’d capture the relevant piece of audio as a wave file. I could then amplify it, feed it back through the same speech engine – and if the two results matched then I’d take action. However, the system still got confused at times when it picked up conversation coming from the television.

As such, when I learned that there would be a Windows 10 version – this meant that I could try the new Windows 10 speech platform directly. This series of articles is about the Windows 10 IoT speech sensor application and the backend control service. The goal now is for all speech processing to be performed locally on the Raspberry Pi, with only the text based command being sent over the network via an API call to the backend control service – much more efficient than constantly streaming all microphone data over the network for processing. The backend control service then interfaces with the rest of the home automation system through a message bus – but more on the overall system later.

Speech Overview Updated

In a previous post, I walked through the basics of creating a Background Task application to run on the Windows 10 IoT Core operating system – on a Raspberry Pi 3. In this post, I’ll go through the basics of using the speech recognition and speech synthesis capabilities of the Windows 10 Universal Windows Platform. Essentially, we need to initialise the speech recognition engine, tell it which phrases it needs to listen for, and then process the results. All of this is done using the system default microphone and speaker devices.

There is one significant caveat however. With the initial build of Windows 10 IoT, I was able to use my Jabra Speak 410 USB Conference speakerphone. I’d been using this device back when I was running Linux for the speech capture system. It’s a good device – USB, no external power needed, and designed for listening to a room (being a speakerphone). However, the initial build of the IoT OS was unable to use that to generate audio. Great – I can listen, but I can’t talk – not overly handy for a device that you want to talk to.

I waited until the next build of Windows 10 IoT. The good news is – it fixed the audio generation issue – I could now generate and play audio through the speakerphone. The bad news is that I could no longer perform speech recognition. It seems the newer IoT builds have become a little more precious, er, specific regarding which microphone devices they are happy to use. The build still recognised my speakerphone as the system default microphone device, but the speech recognition engine would simply fail – it would immediately stop listening once I called the start function.

It turns out that others had been seeing something similar. I took some advice from this post, and bought a small Kinobo – Mini Akiro USB Microphone from Amazon and used that in conjunction with the Jabra device. That actually worked – it used the Kinobo microphone for speech recognition, and the Jabra speakerphone for the speech synthesis. That’s good for now, but I’m really hoping that as the IoT OS builds settle down a bit – my original device starts working again. Either that, or I need to find a new combo speakerphone that does work. This new pencil microphone isn’t really well suited to listening to the room.

Anyway, the code works – but only with specific microphone devices. All I can suggest is to keep trying the new builds to see if there’s any improvement in the device support.

I’ve created a new Background Task application, and used the following code.

The StartupTask.cs file looks fairly similar to the previous post – although I’ve removed logging and error handling commands for brevity.

using System;
using Windows.ApplicationModel.Background;

namespace SpeechTest
{
	public sealed class StartupTask : IBackgroundTask
	{
		private BackgroundTaskDeferral _deferal;

		public async void Run(IBackgroundTaskInstance taskInstance)
		{
			_deferal = taskInstance.GetDeferral();
			taskInstance.Canceled += TaskInstance_Canceled;

			Speech.Initialize();

			await Speech.LoadGrammar();

			await Speech.StartRecognition();
		}

		private async void TaskInstance_Canceled(IBackgroundTaskInstance sender, BackgroundTaskCancellationReason reason)
		{
			await Speech.StopRecognition();

			_deferal.Complete();
		}
	}
}

The Speech.cs file template is below, the key functions I’ll walk through separately.

using System;
using System.Collections.Generic;
using System.Threading.Tasks;
using Windows.Media.Playback;
using Windows.Media.SpeechRecognition;
using Windows.Media.SpeechSynthesis;

namespace SpeechTest
{
	class Speech
	{
		private static SpeechRecognizer _speechRecognizer;
		private static bool bSpeechRunning = false;

		public static void Initialize()
		{
			_speechRecognizer = new SpeechRecognizer();
		}

		// public static async Task<bool> LoadGrammar()
		// public static async Task<bool> StartRecognition()
		// public static async Task<bool> StopRecognition()
		// private async static void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
		// public static async Task GenerateSpeech(string strSpeech)
	}
}

The speech recognition framework requires specific phrases to be registered – informing the engine which phrases it should attempt to match. You can specify multiple phases in a group, and then associate them with a tag. This tag makes it easier to process the result later. You can specify numerous phrases, either in the format shown below, or through several other options (e.g. file based input).

public static async Task<bool> LoadGrammar()
{
	SpeechRecognitionCompilationResult resultCompilation;

	_speechRecognizer.Constraints.Add(new SpeechRecognitionListConstraint(new List<string>() { "can you hear me?", "can you hear what i'm saying?" }, "question_hear_me" ));
	_speechRecognizer.Constraints.Add(new SpeechRecognitionListConstraint(new List<string>() { "other phrases" }, "other_tag" ));
							
	resultCompilation = await _speechRecognizer.CompileConstraintsAsync();

	if (resultCompilation.Status == SpeechRecognitionResultStatus.Success)
		return true;
	else
	{
		// Handle Error
		return false;
	}
}

The StartRecognition function associates the ResultGenerated event handler, and asynchronously starts the recognition engine. The StopRecognition function stops the engine and clears the reference.

public static async Task<bool> StartRecognition()
{
	if (!bSpeechRunning)
	{
		try
		{
			_speechRecognizer.ContinuousRecognitionSession.ResultGenerated += ContinuousRecognitionSession_ResultGenerated;

			await _speechRecognizer.ContinuousRecognitionSession.StartAsync();

			bSpeechRunning = true;
		}
		catch (Exception eException)
		{
			// Handle Error
			return false;
		}				
	}

	return true;
}

public static async Task<bool> StopRecognition()
{
	if (bSpeechRunning)
	{
		try
		{
			await _speechRecognizer.ContinuousRecognitionSession.StopAsync();

			_speechRecognizer.Dispose();
			_speechRecognizer = null;

			bSpeechRunning = false;
		}
		catch (Exception eException)
		{
			// Handle Error
			return false;
		}
	}

	return true;
}

This function is the event handler used when the engine matches a phrase. Have a good read through the MSDN help for the topic – there’s more information available through the args parameter – e.g. confidence level or alternate matches. I then use the tag to determine which set of phrases were matched – and then generate the response audio using the GenerateSpeech function shown further down.

private async static void ContinuousRecognitionSession_ResultGenerated(SpeechContinuousRecognitionSession sender, SpeechContinuousRecognitionResultGeneratedEventArgs args)
{
	if (args.Result.Confidence == SpeechRecognitionConfidence.Rejected)
		return;

	try
	{
		if (args.Result.Constraint.Tag == "question_hear_me")
			await GenerateSpeech("yes, I can hear you.");
	}
	catch (Exception eException)
	{
		// Handle Error
		await GenerateSpeech("I'm sorry, I was unable to process that request.");
	}
}

This function uses the speech synthesis engine to generate an audio response. I then use the BackgroundMediaPlayer to play the audio to the default system speaker. You can select which voice you want the system to use when generating speech – I went with Zira.

public static async Task GenerateSpeech(string strSpeech)
{
	SpeechSynthesizer synthesizer;
	SpeechSynthesisStream synthesisStream;

	try
	{
		synthesizer = new SpeechSynthesizer();

		foreach (VoiceInformation voice in SpeechSynthesizer.AllVoices)
		{
			if (voice.DisplayName == "Microsoft Zira Mobile")
				synthesizer.Voice = voice;
		}

		synthesisStream = await synthesizer.SynthesizeTextToStreamAsync(strSpeech);
				
		BackgroundMediaPlayer.Current.AutoPlay = true;
		BackgroundMediaPlayer.Current.SetStreamSource(synthesisStream);
		BackgroundMediaPlayer.Current.Play();
	}
	catch (Exception eException)
	{
		// Handle Error
	}
}

There’s a lot more you can do to improve upon this – but it’s a good starting point. In future posts, I’ll go into more detail on the API and central service that I use for actually processing the conversations. The sample above has that code excluded for simplicity – it simply shows the speech recognition and speech synthesis.

Good luck, and I’m keen to hear what microphones you’ve found to work with this system.

~ Mike

3 thoughts on “Speech Controlled Home Automation – Part 1 (Raspberry Pi/Windows IoT/C#)

  1. Hello Mike,

    Have you explored more on processing the conversation? I’m interested in developing something like – Play music (then it will have multiple choices – Happy songs, old songs, dance, hip-hop), play games (game1, game2..). I’m thinking that these games will be like Country/Movie games where each player will say country/movie one after another… Any help?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s