Editor’s note: The following post comes from Mark Vilrokx (@mvilrokx), one of my colleagues in Apps UX. Mark has been working on the Voice project that Ultan (@ultan) mentioned in his recap of the OUAB meeting last week, and as part of his work on Voice, Mark has done some serious hacking that I wanted to publicize. Happily, he agreed, and here we are.
Mark blogs at All Things Rails, and maybe he’ll reignite my Rails interest, which has waned since we stopped working on Connect. Anyway, enjoy Mark’s adventure, Build your own Siri Application … in the browser! and find the comments here on his original post.
I have emerged myself recently in Voice driven applications and was asked to knock up a quick prototype of something “that looks and acts like Siri”. That’s a pretty tall order I thought but after some research I came up with the following…
The first problem we have to solve is Speech Recognition, i.e. convert the voice data into text. The data would have to be streamed to a server which then performs the actual recognition and sends back a string of what it thinks you said. That’s some complicated stuff right there. Voice recognition is a science in itself and I also did not want to have to deal with the server setup. Luckily for me, it turns out that Google has built all of this stuff into their Chrome browser already courtesy of the HTML 5 Speech Input API. All you have to do is add a special attribute to an <input> and it will allow users to simply:
“click on an icon and then speak into your computer’s microphone. The recorded audio is sent to speech servers for transcription, after which the text is typed out for you.”
Sounds about right to me, first problem solved!
The second challenge is to extract meaningful information from the text to understand what the users wants you to do. When the user says ”What is the weather forecast for tomorrow,” you have to figure out, from this string that the user … well … wants to see the weather forecast for tomorrow. If this is the only case your application has to handle, it’s pretty easy:
if utterance =~ /.*weather forecast.*/ig
return “I do not know what you mean, try asking again (e.g. what is the weather forecast for tomorrow)”
But also pretty useless.
Clearly you could not write a case statement big enough that could handle all possible scenarios or even a fairly limited scenario, e.g. what would happen if the user asks “Show me the forecast of the weather,” not to mention “Is is going to rain tomorrow?”. You can see that this processing of natural language can get fairly complicated very quickly. As it it turns out, this is another field of science (Natural Language Processing or NLP) that people much smarter than myself have worked on for decades. One example of a website that uses NLP to answer questions is wolframalpha. And guess who uses wolframalpha … that’s right: Siri. So if it is good enough for Apple, it’s certainly good enough for my prototype so I registered for a developers licence with them and that was it (I suggest you do the same if you want to follow this article). Now I just needed to hook up everything, I’m going to create a Rails application to do just that.
It will be a very simple application with 1 page that has 1 form on it. This form in turn will have 1 field on it that can be used by the user to “enter” their question. To support voice entry, I will add the required attribute (“x-webkit-speech”) to this input field. To further emphasis the fact that this is a voice driven application, I am going to style the input field:
Using the following CSS:
Furthermore, that same page will have an area that displays the data: what the user says and what wolframalpha returns as the answer. We call this the stream and represent it as in ordered list, which gives us the following (using HAML):
Incredibly simple! The user presses on the microphone and starts talking. When he stops talking, Google processes the voice data and returns a text representation (actual it returns several, ranked in decreasing order of “correctness”, we just always use the top result). It inserts the text into the Text Field on which the voice was triggered, essentially Chrome fills in the form for us with the transcribed voice data. This is all handled by chrome, we do not have to do anything for this to work.
When the result comes back from google, chrome also raises a JS event that we can listen for. We will use this to trigger an AJAX call to WolframAlpha, passing in the received text, i.e. we automatically submit the form to process_speech. process_speech is a controller method that handles the call to WolframAlpha (I am using the Faraday gem). When we receive an answer from WolframAlpha, we attach this to the stream (in coffeescript):
|# Function called when a speech recognition result is received.|
|speechChange = (e) ->|
|# 1. pass received text to service that can interpret the text (using WolframAlpha right now)|
|# 2. when this service returns, show results of this service in the stream|
|if e.type == 'webkitspeechchange' && e.originalEvent.results|
|topResult = e.originalEvent.results|
|# submit the form to the proxy service|
|# (data, textStatus, jqXHR) -> adjustStream(data.queryresult.pod.subpod.plaintext),|
|(data, textStatus, jqXHR) ->|
|if data.queryresult.success == 'true'|
|if data.queryresult.pod.subpod instanceof Array|
|adjustStream("I'm sorry, I didn't understand, please try again")|
And that is it really, some more CSS and more coffeescript to make it look pretty and you are good to go: Siri in the browser in less than 150 lines of code. I haven’t had a chance yet to clean up the code so it’s not public yet on github, but here’s a video showing the end result.