Let’s follow Mark as he tampers with Siri and find the comments here or on his original post.
A few months ago, just after Siri was released on the world I got the uncontrollable urge to see what Siri could do for me. I was planning to leverage the Siri APIs and slap some Ruby code on it to scratch my itch. But my enthusiasm was quickly doused when realizing that Apple didn’t release any public APIs and probably won’t be doing so for the foreseeable future.
Siri Proxy to the rescue!
Siri Proxy is a proxy server for Siri, written in ruby. It sits between Apple’s servers and your iPhone, allowing you to intercept all traffic from and to, no jailbreak required. It also has a handy plugin framework using standard ruby gems which means I could use this for my own nefarious purposes:
Setting up a Siri Proxy is well documented in the README of the git repo, including several YouTube videos so this is outside the scope of this post. I will delve deeper into a plugin I wrote for Siri Proxy and what I learned in the process about Siri and a voice driven user experience.
The plan is to control my Logitech Squeezebox Radio from Siri, using just my voice. Given that the squeezebox server which controls the radio can be connected to with Telnet and queried from the command line, this shouldn’t be too hard. I will write a plugin for Siri Proxy that will intercept and listen for certain words and trigger calls to the radio. On with the show.
I first create an object that represents my radio. It will allow me to connect and talk to my radio:
The constructor connects to the default server and port of the radio (this can be configured) using the Telnet protocol. Once instantiated, we can issue any method we want against the object. These will get caught by the method_missing method which will pass the calls to the radio. It takes the method name and parameters and passes these over Telnet to the radio. This simple construct allows us to call any known squeezebox server command on the Squeezebox object with very little code. As long as Squeezebox server understands the command, i.e. as long as the method we call on the object is a known command, this will work. Now that we have this object, writing the actual plugin is peanuts.
On initialization of the plugin framework, we initialize our Squeezebox object. This connects us to the radio. We then listen to certain commands that come from Siri and trigger the appropriate commands on the radio object, which in turn passes them to the radio itself which executes them. The plugin only supports 3 different types of commands, radio on, off and playing music from a particular artist, but you get the gist of it. It could easily be extended with many more commands, like forward, backwards, etc.
So, what have we learned? Well, it turns out that it is ridiculously easy to listen in on traffic from Apple servers, borderline dangerous I’d say. If you have control over the DNS server, you can listen in on ALL Siri traffic. You can also issue commands to the phone (e.g. send an SMS on his behalf) and the user would be completely unaware. Fun for you, not so much for the user. In an enterprise setting, this would obviously be completely unacceptable and I can therefor not recommend this approach to anybody trying to build a business around this idea (are you listening in for?). However, for home use, this is a lot of fun and quite useful.
As for User Experience, I think voice has a bright future but the example I created exposes the Achilles Heel of the whole concept: understanding exactly what the user is trying to do. ”Radio on” and “Radio off” are simple commands but even those could be expressed by the users in an infinite number of ways. A polite user might ask “Could you please turn the radio off?”, a not so polite user might shout “Shut up!”. Some defensive coding and clever regex’ing might help here, but you can easily see that as the vocabulary expands, this will become very difficult very quickly indeed. What makes for a good User Experience is not the conversion of voice into text, but extracting intent from that text. If you cannot do that, your application will fail, no matter how good it is at converting voice to text. I will delve a bit deeper into this processing of (natural) language in a future post.
Here is a video of the plugin in action on my own radio.
All the code, including installation instructions, is available on github.