As the old saying goes “If Mohamed won’t come to the mountain then the mountain will come to Mohamed”. English is an ambiguous language. Many words are pronounced the same yet are spelt differently; certain letters are not pronounced and then there are a whole host of accents.
The leader in voice recognition is Google, and while the recognition is remarkably good, unfortunately voice recognition for the most part remains too unreliable to be used consistently. Can we blame Google for this failing? Yes and no.
Google have improved recognition reliability considerably by ditching voice personalization and instead have opted for a cloud solution powered by an algorithm that matches a voice pattern against popular inputs and selects the closest match. This has proved to be a way more accurate way to handle recognition and requires no voice training, however it has its shortfalls.
The older method used in Microsoft Speech Recognition or Dragon Software uses an advanced method which gets a general idea of what you are trying to say, then attempts to improve accuracy by allowing you to train the software to suit your voice and accent. Unfortunately this method also has it’s limitations.
The marriage of these systems could be the way forward. Just today I was trying to get Google to recognize the following “I am going cycling in the mountains”, which kept being translated as “I am going fighting to the mountains”. When I spoke the single word “Cycling”, amazingly Google recognized this. In a sentence though, the cloud based algorithms inadvertently break the recognition. I was disappointed, if only I could train Google to learn that one distinct sounded word, and others, boy would this software be good!
No such luck.
Which got me thinking about the problem in a new way, and about the long term future of recognition. Clearly what we all want is Star Trek quality recognition, so the question is how are we going to get there. We are either going to need vastly more complex algorithms handling the ambiguous input, or we need to strip away the ambiguity.
Teenagers have long ago pirated the English language when it comes to text based phone messages and Facebook posts. They’ve essentially created a short hand version of the language which they use to communicate quickly with their thumbs. Although I don’t condone destroying the English language, the approach is a smart one, and results in A) Quicker communication B) Cheaper Communication C) More simplified communication / comfortable.
Which got me thinking, why aren’t we doing the same for voice recognition? Is the situation beckoning for a digital Shakespeare to join us in our plight and invent new forms of “Puke” that are easier for less intelligent computer systems to recognize? Do we need a whole new set of synonyms that we can slot in and out interchangeably that would ensure a computer actually gets it right 99.9% of the time?
This is definitely food for thought.