User experience design as a career fell largely on the era of GUI. Thus most people in my profession are visual thinkers if not by birth than by experience. When it comes to presenting information, we think visualization. Times are changing, and with that we are challenged to present information verbally. This is where text analytics meets UX. I only worked on a handful of projects that are about text, and only with a handful of text technologies, but the experience has been worth mentioning.
Text analytics, more or less meaning the same as text mining, is “devising of patterns and trends from text through means such as statistics…” (Oh, Wikipedia!)
There are many areas of text analytics – text summarization, information retrieval, sentiment analysis, named entity recognition, and on… The tools and techniques are constantly getting better, it is exciting. I get an impression that the text mining companies are intoxicated with the coolness of technologies they build, so they think of it first and think of possible industry applications later. As I am conditioned to think in an opposite direction, it was interesting for me to see how the same technique can be so useful in one case and completely irrelevant in another.
Here is my use case inventory. Take a brand manager versus a sales representative. A brand manager might like daily sentiment analysis of her brands and those of her competitor. On the other hand, the sales representatives we have interviewed are not at all into sentiment analysis. What they look for is highly tuned searches that would brief them daily on what’s happening with their top clients. They also search for industry news that they can retweet with a hope to influence the clients. A money manager might need to use text analytics to contextualize the jump in a stock price, while a marketer would rather have a predictive text mining tool to target customers for a purchasing recommendation. I often research different design topics and am interested in text analytics that would make me see at a glance what a collection of papers or articles is about. I also like to see daily summaries of trending topics in design and technology.
So the first lesson I’ve learned is how all text analytics use cases are different.
The second lesson is how the devil is in detail.
For one of my project, I wanted to have a condensed representation of press coverage for the new release of HCM applications, specifically, its user experience. For my purposes, I wanted to have it as a cloud of words. I have collected a number of press releases and reviews, and fed them through four text analytics tools I could put my hands on, namely Semantria, Open Calais, TagCrowd, and Oracle’s own Social Relationship Management (SRM) Listen and Analyze.
Here are the results.
For the fairness of the comparison, I have stripped the lists of its the original formatting (the products have drastically different interfaces), and limited the results to 20 items. Moreover, some packages categorize the results into “themes,” “entities,” etc. I kind of had to either pick or merge. SRM doesn’t allow me to feed corpus of text to it to analyze, so I had to create a search query about OAUX instead.
You can see that the differences are dramatic. I believe some differences are the results of subtle choices made by the product designers – frequency thresholds, parts of speech included, the choice of either 1, 2 or 3 word phrases, etc. Other differences are the results of the actual algorithms beneath – bag of words, word vectors, neural nets, skip grams, chaining, deep learning, … . At first, I was determined to figure them all out. I quickly realized that there is no way I can get through the math of it. So I decided to approach it in a chocolate tasting way. If I like the taste, I’ll make an effort to read the ingredients.
Semantria I liked the most. I liked the combination of themes and entities; I thought the length of the phrases was well balanced. I read the ingredients. Instead of plain word frequencies, Semantria uses something called “lexical chaining” to score themes. “The algorithm takes context and noun-phrase placement into account when scoring themes.” I put “lexical chaining” high on my list of likes.
OpenCalais looked totally solid, though heavy on terms and nouns, and light on themes and adjectives. This is to no surprise, as Named Entity Recognition is OpenCalais’ core competency, and there it is unsurpassed. The new “generic relations” feature in a shape of a “subject-predicate-object” is amazing.
TagCrowd’s was definitely too plain to represent what the collection is about. This is a very simple well-meaning word frequency tool, with the stop words (the and a removed) being its only “lexical analysis” feature. From TagCrowd I’ve learned that the word frequencies can take you only that far.
Finally, there is SRM. SRM uses latent semantic analysis, which is a type of vectorial technique.
And what’s your favorite?