What’s the Best Way to Collect Ratings?

Posted by JakePosted 16 years ago 24 September 2009

Photo by anne.oeldorfhirsch from Flickr used under Creative Commons

YouTube published (h/t TechCrunch) an interesting graph of its video ratings earlier in the week.

YouTube uses a five-star scale for rating videos, and according to them, rating a video one star means you “loathe” it, while rating a video five stars means you love it.

The data show that an overwhelming majority of the total ratings are five-star, with one-star coming in a very distant second. Use of two, three and four-star ratings was negligible, like a rounding error small.

These data expose a few problems with using a five-star system for rating artifacts. First off, it seems counter-intuitive to give something any stars if I “loathe” it, and more broadly, YouTube’s users seem motivated only to reward excellence, making the rating system essentially worthless.

These behaviors point to a Digg/Bury (vote up/down) model working better, and this is likely due to a fair amount of overlap between heavy YouTube and Digg users. I wonder if YouTube will change models, or try to motivate users to be less binary.

We’ve had similar frustrations with voting models with Connect.

The IdeaFactory initially launched with a five-star system for rating ideas, and we noticed issues with the ratings. However, the experience was the opposite of what YouTube reports; we saw the majority of ideas getting four and five stars, which seemed to indicate that people wanted to show support for colleagues.

Very few ideas were rated lower than three, I suppose for fear of insulting the creator, even though ratings were anonymous.

So, what we had was virtual heat of scads of ideas all with between four and five stars.

I have this same issue with Netflix. They use a five-star system, and I constantly run into problems choosing between three and four stars.

I don’t think these problems would be solved by adding more values, e.g. more stars or half-stars. Maybe that star model is broken.

We considered this when we built Oracle Mix back in late 2007 and decided to go with a Digg-style voting model, minus the buries. So, people could agree (+1) with any number of ideas.

When we moved Connect to the Mix codeline, we converted the star ratings to votes. Basically, if you rated an idea, we turned that into a +1 vote. Not very scientific, but the stars were pretty arbitrary. So, we figured it would be fine.

We decided to avoid buries (or down votes) to promote harmony among voters. A vote against something seems pretty harsh; not voting for something means you don’t think it’s a good idea. Voting against it carries some extra payload, like you actively don’t want it.

I’m not sure if it mattered, and a few people on Mix had legitimate use cases for down voting ideas. Still, it seemed like the best course of action.

Incidentally, this model was in the news this week too, when the Google Reader team opened up Product Ideas for Google Reader, where you can submit product ideas and vote for or against them.

I love Reader and embrace this opportunity to contribute to its direction. So, I submitted a couple myself (open source and add a proxy setting), but so far, the ideas have more votes against than for them. I find myself wondering why people would take the time to vote against an idea.

What’s the harm in open sourcing the Reader engine so I can install it behind my firewall and read internal feeds? Similarly, what’s the harm in adding a proxy setting to read feeds behind a firewall?

If anything, this makes me want to vote competitively against other ideas.

This is exactly the behavior we wanted to avoid.

Still, the voting model is diluted by the fact that there is no limit on the number of votes that can be cast. No scarcity of votes means everything is highly voted, causing essentially the same issue that the star system has.

In the latest revision to Connect, we converted all votes to likes. Likes are more inline with what voting had become, but there was some blowback around the conversion. Nothing major though; I suspect everyone felt that voting was pretty diluted anyway.

Obviously, likes don’t really rank artifacts. So, the question remains; if you really want to rate artifacts, what’s the best method?

My experience tells me that star and +/-1 systems don’t work well enough.

What might work? I think you need something scarce, like a currency, to create a market and prevent excessive spending on any artifact (idea) that looks decent. I’m not convinced that down voting is necessary in a market, although I suppose there are cases where you’d want to stop an idea that might do harm. Those cases are rare though.

What do you think works best? Did I miss something obvious about stars and +/-1 systems?

Find the comments.

AboutJake

a.k.a.:jkuramot

9 comments

jpiwowar says:

25 September 2009 at 5:01 am

Those “stop an idea that might do harm” cases might be rare, but they make for awesome flame wars on the OTN forums. 😉

I think star systems have their place, but are mostly relevant for personal ranking systems (e.g. iTunes) or systems where that correlates preferences between users with similar tastes/tendencies (e.g. Netflix). Not sure what to think about downvotes. The problem that occurs to me with assuming that no vote as a proxy for dislike is that no vote could also connote “no interest.” Maybe that's not important; I don't claim to have thought this through too far.

The lack of vote scarcity could be addressed with a reputation system, where users are allocated a certain number of votes based upon both time and level of community participation. One could even work downvotes in there, at a higher level of scarcity, and require a certain rep level before downvoting is even possible. IIRC, that's similar to how Stack Overflow operates. I can't imagine how the karma/rep thing would scale to Digg or Youtube levels, though; it's probably better suited to more bounded communities, like Connect or Mix.
Jake says:

25 September 2009 at 3:39 pm

The problem with stars, even if they're only for personal rating, is they still tend to migrate too high, at least for me. On Netflix, I have too many movies that are 3/4 stars to make that system useful anymore.

Down-voting makes sense sometimes, as you say, not voting and voting against aren't necessarily the same things.

I nearly mentioned reputation, which is my new kick, because this is best way to establish credit in a community. Glad you came to that conclusion without me. It wasn't a test, just good that other smart people think the same way we do. We've code-named our little experiment in reputation 1up. Hoping to get time to build it.
Trung Ly says:

25 September 2009 at 5:12 pm

First off, great blog. I was immediately drawn to a few of your articles, which is a rarity for me as I'm not a heavy reader of blogs.

I'm a developer at a fairly popular ratings website and I've ran across some of problems you mentioned about 5-star ratings (and ratings systems in general). I share your same frustration with Netflix's system as well. There is one idea I've been kicking around in my head, and I'm sure it's not a new one. What are your thoughts on a ranking system in lieu of star ratings? People love to make top 10 and top 5 lists, why not let them do that for the products and services they're interested in. I'm not familiar with your Connect product, so I don't know if it works for you… but hear me out on this, and tell me what you think.

The list of choices will be low (like, say, 3 to 5, and the selected choices will either be randomly chosen by the system or chosen by the user, depending on the nature of the product they are ranking.

Here are a couple of examples of what I mean. Say you have a have a site which you can rate people's ideas (like halfbakery.com; and is that what Connect is?). Raters (rankers) would click through to view the idea, and at the bottom or side of the page, be presented with the option to rank that idea against, say, two other ideas. The number of options has to be kept low in order to encourage higher participation. Then results are aggregated and used to weigh according t rankings to produce the final outcome. For example, user Jack ranks ideas in this order: A-B-C. User Jill ranks the same ideas B-C-A. User Joe ranks those same ideas C-B-A. If you weighed the results as first place: 3 points, 2nd place: 2 points, 3rd place: 1 point, the final results are that B is the best with 7, then B with 6, and A with 5. Very simple concept; it just took me a while to explain. 🙂

Similarly on, say, and apartment review website, users can rank their own apartment instead of rating them. But this time, they get to choose what other apartments to rank against since the system does not necessarily know what other apartments they have lived at. (Or if the user has input their rental history, then it CAN display the full, or a limited list to pit the current apartment against).

So I think the advantage of this system over a traditional star rating system is that it eliminates those faults which you outlined in the article. The advantage over a simple vote up/down system is that for a slightly more effort on the user's part, you gain a factor of times more information from the user.

This is what I mean. Say, user Mike has lived at 5 apartments, and he ranks them in order of favorite to least favorite. The system gains 10 data points (i.e. A>B, A>C, A>D, A>E, B>C, etc.). In the star rating system, you only get 5 data points (albeit finer detail per point), but wrought with inaccuracies and bias, as you say. Anecdotely, I find it easier to rank things rather than give them a rating, so I'm more inclined to do a ranking than a rating. Another advantage is you get finer quality data. I.e. Movies A and B may both be 3 stars, but one may actually be a little better than another.

The disadvantages are that on the indivigual (user) level, the final outcome of a ranking are forced to be evenly spaced apart. What I mean is that, there is no way a user can indicate that, for example, restaurant A was just so much better than all the other 4 restaurants on the list. In a ratings system, a user could convey that, but not in the ranking system. What I'm hoping is that with the advantage of more data points, the aggregated data results of the ranking system overcome that disadvantage.

A way to overcome the aforementioned disadvantage would be to do a 'limited point reward' system, like uservoice.com uses. So a user can give 9 votes to idea A, and 1 vote to their second favorite idea B. The only disadvantage I see in this system is that it's slightly more involved for a user to cast their vote than to simply arrange a drag-and-dropable list.

I dunno.. just throwing some thoughts out there.. hope I'm making sense. And sorry for the long comment posting.
Trung Ly says:

25 September 2009 at 5:16 pm

oh wow, gg Facebook (or Disqus?). makin' me look all fat with your non-preservation-of-profile-photo-proportions.
Jake says:

25 September 2009 at 6:48 pm

Thanks for stopping by, glad you like (some) of the content. There's a lot to digest in your comment, which is great. I'll work top-down with my thoughts.

I'm not a fan of ranking in a finite list, unless there are concrete data to use, but I think I might be in the minority there. The logic is sound for your plan, call it the eye doctor system; show you artifact A, then ask if B is better/worse, or C, better/worse.

You'll run into issues with coverage though, i.e. if the other options are shown randomly, you may not get enough exposure for all your artifacts to make the comparison valid. For example, A gets ranked 10 times and B 100 times. Not a good data set for comparing.

I also wonder if users would do that without a game; reminds me of that game where you try to match descriptive adjectives of a picture with an anonymous partner. The game is built around creating metadata for photos. Can't recall the name offhand, but that would make rating much less like work.

When you get into complex artifacts like restaurants and apartments, I think you need more granular attributes that roll up into an overall rating. There are too many pieces that make up how you rank restaurants against each other for a single number. But again, you run into user laziness with more input.

I do like the scarce voting model you mention. It creates a better focus for voting, and wrapping the votes around a reputation system (like John mentions) seems like the logical way to get more votes, a la Stackoverflow.

Good stuff. I appreciate the perspective. This is an interesting problem that's fun to discuss, IMO. Stop by anytime.
Jake says:

25 September 2009 at 6:48 pm

I'd blame Disqus for that 🙂
Vinayak says:

7 October 2009 at 11:07 pm

Nice article. You have mentioned “First off, it seems counter-intuitive to give something any stars if I “loathe” it”

I would like to point out that the stars are just an image. Use a different image and you can change the entire experience of rating. I would suggest you take a look at blippr. See it in action here – http://mashable.com/2009/10/06/gmail-accounts-e… You will see small blue icons having the text “=D” alongside site names like Google and Gmail. Hover your mouse on the blue icon and wait a second or so. It is a rating system, but the use of images is quite innovative.

I think the 5 point rating system is just perfect. Many studies on the human mind have shown that our brain is most comfortable in organizing things in 3 or 5 categories. It gets confused with other numbers. In some case 2 categories (simple Yes/No) can be used. But wherever gradings are required and absolute answers do not work – 3 or 5 categories is the best.
Jake says:

8 October 2009 at 7:48 am

The problem with any 1-5 or any numerical rating system is that not rating (indifference) seems to carry more weight than the worst rating I can give, 1. Giving something 1 star, even if that equates to a failing grade, is completely counter-intuitive.

It doesn't matter to me what the images are, but stars are the most common, and I think subconsciously we tend to overrate objects with a star system b/c stars carry a positive connotation.

At least with a Yes/No system, my dislike is correctly weighted.

I have noticed those icons on Mashable's site, but I avoided them.
Jake says:

8 October 2009 at 3:48 pm

The problem with any 1-5 or any numerical rating system is that not rating (indifference) seems to carry more weight than the worst rating I can give, 1. Giving something 1 star, even if that equates to a failing grade, is completely counter-intuitive.

It doesn't matter to me what the images are, but stars are the most common, and I think subconsciously we tend to overrate objects with a star system b/c stars carry a positive connotation.

At least with a Yes/No system, my dislike is correctly weighted.

I have noticed those icons on Mashable's site, but I avoided them.