Parallel Dimension: Netflix Challenge

Soooo..... I was talking with Robert, a friend I work with, when he tells me about this challenge that Netfilx puts out about a year ago (http://www.netflixprize.com). The basic idea is that they want a better way to recommend movies, and they want it to the tune of $1,000,000. So here is the catch... you have to make it at least 10% better.

So whats 10% better you may ask... well its all about predicting what users will rate a movie. They use the RMSE or the Root Mean Squared Error. This basically calculates how close you were in predicting the correct rating for users over a large data set. When all is said and done you have to score a RMSE of .8572 or lower just to qualify. This is not an easy task as of the time of this writing no one has qualified and the lowest score is .8782 by team Bellkor.

My first reaction to this problem was that the data set was way to narrow to make anything more then some sophisticated sorting algorithm. My experience with Netflix was all they really take away from your experience with a dvd is a simple 5 star rating. I mean sure all of the basic links are there: Actors, Directors, Writers, Date the movie was made, who produced the movie, what is the general theme of the movie, so on so forth. However there is so much more to a movie then simply what it is about and how made/stared in it. To me it is like saying "I like Pepsi" so they suggest Coke, not really knowing my motivation for liking Pepsi.

Motivation, feeling, past experiences, perception, and current mood all are individual to each client and plays an important role in how they reacts to movies. Some times it only takes one scene to trigger a painful memory to make a 5 star movie a 1 star. It could simply be the person is in a bad mood that day and it can turn a good movie into a bad movie. There is alot of "randomness" in the reaction to movies that depends on alot of hard to track variables. And enough bad data points due to this "randomness" and the whole recommendation system is thrown off.

So at this point I signed up for the challenge and deiced that it would be worth taking a look at the data that they amassed for the example set. There is a basic set of information here including a MovieID, Date of release, CustomerID, Date of rating, and Rating. Not much to go on...

The data is split into three main sections. One being information about the movies. This provides information such as an internal Movie ID, the date of the release (This can either be of the dvd or of the theatrical release... not much use in identifying the time period the movie came from) and of course the title, as it is stored by Netflix, of the movie. The next part is the training data which is split into one file per MovieID. In each file there is a MovieID followed by all CustomerID, Date, and Ratings for that movie. The last section is a file that contains a list of MovieIDs and customers who rated them and on what day.

From this it looks like you could at best single out individuals that seem to have the same taste in movies and recommend movies that people in the group had not seen that others in the group rated higher. This clustering of tastes does not seem like a great solution because from my experience in life just because I agree with someone on one movie dose not mean I will on the next. This is kind of like a model of a book club where each month a group reads a book and then instead of discussing the book at the end of the month everyone just shows up and assumes that they all liked the book for the same reasons.

Anyways for now I am going to load this data up in a database and then start messing around with some C# to see what kind of information I can get out of it... I will post more on this subject as I have some results working with the data.

Game Over,
Chris Kincanon

Parallel Dimension

Friday, June 15, 2007

Netflix Challenge

1 comment:

Blog Archive

About Me

Dimensional News