0

Collaborative Filtering, Hadoop and the Hazards of Copy-Paste

I’ve been working on a new App idea lately – a recommender for Android programs. Basically, it looks at what you have installed (and possibly ratings) and recommends other applications you might like by using the recommendations of other people in the same way as Amazon or the various music services – in a word – collaborative filtering.

There are different ways to do collaborative filtering, but they are all expensive when you get a lot of records to sort through. Two common approaches are 1) Calculate the similarity of users, and recommend apps liked by similar users, or 2) Calculate the similarity of apps, and recommend apps similar to ones the user likes. I am trying the second way, known as item-based collaborative filtering or the model-based approach, which allows for fast queries at the cost of an expensive offline step that re-computes the item similarities every once in awhile.

My initial tests in Python, based on the very interesting book “Programming Collective Intelligence” quickly became too slow with just a few thousand users and apps. Because there are already around 5,000 apps and a few million users of Android (with many more every day), there’s no way the script would be able to handle the future growth of the platform.

Enter MapReduce and Hadoop. The explanation is better left to the pros, but simply, MapReduce is a way of parallelizing certain types of computations across many computers and then merging the final results. With the availability of Amazon Web Services, which allows you to rent a cluster of computers by the hour, it becomes possible to run a prohibitively expensive computation once every few days for just a couple of dollars. There are several different MapReduce frameworks out there, but I choose to try Hadoop, which is available on Amazon’s services and used heavily by Yahoo and many others.

There will be a lot more to say about Hadoop as I gain more experience. But all-in-all, it is pretty fun to re-think an algorithm, even just a little bit, to make it suitable for MapReduce. I *think* I have a correct implementation of Item-Based Collaborative Filtering running on my tiny 2-node cluster and it’s pretty cool!

One snag I ran into while trying to get my cluster running using the ubiquitous WordCount example for Hadoop. Like most people, I copy-pasted the source from the Hadoop tutorial and tried to run it. It ran, great! So then instead of reading the rest of the documentation, I immediately tried to modify it. Eventually, I ended up trying to make the simplest change – to return Text instead of IntWritables from the Map operation and — WTF!?! I spent HOURS trying to figure out why there was a ClassCastException. So for other poor souls trying to modify the WordCount example, there are 3 things you need to do:

First, get the method signatures right. The Mapper has to output Text and the Reducer has to consume Text (Eclipse will help with that, of course)

Second, add the lines: “conf.setMapOutputKeyClass(Text.class);” and “conf.setMapOutputValueClass(Text.class);” to the main() method. These tell Hadoop that the Mapper is not using the default, IntWritable, for output

Third, and crucially important, remove the line “conf.setCombinerClass(Reduce.class);”. Discovering that I needed to remove that single line took me about half a day, digging through the logs and Googling everything I could think of until I discovered this thread. Because it was part of the example, I assumed it was Hadoop boiler-plate that was essential — it’s not, it’s an optimization. The Combiner is kind of like a pre-Reduce phase that saves time by combining in-memory results instead of writing them to disk and combining them later. The Combiner needs a method signature that accepts the output of the Mapper and is still suitable as input to the Reducer. Otherwise, it chokes.

So is the peril of the copy-paster who runs code without really understanding all of it ~~

0

A tale of woe, a database disaster averted

I know I’m not a great coder, but I like to think I’m at least decent or even good on occasion. The events of the past week illustrate that even that estimation may be a stretch.

When I came to work on Monday morning, I discovered that one of my database tables was exceedingly empty. For about an hour, I was convinced I had been hacked. However, after checking every last thing, I finally discovered the culprit – a leftover setup script had dropped and re-created the databases. There are so many things that went wrong, I am inclined to enumerate them:

  • I LEFT A SETUP SCRIPT ON THE SERVER
  • There was no logging in place, so it took ages to diagnose the problem and rule out hacking
  • I had left unneccesary CREATE/DROP privileges turned on when only INSERT was needed
  • I found a SQL Injection vulnerability (that had been added in a last minute tweak)
  • I didn’t have an automated backup in place
  • I almost made a manual backup on Friday afternoon but then thought “What could go wrong?”
  • I had looked at the data several times in the last week in a spreadsheet, but didn’t save the spreadsheet and it was gone from the local cache
  • The only other person who had been downloading daily had deleted the one column I needed – every single day.

So it was pretty much a conspiracy of every single possible thing going wrong. At least, I learned a good number of lessons about the value of logging and automated backups – in the future, I will be much less cavalier. Mercifully, the server administrators (who are more prudent than your author) take nightly snapshots. After prostrating myself at the altar of the Unix Gods, my database was restored after 3 days with minimal lossage.

In an unrelated matter, I am updating a website whose data access must be completely changed to implement a new protocol. Looking back at my old code, I am frankly unimpressed. The separation of data, logic, and UI is downright bad. If I had designed this well the first time (about 18 months ago), it would be a relatively simple matter of swapping out the data layer. Instead, I am now forced to go back and re-do the whole thing over, only better this time.

So it’s been a big week for humble pie. I know it’s healthy for us humans to get frequent reminders of our fallability and to learn from our mistakes, but that doesn’t make it fun.