Can Clojure Find Me An Apartment?

February 12th, 2010

This post was going to be about how I spent the better part of a day trying to get clojure and emacs and slime and the java classpath all working together.

The gist of it is this: I am an idiot sometimes. I spent most of an afternoon trying to figure out why it is an error to (use ‘clojure.contrib). Earlier in the day, my classpath was setup wrong, so (use ‘clojure.contrib.duck-streams) didn’t work. At some point, I stopped typing the whole thing, thinking that if ‘clojure.contrib.duck-streams works, then so should the parent package ‘clojure.contrib. A-ha! Save myself a bit of typing! Nope. That never works.. so, when I finally did get my classpath working, I didn’t know it because I was typing something that’s just plain wrong. Hilarious and Awesome, huh?

So, with everything finally working, I made my first little half-way real Clojure program.

Our current lease runs out in about a 6 weeks, so me and my roommate need to find a new place to live – sounds like a job for Craigslist. There’s a problem though: in big cities, Craigslist is absolutely flooded with apartments and the search functions just aren’t that good. I have no interest in skimming hundreds or thousands of posts looking for that perfect combination of price/location/amenities (well, mostly price and location, actually), so why not let the computer do the work instead? Usually this would be a job for Python/BeautifulSoup, but in the interest of learning Clojure, here goes..

Following is what I’ve come up with so far for scraping apartments off Craiglist as gently as possible by filtering out links that don’t meet my criteria. Right now, this code only generates the list of matching links, it doesn’t actually follow them. If I continue further with this program, that will be Step 2, probably using http://lethain.com/entry/2009/nov/24/scalable-scraping-in-clojure/ for inspiration.

This is based on the Enlive library, which provides a very usable syntax for ripping through HTML (though I don’t quite understand it all yet). As I’m still a complete beginner with Clojure and functional programming in general, the following code is probably far from idiomatic and may look sloppy to you pros out there. Comments and suggestions are welcome!

;; import enlive
(use 'net.cgrand.enlive-html)
 
;; html helper
(defn fetch-url [url]
  (html-resource (java.net.URL. url)))
 
;; pulls link from paragraph
;; ie, (map get-link (select *cl* [:p]))
(defn get-link [p]
  (:href (:attrs (first (:content p)))))
 
;; pulls text of link from paragraph
(defn get-link-text [p]
  (:content (first (:content p))))
 
;; pulls text of parens following link
;; usually this is zipcode/location info
;; "", if absent
(defn get-paren-text [p]
  (let [content (:content p)]
    (if (< 2 (count content))
      (:content (nth content 2))
      "")))
 
;; pulls link/text/location into a map
(defn get-all [p]
  {:link (get-link p)
   :text (str (get-link-text p)
	      (get-paren-text p))})
 
;; some helpers to remove links we don't care about 
 
;; (affordable "$800" 600 1000) #t
;; (affordable "$1500" 600 1000) #f
(defn affordable? [text min max]
  (let [price (second (re-find #"\$(\d+)" text))]
    (if price
      (let [price (Integer/parseInt price)]
	(and (<= min price)
	     (>= max price))))))
 
;; (has-kword "downtown" (list "down")) #t
;; (has-kword "down" (list "downtown")) #f
(defn has-kword? [text kwords]
  (let [vals (map #(re-find (re-matcher (re-pattern %) text)) kwords)]
    (some #(not (= nil %)) vals)))
 
;; parameterizes a function to decide if a link is worth retrieving
;; this would be cooler if the criteria functions
;; came in as a list too.. but that makes my head
;; spin.. maybe later
(defn keep-link? [min max areas beds]
  (fn [{link :link text :text}]
    (let [text (.toLowerCase text)]
      (and link
	   (re-find #"/apa/" link)
	   (affordable? text min max)
	   (has-kword? text areas)
	   (has-kword? text beds)))))
 
;; some top level definitions
;; you may need to change these to get non-empty results
(def *url* "http://losangeles.craigslist.org/apa/")
(def *min-price* 100)
(def *max-price* 10000)
;; I kinda like it in the South Bay, but whatever..
(def *areas* (list "hollywood" "weho"))
(def *beds* (list "2br" "3br"))
(def my-keep-link? (keep-link? *min-price* *max-price* *areas* *beds*))
 
;; actually do the work
(filter my-keep-link? (map get-all (select (fetch-url *url*) [:p])))
 
;; References
;; 1) http://wiki.github.com/cgrand/enlive/
;; 2) http://github.com/swannodette/enlive-tutorial/
;; 3) Programming Clojure, Stuart Halloway
;; 4) lots and lots of Googling

On the whole, I’m liking Clojure a lot, but there is also a lot to learn.

(Shocking conclusion, I know!)

A few cool videos from Google Tech Talks

January 9th, 2010

I keep meaning to find some interesting podcasts and online lectures. There’s a ton of material out there, but so much of it sucks. Anyway, browsing the topic “What are the best Google Tech Talks” on Stackoverflow, I found the following, which I now link for your viewing pleasure:

XKCD visits Google – Very funny and interesting, but perhaps less enjoyable unless you’re an xkcd fanboy like me. Jump to 21:30 where xkcd answers a joking question from Donald Knuth.

PolyWorld: Using Evolution to Design Artificial Intelligence – An interesting A-Life experiment/visualization. Jump to 5:35 for some really neat video of an older program that evolves different body morphologies for efficient movement in a simulated physical environment. (I think this is the original work the speaker is citing)

The Next Generation of Neural Networks – The speaker flies through the intro material much too fast for me to understand with only a rudimentary knowledge of NN. Nevertheless, the demo at 21:35 is cool, as is the discussion around 31:40 of using these layered NN for document clustering and classification.

Batch Extracting MP3s from YouTube Videos

January 9th, 2010

Last night I wanted to extract audio tracks from a number of YouTube videos that I’d downloaded using youtube-dl. Being only a so-so shell scripter, I’ve always resorted to ugly for-loops when manipulating multiple files. This invariably ends badly when my loop improperly handles whitespace and mangles the filenames.

No more! Skimming a tutorial last night I stumbled on something that heavy shell users already know: the -exec parameter for the find command. This allows you to specify a command to run on everything that find finds. In the case of extracing audio from MP3s, it works like this:

find . -name '*.flv' -exec ffmpeg -i '{}' '{}.mp3' ';'

This command looks in the current directory for flv files and uses ffmpeg to extract the audio to another file with the same name, plus the .mp3 extension. The funny brackets {} are substituted for the file name.

A downside to this approach – your files end up with names like .flv.mp3 instead of .mp3. If that bothers you, you can fix it with the rename command which uses regexes to rename files:

rename 's/\.flv\.mp3/\.mp3/' *.flv.mp3

Ubuntu users like myself will need to install ffmpeg and ubuntu-restricted-extras to get the necessary encoder.

There are certainly lots of other ways to encode a directory worth of files, but I think this one is pretty cool.

Collaborative Filtering, Hadoop and the Hazards of Copy-Paste

August 29th, 2009

I’ve been working on a new App idea lately – a recommender for Android programs. Basically, it looks at what you have installed (and possibly ratings) and recommends other applications you might like by using the recommendations of other people in the same way as Amazon or the various music services – in a word – collaborative filtering.

There are different ways to do collaborative filtering, but they are all expensive when you get a lot of records to sort through. Two common approaches are 1) Calculate the similarity of users, and recommend apps liked by similar users, or 2) Calculate the similarity of apps, and recommend apps similar to ones the user likes. I am trying the second way, known as item-based collaborative filtering or the model-based approach, which allows for fast queries at the cost of an expensive offline step that re-computes the item similarities every once in awhile.

My initial tests in Python, based on the very interesting book “Programming Collective Intelligence” quickly became too slow with just a few thousand users and apps. Because there are already around 5,000 apps and a few million users of Android (with many more every day), there’s no way the script would be able to handle the future growth of the platform.

Enter MapReduce and Hadoop. The explanation is better left to the pros, but simply, MapReduce is a way of parallelizing certain types of computations across many computers and then merging the final results. With the availability of Amazon Web Services, which allows you to rent a cluster of computers by the hour, it becomes possible to run a prohibitively expensive computation once every few days for just a couple of dollars. There are several different MapReduce frameworks out there, but I choose to try Hadoop, which is available on Amazon’s services and used heavily by Yahoo and many others.

There will be a lot more to say about Hadoop as I gain more experience. But all-in-all, it is pretty fun to re-think an algorithm, even just a little bit, to make it suitable for MapReduce. I *think* I have a correct implementation of Item-Based Collaborative Filtering running on my tiny 2-node cluster and it’s pretty cool!

One snag I ran into while trying to get my cluster running using the ubiquitous WordCount example for Hadoop. Like most people, I copy-pasted the source from the Hadoop tutorial and tried to run it. It ran, great! So then instead of reading the rest of the documentation, I immediately tried to modify it. Eventually, I ended up trying to make the simplest change – to return Text instead of IntWritables from the Map operation and — WTF!?! I spent HOURS trying to figure out why there was a ClassCastException. So for other poor souls trying to modify the WordCount example, there are 3 things you need to do:

First, get the method signatures right. The Mapper has to output Text and the Reducer has to consume Text (Eclipse will help with that, of course)

Second, add the lines: “conf.setMapOutputKeyClass(Text.class);” and “conf.setMapOutputValueClass(Text.class);” to the main() method. These tell Hadoop that the Mapper is not using the default, IntWritable, for output

Third, and crucially important, remove the line “conf.setCombinerClass(Reduce.class);”. Discovering that I needed to remove that single line took me about half a day, digging through the logs and Googling everything I could think of until I discovered this thread. Because it was part of the example, I assumed it was Hadoop boiler-plate that was essential — it’s not, it’s an optimization. The Combiner is kind of like a pre-Reduce phase that saves time by combining in-memory results instead of writing them to disk and combining them later. The Combiner needs a method signature that accepts the output of the Mapper and is still suitable as input to the Reducer. Otherwise, it chokes.

So is the peril of the copy-paster who runs code without really understanding all of it ~~

A tale of woe, a database disaster averted

August 1st, 2009

I know I’m not a great coder, but I like to think I’m at least decent or even good on occasion. The events of the past week illustrate that even that estimation may be a stretch.

When I came to work on Monday morning, I discovered that one of my database tables was exceedingly empty. For about an hour, I was convinced I had been hacked. However, after checking every last thing, I finally discovered the culprit – a leftover setup script had dropped and re-created the databases. There are so many things that went wrong, I am inclined to enumerate them:

  • I LEFT A SETUP SCRIPT ON THE SERVER
  • There was no logging in place, so it took ages to diagnose the problem and rule out hacking
  • I had left unneccesary CREATE/DROP privileges turned on when only INSERT was needed
  • I found a SQL Injection vulnerability (that had been added in a last minute tweak)
  • I didn’t have an automated backup in place
  • I almost made a manual backup on Friday afternoon but then thought “What could go wrong?”
  • I had looked at the data several times in the last week in a spreadsheet, but didn’t save the spreadsheet and it was gone from the local cache
  • The only other person who had been downloading daily had deleted the one column I needed – every single day.

So it was pretty much a conspiracy of every single possible thing going wrong. At least, I learned a good number of lessons about the value of logging and automated backups – in the future, I will be much less cavalier. Mercifully, the server administrators (who are more prudent than your author) take nightly snapshots. After prostrating myself at the altar of the Unix Gods, my database was restored after 3 days with minimal lossage.

In an unrelated matter, I am updating a website whose data access must be completely changed to implement a new protocol. Looking back at my old code, I am frankly unimpressed. The separation of data, logic, and UI is downright bad. If I had designed this well the first time (about 18 months ago), it would be a relatively simple matter of swapping out the data layer. Instead, I am now forced to go back and re-do the whole thing over, only better this time.

So it’s been a big week for humble pie. I know it’s healthy for us humans to get frequent reminders of our fallability and to learn from our mistakes, but that doesn’t make it fun.

Some Tricky Things About Creating a High Scores List

July 18th, 2009

My project lately has been adding a Net-based high scores list to a game on the Android marketplace. While the game has a decent number of downloads, I think it gets kinda boring after a few plays through. Hopefully allowing players to compete with each other for the highest score will boost the fun/challenge (and thus the ratings and sales)

Allowing players to submit high scores to a global ranking raises quite a few issues that are absent in a non-networked high scores list. Following are some of the ideas I’ve come up with. If you’ve tried something similar, I’d love to hear your approach.

A couple of interesting issues:

1. WebView or Android UI
2. Routine networking challenges
3. Serialization
4. Storing the high scores on the server
5. Letting players choose names
6. Preventing hacks

There are two main ways to display high scores – either using a WebView to display an HTML high score list or building a regular Android UI. For now I’ve opted for the regular UI because it integrates better with the look of the app. However, that decision comes at a price — with a WebView/HTML UI, you can make changes on the webserver and it immediately affects every client. With the regular UI, if there’s a problem, you can’t push the fix. All you can do is publish an update and hope everyone downloads it.

There are two important things to remember about networking for a high scores list (or just in general) – first, the connection may be unavailable for a variety of reasons (the device may not be connected to the network, or the server could be down, etc). So the code needs to detect being unable to connect and handle it appropriately. Secondly, a connection could be slow or time-out. Together this means the connection should be handled in a separate thread or the UI will hang and Android will kill the app.

There are several reasonable ways to transfer the data. I found JSON to be the simplest encoding because the Java API is straightforward and it is only requires 1 line of PHP on the server to unpack a JSON structure into either an Object or an associate array. Nice!

Storing the high scores on the server is pretty easy – just drop them in a SQL database. I’m still struggling with the best way to handle the growth of the table. Depending on the number of players, the scores table may grow quite substantially (though, the growth should be somewhat bounded as the top scores trend higher and closer to the absolute limit). One possibility is to setup cron or some automated task to periodically wipe out scores below a certain threshold. Another possibility is to check each time a score is submitted whether anything needs to be deleted. Yet another possibility is to just keep all the scores and delete them manually from time to time, since there will be thousands or tens-of-thousands, not millions or billions. Fortunately, deciding what to keep is a server-only issue, so it is easy to keep everything now and postpone the decision based on actual usage and growth.

Another tricky issue is letting players choose names for the high score list. Since my game is appropriate for all ages, I don’t really want some bozo submitting filthy or offensive names. One approach is to not worry about it. Another approach is to try filtering the submitted names. Inevitably, any filtering is bound to miss something that’s offensive to someone. Furthermore, if the filtering is noticed, that turns it into a challenge for the player, who may now wonder: “what’s the most awful thing i can sneak past the filter”? While unlikely, that could get bad fast and having a high score list that requires daily moderation is no good. Another idea would be allowing only 3 letters for the names. That eliminates the 4-letter words, but initials-only makes a high score list less fun. Right now, I plan to wait and see if there is actually a problem before trying to fix it.

And saving the best for last — how to prevent someone from hacking the high scores list? Alas, it can’t really be done. No matter how you arrange the code, ultimately the player is responsible for truthfully submitting his own score. Because it is pretty easy to watch the data in-transit, the best you can do is to make it more complicated. One option is a checksum – send an additional parameter that is a salted hash of, say, the name and score. This is not bad and should deter half-hearted attempts, however the salt may be discoverable and any encryption or checksum is doomed to fail since the player controls the client. Another interesting option is submitting snapshot data of the game in-progress. However, that requires the server to know quite a bit about the game and what a valid state looks like. A super-crazy option would be storing all the game state on the server and only transferring player input (keypresses, screen touches) over the wire. That could actually work – if the game has an element of randomness, the only way to cheat would be to write a semi-intelligent bot. (If the game is deterministic, the player could just replay a winning series of moves) But.. the overhead of running all the logic and serving hundreds or thousands of simultaneous clients makes it unrealistic for a typical game. Since there is no monetary incentive to winning my game, I’m not too worried about this, but my optimism may be misplaced. Much more on this here.

Finally, an idea which I’ve dropped for the moment is a weekly high scores list (in addition to all-time high scores). This would make a high score achievable for a new or less determined player and prevents that inevitable stagnation as the all-time high scores get closer and closer to the maximum possible. One tricky part about a weekly list is that once again you need to rely on the client to be truthful about when each score was achieved. The current plan is to see how things go with the regular high score list, and then evaluate whether a weekly/daily/monthly list makes sense.

Eh, well. That’s what’s been on my mind lately. For something that sounds simple at first, there really are quite a lot of issues to consider in every part of the code. For now, the implementation is almost complete, so I can’t wait to see how it goes. If it works out well, I’ll try to follow up with actual code instead of random musings.

Another interesting python snippet

July 2nd, 2009

Well, I think it’s interesting anyway..

So today I was trying to express the idea “do something to all these files, unless the filename matches the list of things we don’t care about”. I have to do a BUNCH of find-and-replace kinda stuff in the next week on a couple thousand webpages, so my plan is to write a little script to make sure I don’t miss anything. Sometimes the script will turn up a false positive that I know I want to ignore..

EDIT – The following snippet doesn’t do quite what I thought it did.. more later..

There are at least three distinct and reasonable ways to do this in Python:

# skip if name matches any of these
ignore = ["thing1", "thing2"]
 
# Iterate over the strings, see if a substring matches
# This is reasonably clear, but it seems long and requires a flag
for filepath in dirwalk(path):
    found = False
    for item in ignore:
        if filepath.find(item) &gt; 0:
            found = True
            break
    if found:
        continue
    # do stuff..
 
# The next two ways build a list of matches
# If there are any matches then skip this file
 
# This is a more functional style
for filepath in dirwalk(path):
    if filter(lambda item: filepath.find(item) != -1, ignore):
        continue
    # do stuff..
 
# Another way to do the same thing
for filepath in dirwalk(path):
    if [item for item in ignore if filepath.find(item) != -1]:
        continue
    # do stuff..

Interestingly, the last two use the same number of characters. I’m not sure which I prefer. While I suspect the final one would be considered the most Pythonic, I do have a soft spot for lambda. Eh, maybe someday when I’m not lazy I will see which is the fastest.. though that doesn’t really matter for my purposes.

Really? ‘ is not valid HTML?

June 16th, 2009

Well, I learned something today. &apos; is not valid in HTML. Incredibly, IE gets it right and FF/etc gets it wrong by interpreting it as an apostrophe character. I had no idea.. the simplest correct thing to do is use: &#39; instead.

More at: http://stackoverflow.com/questions/419718/html-apostrophe and http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/ and http://www.alistapart.com/articles/emen.

The A List Apart article is interesting, basically saying that everybody is doing everything wrong – technically. Yet I can’t help but feel that what people actually do trumps something written in a style guide. It’s kinda like when people say you can’t end a sentence with a preposition… but you can, and people do. Is it really bad grammar when everybody does it? Or does that just mean our rules of grammar are incorrect for modern usage? Well, linguistic and philosophical excursions aside, &apos; is bad. Who knew?

Extending Android’s Chronometer to get Elapsed Time

June 7th, 2009

In revamping one of my Games on the Android Market, I wanted to add an onscreen timer. The built-in Chronometer class does 95% of what I needed, but getting that extra 5% was annoying enough that I decided to post my solution.

The OOTB Chronometer keeps tracks of time, but it can’t tell you the number of seconds elapsed since it started. This is easily remedied, as I discovered from the fine people on StackOverflow. However, simply getting the elapsed time is not enough because the Chronometer will reset itself everytime the app is Paused or Killed. The two cases are quite different. When the app is Paused, it is still alive and may resume (ex. an incoming call). When the app is Killed, it needs to store the elapsed time somewhere so it can be recovered when the app is Restored (ex. orientation change, or b/c the OS needs memory).

So here is a stripped down solution that seems to do the job:

import android.app.Activity;
import android.content.Context;
import android.os.Bundle;
import android.os.SystemClock;
import android.util.Log;
import android.view.Menu;
import android.widget.Chronometer;
 
public class CustomChronometerActivity extends Activity {
	private static final String TAG = "CustomChronometerActivity";
	private static final String MS_ELAPSED = "com.etc.etc.MsElapsed";
 
	private MyChronometer chrono;
 
	@Override
    public void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
 
        //start the chronometer
        chrono = new MyChronometer(this);
        chrono.start();
        setContentView(chrono);
    }
 
	@Override
	protected void onPause() {
		Log.i(TAG, "onPause()");
		super.onPause();
		chrono.stop();
	}
 
	@Override
	protected void onResume() {
		Log.i(TAG, "onResume()");
		super.onResume();
		chrono.start();
	}
 
	@Override
	protected void onSaveInstanceState(Bundle outState) {
		super.onSaveInstanceState(outState);
		Log.i(TAG, "onSaveInstanceState()");
		chrono.stop();
		outState.putInt(MS_ELAPSED, chrono.getMsElapsed());
	}
 
	@Override
	protected void onRestoreInstanceState(Bundle savedInstanceState) {
		super.onRestoreInstanceState(savedInstanceState);
		Log.i(TAG, "onRestoreInstanceState()");
		int ms = savedInstanceState.getInt(MS_ELAPSED);
		chrono.setMsElapsed(ms);
		chrono.start();
	}
 
	class MyChronometer extends Chronometer {
 
		public int msElapsed;
		public boolean isRunning = false;
 
		public MyChronometer(Context context) {
			super(context);
		}
 
		public int getMsElapsed() {
			return msElapsed;
		}
 
		public void setMsElapsed(int ms) {
			setBase(getBase() - ms);
			msElapsed  = ms;
		}
 
		@Override
		public void start() {
			super.start();
			setBase(SystemClock.elapsedRealtime() - msElapsed);
			isRunning = true;
		}
 
		@Override
		public void stop() {
			super.stop();
			if(isRunning) {
				msElapsed = (int)(SystemClock.elapsedRealtime() - this.getBase());
			}
			isRunning = false;
		}
	}
}

Certainly there are other ways to do this – either by implementing your own timer using Threads or Handlers, or perhaps by implementing an OnChronometerTickListener and subscribing to events. I rather like this solution, but if you’re the clever sort and see some situation where this doesn’t work or some reason why it might be a bad idea, please let me know.

More Reasons I Love Python..

May 19th, 2009

I was cleaning up some folders the other day at work where the files had been named using one of several naming schemes (or a few with no particular scheme at all). After brief consideration, I decided to do the legwork of renaming all the files with a naming scheme that actually makes sense:

Category_YYYY-MM-DD

That way, the files will stay grouped together if they get copied around to other folders, and they sort alphabetically by date. Then there’s the task for regenerating all the HTML for these baddies. Happily, Python was up to the task:

import os
from datetime import date
files = os.listdir("Path\\To\\File")
files.sort()
files.reverse()
for file in files:
    # chop the prefix, chop the suffix, split into (year, month, date), convert to int
    x = [int(x) for x in file.split("_")[-1][:-4].split('-')]
    print "<li><a href=\"/path/to/%s\">%s</a></li>" % (file, date(x[0], x[1], x[2]).strftime('%B %d, %Y'))

Well, it’s nothing like the real pros can do. But you gotta love a few links of code that save your fingers from a repetitive and typo-prone task like manually editing hundreds of links.