0

A Little Python To Put Chartbeat in the Console

I can’t freakin’ stop watching Chartbeat. So.. to reclaim a little bit of my day, I’ve thrown together a little python script so I can cast an eye on the two important numbers (total users and page load time) without keeping a browser window open.

This just uses the Chartbeat JSON API to grab the data and uses pycurses to keep it all on the same screen.

import json
import urllib
import curses
import time
 
host = 'HOSTNAME' # your .com here, sans http://www.
key = 'APIKEY' # you gotta register an API key with Chartbeat first
url = 'http://api.chartbeat.com/summize?host=%s&apikey=%s' % (host, key)
 
screen = curses.initscr()
curses.noecho()
curses.curs_set(0)
screen.nodelay(1)
screen.keypad(1)
 
try:
	while True:
		x = screen.getch()
		if x == ord('q'): break
		page = urllib.urlopen(url).read()
		data = json.loads(page)
		screen.clear()
		screen.addstr(0,0,"Visitors: %d"%data['people'])
		screen.addstr(1,0,"Pageload: %.1f"%(data['domload']/1000.0))
		screen.refresh()
		time.sleep(600) # sleep for 10 minutes
except:
	pass
finally:
	curses.nocbreak()
	screen.keypad(0)
	curses.echo()
	curses.endwin()
0

Quick Note: Getting WordPress Post ID from URL

Took a bit of googling to find this handy function to convert a url into the corresponding WordPress Post ID, so I’m writing it down so I don’t forget. Not a ton of cases where this is useful because you’re usually inside the Loop and have the ID readily available. I’m using it as part of a little http://chartbeat.com plugin. Since the API can only give back the url to the page, you need this little gem to get the ID.

You know this functionality had to exist somewhere in WP because that’s what it does internally – route urls to post_ids, but sometimes it is tricky to figure out what it’s called..

$post_id = url_to_postid('http://example.com/path/to/page');

Thanks to http://www.tech-otaku.com/blogging/posts-id-posts-url-wordpress/ for posting it first.

0

Mass Re-Indexing for the WordPress YARPP Plugin

On my buddy’s WordPress blog, we wanted to add a Related Posts feature to get people hooked after they read the article they came for. A quick search of the available plugins turned up YARPP (Yet Another Related Posts Plugin). The plugin is hugely configurable and it seems to do a great job of selecting relevant posts to display.

The one downside? It takes about 4 seconds to index a post.

Now, that’s not much, but when you have ~8,000 posts, that’s about 9 hours! Digging in a bit, the plugin is very smart: caching aggressively and only doing the expensive calculation when a page is saved or the first time that someone looks at it (in the case of old posts). However, it would be even nicer to do all the calculations offline and just upload a full cache.

So here’s a function that does that. Create a new file in the wp-content/plugins/yet-another-related-posts-plugin/ folder. I run this through PHP on the command-line, otherwise you risk problems with timeouts:

(Note: I am using YARPP 3.1.8, your mileage may vary, use at your own risk, etc..)

<?php
// Hook into WP so we can access the DB
include '../../../wp-blog-header.php';
// Load all the YARPP functions
include 'yarpp.php';
// Let YARPP create tables, if they don't exist already
yarpp_activate();
//
$time_start = time();
$sql = "SELECT ID FROM $wpdb->posts WHERE post_type='post' and post_status='publish' ORDER BY ID desc";
$ids = $wpdb->get_col($sql,0);
$c = count($ids);
for($i=0; $i<$c; $i++) {
    $id = $ids[$i];
    printf("%d/%d\tID: %d\tELAPSED: %d\tREMAINING: %d\n",$i,$c,$id,$time_elapsed,$time_remaining);
    flush();
    // this fn causes yarpp to compute relatedness for the post
    yarpp_related(array('post'),array(),false,$id,'website');
    $time_elapsed = (time() - $time_start);
    $time_remaining = (($c-$i)-1) * ($time_elapsed/$i);
}

If it finishes too early, try turning on error_reporting(E_ALL). I found that I was running out of memory until adding ini_set(‘memory_limit’, ’512M’).

And that’s about it.. Since it takes a looong time to run, I’ve added some logging so you can be sure it doesn’t get stuck, etc..

...
3/10	ID: 45647	ELAPSED: 10	REMAINING: 35
4/10	ID: 45625	ELAPSED: 14	REMAINING: 28
5/10	ID: 45601	ELAPSED: 19	REMAINING: 23
6/10	ID: 45593	ELAPSED: 22	REMAINING: 17
7/10	ID: 45572	ELAPSED: 29	REMAINING: 14
8/10	ID: 45571	ELAPSED: 31	REMAINING: 8
9/10	ID: 45570	ELAPSED: 34	REMAINING: 4
...

If you run this on your live server, then just activate the plugin and you are done. Otherwise, activate the plugin and upload the two tables (_yarpp_keyword_cache and _yarpp_related_cache) to your live server.

1

Collaborative Filtering in Clojure, First Try

For reasons that are unclear, I would apparently rather spend the day screwing around with Clojure than working on my Android apps or finding freelance gigs. I *really* want to love Clojure because the blogs make it sound so cool once you know what you’re doing.

But.. I have no idea what I’m doing.. Maybe I can learn?

The trouble with learning programming languages is that you need to find a problem that is just the right difficulty. If you choose a toy problem, you solve it too quickly without learning anything. If you choose a real nasty one, you’ll get hung up and become frustrated.

With that in mind, I’m gonna try implementing some algorithms from the rather delightful “Programming Collective Intelligence” in Clojure (the book uses Python) (also on Google Books) (and source code). The examples in the book are well explained and include sample data, so you can be sure your implementation is at least getting the right answers. My hope is that by solving the problems, learning some more and then revisiting a few weeks later, I will eventually start writing better Clojure code.

So Chapter 3 of the book covers Collaborative Filtering. Briefly, it is a technique of using ratings to make suggestions. The notion makes intuitive sense: If I like “A, B and C” and you like “A and B”, there’s a good chance you’re gonna like “C” as well.

Ratings take the form of a Map of Maps:

{"Person A": {"Item 1": 1.5, "Item 2": 2.5}, 
 "Person B": {"Item 2": 3.5}}

Additionally, the algorithm employs a similarity metric, which sets a numerical value on how closely two people’s ratings agree. Intuitively, you’d be more interested in my recommendations if we like the same things, so the similarity metric provides a way to gauge that.

similarity(ratings_a,ratings_b) = 0.5

Recommendations are the combination of all ratings, weighted by the similarity of the user who made the rating to the user we are making recommendations for.

At this point, it should be clear that I’m terrible at explaining things. Just check out Chapter 3 of Programming Collective Intelligence and all will become clear.

Anyhow, here is my first attempt at an implementation:

(use 'clojure.set)
 
;;
;; Some utility functions
;;
 
;; sum of squares of differences
(defn sum-of-squares [a b]
  (apply + (map (fn [x y] (Math/pow (- x y) 2)) a b)))
 
;; book has typo in definition here, see errata
(defn inverse-sum-of-squares [a b]
  (/ 1 (+ 1 (Math/sqrt (sum-of-squares a b)))))
 
;; items this person has rated 
(defn seen-by [db person]
  (into #{} (keys (get db person))))
 
;; items person hasn't rated
(defn unseen-by [db name items]
  (difference items (set (map first (get db name)))))
 
;; items both people have rated
(defn co-rated-items [db name-1 name-2]
  (intersection (set (keys (get db name-1))) (set (keys (get db name-2)))))
 
;; ratings by name of co-rated items in alphabetical order
(defn co-ratings [db name corated]
  (vals (filter (fn [v] (contains? corated (first v))) (sort (get db name)))))
 
;;
;; Distance metrics
;;
 
;; Manhattan distance
(defn sim-distance [db name-1 name-2]
  (let [corated (co-rated-items db name-1 name-2)]
    (if (not corated) 0
	(inverse-sum-of-squares (co-ratings db name-1 corated)
				(co-ratings db name-2 corated)))))
 
;; Pearson distance
;; book has a typo causing incorrect float division
(defn sim-pearson [db name-1 name-2]
  (let [corated (co-rated-items db name-1 name-2)]
    (if (not corated) 0
	(let [n (count corated)
	      ratings-1 (co-ratings db name-1 corated)
	      ratings-2 (co-ratings db name-2 corated)
	      sum-1 (apply + ratings-1)
	      sum-2 (apply + ratings-2)
	      sum-1-sq (apply + (map (fn [x] (* x x)) ratings-1))
	      sum-2-sq (apply + (map (fn [x] (* x x)) ratings-2))
	      psum (apply + (map (fn [x y] (* x y)) ratings-1 ratings-2))
	      num (- psum (/ (* sum-1 sum-2) n))
	      den (Math/sqrt (* (- sum-1-sq (/ (* sum-1 sum-1) n))
				(- sum-2-sq (/ (* sum-2 sum-2) n))))]
	  (if (= den 0) 0
	      (/ num den))))))
 
;;
;; Recommendation algorithm
;;
 
(defn total-sums [db other items sim]
  (reduce (fn [m item] (assoc m item (* (get (get db other) item) sim))) {} items))
 
(defn sim-sums [items sim]
  (reduce (fn [m item] (assoc m item sim)) {} items))
 
;; combines sum(rating*similarity) and sum(similarity) of all raters
(defn loop-totals [db others me metric]
  (loop [others others t {} s {}]
    (if (= (count others) 0) [t s]
	(let [other (first others)
	      unseen-items (unseen-by db me (seen-by db other))
	      sim (metric db me other)]
	  (if (< sim 0)
	    (recur (rest others) t s)
	    (recur (rest others)
		   (merge-with + t (total-sums db other unseen-items sim))
		   (merge-with + s (sim-sums unseen-items sim))))))))
 
;; generates recommendations of the form {rating: item}
(defn recommend [db me metric]
  (let [others (disj (set (map first db)) me)
	[my-totals my-sims] (loop-totals db others me metric)]
    (reverse (sort (map (fn [item] [(/ (get my-totals item) (get my-sims item)) item]) (set (keys my-totals)))))))
 
;;
;; functions dealing with creation of DB
;;
 
;; returns DB with a new rating added
(defn add-rating [db name item rating]
  (assoc db name (conj (get db name {item rating}) {item rating})))
 
;; returns DB with a list of new ratings added
(defn add-ratings [db ratings]
  (reduce
   (fn [db [name item rating]]
     (add-rating db name item rating))
   db ratings))
 
;; creates the DB used for examples in the book
(defn init-db []
  (add-ratings {}
	       [["Lisa Rose" "Lady in the Water" 2.5]
		["Lisa Rose" "Snakes on a Plane" 3.5]
		["Lisa Rose" "Just My Luck" 3.0]
		["Lisa Rose" "Superman Returns" 3.5]
		["Lisa Rose" "You, Me and Dupree" 2.5]
		["Lisa Rose" "The Night Listener" 3.0]
		["Gene Seymour" "Lady in the Water" 3.0]
		["Gene Seymour" "Snakes on a Plane" 3.5]
		["Gene Seymour" "Just My Luck" 1.5]
		["Gene Seymour" "Superman Returns" 5.0]
		["Gene Seymour" "The Night Listener" 3.0]
		["Gene Seymour" "You, Me and Dupree" 3.5]
		["Michael Phillips" "Lady in the Water" 2.5]
		["Michael Phillips" "Snakes on a Plane" 3.0]
		["Michael Phillips" "Superman Returns" 3.5]
		["Michael Phillips" "The Night Listener" 4.0]
		["Claudia Puig" "Snakes on a Plane" 3.5]
		["Claudia Puig" "Just My Luck" 3.0]
		["Claudia Puig" "The Night Listener" 4.5]
		["Claudia Puig" "Superman Returns" 4.0]
		["Claudia Puig" "You, Me and Dupree" 2.5]
		["Mick LaSalle" "Lady in the Water" 3.0]
		["Mick LaSalle" "Snakes on a Plane" 4.0]
		["Mick LaSalle" "Just My Luck" 2.0]
		["Mick LaSalle" "Superman Returns" 3.0]
		["Mick LaSalle" "The Night Listener" 3.0]
		["Mick LaSalle" "You, Me and Dupree" 2.0]
		["Jack Matthews" "Lady in the Water" 3.0]
		["Jack Matthews" "Snakes on a Plane" 4.0]
		["Jack Matthews" "The Night Listener" 3.0]
		["Jack Matthews" "Superman Returns" 5.0]
		["Jack Matthews" "You, Me and Dupree" 3.5]
		["Toby" "Snakes on a Plane" 4.5]
		["Toby" "Superman Returns" 4.0]
		["Toby" "You, Me and Dupree" 1.0]]))

Then to use it, you’d do something like this:

(def *db* (init-db))
(recommend *db* "Toby" sim-distance)

So it works, and that makes me reasonably happy.. though I hope to improve it in the future. Learning more of the core library should help.

Comments and suggestions are very welcome.

A few notes:

* Having never made a *serious* commitment to a REPL before, it was interesting to build up functions piece-at-a-time from the inside out.. I can dig that.

* I’m not too happy with the way computing sums-of-squares works where I’m figuring out which items are rated by the same people and then having to sort each person’s ratings by name to make sure they’re in the same order. sums-of-squares itself is cool, but the way I prepare the data feels weird. Not sure what would be better though..

* Also, there are a couple of typos in the version of the book I have that cause some of the numbers to come out differently. Not sure if they’ve been corrected in later editions. More info: Unconfirmed Errata

0

Making My Own WordPress Chartbeat Plugin

Instead of doing something useful this morning, I made my own little plugin using the Chartbeat API to display the most popular posts on a WordPress blog.

Note: There is really no reason to do this. The Chartbeat Plugin does this exact same thing and more. However, it was an entertaining exercise for me to practice writing wordpress plugins.

Also Note: This only works if you have signed up for Chartbeat and get an API Key.

The reason this is cool? Well, most of your “most popular posts” plugins need to make an extra call to the database to get/set a counter because wordpress doesn’t track page views by default. But if you’re using chartbeat to track your blog’s performance, you can save some effort by using their numbers instead.

And with no further ado, here’s the code:

<?php
/*
Plugin Name: Ct Most Popular
Plugin URI: http://www.craiget.com
Description: Display most viewed posts using the Chartbeat API, exposes one function: ct_most_popular_plugin_widget(); 
Version: 0.1
Author: Craige
Author URI: http://craiget.com
License: For example and testing purposes. Not suggested for use on a real site.
*/
 
$ct_most_popular_plugin_version = "0.1";
 
$ct_most_popular_plugin_data = array();
 
// create a most_popular option
register_activation_hook(__FILE__, 'ct_most_popular_plugin_install');
function ct_most_popular_plugin_install()
{
	add_option("ct_most_popular_plugin_data", $ct_most_popular_data);
	add_option("ct_most_popular_plugin_version", $ct_most_popular_plugin_version);
	// schedule hourly update
	wp_schedule_event(time(), 'hourly', 'ct_most_popular_plugin_update_event');
}
 
// delete the most_popular option
register_deactivation_hook(__FILE__, 'ct_most_popular_plugin_uninstall');
function ct_most_popular_plugin_uninstall()
{
	delete_option("ct_most_popular_plugin_data");
	delete_option("ct_most_popular_plugin_version");
	// un-schedule hourly update
	wp_clear_scheduled_hook('ct_most_popular_plugin_update_event');
}
 
// appear under "Settings" on the admin page
add_action('admin_menu', 'ct_most_popular_plugin_menu');
function ct_most_popular_plugin_menu() {
	add_options_page('Ct Most Popular', 'Ct Most Popular', 'manage_options', '', 'ct_most_popular_plugin_options');
}
 
// init option values in db
add_action('admin_init', 'ct_most_popular_plugin_options_init' );
function ct_most_popular_plugin_options_init(){
	register_setting('ct_most_popular_plugin_options', 'ct_most_popular_plugin', 'ct_most_popular_plugin_validate' );
}
 
// sanitize and validate input
function ct_most_popular_plugin_validate($input) {
	$input['host'] =  wp_filter_nohtml_kses($input['host']);
	$input['chartbeat_api_key'] =  wp_filter_nohtml_kses($input['chartbeat_api_key']);
	$input['limit'] =  (int)($input['limit']);
	if($input['limit'] == 0) $input['limit'] = 10;
	return $input;
}
 
// display options page html
function ct_most_popular_plugin_options() {
	if (!current_user_can('manage_options'))  {
		wp_die(__('You do not have sufficient permissions to access this page.') );
	}
?>
<div class="wrap">
	<h2>Ct Most Popular Plugin Options Title</h2>
	<form method="post" action="options.php">
		<?php settings_fields('ct_most_popular_plugin_options'); ?>
		<?php $options = get_option('ct_most_popular_plugin'); ?>
		<table class="form-table">
		<tr valign="top">
			<th scope="row">Host</th>
			<td><input type="text" name="ct_most_popular_plugin[host]" value="<?php echo $options['host']; ?>" /></td>
			<td><i>ie, example.com</i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Chartbeat API Key</th>
			<td><input type="text" name="ct_most_popular_plugin[chartbeat_api_key]" value="<?php echo $options['chartbeat_api_key']; ?>" /></td>
			<td><i><a href="http://chartbeat.com/apikeys/">http://chartbeat.com/apikeys/</a></i></td>
		</tr>
		<tr valign="top">
			<th scope="row">Limit</th>
			<td><input type="text" name="ct_most_popular_plugin[limit]" value="<?php echo $options['limit']; ?>" /></td>
			<td><i>number of items to show, 10</i></td>
		</tr>
		</table>
		<p class="submit">
			<input type="submit" class="button-primary" value="<?php _e('Save Changes') ?>" />
		</p>
		<p>
		This plugin uses the <a href="http://chartbeat.pbworks.com/">Chartbeat API</a> to show the most popular pages on your site, updated hourly.
		</p>
		<p>
		This plugin was created for my own amusement and to practice creating Wordpress plugins, it is <strong>NOT RECOMMENDED</strong> for use.
		</p>
		<p>
		Chartbeat has released a perfectly good plugin that does this and more: <a href="http://wordpress.org/extend/plugins/chartbeat/">http://wordpress.org/extend/plugins/chartbeat/</a>
		</p>
		<p>
		This plugin fetches new data once every hour using Wordpress's built-in <a href="http://codex.wordpress.org/Function_Reference/wp_schedule_event">scheduling hooks</a> to update the list of popular posts hourly.
		This keeps things self-contained, but doesn't provide much flexibility. You may want to use cron instead, which would require a little hacking.
		</p>
	</form>
</div>
<?php
}
 
// get popularity data from chartbeat, store in db
add_action('ct_most_popular_plugin_update_event', 'ct_most_popular_plugin_update_chartbeat');
function ct_most_popular_plugin_update_chartbeat() {
	// construct chartbeat call
	$options = get_option('ct_most_popular_plugin');
	$host = $options['host'];
	$apikey = $options['chartbeat_api_key'];
	$limit = $options['limit'];
	// build url
	$url = 'http://api.chartbeat.com/toppages/?host=HOST&limit=LIMIT&apikey=APIKEY';
	$url = str_replace('HOST', $host, $url);
	$url = str_replace('APIKEY', $apikey, $url);
	$url = str_replace('LIMIT', $limit, $url);
	// fetch data
	$data = file_get_contents($url);
	$data = json_decode($data, true);
	// exit if not enough results back
	if(count($data) < $limit)
		return;
	$result = array();
	for($i=0; $i<count($data); $i++) {
		if($data[$i]['path'] == "/")
			continue;
		$result[] = $data[$i];
	}
	$result = array_slice($result, 0, $limit);
	// store in db
	update_option("ct_most_popular_plugin_data", $result);
}
 
// add this function in your sidebar
function ct_most_popular_plugin_widget() {
	$data = get_option("ct_most_popular_plugin_data");
	echo('<ul>');
	foreach ($data as $post) {
		echo('<li>');
		echo('<a href="'.$post['path'].'">'.$post['visitors'].'-'.$post['i'].'</a>');
		echo('</li>');
	}
	echo('</ul>');
}

Go to “Settings” > “Ct Most Popular” to set your API Key and other options.

Updates occur once each hour.

You’ll almost certainly want to tweak the way the posts are displayed in the ct_most_popular_plugin_widget() function.

Anyway.. just fooling around.. For all the frustration it has caused me.. Still gotta say, WordPress is pretty friggin’ cool.

2

Is AdMob worth it? Maybe..

I admit having mixed feeling about internet advertising. I guess at present, I view it as a necessary evil in which I happen to participate. As I can hardly muster my thoughts into coherence on THAT subject, instead, here’s a bit on how it is working out for me.

So “paid” apps on the Android Market are not making money for most developers. There may be a host of reasons, but I suspect it has to do with iPhone users being very comfortable with the $0.99 music purchases from iTunes and applying the same mentality to the App Store, while Android users simply aren’t used to pulling the trigger. Free apps, on the other hand, do just fine.

The conversion rate from free to paid on my two popular apps?

1/10,000 and 1/1,000. No joke. 400,00 free downloads. 40 paid.

Since that sucks, like many folks, I’ve resorted to using Ads to make some money from free apps.

After showing ads for awhile, I thought it would be an interesting to try promoting my newest app by BUYING some ads as well.

First, I setup 2 House Ads, which are free, but only show up in your own applications. (So you can advertise your own stuff, but you don’t make any money). Over a week, these have had a rather good click-thru rate of 4.44%.

For the experiment, I spent $50 (the minimum allowed) for one day of regular advertising on AdMob, creating two ads identical to my House Ads. Surprisingly, these had a much lower click-thru rate of 0.47% (from 353,690 impressions).

I’m not sure what to make of this, but the most obvious conclusion seems to be that my purchased ads were badly targeted. Oddly, when creating ads, you can choose some basic demographic information like location, age and gender, but you don’t get to target specific keywords. However, as an app publisher, you *do* get to target keywords. How does that work? I can only guess that they match the keywords against the 35 (max) letters of the ad text. But that can’t possibly be reliable in the same way as matching a long webpage body text. Maybe they do it manually? That would explain the 24 hour-ish ad approval period.. Hmm…

Anyway, kinda guessing a bit, since there’s not a good way to know when each download occurred and whether it was an ad click-thru or a normal download, it looks like the $50 netted about 1000 downloads. Or, about $0.05 per download.

That’s on a free app, by the way. Paid apps will likely have a MUCH lower conversion, resulting in a higher cost per download.

So, is it worth it? Well, that’s hard to say. My current feeling is that it might be worth it initially to bootstrap a new app with a couple thousand downloads. When people download an app, they see a range indicating the approximate number of downloads (1-50, 50-250, etc..) I think it inspires confidence to see that an app has been downloaded 10,000 times. Also, more downloads seems to mean higher rankings in the “Top Free” section, more or less..

However, my most popular app, which was never advertised, has nearly 400,000 downloads. That would cost $20,000!! (yeah, I know that estimate makes *tons* of assumptions) So paid advertising is certainly not a viable way to get all the way to the top of the “Top Free” section.

Anyway, I would be interested in anyone else’s experiences with advertising free apps on AdMob.

One point to note, the AdMob Help seems to indicate that you need to pay a higher CPC rate if you want your ads to be shown. In my experience, that was not the case. Even with the lowest $0.03, my ads still got shown over 350k times. So don’t pay $0.20 for each click! Furthermore, as an app publisher, I rarely see a 100% fill rate. While the reality may be more complicated, that seems to indicate that there are too many publishers and not enough advertisers.

0

Fetching Android Market Stats with Selenium RC

Finally.. I’ve got a reasonably decent way to pull Android Market stats. For some reason I keep coming back to this topic (see here and here). This time, the way forward is to use Selenium RC, part of the Selenium browser testing suite.

My example will be in Python, but Selenium has bindings for several languages.

First of all, you gotta download Selenium RC from here: http://seleniumhq.org/download/

Then, extract it someplace you can remember. I’ve been putting things in ~/opt lately.

Okay, now create a new python script, comma ca:

import sys
sys.path.append('/the/path/to/selenium-python-client-driver-1.0.1')
 
from selenium import selenium
 
email = 'YOUR_GOOGLE_LOGIN'
passwd = 'YOUR_PASSWORD'
 
s = selenium("localhost", 4444, "*firefox", "http://market.android.com")
s.start()
s.open("/publish/Home")
s.type("Email", email)
s.type("Passwd", passwd)
s.click("signIn")
s.wait_for_page_to_load("30000")
 
n = int(s.get_xpath_count("//div[@class='listingRow']"))
for i in range(3,n):
  try:
    title = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[1]/div[1]" % i)
    downloaded = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[1]/span[1]" % i)
    installed = s.get_text("xpath=(//div[@class='listingRow'])[%s]/div[2]/div[2]/span[1]" % i)
    comments = s.get_text("xpath=(//div[@class='listingRow'])[%s]/table" % i)[1:-1]
    print title, downloaded, installed, comments
  except:
    pass

* Be sure to fill in YOUR_GOOGLE_LOGIN with your email (or whatever login) and the matching password.

This script is a bit of a trainwreck.. but it works and I don’t feel like screwing with it..

* Working with xpath in selenium-rc’s python binding feels really weird.. doesn’t seem to behave quite the way you would expect.

* Why does the iteration start at 3? I dunno.. there are some empty rows at the beginning I guess..

* Why is it wrapped in a try-except block? I dunno.. some empty rows at the end?

* It works on Ubuntu 10.04 / FF 3.6.3. Your mileage may vary. I wouldn’t be surprised if those xpath selectors needed more tweaking in some cases.

To run the script, you need to start the Selenium RC server. Go to the place you downloaded it:

cd /path/to/selenium
java -jar selenium-server.jar

Then, you should be able to run this script from a terminal and it will start firefox, log you in to the Android Developer Console, wait a few seconds til the Ajax all loads, then use xpath to scrape each row of data from the table and print it to the terminal.

From there it should be pretty simple to export the results into a CSV file or make pretty charts or whatever it is you wanna do.

It does pop up a window on the screen, which is kinda annoying. Cooler to run firefox headless, maybe some other time..

0

Integer overflow, a first time for everything

Somehow until today I had avoided the bite of an integer overflow bug.

I wanted to get a series of date strings in the format “yyyymmdd” for fetching some resources from a website. So the following seemed like it should work:

//bad - don't do this
long TS0 = 1272690000000L; //may 1st, 2010
long millis = System.currentTimeMillis() + 1000*60*60*24*365; //one year from today
while(millis > TS0) {
  String date = DateFormat.format(yyyyMMdd, millis).toString();
  millis -= 1000*60*60*24;
}

That method worked.. kind of.. doing the next 10 days worked just fine. But doing 365 days didn’t! Huh?!

After bashing on it in place for waaay too long and beginning to question my sanity, I decided to write a separate little program to isolate the problem.

public class WaitWhat {
  public static void main(String args[]) {
    long millis = 1000*60*60*24*365; // should be 31,536,000,000
    System.out.println(millis);
  }
}

And the result?

1471228928

Well, that came as something of a surprise..

So what’s going on? Well, Java is using integers instead of longs, so since 31,536,000,000 is larger than the maximum integer of 2,147,483,647, it wraps around.

I guess I assumed that multiplication would automatically use longs if it needed to. Apparently not the case!

So what’s the fix? Force long multiplication so it doesn’t overflow, like this:

//note the "L"
long millis = 1000L*60*60*24*365;

Well, an interesting little lesson.. Stupid bugs like that are always humbling.. I wonder how many little gems like that are buried in my code, just waiting for their day..

0

Can Clojure Find Me An Apartment?

This post was going to be about how I spent the better part of a day trying to get clojure and emacs and slime and the java classpath all working together.

The gist of it is this: I am an idiot sometimes. I spent most of an afternoon trying to figure out why it is an error to (use ‘clojure.contrib). Earlier in the day, my classpath was setup wrong, so (use ‘clojure.contrib.duck-streams) didn’t work. At some point, I stopped typing the whole thing, thinking that if ‘clojure.contrib.duck-streams works, then so should the parent package ‘clojure.contrib. A-ha! Save myself a bit of typing! Nope. That never works.. so, when I finally did get my classpath working, I didn’t know it because I was typing something that’s just plain wrong. Hilarious and Awesome, huh?

So, with everything finally working, I made my first little half-way real Clojure program.

Our current lease runs out in about a 6 weeks, so me and my roommate need to find a new place to live – sounds like a job for Craigslist. There’s a problem though: in big cities, Craigslist is absolutely flooded with apartments and the search functions just aren’t that good. I have no interest in skimming hundreds or thousands of posts looking for that perfect combination of price/location/amenities (well, mostly price and location, actually), so why not let the computer do the work instead? Usually this would be a job for Python/BeautifulSoup, but in the interest of learning Clojure, here goes..

Following is what I’ve come up with so far for scraping apartments off Craiglist as gently as possible by filtering out links that don’t meet my criteria. Right now, this code only generates the list of matching links, it doesn’t actually follow them. If I continue further with this program, that will be Step 2, probably using http://lethain.com/entry/2009/nov/24/scalable-scraping-in-clojure/ for inspiration.

This is based on the Enlive library, which provides a very usable syntax for ripping through HTML (though I don’t quite understand it all yet). As I’m still a complete beginner with Clojure and functional programming in general, the following code is probably far from idiomatic and may look sloppy to you pros out there. Comments and suggestions are welcome!

;; import enlive
(use 'net.cgrand.enlive-html)
 
;; html helper
(defn fetch-url [url]
  (html-resource (java.net.URL. url)))
 
;; pulls link from paragraph
;; ie, (map get-link (select *cl* [:p]))
(defn get-link [p]
  (:href (:attrs (first (:content p)))))
 
;; pulls text of link from paragraph
(defn get-link-text [p]
  (:content (first (:content p))))
 
;; pulls text of parens following link
;; usually this is zipcode/location info
;; "", if absent
(defn get-paren-text [p]
  (let [content (:content p)]
    (if (< 2 (count content))
      (:content (nth content 2))
      "")))
 
;; pulls link/text/location into a map
(defn get-all [p]
  {:link (get-link p)
   :text (str (get-link-text p)
	      (get-paren-text p))})
 
;; some helpers to remove links we don't care about 
 
;; (affordable "$800" 600 1000) #t
;; (affordable "$1500" 600 1000) #f
(defn affordable? [text min max]
  (let [price (second (re-find #"\$(\d+)" text))]
    (if price
      (let [price (Integer/parseInt price)]
	(and (<= min price)
	     (>= max price))))))
 
;; (has-kword "downtown" (list "down")) #t
;; (has-kword "down" (list "downtown")) #f
(defn has-kword? [text kwords]
  (let [vals (map #(re-find (re-matcher (re-pattern %) text)) kwords)]
    (some #(not (= nil %)) vals)))
 
;; parameterizes a function to decide if a link is worth retrieving
;; this would be cooler if the criteria functions
;; came in as a list too.. but that makes my head
;; spin.. maybe later
(defn keep-link? [min max areas beds]
  (fn [{link :link text :text}]
    (let [text (.toLowerCase text)]
      (and link
	   (re-find #"/apa/" link)
	   (affordable? text min max)
	   (has-kword? text areas)
	   (has-kword? text beds)))))
 
;; some top level definitions
;; you may need to change these to get non-empty results
(def *url* "http://losangeles.craigslist.org/apa/")
(def *min-price* 100)
(def *max-price* 10000)
;; I kinda like it in the South Bay, but whatever..
(def *areas* (list "hollywood" "weho"))
(def *beds* (list "2br" "3br"))
(def my-keep-link? (keep-link? *min-price* *max-price* *areas* *beds*))
 
;; actually do the work
(filter my-keep-link? (map get-all (select (fetch-url *url*) [:p])))
 
;; References
;; 1) http://wiki.github.com/cgrand/enlive/
;; 2) http://github.com/swannodette/enlive-tutorial/
;; 3) Programming Clojure, Stuart Halloway
;; 4) lots and lots of Googling

On the whole, I’m liking Clojure a lot, but there is also a lot to learn.

(Shocking conclusion, I know!)

0

Collaborative Filtering, Hadoop and the Hazards of Copy-Paste

I’ve been working on a new App idea lately – a recommender for Android programs. Basically, it looks at what you have installed (and possibly ratings) and recommends other applications you might like by using the recommendations of other people in the same way as Amazon or the various music services – in a word – collaborative filtering.

There are different ways to do collaborative filtering, but they are all expensive when you get a lot of records to sort through. Two common approaches are 1) Calculate the similarity of users, and recommend apps liked by similar users, or 2) Calculate the similarity of apps, and recommend apps similar to ones the user likes. I am trying the second way, known as item-based collaborative filtering or the model-based approach, which allows for fast queries at the cost of an expensive offline step that re-computes the item similarities every once in awhile.

My initial tests in Python, based on the very interesting book “Programming Collective Intelligence” quickly became too slow with just a few thousand users and apps. Because there are already around 5,000 apps and a few million users of Android (with many more every day), there’s no way the script would be able to handle the future growth of the platform.

Enter MapReduce and Hadoop. The explanation is better left to the pros, but simply, MapReduce is a way of parallelizing certain types of computations across many computers and then merging the final results. With the availability of Amazon Web Services, which allows you to rent a cluster of computers by the hour, it becomes possible to run a prohibitively expensive computation once every few days for just a couple of dollars. There are several different MapReduce frameworks out there, but I choose to try Hadoop, which is available on Amazon’s services and used heavily by Yahoo and many others.

There will be a lot more to say about Hadoop as I gain more experience. But all-in-all, it is pretty fun to re-think an algorithm, even just a little bit, to make it suitable for MapReduce. I *think* I have a correct implementation of Item-Based Collaborative Filtering running on my tiny 2-node cluster and it’s pretty cool!

One snag I ran into while trying to get my cluster running using the ubiquitous WordCount example for Hadoop. Like most people, I copy-pasted the source from the Hadoop tutorial and tried to run it. It ran, great! So then instead of reading the rest of the documentation, I immediately tried to modify it. Eventually, I ended up trying to make the simplest change – to return Text instead of IntWritables from the Map operation and — WTF!?! I spent HOURS trying to figure out why there was a ClassCastException. So for other poor souls trying to modify the WordCount example, there are 3 things you need to do:

First, get the method signatures right. The Mapper has to output Text and the Reducer has to consume Text (Eclipse will help with that, of course)

Second, add the lines: “conf.setMapOutputKeyClass(Text.class);” and “conf.setMapOutputValueClass(Text.class);” to the main() method. These tell Hadoop that the Mapper is not using the default, IntWritable, for output

Third, and crucially important, remove the line “conf.setCombinerClass(Reduce.class);”. Discovering that I needed to remove that single line took me about half a day, digging through the logs and Googling everything I could think of until I discovered this thread. Because it was part of the example, I assumed it was Hadoop boiler-plate that was essential — it’s not, it’s an optimization. The Combiner is kind of like a pre-Reduce phase that saves time by combining in-memory results instead of writing them to disk and combining them later. The Combiner needs a method signature that accepts the output of the Mapper and is still suitable as input to the Reducer. Otherwise, it chokes.

So is the peril of the copy-paster who runs code without really understanding all of it ~~