0

More Reasons I Love Python..

I was cleaning up some folders the other day at work where the files had been named using one of several naming schemes (or a few with no particular scheme at all). After brief consideration, I decided to do the legwork of renaming all the files with a naming scheme that actually makes sense:

Category_YYYY-MM-DD

That way, the files will stay grouped together if they get copied around to other folders, and they sort alphabetically by date. Then there’s the task for regenerating all the HTML for these baddies. Happily, Python was up to the task:

import os
from datetime import date
files = os.listdir("Path\\To\\File")
files.sort()
files.reverse()
for file in files:
    # chop the prefix, chop the suffix, split into (year, month, date), convert to int
    x = [int(x) for x in file.split("_")[-1][:-4].split('-')]
    print "<li><a href=\"/path/to/%s\">%s</a></li>" % (file, date(x[0], x[1], x[2]).strftime('%B %d, %Y'))

Well, it’s nothing like the real pros can do. But you gotta love a few links of code that save your fingers from a repetitive and typo-prone task like manually editing hundreds of links.

2

Getting your stats from the Android Marketplace with PHP/cURL

A few weeks ago I mentioned one way to get your developer stats off the Android Developer Console automatically in the post “Fetching Android Market Stats with Python, MozRepl, and BeautifulSoup“. Unfortunately, despite being very awesome, Firefox + MozRepl is not super-great for this task. When a plugin is updated, Firefox hangs on startup. That’s fine, but it kinda sucks for scripting. I’m sure there’s a way around it, but that difficulty makes a good excuse for coming back to solve this problem the right way.

Following is a PHP script that uses cURL to login to the Developer Console and grab the market stats. Unfortunately, Google’s app is written in GWT and the its Javascript is completely obfuscated. The market stats are fetched as JSON data and then somehow parsed, but I haven’t been able to figure out how exactly. If you run this script (or just look using Firebug), you’ll see that the JSON is a gigantic array. While the data of interest are clearly present in this array (total downloads, current installed base, rating, etc..), I haven’t been able to figure out how to parse it reliably. If you’ve tried this and figured it out, I’d love to know!

This script was assembled from a bunch of random PHP/cURL tutorials and may contain redundancy, unnecessary cURL settings, etc. Python fans, see the comments of my other post on this topic where a kind soul has demonstrated the same thing in Python using mechanize.

<?php
//setup a temp file to store cookies
$ckfile = tempnam ("/tmp", "CURLCOOKIE");
 
//do google authorization
$data = array('accountType' => 'GOOGLE',
          'Email' => 'YOUR_ACCOUNT_EMAIL_HERE',
          'Passwd' => 'YOUR_ACCOUNT_PW_HERE',
          'source' => '',
          'service' => 'androiddeveloper');  
 
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://www.google.com/accounts/ClientLogin");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
$output = curl_exec($ch);
$info = curl_getinfo($ch);
curl_close($ch);
 
//grab the AUTH token for later
$auth = '';
if($info['http_code'] == 200) {
    preg_match('/Auth=(.*)/', $output, $matches);
    if(isset($matches[1])) {
        $auth = $matches[1];
    }
}
 
//login to Android Market
//this results in a 302
//I think this is necessary for a cookie to be set
$ch = curl_init ("http://market.android.com/publish?auth=$auth");
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
 
//go to the Developer Console
$ch = curl_init ("http://market.android.com/publish/Home");
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
 
//grab the JSON data
//perm and postdata seem to have changed in the last 6 months
//if the script isn't working, try using firebug to inspect the Request when
//http://market.android.com/publish/editapp gets fetched
$perm = "81E29277804F7729E9B743A43B2EFD07";
$headers = array(
    "Content-Type: text/x-gwt-rpc; charset=utf-8",
    "X-GWT-Permutation: $perm",
    "Referer: http://market.android.com/publish/gwt/$perm.cache.html");
//not sure what x-gwt-permutation means, I think it may have to do with which version of GWT they serve based on your browser
$postdata = "5|0|4|http://market.android.com/publish/gwt/|09C42EAE15B55219550B2D800FAC1644|com.google.wireless.android.vending.developer.shared.AppEditorService|getFullAssetInfosForUser|1|2|3|4|0|";
$ch = curl_init ("http://market.android.com/publish/editapp");
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postdata);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
 
//now what?!?
echo('<pre>');
$output = json_decode(substr($output, 4));
print_r($output);

If you run this script and are willing to send me your stats, that would be super-helpful. Maybe I’ll be able to get enough data to figure out why some apps have more fields than others. With only 3 apps currently on the market, I don’t have much to go on. Feel free to obscure your data, but please make the changes obvious and note whether the app is free/paid and what part of the market it appears on (games/apps and sub-category). Here is a link to my best guesses so far in an Excel worksheet: market-json

0

Finding Anagrams Is Harder Than It Should Be

So my most recent Android program is an anagram finder called Unanagram. For some reason I really like writing progams to solve word puzzles like Yahoo Word Racer and Scrabble. Anyway, while it seems like finding anagrams is a really easy thing to do, it turned out to be a little tricky. Following are some notes on the complications encountered in writing for Android and some solutions.

There is an easy algorithm to determine if two words are anagrams of each other: just sort the letters of each word alphabetically. If they match, they are anagrams. For example: “cats” and “acts” both sort alphabetically as “acst”.

Trickier though is finding ALL the anagrams for a particular set of letters. You could sort every word in the dictionary as a “key” and keep the unsorted word as the value. The trouble is that this obligates you to check every key and requires a fair amount of space as well. Also, it doesn’t really allow spaces which is vital for multi-word results. With only 16mb per program, memory-intensive algorithms are not viable.

After a fair amount of experimentation with different data structures, I decided to use my reliable friend, the Trie. This has the dual benefit of being space-efficient and having fast lookups. (A DAG might be more appropriate, but the Trie worked so well I didn’t bother investigating)

Nevertheless, a few complications arose.

First, you have to load the data structure from disk. Usually files like a dictionary would go into /assets. However, files in that directory get compressed, so they take longer to load. For a large file the time is unacceptable. For faster loading they can go in /res/raw.

Then there is actually building the data structure. A Trie is built through successive insertions which add branches to the tree as necessary. While adding 60k words takes less than a second on my laptop, it took 5 MINUTES on Android. That wasn’t gonna work. Since the application only needs lookups, and not insertions, I decided to build the Trie offline, serialize it into a pseudo-binary format and then load that in Android. While the file size increased from 600k to 1.5Mb, the loading time dropped to 6 seconds. This is still slow, but much more acceptable. By beginning the loading in a background thread as soon as the application starts, it becomes unnoticeable. Someone more clever than myself may be able to get both better compression and better load time, but that was “good enough” for my purposes.

Now because you only get 16Mb of RAM, it is necessary to build the Trie using as little memory as possible. I was able to get down to about 4.5Mb by building the Trie as a series of nodes, each with a Child pointer, a Next pointer and a Character indicating the node’s letter. This is a bit different from your typical Trie, which stores a list of child nodes. An uppercase letter was used to indicate a terminal node, rather than adding a boolean flag.

That’s great for the in-memory size of the tree, however, with a branching factor of 2, this tree has ENORMOUS depth. While again my laptop has no problem, Android hits a StackOverflow error when the recursion gets too deep. Happily, this is fixable by converting the naturally recursive traversal into an iterative algorithm with an external Stack. Yuck. But it works and requires an insignificant amount of memory. It’s almost certainly faster as well, but I didn’t check.

A final complication is that Android programs must always be ready to be Paused or Killed. Since generating ALL the anagrams for a long word can easily take over a minute, it would be bad news to re-start the search everytime the program gets killed (which happens a LOT – incoming calls, switching the screen orientation, and other programs contending for memory can all kill an app). Since the lookup is stack based, it was fairly straightforward to build a resume function which rebuilds the stack using the letters of the last word found.