Tuesday, 18 January 2011

desktop data structures

Most humans when sorting—ordering a deck of cards, for example—use a method that is similar to insertion sort. —Wikipedia - Insertion Sort
The obvious lessons:
  • Insertion sort is a simple and intuitive algorithm
  • The obvious algorithm is sometimes slow
  • Humans optimise for human-sized problems

Hang on though, when I sort a deck of cards, I do this:

  1. Sort 5 cards through some intuitive process.
  2. Grab another card (if there is one)
  3. Find where it goes
  4. Insert it
  5. Go to step 2

But I certainly don't lay all the sorted cards out in a line, and move them all across by one when I insert one into the list. I just make a gap and stick the card in. This doesn't take longer if i have more cards!

What does take longer is finding the right place to insert a card, but when this gets hard I end up using a modified binary search. (Modified because I know something about the distribution of cards, and because it's easy to look at (and cache) a few consecutive cards). So this is log(n), and the overall algorithm is O(n log n), or as we say in Germany, 'fast'[1].

The ringbinder

We get away from it because real life has a 'data structure' that's half array, half linked list:

  • 'Get the nth item' is constant time (ish. You get pretty good at estimating how thick 10 cards are)
  • Insert an item is constant time.
  • Remove an item is constant time

I'm going to call this structure the 'Ringbinder' because 'Stack' is already taken by something that behaves nothing like any real-life stack I've ever seen. For sufficiently large values of N, your program will run faster using a USB robotic ring-binder operator rather than an ArrayList. You heard it here first.

Divserion: quantum bogosort

Cheating at algorithms using real-life's weirdnesses isn't new. qfork() is a version of fork() that returns a different value in each quantum universe. So we sort in O(n) like so:

srand(qfork());
shuffle(data);
if(!sorted(data))
   exit(1); // todo cleanup world

I mention this only because it amuses me.

So, seriously?

I guess at some point physically moving the cards to make a gap doesn't actually take constant time. If our robot can exert some fixed force and has to move n items just enough to make room for one more, then this takes O(√n) time[2]. Revised performance:

  • Go to index: O(√n)
  • Get next/prev: O(1)
  • Insert/delete: O(1)

This makes our insertion sort O(n^1.5 log n) which is no longer state of the art. But surely there are currently-optimal algorithms where using this data structure would give better asymptotic performance than we get on our fancy-pants von Neumann machines[3]. Or maybe I'm missing something. I'm probably missing something.

Any others?

What other data structures can you build from common household items that put our silicon masters to shame when it comes to data processing?

(Using people to demonstrate the power of parallelism doesn't count, It's been done).

[1] working for American software companies is a lousy way to learn German.

[2] F=ma, s = at^2, etc.

[3] and way worse real-world performance, as we're shoveling around avocado-sized chunks of matter instead of deftly handling individual atoms. Still.

Thursday, 25 March 2010

the old man and the cspan

I've been watching the USA healthcare fiasco (surely, the appropriate word) with a sort of morbid curiosity.

Although comparisons to the American Civil Rights movement seem a little strained, there is one element that draws the two together in my mind. It's hard to understand, looking back, how reasonable people could have found segregation to be just; similarly I can't find the right cultural lens to see a medical safety net to be unneccessary or even immoral, in a country that can manifestly afford it.

Of course the difference is that this America is here now, with few cultural barriers from where I sit, and so you can watch it tick in real-time and even converse with it. And sometimes marvel at the schizophrenic behaviour that loves teams but hates collectivism, and values the Constitutional means over the moral ends.

The high-stakes game theory involved in passing controversial legislation has been fascinating recently. There is a disconnect between the centrist reality[1] and the "left-wing" label that the bill is receiving from left and right alike. Ben Hyde writes:

If you map out the left-right spectrum of the nation and the legislatures the reform we got falls at the center. That was exactly what Obama signaled he would aspire to deliver; going all the way back to the earliest days of his campaign. As a practical matter this bill is about exactly as far to the left as anybody should have hoped for. The party on the left doesn’t get to pass a bill that is at the center of it’s party member’s opinions. You only get to pass a bill that gets you the vote of that last necessary right most legislator.

That's only the case if those on the right are willing to go without reform. The Republicans clearly are (for this electoral cycle), having made the calculation that an ineffectual government with a Democratic majority will give them a win at the next election. However, the right flank of the Democrats don't have this strategy available to them. If they truly wanted healthcare reform, Ben's logic would equally well suggest that a bill would have to cater to the left of the party. Simple symmetry says that with Republicans out of the picture as potential votes, the bill should indeed be written by the median Democrat if as Ben says:

But yet [the] old system was on the fast track to making American industry entirely unable to compete. And hence even those who think that Government’s only function is to help large economic actors were largely in support of reform.

One would hope so, and the refrain from the holdouts has been "Everyone agrees we need reform, but". The evidence is thin on the ground.


[1] By US standards. From my perspective, comprehensive government-run healthcare like in NZ or the UK would be favoured by the left, private healthcare funded by taxes (e.g. Canada) or public health insurance would be considered centrist, and the regulated private healthcare just passed would be a right-wing proposal.

Tuesday, 17 November 2009

what encoding is that?

A few times in a project I'm working on, I've got some binary data that represents encoded text in some unknown encoding. Sometimes I know what it should decode to, and sometimes not.

Encoding Tools is a small application that decodes a byte string in a bunch of ways, and tries to guess which is correct.

It runs on Mac OS X 10.5 or later, and the source is available from the link above.

Hadn't used "real" Cocoa (as opposed to Touch) in a while, so this was a good excuse to play with Bindings and NSTokenField.
The former is very cool, but not quite as seamless as I thought - you get your UI and model synced without writing any extra code, as long as you do things in a KVO-compliant way, which requires writing non-idiomatic extra code.
NSTokenField would be nice if it wasn't broken by default.

Sunday, 30 August 2009

in which I actually write some software

BioCalc main screen

Biostats Calculator appeared on the iTunes App Store the other day, after an approval process that felt like it took years (but actually was only a couple of weeks). It's my first iPhone app so it's kind of exciting, and some have been sold already (Hi, Colombia!), which is really exciting in the paying-my-rent kind of way.

The application performs various biostatistical significance tests - relative risks, t-tests, chi-squared etc - and a couple of other calculations. (Full details are on the website). It aims to provide the most common statistical calculations a medical researcher might need, without having to go back to their computer and deal with a full statistics package. Also, it's totally awesome and you should buy it and tell your friends.

</plug>

Other notes:

  • Objective C is awesome. Garbage collection is disabled on the iPhone, which I thought would bother me but is actually hardly a problem at all. It really gives you a sense of how simple the language as an extension of C - basically just objc_msgSend and a little bit of preprocessor magic. And knowing how the OO constructs work allows me to feel completely comfortable mixing in plain C instead when it's more natural. (I'm still not sure how autorelease works. I'm pretty sure it's magic.)
  • Apple did a really good job on Cocoa Touch. The simple things seem to just work, and the hard things aren't all that hard. Except where they're impossible by design, per the next point.
  • They're serious about their Human interface guidelines. The first submitted version of the application used an 'add contact' button icon (rendered as a blue plus-sign) to add data to a data set, and they rejected it for misusing a standard icon. I took a screenshot of the button, tinted the image green, and they had no problem with that.
  • The approval process is frustating. They have an actual human test your application, make sure it doesn't crash, follows guidelines, and does what it says on the box. I think this is a really good thing that drives up application quality and platform consistency, but it means waiting a couple of weeks between finishing the app and it hitting the store. Plus you get to repeat the whole process when you didn't read the HIG and used the wrong button icon. Ahem.
  • Despite not being quite as streamlined a process as I would have assumed (lots of forms to fill out, certificates to download, having to get a credit card in my name to open an account, etc) the App Store is an amazing service. You develop the software, describe it, and decide how much it should cost, and that's it - you don't have to work out how to sell it, just what the next app will do.

Thursday, 14 May 2009

Enjoy - a JoyToKey clone for mac

JoyToKey lets you map joysticks to keyboard buttons to control programs that aren't joystick-aware. Gamepad Companion is a similar thing for Mac but it's nagware and it annoyed me. Also I wanted the ability to automatically switch configurations depending on the active application. So: Enjoy.

Somewhat related: half-working hack to get the tilt-sensor on a generic PS3 guitar hero controller to work with Frets on Fire.

Tuesday, 3 March 2009

poker paradox?

hold'em in a nutshell

Each player has 2 secret cards, and later 5 shared cards are revealed. Players bet on who has the best combination of cards both before and after the shared cards are revealed. Most secret cards are not worth betting on, good players will play between 10% and 30% of their hands.

three common principles

  • Deception (fundamental theorem of poker): if your opponent has less (or wrong) information, you make more money than if they have more information
  • Hand strength: all else equal, you make more money with higher cards than lower cards. (E.g. 9♣8♣ vs 8♣7♣)
  • Style: a range of styles can be equally profitable, for example you may be able to maximise your winnings by playing anything from 10% to 30% of hands.

the problem

These principles seem axiomatic (except perhaps the third, but empirically it seems true). But they conflict:

  • Take two optimal strategies: a tight 10%-of-hands game "A" and a loose 30% game "B".
  • At the start of each hand, flip a coin, and play either strategy "A" or "B" accordingly. Call this compound strategy "C".
  • A player who's played against either A or B for long enough knows they play 10% or 30% of hands respectively, and they're winning so it's probably the best 10% or 30% of hands. (Strictly, a best 10%. Hand strength is only a partial order, it doesn't compare 9♣8♣ to 2♢2♠).
  • A player who's played against C knows that he plays 20% of hands. If the player is very observant he might know about the coin-flipping, but if the coin is hidden he doesn't know whether he's playing against A or B for a particular hand. In either case, he's playing against an optimal strategy with less information than last time, so when the coin comes up heads C makes more money than A, and when it comes up tails C makes more money than B. So C is more profitable than A and B (which were supposed to be equal and optimal).
  • Strategy C involves sometimes playing hands in the 20-30th percentile, and sometimes not playing hands in the 10-20th percentile. According to hand strength, we'd make more money by replacing some worse hands we're playing with some better ones we're not playing. This gives a strategy "D" where we flip a coin play the top 20% of hands according to strategy A or B.

So D > C > A and B, but A and B were supposedly optimal.

resolutions

The first weakness I can see is the assumption that it's always incorrect to throw away a more profitable and play a less profitable one. A hand like 5♡7♡ that rarely wins may be profitable if played occasionally because when it does win your opponent will discount the possibility that you have it. However you have to fold it most of the time, even though playing it would be profitable for that individual hand, to maintain this value.

I'm surprised that the benefit from playing a probabilistic strategy is big enough to counteract the 'paradox'. Most analysis online tends to focus on opponent ranges - the range of cards they might have based on their actions in the hand - and it's assumed in calculations that the opponent would play their cards that way every time. Against good opponents, the edges of the ranges will be significantly fuzzy, and not accounting for this makes for flawed analysis that loses money (fundamental theorem again).

The second weakness (thanks Kyle!) is the assumption that C gives our opponent less information than A or B. This is intuitively true if A and B are deterministic, in the same way that a coin-flip is more uncertain than 'heads' or 'tails' individually; but if they A and B are probabilistic then C may be no more uncertain - the average of two random numbers is not 'more random'.

Monday, 26 January 2009

guessing games 1

One of the first programs I ever wrote was a number guessing game:

I'M THINKING OF A NUMBER BETWEEN 1 AND 20
> 7
HIGHER
> 15
HIGHER
> 18
LOWER
> 17
YOU WIN

(The Commodore 64's green was brighter of course, but I wouldn't do that to you). The best strategy for beating the game, of course, is to pick the middle number of the range first in order to break the range in half. If that didn't happen to be the right number, you have a new possible range that's half the size, and you repeat the process. This is binary search at its purest. It's the best strategy in two ways: it always finishes in at most 5 guesses (worst-case performance), and the average number of guesses needed if you pick randomly is 3.7 (average-case performance).

I tried to write a program to do the guessing, but I couldn't work out how to let the user type "higher" or "lower" instead of numbers, and I gave up.

twenty questions

It's the same game, except instead of a number you can pick anything, and the guesser is allowed twenty yes-or-no questions to try to work out what it is.

I'M THINKING OF A SOMETHING
> IS IT ORGANIC?
YES
> IS IT ALIVE?
YES
> IS IT A PLANT OR AN ANIMAL?
YES
> LOL, IS IT A PLANT XOR AN ANIMAL?
NO
> IS IT ROBERT PLANT?
YOU WIN

Obviously it wouldn't be an easy program to write, but in theory it's possible if we formalise the game. First we need a dictionary, a list of all the possible objects the computer can be thinking about. There can only be a finite number if we want the game to be winnable.

Next we need a set of questions. Every yes-or-no question breaks the dictionary up into two sets: things that answer 'yes' and things that answer 'no'. We can characterise all questions by the sets of things that answer 'yes' (the objects they accept). Two questions that accept the same objects are really the same question.

Suppose our dictionary is {barrel, cat, dog, ball, fish, box, building}. The question "is it an animal?" accepts the objects {cat, dog, fish}. It's the same question, for our purposes, as "is it a cat or a dog or a fish" or "does it not begin with b".

If we allow any question (that is, any set of objects) to be asked, then the game becomes trivial: we can reduce it to the number guessing game. Just number all the objects, break them into ranges, and ask questions like: "Is it barrel or cat or dog". This is the optimal strategy, which implies that you can't always win at twenty questions if there are more than a million possible answers. But it makes for long and boring questions, so in the informal game we might say you're only allowed twenty words per question. In the formal game we'll say you're provided with a list of questions you're allowed to ask.

So our dictionary might be: {barrel, cat, dog, ball, fish, box, building}. And our questions might be:

  • {cat, dog, fish} ("is it an animal?")
  • {cat, dog, ball, box} ("can you play with it?")
  • {barrel, ball, box, building} ("does it begin with b?")
  • {barrel} ("is it a barrel?")
  • {cat} ("is it a cat?)
  • etc...

The "is it a X" questions aren't strictly required, you could say you 'win' as soon as you know exactly what the object is. They do guarantee it's possible to win though.

A strategy for 20 questions is a tree flowchart like the one for the guess-the-numbers game. Here's one for a poor strategy: just guessing.

The height of the tree is 6 which indicates it may take us six guesses to work out the object in the worst case. This next one is better (height 3). In fact it's optimal: since it is as good as binary search we know there is no better strategy. Usually the optimal strategy won't be as good as binary search.

The question is: how do we efficiently calculate the optimal strategy? Either on the fly - i.e. given the information we have so far, determine the next question to ask - or by pre-computing the entire flowchart.

More formally: given a set S and a collection of subsets T ⊆ ℘(S), construct a binary tree with leaves S and internal vertices T of minimal height, such that every left-descendant of t is in t, and no right-descendant is.

An obvious heuristic, reminiscent of binary search, is to always ask a question that splits the remaining possibilities as close to evenly as possible. While we might expect that this 'greedy algorithm' is optimal, it isn't due to interactions between questions. In the diagram below, question A divides the space of objects neatly in half, but whatever question we ask next we can be left with around a third of the objects afterward. Starting with questions B and C guarantees only a little more than a quarter of the possibilities remain after two questions.

A naive but approach is to recursively find the best left and right subtrees for the subsets induced by each first question, and choose the question that gives minimal max-height. Ultimately this runs takes exponential time since we find the tree for all 2n subsets of the dictionary. It has the feel of a hard problem, I wonder if exponential time is the best possible.