## Six months of Windows Phone Development — Tips, tricks, and performance considerations

TL;DR: I’ve taken plenty of notes while developing my YiXue Chinese dictionary for Windows Phone. Topics covered in this article include performance tips, best practices, and plenty of code snippets for Windows Phone.

You can skip the DOs and DON’Ts and jump directly to the tips and tricks section, or even straight to the performance tips section, if you feel really confident with Windows Phone development.

## Introduction

This article presents notes and remarks that I gathered while working on a Chinese Dictionary App for Windows Phone, YiXue Chinese Dictionary: mistakes I made, fun tips I wrote down, and so on.
I initially didn’t really intend to create a full blog post out of these notes, but their increasing number, and my app recently placing second in Microsoft France’s App Awards contest, gave me enough motivation to share them with the community. Along with various tips and tricks and answers to often-asked (but seldom answered) questions, I will discuss a number of performance improvements that specifically apply to Windows Phone apps.

## App development tips: DOs and DON’Ts

### DO keep the size of your XAP package (reasonably) small

XAP files are just ZIP files with a fancy file extension.Shown above, the contents of the YiXue Dictionary XAP package.

As of March 2013, the limit for users to be able to download your app over the air is 20MB. Above this limit, users will have to switch to connect to a WiFi network before they can download your app. Below 20MB, you’re safe.
That said, keeping an eye on your XAP size is definitely a good idea. PC World ran two studies in 2012, concluding that the average 3G download speed in the United States is about 2.5mbps. That’s about 3.2 seconds per MB in your app’s XAP — about 1 minute for a 20MB app download over 3G. In areas with less good coverage, this will probably climb up to 2 or 3 minutes (assuming 1mbps).

### DO use the Silverlight Toolkit for Windows Phone

The Silverlight toolkit for Windows Phone is a library distributed by Microsoft, which includes a number of controls which didn’t make it into the official SDK. It’s frequently updated, it contains a number of gems that can widely improve your app, and you can download it using nuget straight from Visual studio (more information here).

Though I was initially reluctant to increase my download size by bundling a 500kB dll with my app, I quickly realized it can dramatically improve the user experience. I’ll focus on just two things here: The PerformanceProgressBar, and page transitions (the Windows Phone Geek website has covered in great details pretty much every control in the toolkit).

#### The PerformanceProgressBar

It’s been said pretty much everywhere, but using the default Windows Phone ProgressBar in Indeterminate mode is a bad idea. The animation is based on custom-themed sliders, which eat up a lot of processing power to display the animation (more on that below). The PerformanceProgressBar included in the Silverlight Toolkit fixes that.

Load the toolkit by adding the following to the phone:PhoneApplicationPage tag:

xmlns:tk="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"

and create a progress bar using the following:

<tk:PerformanceProgressBar IsIndeterminate="true" />

#### Page transitions (toolkit:TransitionService.NavigationInTransition)

It’s pretty standard in all system apps on Windows Phone; when you tap a button or a link, the current page will flip out, and a new page will flip in.

Apart from the eye candy, the main advantage of this animation is that it vastly improves the perceived speed of your app, by virtually supressing the delay that separates user actions (tap a button, click a link, pick a ListBox item) and the corresponding reaction; the animation starts immediately, and the constructor and OnNavigatedTo method of your new page are called while the animation runs.

This way, your users have direct feedback that their tap was taken into account. I initially implemented that in YiXue because the pretty complex layout of the Dictionary page would take a split second to render, and users would tap two or three times on the “Dictionary” button, thinking the tap hadn’t been taken into account.

To improve the visual consistency of your app, I recommend using the animation on all your pages. One neat thing is that this animation can be bundled in a style, and reused on all pages without repeating the same code on each. That’s something I didn’t know at first, and that was another reason I was reluctant to generalize the use of that animation to all my pages. But it’s actually rather simple:

First declare the toolkit namespace in your App.xaml using the following line :

xmlns:toolkit="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"

Then declare the following style in your App.xaml file:

	<Style x:Key="TransitionPageStyle" TargetType="phone:PhoneApplicationPage">
<Setter.Value>
<toolkit:TurnstileTransition Mode="BackwardIn"/>
<toolkit:TurnstileTransition Mode="ForwardIn"/>
</Setter.Value>
</Setter>

<Setter.Value>
<toolkit:TurnstileTransition Mode="BackwardOut"/>
<toolkit:TurnstileTransition Mode="ForwardOut"/>
</Setter.Value>
</Setter>
</Style>


Finally, reuse your new style on all pages like this:

<phone:PhoneApplicationPage
(...)
Style="{StaticResource TransitionPageStyle}">


### DON’T show a minimized application bar on a page with TextBoxes

Trying to hit the “change language” button (繁) of the keyboard when the application bar is minimized generally results in a lot of frustration.

That’s a usability tip which plenty of apps on the market place fail to follow, especially translation / dictionary apps. If have TextBoxes on a page showing an ApplicationBar, make sure it is not minimized. Otherwise, your international users will have the frustration of hitting the application bar instead of the “change language” button on the keyboard, thereby expanding the application bar, above the keybord. See the screenshot if you’re not sure what I mean by that.

The solution is pretty simple : either hide the application bar when your TextBox gets the focus, or just set the ApplicationBar‘s Mode to Default instead of Minimized.

DO:

<shell:ApplicationBar Mode="Default">

DON’T:

<shell:ApplicationBar Mode="Minimized">

### DON’T use the GestureService from the Silverlight toolkit

However awesome it might be, the Silverlight toolkit still has a few rough edges. In particular, it doesn’t unregister it’s listener when you use a GestureService (GestureService allows you to catch touch events like Flick, Slide, and so on). This will bite you if you use an InkSurface for drawing at some point, because a GestureService loaded on another page will keep firing touch events although the page it initially came from should have been disposed. Using the GestureService, furthermore, will prevent your pages from being properly disposed.
Fortunately, there’s a pretty simple workaround: use the XNA TouchPanel instead.

First, enable the listener (can be done in yoru app initialization code):

	Microsoft.Xna.Framework.Input.Touch.TouchPanel.EnabledGestures =
Microsoft.Xna.Framework.Input.Touch.GestureType.Flick;


Then, register the ManipulationCompleted event on a control:

	private void Control_ManipulationCompleted(object sender, ManipulationCompletedEventArgs e) {
int delta_x = gesture.Delta.X, delta_y = gesture.Delta.Y;

if (gesture.GestureType == GestureType.Flick
&& Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y))
//User flicked horizontally (delta_x < 0: flicked left; otherwise right)
}


That works well, but not perfectly. Can you guess why? — The catch is that the TouchPanel will record all gestures made on the page; even if only part of your page is touch-enabled using ManipulationCompleted, the call to TouchPanel.ReadGesture will also report gestures made outside of that area.

The problem, luckily, is easily fixed by discarding all gestures but the last when ManipulationCompleted fires; add the following snippet to the previous function:

	GestureSample? sample = null;
while (TouchPanel.IsGestureAvailable) //Discard all gestures but the last

if (sample.HasValue)
if (gesture.GestureType == GestureType.Flick
&& Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y))
//User flicked horizontally (delta_x < 0: flicked left; otherwise right)


## Cool development tips and code snippets

Windows Phone users can choose between light and dark themes, which at the core are just white text on a dark background, or black text on a light background.

### Override system-default colors and brushes if your app uses a custom background

If your app includes a background image (see my tip on that below), you’ll probably want to force one of the color themes (light / dark). If you do that I’d advise going for a rather dark background, since that will use much less battery ; in that case you will want to enfore the dark theme in your app. I did that in YiXue dictionary, including a piece of famous Chinese calligraphy in the background, and users told me they really liked it.

First, note that you can’t directly replace an existing value in the App.Current.Resources dictionary. Removing the key and adding it again, however, does work (took me a while to find out):

	private static void SetAppResource(object key, object value) {
//Cannot replace an existing value:
//using 'App.Current.Resources[foo] = bar' won't work.

if (App.Current.Resources.Contains(key))
App.Current.Resources.Remove(key);

}


You can also override properties of already existing objects, like system brushes. For your convenience, here is a quick snippet setting most of the common color settings to their default Light Theme values.

//Code from http://pit-claudel.fr/clement/blog/ ; attribution much appreciated!
(App.Current.Resources["PhoneForegroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneBackgroundBrush"] as SolidColorBrush).Color = Colors.Black;

(App.Current.Resources["PhoneDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255);
(App.Current.Resources["PhoneInactiveBrush"] as SolidColorBrush).Color = Color.FromArgb(51, 255, 255, 255);

(App.Current.Resources["PhoneContrastBackgroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneContrastForegroundBrush"] as SolidColorBrush).Color = Colors.Black;

(App.Current.Resources["PhoneTextBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneTextBoxForegroundBrush"] as SolidColorBrush).Color = Colors.Black;
(App.Current.Resources["PhoneTextBoxSelectionForegroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxEditBackgroundBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxEditBorderBrush"] as SolidColorBrush).Color = Colors.White;
(App.Current.Resources["PhoneTextBoxReadOnlyBrush"] as SolidColorBrush).Color = Color.FromArgb(119, 0, 0, 0);

(App.Current.Resources["PhoneRadioCheckBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneRadioCheckBoxCheckDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 0, 0, 0);
(App.Current.Resources["PhoneRadioCheckBoxDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255);

(App.Current.Resources["PhoneBorderBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255);
(App.Current.Resources["PhoneSubtleBrush"] as SolidColorBrush).Color = Color.FromArgb(153, 255, 255, 255);


As a side note, remember that you can’t set the application bar’s color to pure white. Just set it to something like #EFEFEF instead, and it will look just as if it was all white.

DO:

shell:SystemTray.ForegroundColor="#FEFEFE"

DON’T:

shell:SystemTray.ForegroundColor="#FFFFFF"

### Let users select and copy text

A TextBox styled like a TextBlock, with some text highlighted

Since text can’t be directly copied from TextBlocks, most websites advise that you use a ReadOnly TextBox instead, styled to look just like a TextBox. That’s pretty sweet, but it fails to mention that users need to give input focus to a TextBox before they can select text. Which means that they’ll have to tap on your modified TextBox twice: once to give it focus (but the caret won’t appear, since the TextBox is read-only), and a second time to actually select text. As a user, this drives me nut.

Luckily there’s a pretty simple work-around: if it’s a very small block of text, intercept the Tap event, and select all text immediately. Otherwise, use the GotFocus event. Users can then refine their selection if they want, but they can also copy all contents straight away if they want to.

	private void CopyableTextArea_Tap(object sender, GestureEventArgs e) {
(sender as TextBox).SelectAll();
}


### Share a background on all your pages

Pages sharing a common background, with the Background property of their LayoutRoot set to Transparent.

If you want to set an app-level background image (perhaps the same as your splash screen, although I’d advise using something a bit darker to improve the legibility of text over your background), manually setting the background image on all your pages is not the best, especially if you use page-turning animations.
Indeed, if you set the background on a per-page level, you’ll quickly notice that the background flips with the rest of the page content when you change pages. It would be much better if it stayed in place, and only page contents flipped out when changing pages. It’s actually pretty easy to do if you know the trick (that, too, took me a while to find):

In App.xaml:

<ImageBrush x:Key="MainBackground" ImageSource="/resources/MainBackground.jpg" />

In App.xaml.cs

(App.Current as App).RootFrame.Background = App.Current.Resources["MainBackground"] as Brush;

Now the catch is, if some of your pages show an application bar, and other don’t, your background will be resized to fit only the available space: the application bar will force it to shrink — that’s pretty ugly. Instead, what I do is add a 72 pixels high row to the LayoutRoot grid, and set the ApplicationBar’s transparency to .99 (choosing a value other than 1 draws the app bar above your page instead of shrinking it). That way, the Application bar is drawn above your page, and nothing is drawn below it because there’s an empty row of just the right height below:

DO:
In your Grid‘s RowDefinitions:

<RowDefinition Height="72" />

Further below:

	<phone:PhoneApplicationPage.ApplicationBar>
<shell:ApplicationBar IsVisible="True" Opacity="0.99" Mode="Default">
(...)
</shell:ApplicationBar>
</phone:PhoneApplicationPage.ApplicationBar>


### Performance tips

Abandonning your ProgressBars in favour of text-only loading messages can yield substantial performance improvements.

Really that’s pretty universal advice. If you never tried, you’ll probably be surprised how much performance you can gain by simply disabling your “loading” progressbar and replacing it with a Loading... message, possible including a percentage at the end, as in Loading... 10%. In Yixue, where the initial loading of the dictionary has been an area that I’ve worked a lot on, disabling the Loading... progressbar has reduced the overall loading time from 3.2 seconds to 2.7s. That’s a sizable 0.5 seconds, or 15% percent performance improvement :)

Another common mistake when displaying progress is to use a timer running in the background and firing every x milliseconds to update your progressbar. I did this in YiXue, because the animation of the progress bar that I used back then would be smoother that way. The loading time was 3.8 second then, and the UI timer fired every 20ms. Increasing this delay to 100ms took the loading time down to 3.5 seconds. Removing the UI timer altogether and firing a ReportProgress event after each significant loading step took that down to 3.2 seconds… and getting rid of the progress bar I went down to 2.7s.

XAML complexity is, in my experience, the main cause for pages loading slowly. Complex bindings (or even simple ones, really, if you’re populating a list) are also very time-consuming. If you’re after the last bit of performance, consider populating your list manually, possibly be removing your listbox alltogether, and using a vertical StackPanel to which you directly add your items, in code-behind. I know that doesn’t really fit in the MVVM pattern, but when I had to display a grid of 214 elements, binding a ListBox using a WrapPanel as its ItemsPanel took close to 2 seconds, while populating a StackPanel only took 0.3 seconds.

To the best of my knowledge, the toolkit’s WrapPanel doesn’t do any kind of UI virtualization, making it useless for displaying moderately large amounts of information.

#### Don’t use the PivotControl for what it’s not made for

The PivotControl has to render all its pages before it’s displayed. That’s a huge performance killer when you use it to diplay multiple data sets (plus, it’s against the design guidelines: Pivots are used throughout the system to show multiple visualisations of the same data set (all mail, unread, starred, and so on, for example)).

In YiXue Dictionary I have study lists (roughly equivalent to stacks of flash cards, only better ;), and I initially displayed them using a Pivot control, one list per pivot page. When I tested that page with 12 study lists contining about 40 words each, not only did it take 2 seconds to render the page, but the memory profiler reported a 130MB memory usage (!). I’ve now replaced it with a single list, letting users flip between each page using gestures, which doesn’t use more than a few MBs.

### Get better I/O performance

The WP emulator is really neat, but there’s one thing that it really doesn’t mimic realistically, namely IO performance. Although it’s much better on my Lumia than on my old HTC, I/O is still much slower on a phone than on your development computer. That’s really something to keep in mind when testing your app, especially if you don’t have a device to test on. Very roughly, my testing indicates that an I/O operation that takes 1 seconds on your computer will take 2 or 3 seconds on the Lumia 920, 3 to 5 seconds one Lumia 800, and 5 to 7 seconds on the Trophy. Take these timings with a pitch of salt since they are really gross approximations, but the general figures are about that. Similarly, garbage collections will be much slower on your phone than in the emulator; if your loading procedure involves a lot of text parsing, it’ll suffer a lot from that on your phone.

So how do you improve that loading performance? Turns out that there is a number of simple tricks that you can use. I won’t get into too much details here, because there’s so much to say, and because you obviously should run benchmarks for your micro-optimisations, but the following general principles will help :

#### Ditch String.Split

Screenshot from MDSN’s remarks section regarding String.Split.

String.Split is nice, but it’s a garbage collector nightmare. It allocates plenty of intermediate buffers (and the documentation is pretty straightforward about that, with a section called “Performance Considerations” well worth reading), and it’s too generic to be as fast as it could be. If you’re really after the last bit of performance, re-code the split function by hand. It’s really not hard, but here it is:

	int start = -1, end = -1;
List<string> parts = new List<string>();

while (end + 1 < string_to_split.Length) { //Assumes that the string to split ends with a '\t'
start = ++end;
while (string_to_split[end] != '\t')
++end;

}


#### Set images as Content, not resource

Changing the build action of your images from Resource to Content will neatly improve the splash screen time, because your images will only be loaded when needed, instead of being loaded when your app starts.

Left: efficient representation. Right: Human-readable version of the same data. Smaller is better.

Most apps store use data as XML, using an XMLSerializer. That’s really neat, but it can get slow if you don’t take a few precautions. There are two things that really helped me in YiXue: first one is disabling white space in the XML output; your files won’t be as legible by a human as they were before, but it easily saved me 20% off the size of my user data files, and the parser won’t see the difference.

	XmlWriterSettings settings = new XmlWriterSettings { Indent = false, NewLineHandling = NewLineHandling.None };


#### Use defaults in your XML serialization attributes

Second one is specifying defaults. In YiXue there can be plenty of info associated to each word in your study lists, but most words don’t have all that info filled in. You can get the Xml serializer to omit the empty properties by simply specifying the default value for the attribute this way:

	[DefaultValue(0)] public int CorrectAnswersGiven;


I’d also advise choosing very short aliases for your attributes and elements names; for example:

	[XmlAttribute(AttributeName = "f")] public string foo;


This will further shrink your data files, and will make it much easier to rename your variables should you want to.

#### Improve perceived performance by showing a message with actual content while the app loads

If you need time to load app resources (beyond the split-second), entertain your users while your app is loading by including tips and fun facts: they won’t notice the delay if they have something else to focus on — game developers know this, and always include tips in their loading screens. This principle is easily extended to other apps.
In YiXue, I show tips and fun facts about learning Chinese while the app loads.

## Conclusion

That’s pretty much all for now. Hopefully that helps you create even better apps!

You can follow me on Twitter here (me) and here (my YiXue Dictionary app). Of course, links to http://pit-claudel.fr/clement/yixue/ and to this blog are most appreciated. And since I’m sure I forgot plenty:

Happy WP development!
Clément.

## Modeling and measuring string comparison performance in C, C++, C# and Python.

Comparing strings is often — erroneously — said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.

## Theoretical amortized cost

### Infinite strings

Suppose one draws two random strings $$A = “a_0a_1a_2…”$$ and $$B = “b_0b_1b_2…”$$ of infinite length, with $$a_n$$ and $$b_n$$ respectively denoting the $$n$$th character of $$A$$ and $$B$$. We will assume that characters are independent, and uniformly drawn over the set $$\{a, b, c, …, z\}$$. For now, we will consider all strings to be of infinite length — we’ll study bounded-length strings in the next section.

Given that characters making up $$A$$ and $$B$$ are independent, the probability of $$A$$ differing from $$B$$ at index $$n$$ is $$25/26$$. Conversely, the probability of $$A$$ matching string $$B$$ at index $$n$$ is $$1/26$$.

Consequently, the probability of $$A$$ matching $$B$$ up to index $$n – 1$$, and differing at index $$n$$, is $$(1/26)^n ⋅ (25/26)$$. Therefore, given that comparing strings up to index $$n$$ included is $$(n+1)$$ the expected number of characters to compare to discriminate $$A$$ and $$B$$ is $$C = \sum_{n=0}^{\infty} (n+1) ⋅ (1/26)^n ⋅ (25/26) = 26/25$$

Indeed, defining $$C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} (n+1) x^n$$, the expected number of comparisons $$C$$ required to discriminate $$A$$ and $$B$$ is $$C(1/26)$$. Now $$C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} \frac{∂ x^n}{∂ x} = (1-x) ⋅ \frac{∂ \sum_{n=0}^{\infty} x^{n+1}}{∂ x} = (1-x) ⋅ (1-x)^{-2} = 1/(1-x)$$. Plugging $$x=1/26$$ yields $$C = 26/25 \approx 1.04$$.

In other words, it takes an average of $$1.04$$ character comparisons to discriminate two random strings of infinite length. That’s rather cheap.

### Finite strings

In this section we assume that $$A$$ and $$B$$ are bounded in length by a constant $$L$$. This implies that the probability of the comparison routine reaching any index greater than $$(L-1)$$ is $$0$$; hence the expected number of comparisons $$C_L$$ is $$C_L = \sum_{n=0}^{L-1} (n+1) ⋅ (1/26)^n ⋅ (25/26) = (26/25) ⋅ (1 – \frac{25L + 26}{26^{L+1}})$$
For any $$L > 3$$, this is virtually indistinguishable from the previously calculated value $$C = 1.04$$.

### Conclusion

The theoretical amortized time required to compare two strings is constant — independent of their length. Of course, there may be a sizeable overhead associated to comparing strings instead of ints: experimental measurements of this overhead are presented below.

## Experimental measurements

The following charts present the time it took to perform 100’000 string comparisons on my laptop, compared to a baseline of 100’000 integer comparisons, in 3 different programming languages and for string lengths varying from 1 to 500 characters. Despite large variations for small string lengths, measured times quickly converge to a fixed limit as strings lengthen1.

Time required to compare 100’000 pairs of random strings, as a function of string length, in Python (3.2.3)

Time required to compare 100’000 pairs of random strings, as a function of string length, in C# (csc 4.0.30319.17929)

Time required to compare 100’000 pairs of random strings, as a function of string length, in C++ (gcc 4.3.4).

### Remarks

Though large strings (over ~200 characters) experimentally match our statistical model, small strings exhibit a much more complex behaviour. I can think of a two causes for such variations:

• String interning: Python interns some strings — exactly which depends on the implementation. On my laptop, this turns small (< 64 characters) string comparisons into direct pointer comparisons, which are about as fast as integer comparisons.
• Locality: In C++ in particular, under certain conditions, applying a function to a
char*

gets slower as the referred string lengthens, even if the function only access the first few elements. The effect is barely noticeable for strings over 200 characters.

## Conclusion

Comparing strings, especially small strings, is very fast. The amortized time required to compare two strings is bounded by a constant, which is experimentally about twice the time required to compare two integers in high level languages like Python or C#, and 5 to 7 times higher in low-levels languages like C++.

Do you usually try to avoid comparing strings? Do you have a more detailed explanation of the irregularities observed for lengths < 200 characters? Tell us comments!

1. If you want to experiment with the tests, you can download the full python, C++, or C# source code. []

## Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

In a famous article published 19511, John Von Neumann presented a way of skew-correcting a stream of random digits so as to ensure that 0s and 1s appeared with equal probability. This article introduces a simple and mentally workable generalisation of his technique to random dice, so a loaded die can be used to uniformly draw numbers from the set $$\{1, 2, 3, 4, 5, 6\}$$, with reasonable success.

## Von Neumann skew-correction algorithm

### Biased coins

Von Neumann’s originally proposes the following technique for getting an unbiased result from a biased coin :

If independence of successive tosses is assumed, we can reconstruct a 50-50 chance out of even a badly biased coin by tossing twice. If we get heads-heads or tails-tails, we reject the tosses and try again. If we get heads-tails (or tails-heads), we accept the result as heads (or tails).

This algorithm is illustrated below:

Flowchart recapitulating Von Neumann’s bias-suppressing algorithm

This amazingly simple strategy does yield heads and tails with equal probabilities, because coin tosses, though biased, are assumed to be independent. Therefore, if heads come out with probability $$p$$, and tails with probability $$1-p$$, then both (head, tails) and (tails, heads) sequences occur with probability $$p(1-p)$$.

### Removing the bias from a stream of random digits

Generalising this technique to a stream of random digits is straightforward: pack digits two-by-to, discard any pair of two equal digits (00 or 11), and finally convert 01 to 0 and 10 to 1.

Illustration of Von Neumann’s biased-suppressing algorithm on a stream of random digits

### Probabilistic remarks

#### Expected rejection rate

Continuing on previous calculations, since each coin toss is taken to be independent from the others, both (heads, tails) and (tails, heads) sequences will occur with probability $$p(1-p)$$, while (heads, heads) and (tails, tails) sequences will occur with respective probabilities $$p^2$$ and $$(1-p)^2$$.
Define the expected rejection rate $$r$$ to be the the probability of obtaining (heads, heads) or (tails, tails) (and thus of discarding the sequence). Then $$r = p^2 + (1-p)^2 = 1 – 2 ⋅ p(1 – p) = (1/2) + 2 ⋅ (p – 1/2)^2$$, which implies that $$r ≥ 1/2$$. In other words, in the best case (when the coin is perfectly balanced), half of the coin sequences will be discarded.

Rejection rate as a function of heads probability

#### Expected number of tosses

Since the rejection rate is $$r = 1 – 2 ⋅ p(1 – p)$$, the odds of discarding exactly $$n$$ two-coins sequences before obtaining an acceptable sequence are $$r^n ⋅ (1-r)$$, and the expected number of two-coins sequences to be discarded is $$d = \sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = \frac{r}{1-r}$$.

Indeed, $$\sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = r(1-r) ⋅ \sum_{k=0}^\infty k r^{k-1} = r(1-r) ⋅ \frac{∂}{∂r} \sum_{k=0}^\infty r^k = r(1-r) ⋅ \frac{∂}{∂r} \frac{1}{1-r} = \frac{r}{1-r}$$

Now, if $$d$$ two-coins sequences are discarded before an acceptable sequence is obtained, then the number of coin tosses required is $$2 ⋅ (1 + d) = \frac{1}{p(1-p)} ≥ 4$$. In other words, it will take in the best case an average of 4 single-coin tosses to obtain an acceptable sequence. Or, as Von Neumann puts it,

The resulting process is rigorously unbiased, although the amended process is at most 25 percent as efficient as ordinary coin-tossing.

Expected number of tosses before an acceptable two-coins sequence is found, plotted as a function of heads probability

## A generalisation to loaded dice

The general idea behind Von-Neumann style skew-correction techniques is to consider sequences of coin tosses instead of isolate ones, and pick a sequence length long enough that an even number of possible outcomes have equal probabilities. These outcomes can then be split in two groups, one representing heads and the other representing tails, while other outcomes are discarded.

Generalising this strategy to weighted dice excludes picking a sequence length $$n ≤ 2$$, since no subset of possible outcomes of equal probability for $$n=1$$ or $$n = 2$$ has a cardinal divisible by six. For $$n=3$$, on the other hand, the outcomes of any dice (crooked or not) can be partitioned into six categories of equal probability according to the relative ordering of the successive numbers rolled, provided these numbers are all different — that is, sequences can be grouped according to whether the second number is greater (↑) or smaller (↓) than the first, and whether the third number is greater (↑↑↑, ↓↑↑) or smaller (↑↓↓, ↓↓↓) than both the first and the second, or between them (↑↑↓, ↓↓↑).

This strategy is illustrated in the diagram below, which provides a reference table for interpreting sequences of three rolls of a loaded dice to produce unbiased results. Should the same number occur twice, all three rolls should be discarded, and the process should be started over.

Decision table for suppressing dice bias. Given a sequence of three throws, chart the three throws so as to form an arrow pattern, and use this table to obtain the corresponding result. Examples are given in green for each pattern. Using this table, rolling $$1 2 5$$ would yield a $$1$$, while $$1 5 2$$ would yield a $$2$$, and $$5 2 1$$, a $$4$$.

All six possible orderings will occur with equal probability, because each of the sequences belonging to any of these orderings matches exactly one sequence of equal probability in every other ordering: for example, $$1 2 4$$ matches the ordering ↑↑↑, whereas $$2 1 4$$, which is equally likely, matches ↓↑↑.

### Implementation

Given a sequence of three independent rolls of a possibly biased die, this function returns a fair dice roll using the previously exposed technique:

			def translate(*rolls):
# returns False if 'rolls' contains duplicates, and a fair dice roll otherwise

# Possible relative ordering of the three rolls. 1 2 3 is ↑↑↑, 2 1 3 is ↓↑↑.
deltas = [(True, True, True),   (True, True, False),
(True, False, False), (False, False, False),
(False, False, True), (False, True, True)] # True stands for ↑

(r1, r2, r3) = rolls;
unique = len(set(rolls)) >= 3
return unique and 1 + sequences.index((r2 > r1, r3 > r1, r3 > r2))

Here’s a test function. The
bias

parameters measures how unbalanced the die is: the higher the bias, the higher the probability to roll a 6.

			def unfair_roll(bias):
return random.choice([1, 2, 3, 4, 5] + [6]*bias)

def test(number_of_rolls, bias):
count = [0]*7
bias = max(bias, 1)

for roll_id in range(0, number_of_rolls):
rolls = unfair_roll(bias), unfair_roll(bias), unfair_roll(bias)
count[translate(*rolls) or 0] += 1

print("Rejection rate: {:.2f}%".format(count[0] / number_of_rolls * 100))
print("Relative digit frequency:")
for digit in range(1, 6 + 1):
print("  {}: {:.2f}%".format(digit, count[digit] / number_of_rolls * 100))
print("Before correction, 6 was {} times more likely to occur"
" than other digits.".format(bias))

Here’s a single run for 50 000 sequences of three rolls of a die ten times more likely to give a 6:

			>>> test(50*1000, 10)
Rejection rate: 80.52%
Relative digit frequency:
1: 3.24%
2: 3.30%
3: 3.26%
4: 3.18%
5: 3.17%
6: 3.32%
Before correction, 6 was 10 times more likely to occur than other digits.

## Conclusion and further remarks

Though easy to work out mentally, this bias-correction technique for dice is far from optimal. Indeed, the minimal rejection rate (obtained for a perfectly balanced dice) is about 44%, and the rejection rate quickly increases as the bias gets stronger, reaching 64% when 6 is 5 times more likely to occur than other individual digits.

More subtle techniques, which I will discuss in a future article, take the rejection rate down to 1.5% for perfectly balanced dice, and keep it under 7% for dice with a 6 to 1 chance of getting a 6 over any other digit.

Do you know other strategies to even the odds when playing with a loaded dice? Sound off in the comments!

1. Various techniques used in connection with random digits. NIST journal, Applied Math Series, 12:36-38, 1951. This article does not seem to be available online, though it is widely cited. It is reprinted in pages 768-770 of Von Neumann’s collected works, Vol. 5, Pergamon Press 1961 []

## The cafeteria paradox: stop using the water dispenser while someone else does!

Most cafeteria water dispensers will let two (sometimes more) people fill a jug at the same time. This article uses simple maths to prove that it’s a waste of time. In other words, two people should never use the same water dispenser at the same time: I call this the cafeteria paradox.

## An intuitive presentation

Alice and Bob are at the cafeteria, seating at different tables. Alice stands up to refill her table’s water jug. Little after, Bob stands up with his own table’s jug, heading to the same water dispenser. That water dispenser is a perfectly standard one, with two taps, and Bob finds himself standing near Alice. After a small hesitation, Bob starts using the second tap to fill his own jug, thereby diverting part of the output previously devoted to Alice.

This grants him an exasperated and somewhat puzzled look from Alice. Why?

The reason is quite simple: though most people will act just like Bob, using the water dispenser while someone else does is — in most of the cases — a useless waste of time. Indeed, starting to fill his own jug before Alice is done will not save Bob any time, but it will waste some of Alice’s time. That’s the cafeteria paradox.

Fortunately, this paradox is readily explained: assuming both Alice and Bob are using similar jugs, and both start with an empty jug, Alice will finish before Bob does; when Bob leaves, therefore, the water dispenser will have dispatched precisely the countenance of two caferetia jugs.
Now, Since the combined output of the two taps equals that of only one, dispatching this amount of water will take a constant time (precisely twice the time that it would take to fill a single jug). Hence Bob will gain no time at all by starting to fill his jug right away, while some of Alice’s time will be wasted.

Not convinced? Imagine Bob and Alice are waiting for some pizza at a take-away restaurant. Given that Bob and Alice won’t start eating before they get to their respective homes, should Bob accaparate half of Alice’s pizza when it arrives, or let Alice get the full pizza first and wait for his own pizza? If Bob asks for half of Alice’s, both will wait for two pizzas to be ready; on the other hand, if Bob lets Alice go first, Alice will have only waited for one, while Bob will have waited for two. Clearly the second solution is better for Alice, and neutral for Bob; and the situation is exactly similar to that of the water dispenser.

Still not convinced? Read on for a full mathematical description of this situation. Otherwise, skip to the conclusion (and the sticker!).

## A mathematically accurate description

Let $$t_A$$ and $$t_B$$ be the respective times when Alice and Bob reach the water dispenser, and $$t_A’, t_B’$$ the respective times when they leave. Let $$t^*$$ be the time when Bob starts using the water dispenser. The water dispenser has a flow rate $$f$$, and each jug has a capacity $$c$$.

From $$t_A$$ to $$t^*$$, Alice’s jug is filling at rate $$f$$; from $$t^*$$ to $$t_A’$$, at rate $$f/2$$. Hence Alice’s departure time $$t_A’$$ satisfies the following equation: $$(t^* – t_A) ⋅ f + (t_A’ – t^*) ⋅ (f/2) = c$$
which yields $$t_A’ = t^* + \frac{c}{f/2} – 2 ⋅ (t^* – t_A)$$, so $$t_A’ = 2 t_A – t^* + \frac{c}{f/2}$$

Similarly, Bob’s departure time $$t_B’$$ satisfies the following equation: $$(t_A’ – t^*) ⋅ (f/2) + (t_B’ – t_A’) ⋅ f = c$$
which yields $$t_B’ = t_A’ + \frac{c}{f} – (1/2) ⋅ (t_A’ – t^*)$$, so $$t_B’ = (1 / 2) ⋅ (t_A’ + t^*) + \frac{c}{f}$$

Finally, subtituting $$2 t_A – t^* + \frac{c}{f/2}$$ for $$t_A’$$ yields $$t_B’ = t_A + (1 / 2) \cdot \frac{c}{f/2} + \frac{c}{f}$$; that is $$t_B’ = t_A + 2 ⋅ \frac{c}{f}$$

To summarize, Alice will leave at $$t_A’ = 2 t_A – t^* + \frac{c}{f/2}$$ (the later Bob starts, the earlier Alice will leave), while Bob will leave at $$t_B’ = t_A + 2 ⋅ \frac{c}{f}$$, which does not depend on $$t_B$$ or $$t^*$$.

## Conclusion

The socially optimal solution is that Bob waits until Alice is done before starting to use the water dispenser; otherwise, some of Alice’s time will be wasted, though Bob will not benefit from it. The key to understanding this is that Bob will not leave the water dispenser before both jugs are full, and filling both jugs takes a constant time (the time needed for the water dispenser to deliver twice the volume of a jug, spanning from Alice’s arrival to Bob’s departure). Thus using the water dispenser while someone else does is not only selfish: it’s downright stupid.

Posted in Brain teasers, Life hacking, Mathematics, Modelisation | | 4 Comments

## Willy Wonka’s golden tickets: certainly the most profitable marketing campaign of all times

There’s something that has troubled me since childhood. In Charlie and the chocolate factory, one of the lucky children (Veruca Salt) gets her golden ticket thanks to her father repruposing his peanut shelling factory in a chocolate-bar-unwrapping factory.

Now, there are only five golden tickets, and chocolate bars in the book seem quite popular. How many bars would one need to buy (and unwrap) just to have a seizable chance of finding one of the tickets? Most likely a lot. After doing the maths, I would estimate the number of chocolate bars that Mr. Salt had to buy to something between 12 and 40 million chocolate bars; which means this promotional campain was most certainly one of the most profitable in history. Details below.

## Background: Charlie and the chocolate factory

In Charlie and the Chocolate Factory, Roald Dahl pictured the archetype of the spoiled little child in the character of Veruca Salt (left). When Willy Wonka, an excentric, genius chocolate maker, announces that five golden tickets have been hidden underneath the wrapping of ordinary chocolate bars, granting their lucky discoverers a visit of his chocolate factory and a lifetime supply of sweets, the world goes into a chocolate-devouring frenzy. It’s only a few days before the first and second ticket are found, the later by a girl named Veruca Salt. Here is how her father explains the discovery

‘You see, boys,’ he had said, ‘as soon as my little girl told me that she simply had to have one of those Golden Tickets, I went out into the town and started buying up all the Wonka bars I could lay my hands on. Thousands of them, I must have bought. Hundreds of thousands!

## So how many chocolate bars did Mr. Salt actually buy?

Assume Mr. Salt wants to have a reasonable chance of finding one of the tickets, perhaps 70 or 80%. Let $$\alpha$$ be the acceptable risk level, so that the probability of actually finding the ticket is $$1 – \alpha$$. Since televisions seem rather common in the book, and it was published in 1964, assume the action takes place in the early 1960′s. Say the children population of England at the time is $$P$$, and assume Mr. Wonka makes most of his business in this country, and only with children (including other countries, or adults, would further reduce Mr. Salt’s chances of success). There are $$T = 5$$ golden tickets in play, and we can estimate the number of bars sold per child to $$b =$$ three or more, given that Charlie Bucket, though extremely poor, buys three.

Excluding Mr. Salt compulsive buying, this generates a total static demand $$D = P \times b$$. Let $$N$$ be the number of chocolate bars sold in total, and $$n$$ the number of bars bought by the Salt family, so $$N = D + n$$ — the underlying assumption being that chocolate production is profitable at any level.

The chance of these $$n$$ bars yielding no golden ticket is $$\binom{N-T}{n} / \binom{N}{n}$$; conversely, the probability of at least one of these $$n$$ bars contains a golden ticket is $$1 – \binom{N-T}{n} / \binom{N}{n}$$. Assuming the family is ready to accept a probability of failure of $$\alpha$$, $$n$$ will be chosen so that $$1 – \binom{N-T}{n} / \binom{N}{n} > 1 – \alpha$$, or equivalently that $$\binom{N-T}{n} < \alpha \binom{N}{n}$$

Developing this expression and substituing $$N - D$$ for $$n$$ yields $$(N - T)! / (D - T)! < \alpha (N! / D!)$$, which is the same as $$\frac{D!}{(D - T)!} < \alpha \frac{N!}{(N - T)!}$$

The children population of England in the 60's was about $$12.2$$ million kids1. Assuming each bought 3 chocolate bars, the static demand was $$36.6$$ million bars. Finally, setting $$T = 5$$ and numerically solving the equation above yields $$N \approx 49.6$$ million chocolate bars, and thus $$n = 12.9$$ million chocolate bars.

In other words, Mr. Salt would need to buy 12.9 million chocolate bars to reach a 70% chance of securing a golden ticket. Taking this probability up to 95% would imply buying an extra 28 million chocolate bars, bringing the grand total to about 41 million chocolate bars.2

### Code

Here’s the Octave/Matlab code used to numerically solve the inequation above. The key is to notice that $$\frac{D!}{(D – T)!}$$ and $$\frac{N!}{(N – T)!}$$ are both values taken by the monotonically increasing polynomial $$f(X) = \frac{X!}{(X-T)!} = \prod_{k=0}^{T-1} (X – k)$$. Thus the inequation can be rewritten as $$g(N) > 0$$, where $$g(N) = (f(N)/f(D)) – (1/\alpha)$$.

function y = f(x, T)
y = prod(x - (1:(T-1)));
endfunction

function y = g(N, D, T, alpha)
y = f(N, T)/f(D, T) - 1/alpha;
endfunction

N = fzero(@(x) g(x, 36.7e6, 5, 0.3), [10e6, 10e20]);


Illustration by Quentin Blake. Title image from wikimedia.

1. Source: World dataBank, indicators SP.POP.0014.TO.ZS and SP.POP.TOTL; the population of the time was 52.4 million people, and the percentage of children aged less than 15 was 23.3% []
2. If you include English adults in this funny little buying game, the static demand reaches 76.9 million chocolate bars, pushing the 70% to $$N = 104$$ million chocolate bars ($$n = 27$$ million), and the 95% probability threshold to $$N = 162$$ million chocolate bars ($$n = 85.7$$ million). []
Posted in Litterature, Octave/Matlab, Probabilities | 1 Comment

## How random is pseudo-random? Testing pseudo-random number generators and measuring randomness

After introducing true and pseudo-random number generators, and presenting the methods used to measure randomness, this article details a number of common statistical tests used to evaluate the quality of random number generators.

## Introduction: true and pseudo-random number generators

Obtaining long sequences of random numbers, as required by some cryptographic algorithms, is a delicate problem. There are basically two types of random number generators: true random number generators, and pseudo-random number generators.

### True random number generators (TRNGs)

True random number generators rely on a physical source of randomness, and measure a quantity that is either theoretically unpredictable (quantic), or practically unpredictable (chaotic — so hard to predict that it appears random).

I’ve compiled a list of sources of true randomness (capable of sufficiently fast output); please let me know in the comments if I forgot any.

• Thermal noise, due to the thermal agitation of charge carriers in an electronic component, and used by some Intel RNGs
• Imaging of random processes, used by LavaRnd
• Atmospheric noise, mostly due to thunder storms happening around the globe, and used by random.org
• Cosmic noise, due to distant stars and background radiation
• Driver noise, often used by /dev/random in Unix-like operating systems1
• Transmission by a semi-transparent mirror, used by randomnumbers.info
• Nuclear decay, used by HotBits
• Coupled inverters, used by Intel’s new random generators.

The coupled inverters technique devised by Greg Taylor and George Cox at Intel is truly the most fascinating one : it relies on collapsing the wave function of two inverters put in a superposed state.

### Pseudo-random number generators (PRNGs)

Pseudo-random number generators are very different: they act as a black box, which takes one number (called the seed2 and produces a sequence of bits; this sequence is said to be pseudo-random if it passes a number of statistical tests, and thus appears random. These tests are discussed in the following section; simple examples include measuring the frequency of bits and bit sequences, evaluating entropy by trying to compress the sequence, or generating random matrices and computing their rank.

#### Formal definition

A pseudo random number generator is formally defined by an initialization function, a state (a sequence of bits of bounded length), a transition function, and an output function:

• The initalisation function takes a number (the seed), and puts the generator in its initial state.
• The transition function transforms the state of the generator.
• The output function transforms the current state to produce a fixed number of bit (a zero or a one).

A sequence of pseudo-random bits, called a run, is obtained by seeding the generator (that is, initializing it using a given seed to put it in its initial state), and repeatedly calling the transition function and the output function.3 This procedure is illustrated by the following diagram:

Schematic diagram of a pseudo-random number generator

In particular, this means that the sequence of bits produced by a PRNG is fully determined by the seed4. This has a few disagreeable consequences, among which the fact that a PRNG can only generate as many sequences as it can accept different seeds. Thus, since the range of values that the state can take is usually much wider than that of the seed, it is generally best to only seldom reset (re-seed) the generator5.

#### Cryptographically secure pseudo-random number generators

Since PRNGs are commonly used for cryptographic purposes, it is sometimes asked that the transformation and output functions satisfy two additional properties :

1. Un-predictability: given a sequence of output bits, the preceding and following bits shouldn’t be predictable
2. Non-reversibility: given the current state of the PRNG, the previous states shouldn’t be computable.

Rule 1 ensures that eavesdropping the output of a PRNG doesn’t allow an attacker to learn more than the bits they overheard.
Rule 2 ensures that past communications wouldn’t be compromised should an attacker manage to gain knowledge of the current state of the PRNG (thereby compromising all future communications based on the same run).

When a PRNG satisfies these two properties6, it is said to be cryptographically secure7.

## Testing pseudo-random number generators

There are a number of statistical tests devised to distinguish pseudo-random number generators from true ones; the more a PRNG passes, the closer it is considered to be from a true random source. This section presents general methodology, and studies a few example tests.

### Testing methodology

#### Testing a single sequence

Almost all tests are built on the same structure: the pseudo-random stream of bits produced by a PRNG is transformed (bits are grouped or rearranged, arranged in matrices, some bits are removed, …), statistical measurements are made (number of zeros, of ones, matrix rank, …), and these measurements are compared to the values mathematically expected from a truly random sequence.

More precisely, assume that $$f$$ is a function taking any finite sequence of zeros and ones, and returning a non-negative real value. Then, given a sequence of independent and uniformly distributed random variables $$X_n$$, applying $$f$$ to the finite sequence of random variables $$(X_1, …, X_n)$$ yields a new random variable, $$Y_n$$. This new variable has a certain cumulative probability distribution $$F_n(x) = \mathbb{P}(Y_n \leq x)$$, which in some cases approaches a function $$F$$ as $$n$$ grows large. This limit function $$F$$ can be seen as the cumulative probability distribution of a new random variable, $$Y$$, and in these cases $$Y_n$$ is said to converge in distribution to $$Y$$.

Randomness tests are generally based on carefully selecting such a function8, and calculating the resulting limit probability distribution. Then, given a random sequence $$\epsilon_1, \ldots, \epsilon_n$$, it is possible to calculate the value of $$y = f(\epsilon_1, \ldots, \epsilon_n)$$, and assess how likely this value was by evaluating the probability $$\mathbb{P}(Y \geq y)$$ — that is, the probability that a single draw of the limit random variable $$Y$$ yields a result greater than or equal to $$y$$. If this probability is lower than $$0.01$$, the sequence $$\epsilon_1, \ldots, \epsilon_n$$ is said to be non-random with confidence 99%. Otherwise, the sequence is considered random.

#### An example: the bit frequency test

Let $$(X_n)$$ be an infinite sequence of independent, identically distributed random variables. Take $$f$$ to be $$f(x_1, …, x_n) = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{x_n – \mu}{\sigma}$$ where $$\mu$$ is the average, and $$\sigma$$ the standard deviation, of any $$X_k$$. Define $$Y_n = f(X_1, …, X_n)$$. The central limit theorem states that the distribution of $$Y_n$$ tends to the standard normal distribution. This theorem is illustrated below.

Example of convergence in distribution to the standard normal distribution, after summing 1, 2, 3, 4, 5, 8, 10, and 50 independent random variables following the same distribution.

Consider the particular case where $$X_n$$ is uniformly distributed over the discrete set $$\{-1,1\}$$. The central limit theorem applies, and $$Y_n = f(X_1, …, X_n)$$ converges in distribution to the standard normal law. For practical reasons, however, we choose to consider $$Y_n = |f(X_1, …, X_n)|$$, which converges in distribution to a variable $$Y_\infty$$. $$Y_\infty$$ has the half-normal distribution (illustrated on the left), which has the nice property of being monotonically decreasing.

In plain colours, the half-normal distribution. In pale colours, the corresponding cumulative distribution function. If a pseudo-random sequence is picked in the red area, it is declared non-random with 99% certainty (the yellow and orange areas correspond to a certainty of 90% and 95% respectively).

Here is how the test proceeds : the sequence $$(\epsilon_1, \ldots, \epsilon_n)$$ given by the PRNG is converted to a sequence of elements of $$\{-1, 1\}$$ by changing each $$\epsilon_i$$ to $$x_i = 2 \cdot \epsilon_i – 1$$. This gives $$x_1, \ldots, x_n$$. Then, $$|f(x_1, \ldots, x_n)| = \left|\sum_{k=1}^n \frac{x_k}{\sqrt{n}}\right|$$ is numerically calculated, yielding $$y$$. Finally, the theoretical likeliness of a truly random sequence yielding a value equal to or greater than $$y$$ is evaluated using the cumulative distribution function of $$Y_\infty$$ (in pale colours).

If this probability is too low (red area, probability below $$0.01$$), the sequence is rejected as non-random, with 99% certainty. If it is greater than $$0.01$$, but less than $$0.05$$ (orange area) the sequence is sometimes9 rejected as non-random with 95% certainty. Similarly, if it is greater than $$0.05$$ but less than $$0.10$$ (yellow area), the sequence is sometimes rejected as non-random, with 90% certainty.

High probabilities (green area, probabilities greater than $$0.10$$), finally, do not permit to distinguish the pseudo-random sequence from a truly random one10.

#### Testing a pseudo-random number generator

To test whether a pseudo-random number generator is close to a true one, a sequence length is chosen, and $$m$$ pseudo-random sequences of that length are retreived from the PRNG, then analysed according to the previous methodology. It is expected, if the confidence level is 1%, that about 99% of the sequences pass, and 1% of the sequences fail; if the observed ratio significantly differs from 99 to 1, the PRNG is said to be non-random. More precisely, the confidence interval is generally chosen to be $$p \pm 3\frac{\sqrt{p(1-p)}}{\sqrt{m}}$$, where $$p$$ is the theoretical ratio (99% here); this takes the probability of incorrectly rejecting a good number generator down to 0.3%.

This confidence interval is obtained as follows. For large values of $$m$$, approximate the probability of rejecting $$r$$ sequences by $$\binom{m}{r} p^{1-r} (1-p)^r$$ (here $$p = 99\%$$). Write $$\sigma_i$$ the random variable denoting whether the $$i$$th sequence was rejected. Then with the previous approximation all $$\sigma_i$$ are independent and take value $$0$$ with probability $$p$$, $$1$$ with probability $$1-p$$ (standard deviation: $$\sqrt{p(1-p)}$$). The central limit theorem states that with probability close to $$\displaystyle \int_a^b \frac{e^{-x^2/2}}{\sqrt{2\pi}}$$, the observed rejection frequency $$\hat{p}$$ lies in the interval $$\left[p + b\frac{\sqrt{p(1-p)}}{\sqrt{m}}, p + a\frac{\sqrt{p(1-p)}}{\sqrt{m}}\right]$$. Finally, setting $$b = -a$$ and adjusting $$a$$ so that this probability is $$\approx 99.7\%$$ yields $$a \approx 3$$.

### Tests suites

A number of test suites have been proposed as standards. The state-of-the-art test suite was for a long time the DieHard test suite (designed by George Marsaglia), though it as eventually superseeded by the National Institute of Standards and Technology recommendations. Pierre l’Ecuyer and Richard Simard, from Montreal university, have recently published a new collection of tests, TestU01, gathering and implementing a impressive number of previously published tests and adding new ones.11

A subset of these tests is presented below.

### A list of common statistical tests used to evaluate randomness

#### Equidistribution (Monobit frequency test, discussed above)

Purpose: Evaluate whether the sequence has a systematic bias towards 0 or 1 (real-time example on random.org)
Description: Verify that the arithmetic mean of the sequence approaches $$0.5$$ (based on the law of large numbers). Alternatively, verify that normalized partial sums $$s_n = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{2\cdot\epsilon_k – 1}{\sqrt{2}}$$ approach a standard normal distribution (based on the central limit theorem).
Variants: Block frequency test (group bits in blocks of constant length and perfom the same test), Frequency test in a block (group bits in blocks of constant length and perform the test on every block).

#### Runs test

Purpose: Evaluate whether runs of ones or zeros are too long (or too short) (real-time example on random.org)
Description: Count the number of same-digits blocks in the sequence. This should follow a binomial distribution $$\mathcal{B}(n, 1/2)$$. By noting that the number of same-digits blocks is the sum of the $$\sigma_i$$ where $$\sigma_i = 1$$ if $$\epsilon_i \not= \epsilon_{i+1}$$ and $$\sigma_i = 0$$ otherwise, the previous method can be applied.
Variants: Longest run of ones (divide the sequence in blocks and measure the longest run of ones in each block).

#### Binary matrix rank

Purpose: Search for linear dependencies between successive blocks of the sequence
Description: Partition the sequence in blocks of length $$n \times p$$. Build one $$n \times p$$ binary matrix from each block, and compute its rank. The theoretical distribution of all ranks is however not easy to compute — this problem is discussed in details in a paper by Ian F. Blake and Chris Studholme from the university of Toronto.

#### Discrete Fourier transform

Purpose: Search for periodic components in the pseudo-random sequence
Description: Compute a discrete Fourier transform of the pseudo-random input $$2\epsilon_1 – 1, \ldots, 2\epsilon_n – 1$$, and take the modulus of each complex coefficient12. Compute the 95% peak height threshold value, defined to be the theoretical value that no more than 5% of the previously calculated moduli should exceed ($$\sqrt{-n\log(0.05)}$$ (NIST)), and count the actual number of moduli exceeding this threshold. Assess the likeliness of such a value.

#### Compressibility (Maurer’s universal statistical test)

Purpose: Determine whether the sequence can be compressed without loss of information (evaluate information density). A random sequence should have a high information density.
Description: Partition the sequence in blocks of length $$L$$, and separate these block into categories. Call the first $$Q$$ blocks the initialisation sequence, and the remaining $$K$$ blocks the test sequence. Then, give each block in the test sequence a score equal to the distance that separates it from the previous occurence of a block with the same contents, and sum the binary logarithm of all scores. Divide by the number of blocks to obtain the value of the test function, and verify that this value is close enough to the theoretically expected value.13

#### Maximum distance to zero, Average flight time, Random excursions

Purpose: Verify that the sequence has some of the properties of a truly random walk
Description: Consider the pseudo-random sequence to represent a random walk starting from zero and moving up (down) by one unit every time a zero (a one) occurs in the sequence. Measure the maximum height reached by the walk, as well as the average time and the number of states visited. Also, for each possible state (integer), measure in how many cycle it appears. Evaluate the theoretical probability of obtaining these values.

The sources previously cited (in particular the NIST recommendation paper) present mathematical background about these tests, as well as lots of other tests.

Did I miss your favourite randomness test? Were you ever confronted to obvious imperfections in a pseudo-random number generator? Tell us in the comments!

1. See this post by Arnold Reinhold for a discussion on how to manually increase the entropy of /dev/random []
2. The current system time is commonly used as a default seed, although for more security-sensitive applications user input is sometimes used (most often by asking the user to shake their mouse in a random fashion) []
3. By definition, since the number of possible states is bounded, this sequence will be periodic; in practice however, this period can be made as large as desired by increasing the number of bits in the state. []
4. This property is very useful when using PRNGs to simulate experiments; seeds used during the simulation are recorded in case the results of one specific run need to be reproduced []
5. except when the state risks being compromised — see the following section []
6. Again, saying that a PRNG satisfies these properties means that either it can be proven that the rules cannot be broken, or that it is not practically possible to break them in a reasonable amount of time []
7. cryptographically secure pseudo-random number generators are often abbreviated as CSPRNGs []
8. In particular, the function is often chosen so that the limit random variable $$Y$$ has a monotonous (decreasing) density function []
9. Some papers advise to reject any sequence with a p-value $$\leq 0.05$$, but most choose $$0.01$$ as the threshold []
10. Note that this generates false positives: a true random number generator will, in the long run, produce every possible sequence: sequences yielding measurements that fall in the red area will be rare, but they will occur, and will be discarded as non-random. This is very important to the PRNG-testing procedure, as explained in the next sub-section []
11. Ent is another implementation of a few tests. []
12. Note that, since the input sequence is real-valued, its DFT is symetrical; hence all the counting need only happen on the first half of the DFT sequence []
13. Since the binary logarithm in concave, this value gets smaller as the dispersion of the scores increase; informally, since the expected distance between to blocks of equal contents is approximately the number of possible different blocks ($$2^L$$ ) the expected value should be about $$\log_2(2^L)$$ — that is $$L$$. Computing the actual expected values yields a somewhat smaller number, which approaches $$L – 0.83$$ as $$L \to \infty$$ (Coron, Naccache) []

## Using abstract classes to simulate tagged unions (aka sum types)

Most functional languages offer support for tagged unions (also called sum types), a type of data structure capable of successively holding values of several fixed types. This article shows how to use abstract classes to emulate such behaviour in high-level object-oriented languages such as C#, Java, or VB.Net1.

## An introduction to sum types and pattern matching

### A preliminary example: linked lists

Starting with a simple example, imagine you want to create a linked list structure. Most C#/Java programmers will write something like

  class MyList<T> {
T value;
MyList<T> next;

MyList(T value, MyList<T> next) {
this.value = value;
this.next = next;
}
}

MyList<int> example = new MyList(1, new MyList(2, null));


This relies on a rather ugly trick: null is used to represent the empty list. Functional languages prefer to clearly separate two cases : the empty list (usually called nil), and the concatenation of a value (called the head of the list) and another list (called the tail).

  (* 't is a type variable, just like T above *)
type 't MyList = Nil | Cons of 't * ('t MyList);;
let example = Cons(1, Cons(2, Nil))

  {- t is a type variable, just like T above -}
data MyList t = Nil | Cons t (MyList t)
example = Cons 1 (Cons 2 Nil)


This introduces the concept of a sum type; a type in which values belong to one of a number of sets (here, either to the singleton set {Nil}, or to the set of all values of type t — for example integers if t is int).

Here is another example, where a value of type Value can be either an integer, or a floating-point number, or a string, or a boolean.

  type Value =
| IntegerValue of int
| FloatValue of float
| StringValue of string
| BooleanValue of bool

  data Value =
IntegerValue Int
| FloatValue Float
| StringValue String
| BooleanValue Bool


Quite naturally, functional languages include a special construct to handle values whose type is a sum type. This construct is called pattern matching; it is a bit like a switch statement, only the switch is on the object’s inner type (the subset of the sum type which the object belongs to). For example, to identify the contents of a variable whose type is Value, functional programmers write

  let identify = function
| IntegerValue (val) -> "int!"
| FloatValue (val) -> "float!"
| StringValue (str) -> "string"
| BooleanValue (b) -> "bool!"

  identify (IntegerValue val) = "int!"
identify (FloatValue val) = "float!"
identify (StringValue str) = "string"
identify (BooleanValue b) = "bool!"


Here’s another example, on lists this time: if functional programmers want to count elements in a list, they write

  let rec count = function
| Nil -> 0
| Cons (head, tail) -> 1 + (count tail)
;;

  count Nil = 0
count (Cons hd tl) = 1 + count tl


Let’s move on to our last example

### XML node trees

In this section, we use a sum type to represent the structure of an XML document (some node types have been omitted). We first define an attribute to be a pair of two strings, and a node to be one of DocumentFragment (a list of XML nodes), Element (a name, attributes, and children elements), Commment (some comment text), or Text (plain text).

  type attr =
Attribute of string * string;;

type xmlnode =
| DocumentFragment of xmlnode list
| Element of string
* (attr list)
* (xmlnode list)
| Comment of string
| Text of string
;;

  data XmlNode =
| DocumentFragment [XmlNode]
| Attribute String String
| Element String [Attribute]
[XmlNode]
| Comment String
| Text String


With such a type declaration, writing concise tree-traversal functions is easy: as an example, we define a serialize function, which generate XML code from an xmlnode object.

  let serialize_attribute (Attribute(name, value)) =
Printf.printf " %s=\"%s\"" name value
;;

let rec serialize = function
| DocumentFragment nodes ->
List.iter serialize nodes (* applies serialize to every xmlnode in nodes *)
| Element (label, attributes, nodes) ->
Printf.printf "<%s" label;
List.iter serialize_attribute attributes;
Printf.printf ">\n";
List.iter serialize nodes;
Printf.printf "</%s>\n" label;
| Comment (str) ->
Printf.printf "<!-- %s -->" str;
| Text (str) ->
Printf.printf "%s\n" str;
;;


Here is a small example to illustrate the previous function:

  serialize (
Element("a", [
Attribute("href", "http://pit-claudel.fr/clement/blog");
Attribute("title", "Code crumbs")
], [
Text("Code crumbs -- Personal thoughts about programming")
])
);;


This prints

<a href="http://pit-claudel.fr/clement/blog" title="Code crumbs">
Code crumbs -- Personal thoughts about programming
</a>


## Simulating sum types using abstract classes

### The problem

Returning to a simpler example, suppose we need to represent a generic binary tree (a tree is defined as being either a branch, containing two sub-trees, or a leaf, containing a value). Functional programmers will write something like

  type 't tree = Branch of ('t tree * 't tree) | Leaf of 't
let example = Branch (Branch (Leaf 1, Leaf 2), Leaf 3)

  data Tree t = Branch (Tree t) (Tree t) | Leaf t
example = Branch (Branch (Leaf 1) (Leaf 2)) (Leaf 3)


On the other hand, (hurried) C#/Java programmers will often implement this as a product type — that is, they’ll pack both cases in a single class, and use an extra field to differentiate branches and leaves:

  enum TreeType = {BRANCH, LEAF};

TreeType type;

T leaf_value;

this.type = BRANCH;
this.left_branch = left_branch;
this.right_branch = right_branch;
// Memory waste: leaf_value is unused (!)
}

this.type = LEAF;
this.leaf_value = leaf_value;
// Memory waste: left_branch and right_branch are unused (!)
}
}

new Tree(3));


This representation wastes a lot of memory and, though rather concise, quickly degenerates when more cases are added to the type definition (for example in the xmlnode case — for another real-world example, see the end of this post).

### The solution

The idea is to encode the type disjunction (Branch or Leaf, DocumentFragment or Element or Comment or Text) in the inheritance hierarchy; practically speaking, the trick is to define an abstract Tree class, and make two new classes, Leaf and Branch, which both inherit from Tree:

  abstract class Tree<T> {
public static class Leaf<T> : Tree<T> {
public T value;

public Leaf(T value) {
this.value = value;
}
}

public static class Branch<T> : Tree<T> {
public Tree<T> left, right;

public Branch(Tree<T> left, Tree<T> right) {
this.left = left;
this.right = right;
}
}
}

Tree<int> example = new Tree::Branch(new Tree::Branch(new Tree::Leaf(1),
new Tree::Leaf(2)),
new Tree::Leaf(3));


This allows for much cleaner code ; and the memory waste is gone!

## Pattern matching on abstract classes

The BadTree class, which uses an enum to distinguish branches and nodes, makes it possible to performe pattern matchings over the Tree class, by switching on the Tree.type field. Though this approach directly translates to the Tree abstract class using the is keyword2, it’s not ideal.

  Tree<int> tree = (...);

switch (tree.type) {
case TreeType.BRANCH:
(...); break;
case TreeType.LEAF:
(...); break;
}

  Tree<int> tree = (...);

if (tree is Tree::Branch) {
Tree::Branch branch =
(Tree::Branch) btree;
(...);
} else if (tree is Tree::Leaf) {
Tree::Leaf leaf =
(Tree::Leaf) tree;
(...);
}


Though functional, this new version of the pattern-matching code is not really pretty. A much cleaner approach is to implement the pattern matching directly in the derived classes (Branch and Leaf), by declaring an abstract method in the parent class, Tree. That is, instead of having one big function doing some explicit pattern matching in the Tree class, we’ll have multiple specialized functions — one in each class inheriting Tree. Philosophically speaking, just like we implemented type disjunctions as different classes inheriting a common parent, we’ll implement logical disjunctions as different functions overriding a common parent.

For example, here is how we can write a Fork function, which returns a new Tree with every leaf split in two identical leaves:

  abstract class Tree<T> {
public abstract Tree<T> Fork();

public static class Leaf<T> : Tree<T> {
public T value;

// Constructor omitted

public override Tree<T> Fork() {
return new Tree::Branch(new Tree::Leaf(value), new Tree::Leaf(value));
}
}

public static class Branch<T> : Tree<T> {
public Tree<T> left, right;

// Constructor omitted

public override Tree<T> Fork() {
return new Tree::Branch(left.Fork(), right.Fork());
}
}
}


## Going further: a real-life example

The benefits of using this abstract class approach are particularly visible when the data types involved get more complex. Below is an example implementation of a data structure used to describe arithmetic expressions.

  abstract class ArithmeticExpression {
// Evaluates an arithmetic expression. context is a dictionary mapping variable
//  names to their values.
public abstract float? Eval(Dictionary<String, float?> context);

public class Variable : ArithmeticExpression {
string label;

public override float? Eval(Dictionary<String, float?> context) {
float? value;
context.TryGetValue(label, out value);
return value;
}
}

public abstract class BinaryOperation : ArithmeticExpression {
protected ArithmeticExpression left_operand, right_operand;

public BinaryOperation(ArithmeticExpression left_operand, ArithmeticExpression right_operand) {
this.left_operand = left_operand;
this.right_operand = right_operand;
}

protected abstract float? Apply(float left_value, float right_value);

public override float? Eval(Dictionary<String, float?> context) {
float? left_result = left_operand.Eval(context), right_result = right_operand.Eval(context);

if (left_result.HasValue && right_result.HasValue)
return Apply(left_result.Value, right_result.Value);
else
return null;
}

class Sum : BinaryOperation {
public Sum(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

protected override float? Apply(float left_value, float right_value) {
return left_value + right_value;
}
}

class Subtraction : BinaryOperation {
public Subtraction(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

protected override float? Apply(float left_value, float right_value) {
return left_value - right_value;
}
}

class Product : BinaryOperation {
public Product(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

protected override float? Apply(float left_value, float right_value) {
return left_value * right_value;
}
}

class Division : BinaryOperation {
public Division(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { }

protected override float? Apply(float left_value, float right_value) {
return left_value / right_value;
}
}
}
}

1. .Net languages have the [StructLayout(LayoutKind.Explicit)] attribute, which makes it possible to create structs which behave a lot like C++ unions. But that only works with primitive types. []
2. Java equivalent: instanceof []

## Introduction

Boolean flags are commonly used to disable a block of code while another is running: for example,

  private bool in_use;

private void Process() {
in_use = true;
(...)
in_use = false;
}

private void OnEvent() {
if (in_use)
return;

(...)
}

  Private InUse As Boolean;

Private Sub Process()
InUse = True
(...)
InUse = False
End Sub

Private Sub Process()
If InUse Then Return

(...)
End Sub


This design has a major drawback : in_use blocks cannot be nested, since nesting two such blocks will turn in_use to false too early. In the following excerpt, the (...) section in the Process function will not run correctly, because in_use will have been set to false when it runs.

  private void Process() {
in_use = true;
Action1();
Action2();
(...) // in_use == false here (!)
in_use = false;
}

private void Action1() {
in_use = true;
(...)
in_use = false;
}

private void Action2() {
in_use = true;
(...)
in_use = false;
}

  Private Sub Process()
InUse = True
Action1()
Action2()
(...) ' InUse = False here (!)
InUse = False
End Sub

Private Sub Action1()
InUse = True
(...)
InUse = False
End Sub

Private Sub Action2()
InUse = True
(...)
InUse = False
End Sub


Such errors are difficult to spot in large applications, and often lead to hard-to-track bugs.

## Robust boolean flags

A useful trick in such cases is to replace your boolean flags with boolean properties, linked to a counter : every time you set in_use to true, the counter is incremented; every time you set it to false, the counter is decremented. Retrieving in_use returns true when the counter is greater than 0, and false otherwise. That’s somewhat similar to the way semaphores work in parallel programming.

  using System.Diagnostics;

private int users_count = 0;

public bool in_use {
get {
return users_count > 0;
}

set {
users_count += (value ? 1 : -1);
Debug.Assert(users_count >= 0);
}
}

  Imports System.Diagnostics

Private UsersCount As Integer

Public Property InUse As Boolean
Get
Return UsersCount > 0
End Get

Set(value As Boolean)
UsersCount += If(value, 1, -1)
Debug.Assert(UsersCount >= 0)
End Set
End Property


You can now nest in_use blocks safely.

### Going further

The previous construct is not thread-safe, and might end up in an inconsistent state if an exception occurs in the body of an in_use block (in_use will never be set to false if an exception occurs in an in_use block).

We solve both problems by declaring a Flag class which allows for the following construct:

  private void Process() {
using (Flag flag = new Flag("custom name")) {
(...)
}
}

private void OnEvent() {
if (Flag.InUse("custom name"))
return;

(...)
}

  Private Sub Process()
Using flag As New Flag("custom name")
(...)
End Using
End Sub

Private Sub OnEvent()
If Flag.InUse("custom name") Then Return

(...)
End Sub


Here is the implementation; the Flag class exposes one static method, InUse(string), which returns whether determine whether a resource designated by a string is in use. The Flag constructor registers a new user by incrementing the flag_users counter, while the destructor accordingly decrements the flag_users counter.

  class Flag : IDisposable {
string name;
private static Dictionary<string, int> flag_users = new Dictionary<string,int>();

public static bool InUse(string name) {
lock (flag_users)
return (flag_users.ContainsKey(name) && flag_users[name] > 0);
}

public Flag(string name) {
this.name = name;

lock (flag_users) {
if (!flag_users.ContainsKey(name))

flag_users[name] += 1;
}
}

public void Dispose() {
lock (flag_users)
flag_users[name] -= 1;
}
}

  Class Flag
Implements IDisposable

Private Name As String
Private Shared FlagUsersCount As New Dictionary(Of String, Integer)()

Public Shared Function InUse(name As String) As Boolean
SyncLock FlagUsersCount
Return (FlagUsersCount.ContainsKey(name) AndAlso FlagUsersCount(name) > 0)
End SyncLock
End Function

Public Sub New(Name As String)
Me.Name = Name

SyncLock FlagUsersCount
If Not FlagUsersCount.ContainsKey(Name) Then
End If

FlagUsersCount(Name) += 1
End SyncLock
End Sub

Public Sub Dispose() Implements System.IDisposable.Dispose
SyncLock FlagUsersCount
FlagUsersCount(Name) -= 1
End SyncLock
End Sub
End Class


How do you implement your own boolean flags? Post examples in the comments!

## Unscrambling shuffled text

A story which surfaced a few years ago, and met quite some success in the press and on the internet, pretended Cambridge University had been conducing research on some of the most amazing faculties of the human brain. According to a supposed study, the order in which letters were laid out when writing a word mattered very little, provided the first and last letter be kept in place : this conclusion was supported by a short excerpt of shuffled text, which anyone could easily decipher1. As a short example, consider the following sentence:

As many commentators pointed out at the time, the trick works well because the words used are relatively short; the following passage should be much harder to understand:

The eofrft rureieqd to sfssllcceuuy dhiepecr sbecmrald pgsaases daiaarclltmy ianserecs as wdors get lehegtinr.

## Shuffling text

Shuffling a piece of text is easy:

import re, random

def shuffle_one(word):
if len(word) <= 3:
return word

middle = list(word[1:-1])
random.shuffle(middle)

return word[0] + ''.join(middle) + word[-1]

def repl(match):
return shuffle_one(match.groups()[0])

def shuffle(text):
return re.sub("(\w+)", repl, text)

shuffle("Normally, reading this shouldn't be too hard.")


Consider the opposite problem:

## Unshuffling scrambled text

Deciphering a shuffled message is non-trivial because one word can be shuffled in many different ways: "Cadgmrbie", "Cgamdbire", "Cmbiagrde" and "Cgrbimdae" are all acceptable representations of the word "Cambridge".

To overcome this difficulty, we define a shuffling-independent signature — a function that maps each word to a signature that does not change when the word is shuffled. The following snippet implements such a signature by isolating the first and last letter of each word and sorting the central part alphabetically.

def signature(word):
word = word.lower()

if len(word) <= 3:
return word

middle = list(word[1:-1])
return word[0] + ''.join(sorted(middle)) + word[-1]

signature("Welcome") # 'wcelmoe'
signature("Woclmee") # 'wcelmoe'


Using this function, we can now compute a normalized version of the word "Cambridge", which does not change when its letters are shuffled: signature("Cambridge"), signature("Cadgmrbie"), signature("Cgamdbire"), signature("Cmbiagrde") and signature("Cgrbimdae") are all equal to "cabdgimre".

We then compute a signature-to-word mapping (possibly multi-valued — different words may map to the same signature) using a word list2 (english.txt).

handler = open("english.txt", encoding = "utf-8")
words = [word.strip() for word in handler.readlines()]
handler.close()

dictionary = {signature(word):word for word in words}


This gives us a signature-to-word dictionary, which we can then use to decipher the shuffled message. For this, we first compute the signature of each scrambled word in the input, and then retrieve the corresponding clear-text words from the dictionary.

def unshuffle_one(word):
sig = signature(word)
return dictionary[sig] if sig in dictionary else word

def unrepl(match):
return unshuffle_one(match.groups()[0])

def unshuffle(text):
return re.sub("(\w+)", unrepl, text)

print(unshuffle("The eofrft rreuqied to sfssllcceuuy dhiepecr sbecmrald "
"pgsaases daiaarclltmy ianserecs as wdors get lehegtinr."))


Taking the signature of each word in our example sentence, we obtain

The effort reeiqrud to sccefllssuuy dceehipr sabcelmrd paaegsss daaacillmrty iaceenrss as wdors get leeghintr.

And by applying the signature-to-word mapping to each word, we finally obtain the clear text:

The effort required to successfully decipher scrambled passages dramatically increases as words get lengthier.

## Further thoughts

### Handling signature collisions

The code can be modified to allow for signature collisions (collisions happen when multiple words have the same signature): for this, instead of using a dictionary as a word-to-word mapping, we use a dictionary as a word-to-list-of-words mapping:

# Load dictionary entries
dictionary = {}
for word in words:
sig = signature(word)

if not sig in dictionary:
dictionary[sig] = [] # Create an empty list

# Add word to the list of all words sharing signature sig
dictionary[sig].append(word)

def unshuffle_one(word):
sig = signature(word)
return '|'.join(dictionary[sig]) if sig in dictionary else word


The following function prints a list of all signature collisions.

def find_collisions():
for sig in dictionary:
if len(dictionary[sig]) > 1:
print(', '.join(dictionary[sig]))


Relatively common words triggering signature collisions are listed below.

Most common English words triggering signature collisions
angels, angles
barking, braking
bedroom, boredom
blowing, bowling
brunt, burnt
caller, cellar
calvary, cavalry
carve, crave
cashing, chasing
catering, creating
centers, centres
circles, clerics
clot, colt
cloud, could
compiled, complied
complaint, compliant
conservation, conversation
dairies, diaries
damned, demand
density, destiny
divers, drives
entirety, eternity
excepted, expected
feeling, fleeing
files, flies
fingers, fringes
fired, fried
flower, fowler
forests, fosters
gateway, getaway
hate, heat
haters, hearts
incest, insect
lair, liar
lakes, leaks
lanes, leans
maiden, median
males, meals
marital, martial
panel, penal
panels, planes
parties, pirates
patrol, portal
pluses, pulses
pointers, proteins
recourse, resource
redrawing, rewarding
reserve, reverse
resorts, rosters
sacred, scared
saints, stains
sales, seals
shooting, soothing
signing, singing
sinks, skins
skates, stakes, steaks
sorting, storing
spilt, split
spotlight, stoplight
spots, stops
sprite, stripe
taxes, texas
tenor, toner
trail, trial
tribune, turbine
warp, wrap
weird, wired