The cereal box problem: How many boxes does it take to find all prizes?

Children’s cereal manufacturers often attract the attention of young clients by including small prizes and toys in every box; sometimes all prizes are identical, but most often individual prizes are part of a collection, and kids are encouraged to collect them and try to complete a full collection. How long does it take ?

Simple probabilistic modeling shows that on average \(n (1 + 1/2 + \ldots + 1/n)\) boxes are required to complete a full set of \(n\) prizes: for example, it takes on average \(14.7\) boxes to complete a full set of six prizes.

Continue reading →

An experimental estimation of the entropy of English, in 50 lines of Python code

“Th_ onl_ wa_ to ge_ ri_ of a tempta____ is to yie__ to it. Resi__ it, an_ you_ soul gro__ sic_ wi__ longi__ fo_ th_ thin__ it ha_ forbi____ to itse__.”

(Osc__ Wil__, The Picture __ ______ ____)

Thanks to the verbosity of the English language, proficient English speakers generally find it relatively easy to decipher the above passage despite the numerous omissions.

How does one quantify this redundancy? This article introduces the notions of Shannon entropy and information rate, and experimentally estimates the information rate of written English by training a Markov model on a large corpus of English texts. This model is finally used to generate gibberish that presents all the statistical properties of written English. Best of all, the entire source code fits in 50 lines of elegant Python code.

Continue reading →

Linear time probabilistic pattern matching and the Rabin-Karp algorithm

Most linear-time string searching algorithms are tricky to implement, and require heavy preprocessing of the pattern before running the search. This article presents the Rabin-Karp algorithm, a simple probabilistic string searching algorithm based on hashing and polynomial equality testing, along with a Python implementation. A streaming variant of the algorithm and a generalization to searching for multiple patterns in one pass over the input are also described, and performance aspects are discussed.

The algorithm is probabilistic in that it doesn’t always return correct results; more precisely, it returns all valid matches and (with reasonably small probability) a few incorrect matches (algorithms such as this one that tend to be over-optimistic in reporting their results are usually said to be true-biased).

Continue reading →

Six months of Windows Phone Development — Tips, tricks, and performance considerations

A stroke order diagram for the character 羲 TL;DR: I’ve taken plenty of notes while developing my YiXue Chinese dictionary for Windows Phone. Topics covered in this article include performance tips, best practices, and plenty of code snippets for Windows Phone.

Introduction

This article presents notes and remarks that I gathered while working on a Chinese Dictionary App for Windows Phone, YiXue Chinese Dictionary: mistakes I made, fun tips I wrote down, and so on.
I initially didn’t really intend to create a full blog post out of these notes, but their increasing number, and my app recently placing second in Microsoft France’s App Awards contest, gave me enough motivation to share them with the community. Along with various tips and tricks and answers to often-asked (but seldom answered) questions, I will discuss a number of performance improvements that specifically apply to Windows Phone apps.

Continue reading →

Modeling and measuring string comparison performance in C, C++, C# and Python.

O(1) Comparing strings is often — erroneously — said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.

Continue reading →

Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

A spinning coin, about to fall on tails In a famous article published 1951 ((Various techniques used in connection with random digits. NIST journal, Applied Math Series, 12:36-38, 1951. This article does not seem to be available online, though it is widely cited. It is reprinted in pages 768-770 of Von Neumann’s collected works, Vol. 5, Pergamon Press 1961)), John Von Neumann presented a way of skew-correcting a stream of random digits so as to ensure that 0s and 1s appeared with equal probability. This article introduces a simple and mentally workable generalization of his technique to random dice, so a loaded die can be used to uniformly draw numbers from the set \(\{1, 2, 3, 4, 5, 6\}\), with reasonable success.

Continue reading →

The cafeteria paradox: stop using the water dispenser while someone else does!

Most cafeteria water dispensers will let two (sometimes more) people fill a jug at the same time. This article uses simple maths to prove that it’s a waste of time. In other words, two people should never use the same water dispenser at the same time: I call this the cafeteria paradox.

An intuitive presentation

Let’s start with a little brain teaser:

Alice and Bob are at the cafeteria, seating at different tables. Alice stands up to refill her table’s water jug. Little after, Bob stands up with his own table’s jug, heading to the same water dispenser. That water dispenser is a perfectly standard one, with two taps, and Bob finds himself standing near Alice. After a small hesitation, Bob starts using the second tap to fill his own jug, thereby diverting part of the output previously devoted to Alice.

This grants him an exasperated and somewhat puzzled look from Alice. Why?

Continue reading →

Willy Wonka’s golden tickets: certainly the most profitable marketing campaign of all times

There’s something that has troubled me since childhood. In Charlie and the chocolate factory, one of the lucky children (Veruca Salt) gets her golden ticket thanks to her father repruposing his peanut shelling factory in a chocolate-bar-unwrapping factory.

Now, there are only five golden tickets, and chocolate bars in the book seem quite popular. How many bars would one need to buy (and unwrap) just to have a seizable chance of finding one of the tickets? Most likely a lot. After doing the maths, I would estimate the number of chocolate bars that Mr. Salt had to buy to something between 12 and 40 million chocolate bars; which means this promotional campaign was most certainly one of the most profitable in history. Details below.

Continue reading →

How random is pseudo-random? Testing pseudo-random number generators and measuring randomness

After introducing true and pseudo-random number generators, and presenting the methods used to measure randomness, this article details a number of common statistical tests used to evaluate the quality of random number generators.

Continue reading →

Using abstract classes to simulate tagged unions (aka sum types)

Most functional languages offer support for tagged unions (also called sum types), a type of data structure capable of successively holding values of several fixed types. This article shows how to use abstract classes to emulate such behaviour in high-level object-oriented languages such as C#, Java, or VB.Net ((.Net languages have the [StructLayout(LayoutKind.Explicit)] attribute, which makes it possible to create structs which behave a lot like C++ unions. But that only works with primitive types.)).

Continue reading →

Code crumbs, by Clément Pit-Claudel

Personal thoughts on maths and programming, often illustrated with code snippets

The cereal box problem: How many boxes does it take to find all prizes?

An experimental estimation of the entropy of English, in 50 lines of Python code

Linear time probabilistic pattern matching and the Rabin-Karp algorithm

Six months of Windows Phone Development — Tips, tricks, and performance considerations

Introduction

Modeling and measuring string comparison performance in C, C++, C# and Python.

Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

The cafeteria paradox: stop using the water dispenser while someone else does!

An intuitive presentation

Willy Wonka’s golden tickets: certainly the most profitable marketing campaign of all times

How random is pseudo-random? Testing pseudo-random number generators and measuring randomness

Using abstract classes to simulate tagged unions (aka sum types)