# The cereal box problem: How many boxes does it take to find all prizes?

Children’s cereal manufacturers often attract the attention of young clients by including small prizes and toys in every box; sometimes all prizes are identical, but most often individual prizes are part of a collection, and kids are encouraged to collect them and try to complete a full collection. How long does it take ?

Simple probabilistic modeling shows that on average $$n (1 + 1/2 + \ldots + 1/n)$$ boxes are required to complete a full set of $$n$$ prizes: for example, it takes on average $$14.7$$ boxes to complete a full set of six prizes.

# An experimental estimation of the entropy of English, in 50 lines of Python code

“Th_ onl_ wa_ to ge_ ri_ of a tempta____ is to yie__ to it. Resi__ it, an_ you_ soul gro__ sic_ wi__ longi__ fo_ th_ thin__ it ha_ forbi____ to itse__.”

(Osc__ Wil__, The Picture __ ______ ____)

Thanks to the verbosity of the English language, proficient English speakers generally find it relatively easy to decipher the above passage despite the numerous omissions.

How does one quantify this redundancy? This article introduces the notions of Shannon entropy and information rate, and experimentally estimates the information rate of written English by training a Markov model on a large corpus of English texts. This model is finally used to generate gibberish that presents all the statistical properties of written English. Best of all, the entire source code fits in 50 lines of elegant Python code.

# Linear time probabilistic pattern matching and the Rabin-Karp algorithm

Most linear-time string searching algorithms are tricky to implement, and require heavy preprocessing of the pattern before running the search. This article presents the Rabin-Karp algorithm, a simple probabilistic string searching algorithm based on hashing and polynomial equality testing, along with a Python implementation. A streaming variant of the algorithm and a generalization to searching for multiple patterns in one pass over the input are also described, and performance aspects are discussed.

The algorithm is probabilistic in that it doesn’t always return correct results; more precisely, it returns all valid matches and (with reasonably small probability) a few incorrect matches (algorithms such as this one that tend to be over-optimistic in reporting their results are usually said to be true-biased).

# Modeling and measuring string comparison performance in C, C++, C# and Python.

Comparing strings is often — erroneously — said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.

# Generating uniformly random data from skewed input: biased coins, loaded dice, skew correction, and the Von Neumann extractor

In a famous article published 1951 ((Various techniques used in connection with random digits. NIST journal, Applied Math Series, 12:36-38, 1951. This article does not seem to be available online, though it is widely cited. It is reprinted in pages 768-770 of Von Neumann’s collected works, Vol. 5, Pergamon Press 1961)), John Von Neumann presented a way of skew-correcting a stream of random digits so as to ensure that 0s and 1s appeared with equal probability. This article introduces a simple and mentally workable generalization of his technique to random dice, so a loaded die can be used to uniformly draw numbers from the set $$\{1, 2, 3, 4, 5, 6\}$$, with reasonable success.

# Unscrambling shuffled text

A story which surfaced a few years ago, and met quite some success in the press and on the internet, pretended Cambridge University had been conducing research on some of the most amazing faculties of the human brain. According to a supposed study, the order in which letters were laid out when writing a word mattered very little, provided the first and last letter be kept in place : this conclusion was supported by a short excerpt of shuffled text, which anyone could easily decipher ((Matt Davis has a fascinating write-up on this. Snopes also has some reference material; the full text was something like Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn’t mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.)). As a short example, consider the following sentence: