Simple probabilistic modeling shows that on average \(n (1 + 1/2 + \ldots + 1/n)\) boxes are required to complete a full set of \(n\) prizes: for example, it takes on average \(14.7\) boxes to complete a full set of six prizes.
The expected number of boxes to buy before finding all \(n\) prices is shown below:
This histogram shows the calculated average number of purchases required to accumulate at least one copy of each prize, for various sizes \(n\) of prizes sets: for example, finding all prizes in a series of twelve requires on average 37.24 purchases. Note that the curve is somewhat steeper than linear, due to a multiplicative logarithmic factor \(\left(1 + \frac12 + \frac13 + \ldots + \frac1{n}\right) \sim \log(n)\)
And here is an experimental simulation for \(n = 6\): the graph below shows how many purchases were required to obtain six prizes out of six, over 1,000,000 simulations.
This histogram shows the experimental probability distribution of \(T_6\); that is, of the number of cereal boxes that one needs to purchase before finding all prizes in a series of six. Interestingly, despite the peak at \(t = 11\), the distribution’s long tail offsets the average to \(14.65\) in this experiment — close to the theoretical prediction of \(14.7\).
Interested in how these plots were obtained? Read on to find out, or skip directly to the comments to share your thoughts!
Assume there are \(n\) different prizes, all of which are equally distributed in cereal boxes. Model buyer behavior as a sequence of purchases, and call \(T_k\) the time at which they find the \(k\)^{th} prize : clearly \(T_0 = 0\) (no purchases are needed to find 0 prizes) and \(T_1 = 1\) (one finds their first prize immediately after buying their first cereal box). Naturally, \(T_n\) denotes the time required to find all prizes.
To calculate \(\newcommand{\E}[1]{\mathrm{\mathbf{E}}\left[#1\right]}\E{T_n}\), call \(\Delta_k\) the elapsed time between the \(k\)^{th} and \(k+1\)^{th} prizes, i.e. \(\Delta_k = T_{k+1} – T_k\). Summing over \(k\) yields \(\sum_{k=0}^{n-1} \Delta_k = T_n – T_0\) which, given the linearity of the expectation operator, yields $$
\sum_{k=0}^{n-1} \E{\Delta_k} = \E{T_n}
$$
All that remains to be done now is to calculate \(\E{\Delta_k}\) for all \(k \in \{ 0, \ldots, n-1 \}\). By definition $$
\newcommand{\P}[1]{\mathrm{\mathbf{P}}\left(#1\right)}\E{\Delta_k} = \sum_{\delta = 1}^\infty \P{T_{k+1} = T_{k} + \delta} · \delta
$$
\(\P{T_{k+1} = T_{k} + \delta}\) is the probability of finding the \(k+1\)^{th} exactly \(\delta\) purchases after obtaining the \(k\)^{th}; that is, the probability of completing \(\delta – 1\) unproductive purchases, then a successful one. Now the probability of hitting a sequence of \(\delta – 1\) unproductive purchases is exactly \((k/n)^{\delta-1}\); hence $$
\P{T_{k+1} = T_{k} + \delta} = \left(\frac{k}{n}\right)^{\delta-1}\left(1 – \frac{k}{n}\right)
$$
Summing over positive values of \(\delta\) yields $$
\begin{align*}
\E{\Delta_k} &= \sum_{\delta = 1}^\infty \P{T_k = T_{k-1} + \delta} · \delta = \sum_{\delta = 1}^\infty (k/n)^{\delta-1} · (1 – k/n) · \delta\\
&= (1 – k/n) \sum_{\delta = 1}^\infty \delta (k/n)^{\delta-1} = (1 – k/n) \frac{1}{(1 – k/n)^2}\\
&= \frac{1}{1 – k/n} = \frac{n}{n-k}
\end{align*}
$$
And summing over \(k\) yields $$
\begin{align*}
\E{T_n} &= \sum_{k=0}^{n-1} \E{\Delta_k} = \sum_{k=0}^{n-1} \frac{n}{n-k} = n \sum_{k=1}^{n} \frac{1}{k}\\
&= n \left(1 + \frac12 + \frac13 + \ldots + \frac1{n}\right)
\end{align*}
$$
Which, for large \(n\), is \(n \left(\log(n) + \gamma + \frac1{2n} + o\left(\frac1n\right)\right)\) (where \(\gamma \approx 0.57722\) denotes the Euler–Mascheroni constant).
Note that, as expected, \(\E{\Delta_k}\) quickly increases as \(k\) grows larger: the more prizes have already been found, the harder it is to get a new one, to the point that finding the last prize requires \(n\) purchases on average.
The experimental data above was simulated using the following code:
repeat = 1000*1000 time_to = defaultdict(Counter) for _ in range(repeat): purchased = 0 collected_prizes = set() while len(collected_prizes) < n: # Until all prizes have been found purchased += 1 # Purchase one more box prize = random.randint(1, n) # Open it if prize not in collected_prizes: # Is it a new prize ? time_to[len(collected_prizes) + 1][purchased] += 1 # Yes, count it ! collected_prizes.add(prize) # And add it to list of already found prizes
One might want to find some intuitive way to comprehend the formula derived above, \(\E{T_n} = n\left(1 + 1/2 + \ldots + 1/n\right)\). A relatively easy way to get such an intuition is to summarize the calculation above, realizing that each new packet bought after finding the \(k\)^{th} prize brings an extra prize with probability \((n-k)/n\) ; since each purchase is independent, it takes on average \(n / (n-k)\) purchases to find a new prize; then summing over \(k\) yields the final formula.
Have you ever completed a full set of cereal box prizes? How long did it take?
]]>(Osc__ Wil__, The Picture __ ______ ____)
Thanks to the verbosity of the English language, proficient English speakers generally find it relatively easy to decipher the above passage despite the numerous omissions.
How does one quantify this redundancy? This article introduces the notions of Shannon entropy and information rate, and experimentally estimates the information rate of written English by training a Markov model on a large corpus of English texts. This model is finally used to generate gibberish that presents all the statistical properties of written English. Best of all, the entire source code fits in 50 lines of elegant Python code.
Entropy, a notion formalized in the context of information theory by Claude Shannon in his 1948 article “A Mathematical Theory of Communication“, is a measure of how much information an observer can gain by learning the result of a certain random process.
Similarly, the entropy of an information source, also known as its information rate, or its source entropy, is defined as the limit of the conditional entropy of each consecutive letter given the previous ones; in other words, the entropy of an information source quantifies how much extra information is gained by reading one extra letter of a text.
As the example above demonstrates, not every letter of a text is needed to decipher it properly; indeed, humans rely on a huge amount of statistical knowledge to speed up their reading: for example, they know that ‘q’ is almost always followed by ‘u’, and that ‘wh’ is most often followed by ‘o’ (‘who’), ‘y’ (‘why’), ‘at’ (‘what’), ‘en’ (‘when’), ‘ere’ (‘where’), or ‘ich’ (‘which’). Source entropy yields a simple, practical measurement of this intrinsic predictability, and of the amount of information carried by each character in a language.
Given a random variable \(X\), Shannon define its entropy as
$$ H(X) = – \sum_{x} \P{X=x} \lg(\P{X=x}) $$
Generalizing, the entropy of \(X\) conditioned on \(Y\) is defined as the expected entropy of \(X\), conditioned on the values of \(Y\):
$$
\begin{align*}
H(X|Y) &= \sum_{y} \P{Y = y} H(X|Y=y)\\
&= – \sum_{y, x} \P{Y = y} \P{X=x|Y=y} \lg(\P{X=x|Y=y})\\
&= – \sum_{y, x} \P{X = x, Y = y} \lg(\P{X=x|Y=y})
\end{align*}
$$
Finally, the entropy rate of an information source is the limit of the entropy of the \(n+1\)^{th} symbol conditioned on the \(n\) previous ones (if that sequence converges):
$$ h(X) = \lim_{n \to \infty} H(X_{n+1} | X_1, \ldots, X_n) $$
This section presents an algorithm to estimate the entropy of English text, using a large corpus of texts. As entropy fundamentally deals with streams, the code presented below handles textual samples as streams of tokens (either characters or words, depending on whether we’re estimating the entropy per character of the entropy per word).
Note: this algorithm assumes that the distribution of the Markov process being modelled is stationary; that is, that any infinite stream of English text satisfies the property that
$$ \forall k, \forall n, \P{X_{k+n+1}|X_{k+1}, \ldots, X_{k+n}} = \P{X_{n+1}|X_{1}, \ldots, X_{n}} $$
First, define a utility function that reads the content a file and turns them into a continuous stream of tokens (characters, words, etc.). tokenize takes a path file_path, and a function tokenizer that splits a line of text into a list of tokens.
def tokenize(file_path, tokenizer): with open(file_path, mode="r", encoding="utf-8") as file: for line in file: for token in tokenizer(line.lower().strip()): yield token
Two typical tokenizers are presented below (the words function will be useful in the next section, when we start generating random English text):
def chars(file_path): return tokenize(file_path, lambda s: s + " ") import re def words(file_path): return tokenize(file_path, lambda s: re.findall(r"[a-zA-Z']+", s))
Then, define a function to build a statistical model of a stream. Since computing the limit \(h(X)\) presented above is not possible in practice, we approximate it by computing the sum for large enough values of \(n\). The model_order parameter below denotes \(n-1\). This function, called markov_model, returns two objects: stats and model
stats = { ('w', 'h'): 53, ('t', 'h'): 67, ... }
model = { ('w', 'h'): {'y':25, 'o':12, 'a':16, ...}, ('t', 'h'): {'i':15, 'a':18, 'e':34, ...}, ... }
indicating that ‘why’ appeared 25 times, ‘who’ 12 times, ‘wha’ 16 times, ‘thi’ 15 times, ‘tha’ 18 times, ‘the’ 34 times, and so on.
To construct these objects, iterate over the stream of tokens, keeping the last \(n-1\) tokens in a buffer and progressively building stats and model:
from collections import defaultdict, deque, Counter def markov_model(stream, model_order): model, stats = defaultdict(Counter), Counter() circular_buffer = deque(maxlen = model_order) for token in stream: prefix = tuple(circular_buffer) circular_buffer.append(token) if len(prefix) == model_order: stats[prefix] += 1 model[prefix][token] += 1 return model, stats
Finally, implement the calculation of the entropy:
import math def entropy(stats, normalization_factor): return -sum(proba / normalization_factor * math.log2(proba / normalization_factor) for proba in stats.values()) def entropy_rate(model, stats): return sum(stats[prefix] * entropy(model[prefix], stats[prefix]) for prefix in stats) / sum(stats.values())
And pack everything together using (for a model of order 3)
model, stats = markov_model(chars("sample.txt"), 3) print("Entropy rate:", entropy_rate(model, stats))
…all in less than 40 lines of neat python code!
Running this algorithm on the entire Open American National Corpus (about 95 million characters) yields the following results:
Estimated entropy rates using increasingly complex Markov models. As expected, the estimated entropy rate decreases as the order of the model (i.e. the number of characters being taken into account) increases.
Note that, because the Markov model being constructed is based on a necessarily finite corpus of textual samples, estimating the entropy rate for large values of \(n\) will yield artificially low information rates. Indeed, increasing the order of the model will decrease the number of samples per \(n\)-letters prefix, to the point that for very large values of \(n\), knowledge the first \(n-1\) letters of a passage uniquely identifies the passage in question, deterministically yielding the \(n\)^{th} letter.
In the graph presented above, it is therefore likely that the estimates for \(n\) over 5 or 6 are already plagued by a significant amount of sampling error; indeed, for any given \(n\) there are in total \(27^{n}\) possible \(n\)-grams consisting only of the 26 alphabetic letters and a space, which exceeds the size of the corpus used in this experiment as soon as \(n > 5\).
Extrapolating these results to large values of \(n\) is particularly tricky, as in the general case nothing is known about the shape of this sequence of values except for the fact that it’s positive decreasing: deriving the limit entropy rate from this set of measurements requires building a model of the consecutive estimates, and fitting this model to the measured estimates.
As a rough example, call this sequence of values \(F_k\) and assume that it verifies the recurrence equation \(F_{k+1} – F_k = \alpha (F_{n} – F_{n-1})\). Then the \(\alpha\) that yields the best approximation (taking the two initial values for granted since they are less likely to suffer from sampling errors) is \(\alpha \approx 0.68\) (\(\mathcal{L}^2\) error: \(6.7 · 10^{-3}\)), and the corresponding entropy rate is \(h \approx 1.14\) bits/character.
Extrapolated entropy rate values for \(\alpha \approx 0.68\). In this heuristic model, the limit entropy rate is \(h \approx 1.14\) bits/character.
The statistical model of English built in the previous section can be used to generate fake English text : to this end, we introduce a pick function that, given a Counter (that is, a mapping from elements to numbers of occurrences), picks an element at random, respecting the distribution described by that Counter object.
import random def pick(counter): sample, accumulator = None, 0 for key, count in counter.items(): accumulator += count if random.randint(0, accumulator - 1) < count: sample = key return sample
A detailed explanation of this function will be the subject of a later post: efficiently sampling from an unordered, non-uniformly distributed sequence without pre-processing it is a non-trivial problem.
Given a length-\(n\) prefix, the generation algorithm obtains a new token by sampling from the distribution of tokens directly following this prefix. The token thus obtained is then appended to the prefix and the first token of the prefix is dropped, thus yielding a new prefix. The procedure is then repeated as many times as the number of characters that we want to obtain.
The following snippet implements this logic: model is a model obtained using the previously listed functions, state is the initial prefix to use, and length is the number of tokens that we want to generate.
def generate(model, state, length): for token_id in range(0, length): yield state[0] state = state[1:] + (pick(model[state]), )
All that remains to be done is to pick a random initial prefix, using pick once again (the fill function wraps text at 80 columns to make the display prettier):
prefix = pick(stats) gibberish = "".join(generate(model, prefix, 100) import textwrap print(textwrap.fill(gibberish))
Feeding the whole of Shakespeare’s works into the model yields the following samples. Note how increasing the order of the model yields increasingly realistic results.
me frevall mon; gown good why min to but lone’s and the i wit, gresh, a wo. blace. belf, to thouch theen. anow mass comst shou in moster. she so, hemy gat it, hat our onsch the have puck.
glouces: let is thingland to come, lord him, for secress windeepskirrah, and if him why does, we think, this bottle be a croyal false we’ll his bein your be of thou in for sleepdamned who cond!
a gentlewomen, to violence, i know no matter extinct. before moment in pauncheon my mouth to put you in this image, my little join’d, as weak them still the gods! she hath been katharine; and sea and from argies, in the sun to be a strawberries ‘god servant here is ceremonious and heard.
gloucester, what he was like thee by my lord, we were cast away, away! mark that seeth not a jar o’ the tiger; take it on him as i have, and without control the whole inherit of this.
The code above defines tokens as characters, and builds a model of sequences of \(n\) consecutive characters. Yet, to generate more realistic-looking text, we can also split the training corpus by words, using
model, stats = markov_model(words('sample.txt', encoding='utf-8'), 3) print(textwrap.fill(" ".join(generate(model, pick(stats), 100))))
Here’s an example of a statistically sound sequence of English words, generated using a Markov model of order 3 of Shakespeare’s writings:
he rouseth up himself and makes a pause while she the picture of an angry chafing boar under whose sharp fangs on his back doth lie an image like thyself all stain’d with gore whose blood upon the fresh flowers being shed doth make them droop with grief and extreme age.
Shannon’s 1948 article laid the foundations of a whole new science, information theory. The original paper is an easy yet extremely interesting read; it extends the notions presented here to derive theorems that link source entropy to the maximal possible compression of a source, and goes on to describe a mathematical framework for error correction.
Shannon himself also worked on the estimation of the information rate of English. However, since large-scale statistical experiments were impractical then, Shannon instead constructed an experiment involving humans: subjects were asked to guess the letters of a text one by one, and incorrect guesses per character were recorded to estimate the source entropy of sentences; in this experimental setting Shannon obtained a an estimate of the information rate of English between 0.6 and 1.3 bits per letter, in line with the results presented in this article.
Want to experiment by yourself? Download the full source code (also available for python 2.7)!
Did you find the initial quote hard to decipher? Does that pseudo-Shakespeare text look realistic? Do you know other beautiful applications of information theory? Let us know in the comments!
]]>The algorithm is probabilistic in that it doesn’t always return correct results; more precisely, it returns all valid matches and (with reasonably small probability) a few incorrect matches (algorithms such as this one that tend to be over-optimistic in reporting their results are usually said to be true-biased).
Given a string s
of length \(n\), and a pattern of length \(k\), the algorithm computes rolling checksums for both the pattern (“needle”) and consecutive substrings of length \(k\) of the searched string (“haystack”). Every time a substring’s checksum equals that of the pattern, a match is reported.
matches = {} pattern_checksum = checksum(pattern) substring_checksum = checksum(s[0 : k[) for position from 0 to n - k - 1 do update substring_checksum to checksum(s[position : position + k[) if substring_checksum = pattern_checksum then add position to matches return matches
Formally, checksum functions map fixed-length strings to numbers, with good checksum functions satisfying two properties: first, collisions (two different strings being mapped to the same checksum) must be rare — indeed, the higher the collision rate, the more likely incorrect matches are to turn up. Second, checksums must be computed in a way that allows for efficient updating: that is, the value of the checksum of s[i+1:i+k+1[
should be quickly derivable from that of s[i:i+k[
.
One easily implemented such hash function is based on polynomials. Given a list of \(k\) numbers \(a_0, \ldots, a_{k-1}\), we define a polynomial
$$ H(a)(X) = a_0 X^{k-1} + a_1 X^{k-2} + \ldots + a_{k-2} X + a_{k-1} = \sum_{j < k} a_j X^{(k-1)-j} $$
Assuming that each character of a string \(s\) can be matched to numbers, we then define the checksum of a substring s[i:i+k[
of length \(k\) of \(s\) as \(H(s[i:i+k[)(X) = \sum_{j < k} s_{i + j} X^{(k-1)-j}\).
As it stands, the mapping from s[i:i+k[
to \(H(s_{[i:i+k[})\) is bijective, and the problem of checking whether s[i:i+k[
matches a pattern p
can thus be reformulated as checking whether \(H(s_{[i:i+k[})(X) = H(p)(X)\). This problem is usually called polynomial identity testing, and can be deterministically solved in time \(O(k)\) by comparing polynomials coefficient-wise.
Allowing for infrequent errors, however, takes the complexity of this process down to probabilistic \(O(1)\): instead of comparing two polynomials \(P\) and \(Q\), one can pick a random value \(a\) and compare the values of \(P(a)\) and \(Q(a)\), accepting if and only if \(P(a) = Q(a)\) (since values of \(P(a)\) and \(Q(a)\) may be too large to fit in standard-sized integers, computations are generally made modulo a certain large prime modulus \(m\)). The reasoning is that if indeed \(P=Q\) then the test will always accept \(P\) and \(Q\) as equal, while if \(P \not= Q\) then the probability of hitting a bad witness and incorrectly accepting \(P\) and \(Q\) as equal is low.
More precisely, if \(P\) and \(Q\) are polynomials of degree \(≤ k\), and \(m\) is a prime number larger than the largest coefficient of \(P – Q\), then the probability of hitting a root of \(P-Q\) is lower than \(k/m\) if \(P\) and \(Q\) differ.
Picking a large enough random prime modulus \(m\) and a random seed \(a \in \{ 0 \ldots m – 1 \} \) therefore allows for the safe definition of $$h_i = H(s_{[i:i+k[})(a) \mod m = \sum_{0 ≤ j < k} s_{i + j} a^{(k-1)-j} \mod m$$ as the checksum of s[i:i+k[
.
This definition allows for the checksum \(h_{i+1}\) of s[i+1:i+k+1[
to be derived from that of s[i:i+k[
(\(h_i\)) in constant time (provided the value of \(a^k\) is pre-computed once and for all before running the algorithm), using the formula \(h_{i+1} = a · h_i – s_i a^k + s_{i+k}\). Indeed,
$$
\begin{align}
h_i = H(s_{[i+1:i+k+1[})(a) &= \sum_{0 ≤ j < k} s_{i + 1 + j} a^{(k-1)-j}\\
&= \sum_{1 ≤ j < k + 1} s_{i + j} a^{(k-1)-(j-1)}\\
&= \sum_{0 ≤ j < k} s_{i + j} a^{(k-1)-j+1} – s_i a^k + s_{i+k}\\
&= a · H(s_{[i:i+k[})(a) – s_i a^k + s_{i+k}\\
&= a · h_i – s_i a^k + s_{i+k}
\end{align}
$$
import random # See the implementation notes below for details. "(→ n)" indicates a footnote. def checksum(string, seed, modulus, length): """Computes the checksum of string[0 : length+1]""" checksum = 0 for pos in range(0, length): checksum = (checksum * seed + ord(string[pos])) % modulus return checksum def string_search(haystack, needle): """Yields the list of positions p such that haystack[p : p + len(needle) - 1] == needle""" LARGEST_CODEPOINT = 0x10FFFF # (→ 1) # Sanity check if len(haystack) < len(needle): return # Initialization needle_length = len(needle) haystack_length = len(haystack) modulus = 0xB504F32D # (→ 2) seed = random.randint(0, modulus - 1) seed_k = pow(seed, needle_length, modulus) # Pre-compute a^k needle_checksum = checksum(needle, seed, modulus, needle_length) substr_checksum = checksum(haystack, seed, modulus, needle_length) # (→ 3) # String processing for pos in range(0, haystack_length - needle_length + 1): if needle_checksum == substr_checksum: # (→ 4) yield pos if pos < haystack_length - needle_length: # (→ 5) substr_checksum = (seed * substr_checksum - ord(haystack[pos]) * seed_k + ord(haystack[pos + needle_length])) % modulus
ord
function, which returns the Unicode code point of any character. The largest such code point is 0x10FFFF.LARGEST_CODEPOINT
, and as the modulus must exceed all checksum polynomial coefficients it must exceed LARGEST_CODEPOINT
. Further requiring that the modulus be as large as possible helps reduce the probability of checksum collisions, and hence of spurious matches. Finally, enforcing that it be prime bounds the number of integer roots of degree-k polynomials. 0xB504F32D
was chosen as it is the largest value satisfying these requirements whose square fits in a 64-bit signed integer.
substr_checksum
is initialized to the checksum of \(s[0:k[\).def verify(haystack, pos, needle): return all(haystack[pos+k] == needle[k] for k in range(0, len(needle)))
and replacing the checksum test with
if needle_checksum == substr_checksum and verify(haystack, needle, pos):
This slows down the algorithm a bit, however, taking its worst-case performance to \(O(nk)\) — in general, however, the performance remains good as long as the pattern is not redundant.
modulus
as 0xB501
would do, at the cost of increasing the rate of erroneous matches.As previously noted, the probability of a spurious match turning up at position \(i \in \{0, \ldots, n-1\}\) is lower than \(k/m\) (\(n\) denotes the length of the haystack, \(k\) that of the needle, and \(m\) is the modulus). The probability of at least one incorrect match turning up at any position in the string, however, is higher — indeed, in the general case, it can only be shown to be lower than \(n\) times probability of an error popping up at an arbitrary given position ; that is, \(n · k/m\). An ideal implementation would choose \(m\) to be as large as \(n · k\) times a certain large constant \(C\), thus taking the probability of one or more incorrect matches popping up down to \(1/C\).
Real-life implementations, however, are constrained by the size of machine integers (avoiding automatic promotion to Python’s arbitrary-sized integers is crucial to ensuring acceptable performance). The implementation above therefore chooses the largest safe modulus value on a 64-bits architecture, 0xB504F32D
. This takes the probability of obtaining one or more incorrect matches down to about \( n · k / 2^{32} \), which is below \(0.5\%\) for typical string searching problems.
The culprit here is the checksum update calculation, and in particular the seed * substr_checksum
multiplication, which yields values as big as the square of the modulus. Unfortunately there is no way (that I am aware of) to perform the multiplication and integer division that does not resort to large integers if the intermediary multiplication result exceeds the size of machine integers. And it gets even worse on 32-bit machines, where the modulus must be chosen smaller than \(2^{15} – 1\).
Low-level optimizations, however, allow for the use of larger moduli:
0xFFFFFFFB
. In that case, special care must be taken in the checksum update code, so that the subtraction does not cause wrapping. The naïve implementation, shown below, performs incorrectly:
substr_checksum = (seed * substr_checksum - outgoing * seed_k + incoming) % modulus
Specifically, if the result of the seed * substr_checksum - outgoing * seed_k
is below 0, then the value is wrapped modulo (1 << 64) - 1
. Instead, the correct calculation is as show below:
uint64_t mul = (seed * substr_checksum) % modulus; uint64_t sub = ( (modulus - outgoing) * seed_k ) % modulus; substr_checksum = (mul + sub + incoming) % modulus;
MUL
instruction to calculate the product above yields a 128-bits result stored in the EDX:EAX
registers, whose remainder in the division by a 64-bits modulus can be calculated using the DIV
instruction (MOD
on HLA).More importantly, this \(n · k/m\) bound on the error rate is only theoretical, and the effective error rate on real-life texts will often be much lower. Experimental results are hard to obtain given the time it takes to search through large files, but searching 200 times for a 5000-characters random string in Project Gutenberg’s pg2554 (Crime and punishment), didn’t return a single false positive, though the theoretical bound that can be derived from the formula above regarding the probability of getting at least one erroneous match in such an experiment was above one.
In a second experiment, I searched for the string TATAAA
(TATA box consensus sequence, length 6) in the genome of a female platypus (length 1 684 663 807). Again, no incorrect matches were found.
One nice property of the algorithm above is that it does not require that the entire string be loaded into memory; instead, it just requires a buffer whose length equals that of the pattern being searches for. This makes it especially suitable for looking for relatively small patterns in huge strings, like gene patterns in DNA strings. In this particular case verifying each match is especially advisable, given that the input stream is discarded after being processed, making it impossible to verify matches after the algorithm has finished running. Plus, the enormous length of the input does increase the probability of an incorrect match popping up at some point.
The modifications required to allow for streaming are presented below:
import random, itertools def verify(match, needle): return all(match[k] == needle[k] for k in range(0, len(needle))) def checksum(input, seed, modulus): checksum = 0 for byte in input: # (→ 1) checksum = (checksum * seed + byte) % modulus return checksum def stream_search(stream, needle): """Yields the list of positions p such that stream[p : p + len(needle) - 1] == needle""" # Initialization needle_length = len(needle) modulus = 0xB504F32D seed = random.randint(0, modulus - 1) seed_k = pow(seed, needle_length, modulus) # Pre-compute a^k prefix = list(itertools.islice(stream, needle_length - 1)) # (→ 2) needle_checksum = checksum(needle, seed, modulus, needle_length) substr_checksum = checksum(prefix, seed, modulus, needle_length - 1) # Stream processing pos = 0 substr = collections.deque(prefix) # (→ 3) for incoming in stream: substr.append(added) outgoing = substr.popleft() if pos > 0 else 0 # (→ 4) substr_checksum = (seed * substr_checksum - outgoing * seed_k + incoming) % modulus if needle_checksum == substr_checksum and verify(substr, needle): yield pos pos += 1
bytes
objects instead of string
objects, as we are mostly going to process huge byte streams loaded from files or from a network.itertools.islice(stream, n)
function consumes n
values from a stream. We only take the needle_length - 1
first ones here, and the first round of update will complete the checksum calculation.popleft
from being called on an incomplete window in the first rolling checksum update step. This is needed because on the first iteration of the loop the checksum is incomplete, and the sliding window is not full yet.As it stands, the algorithm presented above performs rather poorly for two reasons:
Though it performs acceptably, the python implementation of this algorithm is therefore no match for the highly optimized built-in string.find
CPython function. The first culprit pointed out above can easily be fixed by moving to another, more low-level language; the second, which is directly related to the algorithm being used, is much harder to mitigate.
Using a non-prime modulus is sometimes advocated; in that case, the modulus most often chosen is generally \(2^{64}\) (or whatever the size of machine integers is (+1), the idea being that C wrapping semantics guarantee that the modulus operation will be performed almost for free by the hardware; in that case the modulus operations can be removed altogether, but care must be taken to ensure a proper choice of seed, so as to minimize the probability of checksum collisions. This yields an experimental speedup of about 4 times in C; the resulting code is in the order of 5 to 10 times slower than the optimized CPython routine, but it is deterministic, and hence vulnerable to collision-based attacks.
Where Rabin-Karp does shine is when searching for multiple, same-length patterns. With trivial modifications, the algorithm above can indeed be extended to match the current substring’s checksum against a list of pre-computed pattern checksums. Provided this lookup operation is implemented efficiently enough, Rabin-Karp becomes an efficient alternative to repeatedly running a more efficient single-pattern search algorithm.
The modified pseudo-code is presented below:
matches = {} patterns_checksums = [checksum(pattern) for pattern in patterns] substring_checksum = checksum(s[0 : k[) for position from 0 to n - k - 1 do update substring_checksum if substring_checksum belongs to pattern_checksum then # (→ 1) add position to matches return matches
All that remains is to find a proper implementation of the belongs to (→ 1) operation. Hash-tables, though they might seem appropriate, require a large amount of memory to not introduce extra collisions, and they introduce a slight performance overhead. Instead, sorting patterns checksums and binary-searching the resulting array does not require any additional memory, and introduces very low overhead for reasonable patterns count.
The following graph demonstrates the effects of using this modified algorithm to search for large numbers of patterns. Four implementations are compared:
for pattern in patterns: if haystack.find(pattern) > 0: return True
for (int pattern_id = 0; pattern_id < patterns_count; pattern++) if (strstr(haystack, patterns[pattern_id])) return true;
Execution time grows linearly with the number of patterns. Useless for more than 50 patterns. Requires multiple passes on the input.
regexp = "|".join(patterns) if re.search(regexp, haystack): return True
Execution time also grows linearly with the number of patterns, but the growth is much slower, making this solution practical for patterns counts under 5000. Requires only one pass on the data. Requires sanitization of the input patterns for the regular expression to compile properly.
The following graph shows the respective performance of these implementations.
Multi-pattern search performance. While naive implementations quickly become impractical, the regex-based implementation yields pretty good performance. It is eventually superseded, however, by Rabin-Karp–based implementations, notably by the hardware modulus implementation. (Experimental settings: searching for random strings of length 11 in pg2554.)
Although its performance if only middling on single-pattern string searching problems, the randomized algorithm presented above shines by its conceptual simplicity, its ease of implementation, and its performance on multi-string searching problems. One of its most successful applications is the implementation of similarity-detection tools, which search for a large number of known sentences in newly submitted texts (articles, exams, …).
The original paper by Rabin & Karp (1986) is a fascinating read; it includes much more details on the algorithm’s expected performance in various settings, and generalizations to a broader class of problems. Further results on the average-case complexity of this algorithm were presented by Gonnet & Baeza-Yates in 1990.
Another approach to the multi-string matching problem, that relies on the remarks made above on converting regular expressions to finite automaton, was presented by Aho & Corasick in 1975. It basically preprocesses the patterns input set to construct a finite deterministic automaton, thus achieving one-pass, linear-time matching of all patterns in the set. Its implementation, however, is much trickier than that of Rabin-Karp.
Have you ever had to use string searching functions other than that provided by the standard library of you programming language? Which algorithms did you use?
]]>This article presents notes and remarks that I gathered while working on a Chinese Dictionary App for Windows Phone, YiXue Chinese Dictionary: mistakes I made, fun tips I wrote down, and so on.
I initially didn’t really intend to create a full blog post out of these notes, but their increasing number, and my app recently placing second in Microsoft France’s App Awards contest, gave me enough motivation to share them with the community. Along with various tips and tricks and answers to often-asked (but seldom answered) questions, I will discuss a number of performance improvements that specifically apply to Windows Phone apps.
You can skip the DOs and DON’Ts and jump directly to the tips and tricks section, or even straight to the performance tips section, if you feel really confident with Windows Phone development.
XAP files are just ZIP files with a fancy file extension.Shown above, the contents of the YiXue Dictionary XAP package.
As of March 2013, the limit for users to be able to download your app over the air is 20MB. Above this limit, users will have to switch to connect to a WiFi network before they can download your app. Below 20MB, you’re safe.
That said, keeping an eye on your XAP size is definitely a good idea. PC World ran two studies in 2012, concluding that the average 3G download speed in the United States is about 2.5mbps. That’s about 3.2 seconds per MB in your app’s XAP — about 1 minute for a 20MB app download over 3G. In areas with less good coverage, this will probably climb up to 2 or 3 minutes (assuming 1mbps).
The Silverlight toolkit for Windows Phone is a library distributed by Microsoft, which includes a number of controls which didn’t make it into the official SDK. It’s frequently updated, it contains a number of gems that can widely improve your app, and you can download it using nuget straight from Visual studio (more information here).
Though I was initially reluctant to increase my download size by bundling a 500kB dll with my app, I quickly realized it can dramatically improve the user experience. I’ll focus on just two things here: The PerformanceProgressBar
, and page transitions (the Windows Phone Geek website has covered in great details pretty much every control in the toolkit).
PerformanceProgressBar
It’s been said pretty much everywhere, but using the default Windows Phone ProgressBar
in Indeterminate
mode is a bad idea. The animation is based on custom-themed sliders, which eat up a lot of processing power to display the animation (more on that below). The PerformanceProgressBar
included in the Silverlight Toolkit fixes that.
Load the toolkit by adding the following to the phone:PhoneApplicationPage
tag:
xmlns:tk="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"
and create a progress bar using the following:
<tk:PerformanceProgressBar IsIndeterminate="true" />
toolkit:TransitionService.NavigationInTransition
)It’s pretty standard in all system apps on Windows Phone; when you tap a button or a link, the current page will flip out, and a new page will flip in.
Apart from the eye candy, the main advantage of this animation is that it vastly improves the perceived speed of your app, by virtually supressing the delay that separates user actions (tap a button, click a link, pick a ListBox
item) and the corresponding reaction; the animation starts immediately, and the constructor and OnNavigatedTo
method of your new page are called while the animation runs.
This way, your users have direct feedback that their tap was taken into account. I initially implemented that in YiXue because the pretty complex layout of the Dictionary page would take a split second to render, and users would tap two or three times on the “Dictionary” button, thinking the tap hadn’t been taken into account.
To improve the visual consistency of your app, I recommend using the animation on all your pages. One neat thing is that this animation can be bundled in a style, and reused on all pages without repeating the same code on each. That’s something I didn’t know at first, and that was another reason I was reluctant to generalize the use of that animation to all my pages. But it’s actually rather simple:
First declare the toolkit
namespace in your App.xaml
using the following line :
xmlns:toolkit="clr-namespace:Microsoft.Phone.Controls;assembly=Microsoft.Phone.Controls.Toolkit"
Then declare the following style in your App.xaml
file:
<Style x:Key="TransitionPageStyle" TargetType="phone:PhoneApplicationPage"> <Setter Property="toolkit:TransitionService.NavigationInTransition"> <Setter.Value> <toolkit:NavigationInTransition> <toolkit:NavigationInTransition.Backward> <toolkit:TurnstileTransition Mode="BackwardIn"/> </toolkit:NavigationInTransition.Backward> <toolkit:NavigationInTransition.Forward> <toolkit:TurnstileTransition Mode="ForwardIn"/> </toolkit:NavigationInTransition.Forward> </toolkit:NavigationInTransition> </Setter.Value> </Setter> <Setter Property="toolkit:TransitionService.NavigationOutTransition"> <Setter.Value> <toolkit:NavigationOutTransition> <toolkit:NavigationOutTransition.Backward> <toolkit:TurnstileTransition Mode="BackwardOut"/> </toolkit:NavigationOutTransition.Backward> <toolkit:NavigationOutTransition.Forward> <toolkit:TurnstileTransition Mode="ForwardOut"/> </toolkit:NavigationOutTransition.Forward> </toolkit:NavigationOutTransition> </Setter.Value> </Setter> </Style>
Finally, reuse your new style on all pages like this:
<phone:PhoneApplicationPage (...) Style="{StaticResource TransitionPageStyle}">
Trying to hit the “change language” button (繁) of the keyboard when the application bar is minimized generally results in a lot of frustration.
That’s a usability tip which plenty of apps on the market place fail to follow, especially translation / dictionary apps. If have TextBox
es on a page showing an ApplicationBar
, make sure it is not minimized. Otherwise, your international users will have the frustration of hitting the application bar instead of the “change language” button on the keyboard, thereby expanding the application bar, above the keybord. See the screenshot if you’re not sure what I mean by that.
The solution is pretty simple : either hide the application bar when your TextBox gets the focus, or just set the ApplicationBar
‘s Mode
to Default
instead of Minimized
.
DO:
<shell:ApplicationBar Mode="Default">
DON’T:
<shell:ApplicationBar Mode="Minimized">
GestureService
from the Silverlight toolkit However awesome it might be, the Silverlight toolkit still has a few rough edges. In particular, it doesn’t unregister it’s listener when you use a GestureService
(GestureService
allows you to catch touch events like Flick, Slide, and so on). This will bite you if you use an InkSurface
for drawing at some point, because a GestureService loaded on another page will keep firing touch events although the page it initially came from should have been disposed. Using the GestureService
, furthermore, will prevent your pages from being properly disposed.
Fortunately, there’s a pretty simple workaround: use the XNA TouchPanel
instead.
First, enable the listener (can be done in yoru app initialization code):
Microsoft.Xna.Framework.Input.Touch.TouchPanel.EnabledGestures = Microsoft.Xna.Framework.Input.Touch.GestureType.Flick;
Then, register the ManipulationCompleted
event on a control:
private void Control_ManipulationCompleted(object sender, ManipulationCompletedEventArgs e) { GestureSample sample = TouchPanel.ReadGesture(); int delta_x = gesture.Delta.X, delta_y = gesture.Delta.Y; if (gesture.GestureType == GestureType.Flick && Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y)) //User flicked horizontally (delta_x < 0: flicked left; otherwise right) }
That works well, but not perfectly. Can you guess why? — The catch is that the TouchPanel
will record all gestures made on the page; even if only part of your page is touch-enabled using ManipulationCompleted
, the call to TouchPanel.ReadGesture
will also report gestures made outside of that area.
The problem, luckily, is easily fixed by discarding all gestures but the last when ManipulationCompleted
fires; add the following snippet to the previous function:
GestureSample? sample = null; while (TouchPanel.IsGestureAvailable) //Discard all gestures but the last sample = TouchPanel.ReadGesture(); if (sample.HasValue) if (gesture.GestureType == GestureType.Flick && Math.Abs(delta_x) > 1.5 * Math.Abs(delta_y)) //User flicked horizontally (delta_x < 0: flicked left; otherwise right)
Windows Phone users can choose between light and dark themes, which at the core are just white text on a dark background, or black text on a light background.
If your app includes a background image (see my tip on that below), you’ll probably want to force one of the color themes (light / dark). If you do that I’d advise going for a rather dark background, since that will use much less battery ; in that case you will want to enfore the dark theme in your app. I did that in YiXue dictionary, including a piece of famous Chinese calligraphy in the background, and users told me they really liked it.
First, note that you can’t directly replace an existing value in the App.Current.Resources
dictionary. Removing the key and adding it again, however, does work (took me a while to find out):
private static void SetAppResource(object key, object value) { //Cannot replace an existing value: //using 'App.Current.Resources[foo] = bar' won't work. if (App.Current.Resources.Contains(key)) App.Current.Resources.Remove(key); App.Current.Resources.Add(key, value); }
You can also override properties of already existing objects, like system brushes. For your convenience, here is a quick snippet setting most of the common color settings to their default Light Theme values.
//Code from http://pit-claudel.fr/clement/blog/ ; attribution much appreciated! (App.Current.Resources["PhoneForegroundBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneBackgroundBrush"] as SolidColorBrush).Color = Colors.Black; (App.Current.Resources["PhoneDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255); (App.Current.Resources["PhoneInactiveBrush"] as SolidColorBrush).Color = Color.FromArgb(51, 255, 255, 255); (App.Current.Resources["PhoneContrastBackgroundBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneContrastForegroundBrush"] as SolidColorBrush).Color = Colors.Black; (App.Current.Resources["PhoneTextBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255); (App.Current.Resources["PhoneTextBoxForegroundBrush"] as SolidColorBrush).Color = Colors.Black; (App.Current.Resources["PhoneTextBoxSelectionForegroundBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneTextBoxEditBackgroundBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneTextBoxEditBorderBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneTextBoxReadOnlyBrush"] as SolidColorBrush).Color = Color.FromArgb(119, 0, 0, 0); (App.Current.Resources["PhoneRadioCheckBoxBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255); (App.Current.Resources["PhoneRadioCheckBoxCheckBrush"] as SolidColorBrush).Color = Colors.Black; (App.Current.Resources["PhoneRadioCheckBoxCheckDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 0, 0, 0); (App.Current.Resources["PhoneRadioCheckBoxDisabledBrush"] as SolidColorBrush).Color = Color.FromArgb(102, 255, 255, 255); (App.Current.Resources["PhoneRadioCheckBoxPressedBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneRadioCheckBoxPressedBorderBrush"] as SolidColorBrush).Color = Colors.White; (App.Current.Resources["PhoneBorderBrush"] as SolidColorBrush).Color = Color.FromArgb(191, 255, 255, 255); (App.Current.Resources["PhoneSubtleBrush"] as SolidColorBrush).Color = Color.FromArgb(153, 255, 255, 255);
As a side note, remember that you can’t set the application bar’s color to pure white. Just set it to something like #EFEFEF
instead, and it will look just as if it was all white.
DO:
shell:SystemTray.ForegroundColor="#FEFEFE"
DON’T:
shell:SystemTray.ForegroundColor="#FFFFFF"
A TextBox
styled like a TextBlock
, with some text highlighted
Since text can’t be directly copied from TextBlock
s, most websites advise that you use a ReadOnly
TextBox
instead, styled to look just like a TextBox
. That’s pretty sweet, but it fails to mention that users need to give input focus to a TextBox
before they can select text. Which means that they’ll have to tap on your modified TextBox
twice: once to give it focus (but the caret won’t appear, since the TextBox
is read-only), and a second time to actually select text. As a user, this drives me nut.
Luckily there’s a pretty simple work-around: if it’s a very small block of text, intercept the Tap
event, and select all text immediately. Otherwise, use the GotFocus
event. Users can then refine their selection if they want, but they can also copy all contents straight away if they want to.
private void CopyableTextArea_Tap(object sender, GestureEventArgs e) { (sender as TextBox).SelectAll(); }
Pages sharing a common background, with the Background
property of their LayoutRoot
set to Transparent
.
If you want to set an app-level background image (perhaps the same as your splash screen, although I’d advise using something a bit darker to improve the legibility of text over your background), manually setting the background image on all your pages is not the best, especially if you use page-turning animations.
Indeed, if you set the background on a per-page level, you’ll quickly notice that the background flips with the rest of the page content when you change pages. It would be much better if it stayed in place, and only page contents flipped out when changing pages. It’s actually pretty easy to do if you know the trick (that, too, took me a while to find):
In App.xaml
:
<ImageBrush x:Key="MainBackground" ImageSource="/resources/MainBackground.jpg" />
In App.xaml.cs
(App.Current as App).RootFrame.Background = App.Current.Resources["MainBackground"] as Brush;
Now the catch is, if some of your pages show an application bar, and other don’t, your background will be resized to fit only the available space: the application bar will force it to shrink — that’s pretty ugly. Instead, what I do is add a 72 pixels high row to the LayoutRoot grid, and set the ApplicationBar’s transparency to .99 (choosing a value other than 1 draws the app bar above your page instead of shrinking it). That way, the Application bar is drawn above your page, and nothing is drawn below it because there’s an empty row of just the right height below:
DO:
In your Grid
‘s RowDefinitions
:
<RowDefinition Height="72" />
Further below:
<phone:PhoneApplicationPage.ApplicationBar> <shell:ApplicationBar IsVisible="True" Opacity="0.99" Mode="Default"> (...) </shell:ApplicationBar> </phone:PhoneApplicationPage.ApplicationBar>
Abandonning your ProgressBar
s in favour of text-only loading messages can yield substantial performance improvements.
Really that’s pretty universal advice. If you never tried, you’ll probably be surprised how much performance you can gain by simply disabling your “loading” progressbar and replacing it with a Loading...
message, possible including a percentage at the end, as in Loading... 10%
. In Yixue, where the initial loading of the dictionary has been an area that I’ve worked a lot on, disabling the Loading...
progressbar has reduced the overall loading time from 3.2 seconds to 2.7s. That’s a sizable 0.5 seconds, or 15% percent performance improvement :)
Another common mistake when displaying progress is to use a timer running in the background and firing every x milliseconds to update your progressbar. I did this in YiXue, because the animation of the progress bar that I used back then would be smoother that way. The loading time was 3.8 second then, and the UI timer fired every 20ms. Increasing this delay to 100ms took the loading time down to 3.5 seconds. Removing the UI timer altogether and firing a ReportProgress event after each significant loading step took that down to 3.2 seconds… and getting rid of the progress bar I went down to 2.7s.
XAML complexity is, in my experience, the main cause for pages loading slowly. Complex bindings (or even simple ones, really, if you’re populating a list) are also very time-consuming. If you’re after the last bit of performance, consider populating your list manually, possibly be removing your listbox alltogether, and using a vertical StackPanel to which you directly add your items, in code-behind. I know that doesn’t really fit in the MVVM pattern, but when I had to display a grid of 214 elements, binding a ListBox using a WrapPanel as its ItemsPanel took close to 2 seconds, while populating a StackPanel only took 0.3 seconds.
To the best of my knowledge, the toolkit’s WrapPanel
doesn’t do any kind of UI virtualization, making it useless for displaying moderately large amounts of information.
PivotControl
for what it’s not made forThe PivotControl
has to render all its pages before it’s displayed. That’s a huge performance killer when you use it to diplay multiple data sets (plus, it’s against the design guidelines: Pivots are used throughout the system to show multiple visualisations of the same data set (all mail, unread, starred, and so on, for example)).
In YiXue Dictionary I have study lists (roughly equivalent to stacks of flash cards, only better ;), and I initially displayed them using a Pivot control, one list per pivot page. When I tested that page with 12 study lists contining about 40 words each, not only did it take 2 seconds to render the page, but the memory profiler reported a 130MB memory usage (!). I’ve now replaced it with a single list, letting users flip between each page using gestures, which doesn’t use more than a few MBs.
The WP emulator is really neat, but there’s one thing that it really doesn’t mimic realistically, namely IO performance. Although it’s much better on my Lumia than on my old HTC, I/O is still much slower on a phone than on your development computer. That’s really something to keep in mind when testing your app, especially if you don’t have a device to test on. Very roughly, my testing indicates that an I/O operation that takes 1 seconds on your computer will take 2 or 3 seconds on the Lumia 920, 3 to 5 seconds one Lumia 800, and 5 to 7 seconds on the Trophy. Take these timings with a pitch of salt since they are really gross approximations, but the general figures are about that. Similarly, garbage collections will be much slower on your phone than in the emulator; if your loading procedure involves a lot of text parsing, it’ll suffer a lot from that on your phone.
So how do you improve that loading performance? Turns out that there is a number of simple tricks that you can use. I won’t get into too much details here, because there’s so much to say, and because you obviously should run benchmarks for your micro-optimisations, but the following general principles will help :
Screenshot from MDSN’s remarks section regarding String.Split
.
String.Split
is nice, but it’s a garbage collector nightmare. It allocates plenty of intermediate buffers (and the documentation is pretty straightforward about that, with a section called “Performance Considerations” well worth reading), and it’s too generic to be as fast as it could be. If you’re really after the last bit of performance, re-code the split function by hand. It’s really not hard, but here it is:
int start = -1, end = -1; List<string> parts = new List<string>(); while (end + 1 < string_to_split.Length) { //Assumes that the string to split ends with a '\t' start = ++end; while (string_to_split[end] != '\t') ++end; parts.Add(string_to_split.Substring(start, end - start)); }
In YiXue, that shaved an extra 30% off the loading time.
Changing the build action of your images from Resource
to Content
will neatly improve the splash screen time, because your images will only be loaded when needed, instead of being loaded when your app starts.
Left: efficient representation. Right: Human-readable version of the same data. Smaller is better.
Most apps store use data as XML, using an XMLSerializer
. That’s really neat, but it can get slow if you don’t take a few precautions. There are two things that really helped me in YiXue: first one is disabling white space in the XML output; your files won’t be as legible by a human as they were before, but it easily saved me 20% off the size of my user data files, and the parser won’t see the difference.
XmlWriterSettings settings = new XmlWriterSettings { Indent = false, NewLineHandling = NewLineHandling.None };
Second one is specifying defaults. In YiXue there can be plenty of info associated to each word in your study lists, but most words don’t have all that info filled in. You can get the Xml serializer to omit the empty properties by simply specifying the default value for the attribute this way:
[DefaultValue(0)] public int CorrectAnswersGiven;
I’d also advise choosing very short aliases for your attributes and elements names; for example:
[XmlAttribute(AttributeName = "f")] public string foo;
This will further shrink your data files, and will make it much easier to rename your variables should you want to.
If you need time to load app resources (beyond the split-second), entertain your users while your app is loading by including tips and fun facts: they won’t notice the delay if they have something else to focus on — game developers know this, and always include tips in their loading screens. This principle is easily extended to other apps.
In YiXue, I show tips and fun facts about learning Chinese while the app loads.
That’s pretty much all for now. Hopefully that helps you create even better apps!
You can follow me on Twitter here (me) and here (my YiXue Dictionary app). Of course, links to http://pit-claudel.fr/clement/yixue/ and to this blog are most appreciated. And since I’m sure I forgot plenty:
DO: share your own tips and tricks in the comments!
Happy WP development!
Clément.
PS: If you liked this article, you can flattr this blog post using the link below, or even better buy the app!
]]>
Suppose one draws two random strings \(A = “a_0a_1a_2…”\) and \(B = “b_0b_1b_2…”\) of infinite length, with \(a_n\) and \(b_n\) respectively denoting the \(n\)^{th} character of \(A\) and \(B\). We will assume that characters are independent, and uniformly drawn over the set \(\{a, b, c, …, z\}\). For now, we will consider all strings to be of infinite length — we’ll study bounded-length strings in the next section.
Given that characters making up \(A\) and \(B\) are independent, the probability of \(A\) differing from \(B\) at index \(n\) is \(25/26\). Conversely, the probability of \(A\) matching string \(B\) at index \(n\) is \(1/26\).
Consequently, the probability of \(A\) matching \(B\) up to index \(n – 1\), and differing at index \(n\), is \((1/26)^n ⋅ (25/26)\). Therefore, given that comparing strings up to index \(n\) included is \((n+1)\) the expected number of characters to compare to discriminate \(A\) and \(B\) is $$ C = \sum_{n=0}^{\infty} (n+1) ⋅ (1/26)^n ⋅ (25/26) = 26/25$$
Indeed, defining \(C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} (n+1) x^n\), the expected number of comparisons \(C\) required to discriminate \(A\) and \(B\) is \(C(1/26)\). Now \(C(x) = (1-x) ⋅ \sum_{n=0}^{\infty} \frac{∂ x^n}{∂ x} = (1-x) ⋅ \frac{∂ \sum_{n=0}^{\infty} x^{n+1}}{∂ x} = (1-x) ⋅ (1-x)^{-2} = 1/(1-x)\). Plugging \(x=1/26\) yields \(C = 26/25 \approx 1.04\).
In other words, it takes an average of \(1.04\) character comparisons to discriminate two random strings of infinite length. That’s rather cheap.
In this section we assume that \(A\) and \(B\) are bounded in length by a constant \(L\). This implies that the probability of the comparison routine reaching any index greater than \((L-1)\) is \(0\); hence the expected number of comparisons \(C_L\) is $$C_L = \sum_{n=0}^{L-1} (n+1) ⋅ (1/26)^n ⋅ (25/26) = (26/25) ⋅ (1 – \frac{25L + 26}{26^{L+1}})$$
For any \(L > 3\), this is virtually indistinguishable from the previously calculated value \(C = 1.04\).
The theoretical amortized time required to compare two strings is constant — independent of their length. Of course, there may be a sizeable overhead associated to comparing strings instead of ints: experimental measurements of this overhead are presented below.
The following charts present the time it took to perform 100’000 string comparisons on my laptop, compared to a baseline of 100’000 integer comparisons, in 3 different programming languages and for string lengths varying from 1 to 500 characters. Despite large variations for small string lengths, measured times quickly converge to a fixed limit as strings lengthen.
Time required to compare 100’000 pairs of random strings, as a function of string length, in Python (3.2.3)
Time required to compare 100’000 pairs of random strings, as a function of string length, in C# (csc 4.0.30319.17929)
Time required to compare 100’000 pairs of random strings, as a function of string length, in C++ (gcc 4.3.4).
Though large strings (over ~200 characters) experimentally match our statistical model, small strings exhibit a much more complex behaviour. I can think of a two causes for such variations:
Comparing strings, especially small strings, is very fast. The amortized time required to compare two strings is bounded by a constant, which is experimentally about twice the time required to compare two integers in high level languages like Python or C#, and 5 to 7 times higher in low-levels languages like C++.
Do you usually try to avoid comparing strings? Do you have a more detailed explanation of the irregularities observed for lengths < 200 characters? Tell us comments!
]]>
Von Neumann’s originally proposes the following technique for getting an unbiased result from a biased coin :
If independence of successive tosses is assumed, we can reconstruct a 50-50 chance out of even a badly biased coin by tossing twice. If we get heads-heads or tails-tails, we reject the tosses and try again. If we get heads-tails (or tails-heads), we accept the result as heads (or tails).
This algorithm is illustrated below:
Flowchart recapitulating Von Neumann’s bias-suppressing algorithm
This amazingly simple strategy does yield heads and tails with equal probabilities, because coin tosses, though biased, are assumed to be independent. Therefore, if heads come out with probability \(p\), and tails with probability \(1-p\), then both (head, tails) and (tails, heads) sequences occur with probability \(p(1-p)\).
Generalizing this technique to a stream of random digits is straightforward: pack digits two-by-to, discard any pair of two equal digits (00 or 11), and finally convert 01 to 0 and 10 to 1.
Illustration of Von Neumann’s biased-suppressing algorithm on a stream of random digits
Continuing on previous calculations, since each coin toss is taken to be independent from the others, both (heads, tails) and (tails, heads) sequences will occur with probability \(p(1-p)\), while (heads, heads) and (tails, tails) sequences will occur with respective probabilities \(p^2\) and \((1-p)^2\).
Define the expected rejection rate \(r\) to be the the probability of obtaining (heads, heads) or (tails, tails) (and thus of discarding the sequence). Then \(r = p^2 + (1-p)^2 = 1 – 2 ⋅ p(1 – p) = (1/2) + 2 ⋅ (p – 1/2)^2\), which implies that \(r ≥ 1/2\). In other words, in the best case (when the coin is perfectly balanced), half of the coin sequences will be discarded.
Rejection rate as a function of heads probability
Since the rejection rate is \(r = 1 – 2 ⋅ p(1 – p)\), the odds of discarding exactly \(n\) two-coins sequences before obtaining an acceptable sequence are \(r^n ⋅ (1-r)\), and the expected number of two-coins sequences to be discarded is \(d = \sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = \frac{r}{1-r}\).
Indeed, $$\sum_{k=0}^\infty k ⋅ r^k ⋅ (1-r) = r(1-r) ⋅ \sum_{k=0}^\infty k r^{k-1} = r(1-r) ⋅ \frac{∂}{∂r} \sum_{k=0}^\infty r^k = r(1-r) ⋅ \frac{∂}{∂r} \frac{1}{1-r} = \frac{r}{1-r}$$
Now, if \(d\) two-coins sequences are discarded before an acceptable sequence is obtained, then the number of coin tosses required is \(2 ⋅ (1 + d) = \frac{1}{p(1-p)} ≥ 4\). In other words, it will take in the best case an average of 4 single-coin tosses to obtain an acceptable sequence. Or, as Von Neumann puts it,
The resulting process is rigorously unbiased, although the amended process is at most 25 percent as efficient as ordinary coin-tossing.
Expected number of tosses before an acceptable two-coins sequence is found, plotted as a function of heads probability
The general idea behind Von-Neumann style skew-correction techniques is to consider sequences of coin tosses instead of isolate ones, and pick a sequence length long enough that an even number of possible outcomes have equal probabilities. These outcomes can then be split in two groups, one representing heads and the other representing tails, while other outcomes are discarded.
Generalizing this strategy to weighted dice excludes picking a sequence length \(n ≤ 2\), since no subset of possible outcomes of equal probability for \(n=1\) or \(n = 2\) has a cardinal divisible by six. For \(n=3\), on the other hand, the outcomes of any dice (crooked or not) can be partitioned into six categories of equal probability according to the relative ordering of the successive numbers rolled, provided these numbers are all different — that is, sequences can be grouped according to whether the second number is greater (↑) or smaller (↓) than the first, and whether the third number is greater (↑↑↑, ↓↑↑) or smaller (↑↓↓, ↓↓↓) than both the first and the second, or between them (↑↑↓, ↓↓↑).
This strategy is illustrated in the diagram below, which provides a reference table for interpreting sequences of three rolls of a loaded dice to produce unbiased results. Should the same number occur twice, all three rolls should be discarded, and the process should be started over.
Decision table for suppressing dice bias. Given a sequence of three throws, chart the three throws so as to form an arrow pattern, and use this table to obtain the corresponding result. Examples are given in green for each pattern. Using this table, rolling \(1 2 5\) would yield a \(1\), while \(1 5 2\) would yield a \(2\), and \(5 2 1\), a \(4\).
All six possible orderings will occur with equal probability, because each of the sequences belonging to any of these orderings matches exactly one sequence of equal probability in every other ordering: for example, \(1 2 4\) matches the ordering ↑↑↑, whereas \(2 1 4\), which is equally likely, matches ↓↑↑.
Given a sequence of three independent rolls of a possibly biased die, this function returns a fair dice roll using the previously exposed technique:
def translate(*rolls): # returns False if 'rolls' contains duplicates, and a fair dice roll otherwise # Possible relative ordering of the three rolls. 1 2 3 is ↑↑↑, 2 1 3 is ↓↑↑. deltas = [(True, True, True), (True, True, False), (True, False, False), (False, False, False), (False, False, True), (False, True, True)] # True stands for ↑ (r1, r2, r3) = rolls; unique = len(set(rolls)) >= 3 return unique and 1 + sequences.index((r2 > r1, r3 > r1, r3 > r2))
Here’s a test function. The bias parameters measures how unbalanced the die is: the higher the bias, the higher the probability to roll a 6.
def unfair_roll(bias): return random.choice([1, 2, 3, 4, 5] + [6]*bias) def test(number_of_rolls, bias): count = [0]*7 bias = max(bias, 1) for roll_id in range(0, number_of_rolls): rolls = unfair_roll(bias), unfair_roll(bias), unfair_roll(bias) count[translate(*rolls) or 0] += 1 print("Rejection rate: {:.2f}%".format(count[0] / number_of_rolls * 100)) print("Relative digit frequency:") for digit in range(1, 6 + 1): print(" {}: {:.2f}%".format(digit, count[digit] / number_of_rolls * 100)) print("Before correction, 6 was {} times more likely to occur" " than other digits.".format(bias))
Here’s a single run for 50 000 sequences of three rolls of a die ten times more likely to give a 6:
>>> test(50*1000, 10) Rejection rate: 80.52% Relative digit frequency: 1: 3.24% 2: 3.30% 3: 3.26% 4: 3.18% 5: 3.17% 6: 3.32% Before correction, 6 was 10 times more likely to occur than other digits.
Though easy to work out mentally, this bias-correction technique for dice is far from optimal. Indeed, the minimal rejection rate (obtained for a perfectly balanced dice) is about 44%, and the rejection rate quickly increases as the bias gets stronger, reaching 64% when 6 is 5 times more likely to occur than other individual digits.
More subtle techniques, which I will discuss in a future article, take the rejection rate down to 1.5% for perfectly balanced dice, and keep it under 7% for dice with a 6 to 1 chance of getting a 6 over any other digit.
Do you know other strategies to even the odds when playing with a loaded dice? Sound off in the comments!
]]>Let’s start with a little brain teaser:
Alice and Bob are at the cafeteria, seating at different tables. Alice stands up to refill her table’s water jug. Little after, Bob stands up with his own table’s jug, heading to the same water dispenser. That water dispenser is a perfectly standard one, with two taps, and Bob finds himself standing near Alice. After a small hesitation, Bob starts using the second tap to fill his own jug, thereby diverting part of the output previously devoted to Alice.
This grants him an exasperated and somewhat puzzled look from Alice. Why?
The reason is quite simple: though most people will act just like Bob, using the water dispenser while someone else does is — in most of the cases — a useless waste of time. Indeed, starting to fill his own jug before Alice is done will not save Bob any time, but it will waste some of Alice’s time. That’s the cafeteria paradox.
Fortunately, this paradox is readily explained: assuming both Alice and Bob are using similar jugs, and both start with an empty jug, Alice will finish before Bob does; when Bob leaves, therefore, the water dispenser will have dispatched precisely the countenance of two caferetia jugs.
Now, Since the combined output of the two taps equals that of only one, dispatching this amount of water will take a constant time (precisely twice the time that it would take to fill a single jug). Hence Bob will gain no time at all by starting to fill his jug right away, while some of Alice’s time will be wasted.
Not convinced? Imagine Bob and Alice are waiting for some pizza at a take-away restaurant. Given that Bob and Alice won’t start eating before they get to their respective homes, should Bob accaparate half of Alice’s pizza when it arrives, or let Alice get the full pizza first and wait for his own pizza? If Bob asks for half of Alice’s, both will wait for two pizzas to be ready; on the other hand, if Bob lets Alice go first, Alice will have only waited for one, while Bob will have waited for two. Clearly the second solution is better for Alice, and neutral for Bob; and the situation is exactly similar to that of the water dispenser.
Still not convinced? Read on for a full mathematical description of this situation. Otherwise, skip to the conclusion (and the sticker!).
Let \(t_A\) and \(t_B\) be the respective times when Alice and Bob reach the water dispenser, and \(t_A’, t_B’\) the respective times when they leave. Let \(t^*\) be the time when Bob starts using the water dispenser. The water dispenser has a flow rate \(f\), and each jug has a capacity \(c\).
From \(t_A\) to \(t^*\), Alice’s jug is filling at rate \(f\); from \(t^*\) to \(t_A’\), at rate \(f/2\). Hence Alice’s departure time \(t_A’\) satisfies the following equation: $$ (t^* – t_A) ⋅ f + (t_A’ – t^*) ⋅ (f/2) = c$$
which yields \(t_A’ = t^* + \frac{c}{f/2} – 2 ⋅ (t^* – t_A)\), so $$ t_A’ = 2 t_A – t^* + \frac{c}{f/2} $$
Similarly, Bob’s departure time \(t_B’\) satisfies the following equation: $$ (t_A’ – t^*) ⋅ (f/2) + (t_B’ – t_A’) ⋅ f = c$$
which yields \(t_B’ = t_A’ + \frac{c}{f} – (1/2) ⋅ (t_A’ – t^*) \), so $$ t_B’ = (1 / 2) ⋅ (t_A’ + t^*) + \frac{c}{f} $$
Finally, subtituting \(2 t_A – t^* + \frac{c}{f/2}\) for \(t_A’\) yields \(t_B’ = t_A + (1 / 2) \cdot \frac{c}{f/2} + \frac{c}{f}\); that is $$ t_B’ = t_A + 2 ⋅ \frac{c}{f} $$
To summarize, Alice will leave at \(t_A’ = 2 t_A – t^* + \frac{c}{f/2}\) (the later Bob starts, the earlier Alice will leave), while Bob will leave at \(t_B’ = t_A + 2 ⋅ \frac{c}{f}\), which does not depend on \(t_B\) or \(t^*\).
The socially optimal solution is that Bob waits until Alice is done before starting to use the water dispenser; otherwise, some of Alice’s time will be wasted, though Bob will not benefit from it. The key to understanding this is that Bob will not leave the water dispenser before both jugs are full, and filling both jugs takes a constant time (the time needed for the water dispenser to deliver twice the volume of a jug, spanning from Alice’s arrival to Bob’s departure). Thus using the water dispenser while someone else does is not only selfish: it’s downright stupid.
Shocked? Leave a message in the comments! Otherwise join our campaing to stop multi-threading in water dispensers: stick the following message on your cafeteria’s water dispenser and post a picture of your sticker in the comments!
Title image © Cosmetal.
]]>Now, there are only five golden tickets, and chocolate bars in the book seem quite popular. How many bars would one need to buy (and unwrap) just to have a seizable chance of finding one of the tickets? Most likely a lot. After doing the maths, I would estimate the number of chocolate bars that Mr. Salt had to buy to something between 12 and 40 million chocolate bars; which means this promotional campaign was most certainly one of the most profitable in history. Details below.
In Charlie and the Chocolate Factory, Roald Dahl pictured the archetype of the spoiled little child in the character of Veruca Salt (left). When Willy Wonka, an excentric, genius chocolate maker, announces that five golden tickets have been hidden underneath the wrapping of ordinary chocolate bars, granting their lucky discoverers a visit of his chocolate factory and a lifetime supply of sweets, the world goes into a chocolate-devouring frenzy. It’s only a few days before the first and second ticket are found, the later by a girl named Veruca Salt. Here is how her father explains the discovery
‘You see, boys,’ he had said, ‘as soon as my little girl told me that she simply had to have one of those Golden Tickets, I went out into the town and started buying up all the Wonka bars I could lay my hands on. Thousands of them, I must have bought. Hundreds of thousands!
Assume Mr. Salt wants to have a reasonable chance of finding one of the tickets, perhaps 70 or 80%. Let \(\alpha\) be the acceptable risk level, so that the probability of actually finding the ticket is \(1 – \alpha\). Since televisions seem rather common in the book, and it was published in 1964, assume the action takes place in the early 1960’s. Say the children population of England at the time is \(P\), and assume Mr. Wonka makes most of his business in this country, and only with children (including other countries, or adults, would further reduce Mr. Salt’s chances of success). There are \(T = 5\) golden tickets in play, and we can estimate the number of bars sold per child to \(b =\) three or more, given that Charlie Bucket, though extremely poor, buys three.
Excluding Mr. Salt compulsive buying, this generates a total static demand \(D = P \times b\). Let \(N\) be the number of chocolate bars sold in total, and \(n\) the number of bars bought by the Salt family, so \(N = D + n\) — the underlying assumption being that chocolate production is profitable at any level.
The chance of these \(n\) bars yielding no golden ticket is \(\binom{N-T}{n} / \binom{N}{n}\); conversely, the probability of at least one of these \(n\) bars contains a golden ticket is \(1 – \binom{N-T}{n} / \binom{N}{n}\). Assuming the family is ready to accept a probability of failure of \(\alpha\), \(n\) will be chosen so that \(1 – \binom{N-T}{n} / \binom{N}{n} > 1 – \alpha\), or equivalently that $$ \binom{N-T}{n} < \alpha \binom{N}{n} $$ Developing this expression and substituing [latex]N - D[/latex] for [latex]n[/latex] yields [latex](N - T)! / (D - T)! < \alpha (N! / D!)[/latex], which is the same as $$ \frac{D!}{(D - T)!} < \alpha \frac{N!}{(N - T)!} $$ The children population of England in the 60's was about [latex]12.2[/latex] million kids. Assuming each bought 3 chocolate bars, the static demand was [latex]36.6[/latex] million bars. Finally, setting [latex]T = 5[/latex] and numerically solving the equation above yields [latex]N \approx 49.6[/latex] million chocolate bars, and thus [latex]n = 12.9[/latex] million chocolate bars. In other words, Mr. Salt would need to buy 12.9 million chocolate bars to reach a 70% chance of securing a golden ticket. Taking this probability up to 95% would imply buying an extra 28 million chocolate bars, bringing the grand total to about 41 million chocolate bars.
Here’s the Octave/Matlab code used to numerically solve the inequation above. The key is to notice that \(\frac{D!}{(D – T)!}\) and \(\frac{N!}{(N – T)!}\) are both values taken by the monotonically increasing polynomial \(f(X) = \frac{X!}{(X-T)!} = \prod_{k=0}^{T-1} (X – k)\). Thus the inequation can be rewritten as \(g(N) > 0\), where \(g(N) = (f(N)/f(D)) – (1/\alpha)\).
function y = f(x, T) y = prod(x - (1:(T-1))); endfunction function y = g(N, D, T, alpha) y = f(N, T)/f(D, T) - 1/alpha; endfunction N = fzero(@(x) g(x, 36.7e6, 5, 0.3), [10e6, 10e20]);
Illustration by Quentin Blake. Title image from wikimedia.
]]>
Obtaining long sequences of random numbers, as required by some cryptographic algorithms, is a delicate problem. There are basically two types of random number generators: true random number generators, and pseudo-random number generators.
This article assumes no particular knowledge about the generation of random numbers, so if you already know about TRNGs and PRNGs, you may want to skip to the core of this article.
True random number generators rely on a physical source of randomness, and measure a quantity that is either theoretically unpredictable (quantic), or practically unpredictable (chaotic — so hard to predict that it appears random).
I’ve compiled a list of sources of true randomness (capable of sufficiently fast output); please let me know in the comments if I forgot any.
/dev/random
in Unix-like operating systemsThe coupled inverters technique devised by Greg Taylor and George Cox at Intel is truly the most fascinating one : it relies on collapsing the wave function of two inverters put in a superposed state.
Pseudo-random number generators are very different: they act as a black box, which takes one number (called the seed and produces a sequence of bits; this sequence is said to be pseudo-random if it passes a number of statistical tests, and thus appears random. These tests are discussed in the following section; simple examples include measuring the frequency of bits and bit sequences, evaluating entropy by trying to compress the sequence, or generating random matrices and computing their rank.
A pseudo random number generator is formally defined by an initialization function, a state (a sequence of bits of bounded length), a transition function, and an output function:
A sequence of pseudo-random bits, called a run, is obtained by seeding the generator (that is, initializing it using a given seed to put it in its initial state), and repeatedly calling the transition function and the output function. This procedure is illustrated by the following diagram:
Schematic diagram of a pseudo-random number generator
In particular, this means that the sequence of bits produced by a PRNG is fully determined by the seed. This has a few disagreeable consequences, among which the fact that a PRNG can only generate as many sequences as it can accept different seeds. Thus, since the range of values that the state can take is usually much wider than that of the seed, it is generally best to only seldom reset (re-seed) the generator.
Since PRNGs are commonly used for cryptographic purposes, it is sometimes asked that the transformation and output functions satisfy two additional properties :
Rule 1 ensures that eavesdropping the output of a PRNG doesn’t allow an attacker to learn more than the bits they overheard.
Rule 2 ensures that past communications wouldn’t be compromised should an attacker manage to gain knowledge of the current state of the PRNG (thereby compromising all future communications based on the same run).
When a PRNG satisfies these two properties, it is said to be cryptographically secure.
There are a number of statistical tests devised to distinguish pseudo-random number generators from true ones; the more a PRNG passes, the closer it is considered to be from a true random source. This section presents general methodology, and studies a few example tests.
Almost all tests are built on the same structure: the pseudo-random stream of bits produced by a PRNG is transformed (bits are grouped or rearranged, arranged in matrices, some bits are removed, …), statistical measurements are made (number of zeros, of ones, matrix rank, …), and these measurements are compared to the values mathematically expected from a truly random sequence.
More precisely, assume that \(f\) is a function taking any finite sequence of zeros and ones, and returning a non-negative real value. Then, given a sequence of independent and uniformly distributed random variables \(X_n\), applying \(f\) to the finite sequence of random variables \((X_1, …, X_n)\) yields a new random variable, \(Y_n\). This new variable has a certain cumulative probability distribution \(F_n(x) = \mathbb{P}(Y_n \leq x)\), which in some cases approaches a function \(F\) as \(n\) grows large. This limit function \(F\) can be seen as the cumulative probability distribution of a new random variable, \(Y\), and in these cases \(Y_n\) is said to converge in distribution to \(Y\).
Randomness tests are generally based on carefully selecting such a function, and calculating the resulting limit probability distribution. Then, given a random sequence \(\epsilon_1, \ldots, \epsilon_n\), it is possible to calculate the value of \(y = f(\epsilon_1, \ldots, \epsilon_n)\), and assess how likely this value was by evaluating the probability \(\mathbb{P}(Y \geq y)\) — that is, the probability that a single draw of the limit random variable \(Y\) yields a result greater than or equal to \(y\). If this probability is lower than \(0.01\), the sequence \(\epsilon_1, \ldots, \epsilon_n\) is said to be non-random with confidence 99%. Otherwise, the sequence is considered random.
Let \((X_n)\) be an infinite sequence of independent, identically distributed random variables. Take \(f\) to be \(f(x_1, …, x_n) = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{x_n – \mu}{\sigma}\) where \(\mu\) is the average, and \(\sigma\) the standard deviation, of any \(X_k\). Define \(Y_n = f(X_1, …, X_n)\). The central limit theorem states that the distribution of \(Y_n\) tends to the standard normal distribution. This theorem is illustrated below.
Example of convergence in distribution to the standard normal distribution, after summing 1, 2, 3, 4, 5, 8, 10, and 50 independent random variables following the same distribution.
Consider the particular case where \(X_n\) is uniformly distributed over the discrete set \(\{-1,1\}\). The central limit theorem applies, and \(Y_n = f(X_1, …, X_n)\) converges in distribution to the standard normal law. For practical reasons, however, we choose to consider \(Y_n = |f(X_1, …, X_n)|\), which converges in distribution to a variable \(Y_\infty\). \(Y_\infty\) has the half-normal distribution (illustrated on the left), which has the nice property of being monotonically decreasing.
In plain colours, the half-normal distribution. In pale colours, the corresponding cumulative distribution function. If a pseudo-random sequence is picked in the red area, it is declared non-random with 99% certainty (the yellow and orange areas correspond to a certainty of 90% and 95% respectively).
Here is how the test proceeds : the sequence \((\epsilon_1, \ldots, \epsilon_n)\) given by the PRNG is converted to a sequence of elements of \(\{-1, 1\}\) by changing each \(\epsilon_i\) to \(x_i = 2 \cdot \epsilon_i – 1\). This gives \(x_1, \ldots, x_n\). Then, \(|f(x_1, \ldots, x_n)| = \left|\sum_{k=1}^n \frac{x_k}{\sqrt{n}}\right|\) is numerically calculated, yielding \(y\). Finally, the theoretical likeliness of a truly random sequence yielding a value equal to or greater than \(y\) is evaluated using the cumulative distribution function of \(Y_\infty\) (in pale colours).
If this probability is too low (red area, probability below \(0.01\)), the sequence is rejected as non-random, with 99% certainty. If it is greater than \(0.01\), but less than \(0.05\) (orange area) the sequence is sometimes rejected as non-random with 95% certainty. Similarly, if it is greater than \(0.05\) but less than \(0.10\) (yellow area), the sequence is sometimes rejected as non-random, with 90% certainty.
High probabilities (green area, probabilities greater than \(0.10\)), finally, do not permit to distinguish the pseudo-random sequence from a truly random one.
To test whether a pseudo-random number generator is close to a true one, a sequence length is chosen, and \(m\) pseudo-random sequences of that length are retreived from the PRNG, then analysed according to the previous methodology. It is expected, if the confidence level is 1%, that about 99% of the sequences pass, and 1% of the sequences fail; if the observed ratio significantly differs from 99 to 1, the PRNG is said to be non-random. More precisely, the confidence interval is generally chosen to be \(p \pm 3\frac{\sqrt{p(1-p)}}{\sqrt{m}}\), where \(p\) is the theoretical ratio (99% here); this takes the probability of incorrectly rejecting a good number generator down to 0.3%.
This confidence interval is obtained as follows. For large values of \(m\), approximate the probability of rejecting \(r\) sequences by \(\binom{m}{r} p^{1-r} (1-p)^r\) (here \(p = 99\%\)). Write \(\sigma_i\) the random variable denoting whether the \(i\)^{th} sequence was rejected. Then with the previous approximation all \(\sigma_i\) are independent and take value \(0\) with probability \(p\), \(1\) with probability \(1-p\) (standard deviation: \(\sqrt{p(1-p)}\)). The central limit theorem states that with probability close to \(\displaystyle \int_a^b \frac{e^{-x^2/2}}{\sqrt{2\pi}}\), the observed rejection frequency \(\hat{p}\) lies in the interval \(\left[p + b\frac{\sqrt{p(1-p)}}{\sqrt{m}}, p + a\frac{\sqrt{p(1-p)}}{\sqrt{m}}\right]\). Finally, setting \(b = -a\) and adjusting \(a\) so that this probability is \(\approx 99.7\%\) yields \(a \approx 3\).
A number of test suites have been proposed as standards. The state-of-the-art test suite was for a long time the DieHard test suite (designed by George Marsaglia), though it as eventually superseeded by the National Institute of Standards and Technology recommendations. Pierre l’Ecuyer and Richard Simard, from Montreal university, have recently published a new collection of tests, TestU01, gathering and implementing a impressive number of previously published tests and adding new ones.
A subset of these tests is presented below.
Purpose: Evaluate whether the sequence has a systematic bias towards 0 or 1 (real-time example on random.org)
Description: Verify that the arithmetic mean of the sequence approaches \(0.5\) (based on the law of large numbers). Alternatively, verify that normalized partial sums \(s_n = \frac{1}{\sqrt{n}} \sum_{k=1}^n \frac{2\cdot\epsilon_k – 1}{\sqrt{2}}\) approach a standard normal distribution (based on the central limit theorem).
Variants: Block frequency test (group bits in blocks of constant length and perfom the same test), Frequency test in a block (group bits in blocks of constant length and perform the test on every block).
Purpose: Evaluate whether runs of ones or zeros are too long (or too short) (real-time example on random.org)
Description: Count the number of same-digits blocks in the sequence. This should follow a binomial distribution \(\mathcal{B}(n, 1/2)\). By noting that the number of same-digits blocks is the sum of the \(\sigma_i\) where \(\sigma_i = 1\) if \(\epsilon_i \not= \epsilon_{i+1}\) and \(\sigma_i = 0\) otherwise, the previous method can be applied.
Variants: Longest run of ones (divide the sequence in blocks and measure the longest run of ones in each block).
Purpose: Search for linear dependencies between successive blocks of the sequence
Description: Partition the sequence in blocks of length \(n \times p\). Build one \(n \times p\) binary matrix from each block, and compute its rank. The theoretical distribution of all ranks is however not easy to compute — this problem is discussed in details in a paper by Ian F. Blake and Chris Studholme from the university of Toronto.
Purpose: Search for periodic components in the pseudo-random sequence
Description: Compute a discrete Fourier transform of the pseudo-random input \(2\epsilon_1 – 1, \ldots, 2\epsilon_n – 1\), and take the modulus of each complex coefficient. Compute the 95% peak height threshold value, defined to be the theoretical value that no more than 5% of the previously calculated moduli should exceed (\(\sqrt{-n\log(0.05)}\) (NIST)), and count the actual number of moduli exceeding this threshold. Assess the likeliness of such a value.
Purpose: Determine whether the sequence can be compressed without loss of information (evaluate information density). A random sequence should have a high information density.
Description: Partition the sequence in blocks of length \(L\), and separate these block into categories. Call the first \(Q\) blocks the initialisation sequence, and the remaining \(K\) blocks the test sequence. Then, give each block in the test sequence a score equal to the distance that separates it from the previous occurence of a block with the same contents, and sum the binary logarithm of all scores. Divide by the number of blocks to obtain the value of the test function, and verify that this value is close enough to the theoretically expected value.
Purpose: Verify that the sequence has some of the properties of a truly random walk
Description: Consider the pseudo-random sequence to represent a random walk starting from zero and moving up (down) by one unit every time a zero (a one) occurs in the sequence. Measure the maximum height reached by the walk, as well as the average time and the number of states visited. Also, for each possible state (integer), measure in how many cycle it appears. Evaluate the theoretical probability of obtaining these values.
The sources previously cited (in particular the NIST recommendation paper) present mathematical background about these tests, as well as lots of other tests.
Did I miss your favourite randomness test? Were you ever confronted to obvious imperfections in a pseudo-random number generator? Tell us in the comments!
]]>
This article features a rather long introduction to sum types in functional languages. If you already know about this, you can skip to the core of this article.
Starting with a simple example, imagine you want to create a linked list structure. Most C#/Java programmers will write something like
class MyList<T> { T value; MyList<T> next; MyList(T value, MyList<T> next) { this.value = value; this.next = next; } } MyList<int> example = new MyList(1, new MyList(2, null));
This relies on a rather ugly trick: null
is used to represent the empty list. Functional languages prefer to clearly separate two cases : the empty list (usually called nil
), and the concatenation of a value (called the head of the list) and another list (called the tail).
(* 't is a type variable, just like T above *) type 't MyList = Nil | Cons of 't * ('t MyList);; let example = Cons(1, Cons(2, Nil))
{- t is a type variable, just like T above -} data MyList t = Nil | Cons t (MyList t) example = Cons 1 (Cons 2 Nil)
This introduces the concept of a sum type; a type in which values belong to one of a number of sets (here, either to the singleton set {Nil}
, or to the set of all values of type t
— for example integers if t
is int
).
Here is another example, where a value of type Value
can be either an integer, or a floating-point number, or a string, or a boolean.
type Value = | IntegerValue of int | FloatValue of float | StringValue of string | BooleanValue of bool
data Value = IntegerValue Int | FloatValue Float | StringValue String | BooleanValue Bool
Quite naturally, functional languages include a special construct to handle values whose type is a sum type. This construct is called pattern matching; it is a bit like a switch
statement, only the switch is on the object’s inner type (the subset of the sum type which the object belongs to). For example, to identify the contents of a variable whose type is Value
, functional programmers write
let identify = function | IntegerValue (val) -> "int!" | FloatValue (val) -> "float!" | StringValue (str) -> "string" | BooleanValue (b) -> "bool!"
identify (IntegerValue val) = "int!" identify (FloatValue val) = "float!" identify (StringValue str) = "string" identify (BooleanValue b) = "bool!"
Here’s another example, on lists this time: if functional programmers want to count elements in a list, they write
let rec count = function | Nil -> 0 | Cons (head, tail) -> 1 + (count tail) ;;
count Nil = 0 count (Cons hd tl) = 1 + count tl
Let’s move on to our last example
In this section, we use a sum type to represent the structure of an XML document (some node types have been omitted). We first define an attribute to be a pair of two string
s, and a node to be one of DocumentFragment
(a list of XML nodes), Element
(a name, attributes, and children elements), Commment
(some comment text), or Text
(plain text).
type attr = Attribute of string * string;; type xmlnode = | DocumentFragment of xmlnode list | Element of string * (attr list) * (xmlnode list) | Comment of string | Text of string ;;
data XmlNode = | DocumentFragment [XmlNode] | Attribute String String | Element String [Attribute] [XmlNode] | Comment String | Text String
With such a type declaration, writing concise tree-traversal functions is easy: as an example, we define a serialize
function, which generate XML code from an xmlnode
object.
let serialize_attribute (Attribute(name, value)) = Printf.printf " %s=\"%s\"" name value ;; let rec serialize = function | DocumentFragment nodes -> List.iter serialize nodes (* applies serialize to every xmlnode in nodes *) | Element (label, attributes, nodes) -> Printf.printf "<%s" label; List.iter serialize_attribute attributes; Printf.printf ">\n"; List.iter serialize nodes; Printf.printf "</%s>\n" label; | Comment (str) -> Printf.printf "<!-- %s -->" str; | Text (str) -> Printf.printf "%s\n" str; ;;
Here is a small example to illustrate the previous function:
serialize ( Element("a", [ Attribute("href", "http://pit-claudel.fr/clement/blog"); Attribute("title", "Code crumbs") ], [ Text("Code crumbs -- Personal thoughts about programming") ]) );;
This prints
<a href="http://pit-claudel.fr/clement/blog" title="Code crumbs"> Code crumbs -- Personal thoughts about programming </a>
We can now move on to the core of this article:
Returning to a simpler example, suppose we need to represent a generic binary tree (a tree is defined as being either a branch, containing two sub-trees, or a leaf, containing a value). Functional programmers will write something like
type 't tree = Branch of ('t tree * 't tree) | Leaf of 't let example = Branch (Branch (Leaf 1, Leaf 2), Leaf 3)
data Tree t = Branch (Tree t) (Tree t) | Leaf t example = Branch (Branch (Leaf 1) (Leaf 2)) (Leaf 3)
On the other hand, (hurried) C#/Java programmers will often implement this as a product type — that is, they’ll pack both cases in a single class, and use an extra field to differentiate branches and leaves:
enum TreeType = {BRANCH, LEAF}; class BadTree<T> { TreeType type; T leaf_value; BadTree<T> left_branch, right_branch; BadTree(BadTree<T> left_branch, BadTree<T> right_branch) { this.type = BRANCH; this.left_branch = left_branch; this.right_branch = right_branch; // Memory waste: leaf_value is unused (!) } BadTree(T leaf_value) { this.type = LEAF; this.leaf_value = leaf_value; // Memory waste: left_branch and right_branch are unused (!) } } BadTree<int> example = new BadTree(new BadTree(new BadTree(1), new BadTree(2)), new Tree(3));
This representation wastes a lot of memory and, though rather concise, quickly degenerates when more cases are added to the type definition (for example in the xmlnode
case — for another real-world example, see the end of this post).
The idea is to encode the type disjunction (Branch
or Leaf
, DocumentFragment
or Element
or Comment
or Text
) in the inheritance hierarchy; practically speaking, the trick is to define an abstract Tree
class, and make two new classes, Leaf
and Branch
, which both inherit from Tree
:
abstract class Tree<T> { public static class Leaf<T> : Tree<T> { public T value; public Leaf(T value) { this.value = value; } } public static class Branch<T> : Tree<T> { public Tree<T> left, right; public Branch(Tree<T> left, Tree<T> right) { this.left = left; this.right = right; } } } Tree<int> example = new Tree::Branch(new Tree::Branch(new Tree::Leaf(1), new Tree::Leaf(2)), new Tree::Leaf(3));
This allows for much cleaner code ; and the memory waste is gone!
The BadTree
class, which uses an enum
to distinguish branches and nodes, makes it possible to performe pattern matchings over the Tree
class, by switching on the Tree
field. Though this approach directly translates to the Tree
abstract class using the is
keyword, it’s not ideal.
Tree<int> tree = (...); switch (tree.type) { case TreeType.BRANCH: (...); break; case TreeType.LEAF: (...); break; }
Tree<int> tree = (...); if (tree is Tree::Branch) { Tree::Branch branch = (Tree::Branch) btree; (...); } else if (tree is Tree::Leaf) { Tree::Leaf leaf = (Tree::Leaf) tree; (...); }
Though functional, this new version of the pattern-matching code is not really pretty. A much cleaner approach is to implement the pattern matching directly in the derived classes (Branch
and Leaf
), by declaring an abstract method in the parent class, Tree
. That is, instead of having one big function doing some explicit pattern matching in the Tree
class, we’ll have multiple specialized functions — one in each class inheriting Tree
. Philosophically speaking, just like we implemented type disjunctions as different classes inheriting a common parent, we’ll implement logical disjunctions as different functions overriding a common parent.
For example, here is how we can write a Fork
function, which returns a new Tree
with every leaf split in two identical leaves:
abstract class Tree<T> { public abstract Tree<T> Fork(); public static class Leaf<T> : Tree<T> { public T value; // Constructor omitted public override Tree<T> Fork() { return new Tree::Branch(new Tree::Leaf(value), new Tree::Leaf(value)); } } public static class Branch<T> : Tree<T> { public Tree<T> left, right; // Constructor omitted public override Tree<T> Fork() { return new Tree::Branch(left.Fork(), right.Fork()); } } }
The benefits of using this abstract class approach are particularly visible when the data types involved get more complex. Below is an example implementation of a data structure used to describe arithmetic expressions.
abstract class ArithmeticExpression { // Evaluates an arithmetic expression. context is a dictionary mapping variable // names to their values. public abstract float? Eval(Dictionary<String, float?> context); public class Variable : ArithmeticExpression { string label; public override float? Eval(Dictionary<String, float?> context) { float? value; context.TryGetValue(label, out value); return value; } } public abstract class BinaryOperation : ArithmeticExpression { protected ArithmeticExpression left_operand, right_operand; public BinaryOperation(ArithmeticExpression left_operand, ArithmeticExpression right_operand) { this.left_operand = left_operand; this.right_operand = right_operand; } protected abstract float? Apply(float left_value, float right_value); public override float? Eval(Dictionary<String, float?> context) { float? left_result = left_operand.Eval(context), right_result = right_operand.Eval(context); if (left_result.HasValue && right_result.HasValue) return Apply(left_result.Value, right_result.Value); else return null; } class Sum : BinaryOperation { public Sum(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { } protected override float? Apply(float left_value, float right_value) { return left_value + right_value; } } class Subtraction : BinaryOperation { public Subtraction(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { } protected override float? Apply(float left_value, float right_value) { return left_value - right_value; } } class Product : BinaryOperation { public Product(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { } protected override float? Apply(float left_value, float right_value) { return left_value * right_value; } } class Division : BinaryOperation { public Division(ArithmeticExpression left_operand, ArithmeticExpression right_operand) : base(left_operand, right_operand) { } protected override float? Apply(float left_value, float right_value) { return left_value / right_value; } } } }]]>