A story which surfaced a few years ago, and met quite some success in the press and on the internet, pretended Cambridge University had been conducing research on some of the most amazing faculties of the human brain. According to a supposed study, the order in which letters were laid out when writing a word mattered very little, provided the first and last letter be kept in place : this conclusion was supported by a short excerpt of shuffled text, which anyone could easily decipher ((Matt Davis has a fascinating write-up on this. Snopes also has some reference material; the full text was something like Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn’t mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.)). As a short example, consider the following sentence:
Narlmloy, radneig tihs shdulon’t be too hrad.
As many commentators pointed out at the time, the trick works well because the words used are relatively short; the following passage should be much harder to understand:
The eofrft rureieqd to sfssllcceuuy dhiepecr sbecmrald pgsaases daiaarclltmy ianserecs as wdors get lehegtinr.
This article presents an efficient algorithm to unshuffle scrambled text.
Shuffling text
Shuffling a piece of text is easy:
import re, random def shuffle_one(word): if len(word) <= 3: return word middle = list(word[1:-1]) random.shuffle(middle) return word[0] + ''.join(middle) + word[-1] def repl(match): return shuffle_one(match.groups()[0]) def shuffle(text): return re.sub("(\w+)", repl, text) shuffle("Normally, reading this shouldn't be too hard.")
Consider the opposite problem:
Unshuffling scrambled text
Deciphering a shuffled message is non-trivial because one word can be shuffled in many different ways: "Cadgmrbie"
, "Cgamdbire"
, "Cmbiagrde"
and "Cgrbimdae"
are all acceptable representations of the word "Cambridge"
.
To overcome this difficulty, we define a shuffling-independent signature — a function that maps each word to a signature that does not change when the word is shuffled. The following snippet implements such a signature by isolating the first and last letter of each word and sorting the central part alphabetically.
def signature(word): word = word.lower() if len(word) <= 3: return word middle = list(word[1:-1]) return word[0] + ''.join(sorted(middle)) + word[-1] signature("Welcome") # 'wcelmoe' signature("Woclmee") # 'wcelmoe'
Using this function, we can now compute a normalized version of the word "Cambridge"
, which does not change when its letters are shuffled: signature("Cambridge")
, signature("Cadgmrbie")
, signature("Cgamdbire")
, signature("Cmbiagrde")
and signature("Cgrbimdae")
are all equal to "cabdgimre"
.
We then compute a signature-to-word mapping (possibly multi-valued — different words may map to the same signature) using a word list ((This list, found on Keith Vertanen’s website, works well)) (english.txt
).
handler = open("english.txt", encoding = "utf-8") words = [word.strip() for word in handler.readlines()] handler.close() # Load dictionary entries dictionary = {signature(word):word for word in words}
This gives us a signature-to-word dictionary, which we can then use to decipher the shuffled message. For this, we first compute the signature of each scrambled word in the input, and then retrieve the corresponding clear-text words from the dictionary.
def unshuffle_one(word): sig = signature(word) return dictionary[sig] if sig in dictionary else word def unrepl(match): return unshuffle_one(match.groups()[0]) def unshuffle(text): return re.sub("(\w+)", unrepl, text) print(unshuffle("The eofrft rreuqied to sfssllcceuuy dhiepecr sbecmrald " "pgsaases daiaarclltmy ianserecs as wdors get lehegtinr."))
Taking the signature of each word in our example sentence, we obtain
The effort reeiqrud to sccefllssuuy dceehipr sabcelmrd paaegsss daaacillmrty iaceenrss as wdors get leeghintr.
And by applying the signature-to-word mapping to each word, we finally obtain the clear text:
The effort required to successfully decipher scrambled passages dramatically increases as words get lengthier.
Further thoughts
Handling signature collisions
The code can be modified to allow for signature collisions (collisions happen when multiple words have the same signature): for this, instead of using a dictionary as a word-to-word mapping, we use a dictionary as a word-to-list-of-words mapping:
# Load dictionary entries dictionary = {} for word in words: sig = signature(word) if not sig in dictionary: dictionary[sig] = [] # Create an empty list # Add word to the list of all words sharing signature sig dictionary[sig].append(word) def unshuffle_one(word): sig = signature(word) return '|'.join(dictionary[sig]) if sig in dictionary else word
The following function prints a list of all signature collisions.
def find_collisions(): for sig in dictionary: if len(dictionary[sig]) > 1: print(', '.join(dictionary[sig]))
Relatively common words triggering signature collisions are listed below.
Most common English words triggering signature collisions | |||
---|---|---|---|
aboard, abroad alerting, altering angels, angles barking, braking beard, bread bedroom, boredom blowing, bowling brunt, burnt caller, cellar calvary, cavalry carve, crave cashing, chasing catering, creating centers, centres circles, clerics clot, colt cloud, could compiled, complied |
complaint, compliant conservation, conversation dairies, diaries damned, demand density, destiny divers, drives entirety, eternity excepted, expected feeling, fleeing files, flies fingers, fringes fired, fried flower, fowler forests, fosters gateway, getaway hate, heat haters, hearts incest, insect |
lair, liar lakes, leaks lanes, leans maiden, median males, meals marital, martial panel, penal panels, planes parties, pirates patrol, portal pedals, pleads pluses, pulses pointers, proteins recourse, resource redrawing, rewarding reserve, reverse resorts, rosters sacred, scared |
saints, stains sales, seals shooting, soothing signing, singing sinks, skins skates, stakes, steaks sorting, storing spared, spread spilt, split spotlight, stoplight spots, stops sprite, stripe taxes, texas tenor, toner trail, trial tribune, turbine warp, wrap weird, wired |
To experiment by yourself, you can download the full source code of this article.
Did you like this article? Leave a comment! (If you hated it, that’s fine too: leave a message and tell me what I should improve next time!)
Leave comments below if you liked this post!
Pingback: Msot popele can undreatnsd tihs setennce