Unscrambling shuffled text

A story which surfaced a few years ago, and met quite some success in the press and on the internet, pretended Cambridge University had been conducing research on some of the most amazing faculties of the human brain. According to a supposed study, the order in which letters were laid out when writing a word mattered very little, provided the first and last letter be kept in place : this conclusion was supported by a short excerpt of shuffled text, which anyone could easily decipher ((Matt Davis has a fascinating write-up on this. Snopes also has some reference material; the full text was something like Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn’t mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.)). As a short example, consider the following sentence:

Narlmloy, radneig tihs shdulon’t be too hrad.

As many commentators pointed out at the time, the trick works well because the words used are relatively short; the following passage should be much harder to understand:

The eofrft rureieqd to sfssllcceuuy dhiepecr sbecmrald pgsaases daiaarclltmy ianserecs as wdors get lehegtinr.

This article presents an efficient algorithm to unshuffle scrambled text.

Shuffling text

Shuffling a piece of text is easy:

import re, random
 
def shuffle_one(word):
  if len(word) <= 3:
    return word
  
  middle = list(word[1:-1])
  random.shuffle(middle)
   
  return word[0] + ''.join(middle) + word[-1]
 
def repl(match):
  return shuffle_one(match.groups()[0])
 
def shuffle(text):
  return re.sub("(\w+)", repl, text)
 
shuffle("Normally, reading this shouldn't be too hard.")

Consider the opposite problem:

Unshuffling scrambled text

Deciphering a shuffled message is non-trivial because one word can be shuffled in many different ways: "Cadgmrbie", "Cgamdbire", "Cmbiagrde" and "Cgrbimdae" are all acceptable representations of the word "Cambridge".

To overcome this difficulty, we define a shuffling-independent signature — a function that maps each word to a signature that does not change when the word is shuffled. The following snippet implements such a signature by isolating the first and last letter of each word and sorting the central part alphabetically.

def signature(word):
  word = word.lower()

  if len(word) <= 3:
    return word
   
  middle = list(word[1:-1])
  return word[0] + ''.join(sorted(middle)) + word[-1]
 
signature("Welcome") # 'wcelmoe'
signature("Woclmee") # 'wcelmoe'

Using this function, we can now compute a normalized version of the word "Cambridge", which does not change when its letters are shuffled: signature("Cambridge"), signature("Cadgmrbie"), signature("Cgamdbire"), signature("Cmbiagrde") and signature("Cgrbimdae") are all equal to "cabdgimre".

We then compute a signature-to-word mapping (possibly multi-valued — different words may map to the same signature) using a word list ((This list, found on Keith Vertanen’s website, works well)) (english.txt).

handler = open("english.txt", encoding = "utf-8")
words = [word.strip() for word in handler.readlines()]
handler.close()
 
# Load dictionary entries
dictionary = {signature(word):word for word in words}

This gives us a signature-to-word dictionary, which we can then use to decipher the shuffled message. For this, we first compute the signature of each scrambled word in the input, and then retrieve the corresponding clear-text words from the dictionary.

def unshuffle_one(word):
  sig = signature(word)
  return dictionary[sig] if sig in dictionary else word
   
def unrepl(match):
  return unshuffle_one(match.groups()[0])
 
def unshuffle(text):
  return re.sub("(\w+)", unrepl, text)
 
print(unshuffle("The eofrft rreuqied to sfssllcceuuy dhiepecr sbecmrald "
                "pgsaases daiaarclltmy ianserecs as wdors get lehegtinr."))

Taking the signature of each word in our example sentence, we obtain

The effort reeiqrud to sccefllssuuy dceehipr sabcelmrd paaegsss daaacillmrty iaceenrss as wdors get leeghintr.

And by applying the signature-to-word mapping to each word, we finally obtain the clear text:

The effort required to successfully decipher scrambled passages dramatically increases as words get lengthier.

Further thoughts

Handling signature collisions

The code can be modified to allow for signature collisions (collisions happen when multiple words have the same signature): for this, instead of using a dictionary as a word-to-word mapping, we use a dictionary as a word-to-list-of-words mapping:

# Load dictionary entries
dictionary = {}
for word in words:
  sig = signature(word)

  if not sig in dictionary:
    dictionary[sig] = [] # Create an empty list
  
  # Add word to the list of all words sharing signature sig
  dictionary[sig].append(word)

def unshuffle_one(word):
  sig = signature(word)
  return '|'.join(dictionary[sig]) if sig in dictionary else word

The following function prints a list of all signature collisions.

def find_collisions():
  for sig in dictionary:
    if len(dictionary[sig]) > 1:
      print(', '.join(dictionary[sig]))

Relatively common words triggering signature collisions are listed below.

Most common English words triggering signature collisions
aboard, abroad
alerting, altering
angels, angles
barking, braking
beard, bread
bedroom, boredom
blowing, bowling
brunt, burnt
caller, cellar
calvary, cavalry
carve, crave
cashing, chasing
catering, creating
centers, centres
circles, clerics
clot, colt
cloud, could
compiled, complied
complaint, compliant
conservation, conversation
dairies, diaries
damned, demand
density, destiny
divers, drives
entirety, eternity
excepted, expected
feeling, fleeing
files, flies
fingers, fringes
fired, fried
flower, fowler
forests, fosters
gateway, getaway
hate, heat
haters, hearts
incest, insect
lair, liar
lakes, leaks
lanes, leans
maiden, median
males, meals
marital, martial
panel, penal
panels, planes
parties, pirates
patrol, portal
pedals, pleads
pluses, pulses
pointers, proteins
recourse, resource
redrawing, rewarding
reserve, reverse
resorts, rosters
sacred, scared
saints, stains
sales, seals
shooting, soothing
signing, singing
sinks, skins
skates, stakes, steaks
sorting, storing
spared, spread
spilt, split
spotlight, stoplight
spots, stops
sprite, stripe
taxes, texas
tenor, toner
trail, trial
tribune, turbine
warp, wrap
weird, wired

To experiment by yourself, you can download the full source code of this article.

Did you like this article? Leave a comment! (If you hated it, that’s fine too: leave a message and tell me what I should improve next time!)

2 thoughts on “Unscrambling shuffled text

  1. Pingback: Msot popele can undreatnsd tihs setennce

Comments are closed.