{"id":9,"date":"2012-07-16T23:59:58","date_gmt":"2012-07-16T21:59:58","guid":{"rendered":"http:\/\/pit-claudel.fr\/clement\/blog\/?p=9"},"modified":"2013-11-16T10:58:32","modified_gmt":"2013-11-16T09:58:32","slug":"unscrambling-shuffled-text","status":"publish","type":"post","link":"https:\/\/pit-claudel.fr\/clement\/blog\/unscrambling-shuffled-text\/","title":{"rendered":"Unscrambling shuffled text"},"content":{"rendered":"<p>A story which surfaced a few years ago, and met quite some success in the press and on the internet, pretended Cambridge University had been conducing research on some of the most amazing faculties of the human brain. According to a supposed study, the order in which letters were laid out when writing a word mattered very little, provided the first and last letter be kept in place : this conclusion was supported by a short excerpt of shuffled text, which anyone could easily decipher ((<a href=\"http:\/\/www.mrc-cbu.cam.ac.uk\/people\/matt.davis\/cmabridge\/\">Matt Davis<\/a> has a fascinating write-up on this. <a href=\"http:\/\/www.snopes.com\/language\/apocryph\/cambridge.asp\">Snopes<\/a> also has some reference material; the full text was something like <em>Aoccdrnig to rscheearch at Cmabrigde uinervtisy, it deosn&#8217;t mttaer waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteres are at the rghit pclae. The rset can be a tatol mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by itslef but the wrod as a wlohe.<\/em>)). As a short example, consider the following sentence: <\/p>\n<blockquote><p>\n  Narlmloy, radneig tihs shdulon&#8217;t be too hrad.\n<\/p><\/blockquote>\n<p>As many commentators pointed out at the time, the trick works well because the words used are relatively short; the following passage should be much harder to understand:<\/p>\n<blockquote><p>\nThe eofrft rureieqd to sfssllcceuuy dhiepecr sbecmrald pgsaases daiaarclltmy ianserecs as wdors get lehegtinr.\n<\/p><\/blockquote>\n<p>This article presents an efficient algorithm to unshuffle scrambled text.<\/p>\n<p><!--more--><\/p>\n<h2>Shuffling text<\/h2>\n<p>Shuffling a piece of text is easy:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nimport re, random\r\n \r\ndef shuffle_one(word):\r\n  if len(word) &lt;= 3:\r\n    return word\r\n  \r\n  middle = list(word&#x5B;1:-1])\r\n  random.shuffle(middle)\r\n   \r\n  return word&#x5B;0] + ''.join(middle) + word&#x5B;-1]\r\n \r\ndef repl(match):\r\n  return shuffle_one(match.groups()&#x5B;0])\r\n \r\ndef shuffle(text):\r\n  return re.sub(&quot;(\\w+)&quot;, repl, text)\r\n \r\nshuffle(&quot;Normally, reading this shouldn't be too hard.&quot;)\r\n<\/pre>\n<p>Consider the opposite problem:<\/p>\n<h2>Unshuffling scrambled text<\/h2>\n<p>Deciphering a shuffled message is non-trivial because one word can be shuffled in many different ways: <code>\"Cadgmrbie\"<\/code>, <code>\"Cgamdbire\"<\/code>, <code>\"Cmbiagrde\"<\/code> and <code>\"Cgrbimdae\"<\/code> are all acceptable representations of the word <code>\"Cambridge\"<\/code>.<\/p>\n<p>To overcome this difficulty, we define a shuffling-independent signature &#8212; a function that maps each word to a signature that does not change when the word is shuffled. The following snippet implements such a signature by isolating the first and last letter of each word and sorting the central part alphabetically.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef signature(word):\r\n  word = word.lower()\r\n\r\n  if len(word) &lt;= 3:\r\n    return word\r\n   \r\n  middle = list(word&#x5B;1:-1])\r\n  return word&#x5B;0] + ''.join(sorted(middle)) + word&#x5B;-1]\r\n \r\nsignature(&quot;Welcome&quot;) # 'wcelmoe'\r\nsignature(&quot;Woclmee&quot;) # 'wcelmoe'\r\n<\/pre>\n<p>Using this function, we can now compute a normalized version of the word <code>\"Cambridge\"<\/code>, which does not change when its letters are shuffled: <code>signature(\"Cambridge\")<\/code>, <code>signature(\"Cadgmrbie\")<\/code>, <code>signature(\"Cgamdbire\")<\/code>, <code>signature(\"Cmbiagrde\")<\/code> and <code>signature(\"Cgrbimdae\")<\/code> are all equal to <code>\"cabdgimre\"<\/code>.<\/p>\n<p>We then compute a signature-to-word mapping (possibly multi-valued &#8212; different words may map to the same signature) using a word list ((<a href=\"http:\/\/www.keithv.com\/software\/wlist\/wlist_match6.zip\">This list<\/a>, found on <a href=\"http:\/\/www.keithv.com\/software\/wlist\/\">Keith Vertanen&#8217;s website<\/a>, works well)) (<code>english.txt<\/code>).<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\nhandler = open(&quot;english.txt&quot;, encoding = &quot;utf-8&quot;)\r\nwords = &#x5B;word.strip() for word in handler.readlines()]\r\nhandler.close()\r\n \r\n# Load dictionary entries\r\ndictionary = {signature(word):word for word in words}\r\n<\/pre>\n<p>This gives us a signature-to-word dictionary, which we can then use to decipher the shuffled message. For this, we first compute the signature of each scrambled word in the input, and then retrieve the corresponding clear-text words from the dictionary.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef unshuffle_one(word):\r\n  sig = signature(word)\r\n  return dictionary&#x5B;sig] if sig in dictionary else word\r\n   \r\ndef unrepl(match):\r\n  return unshuffle_one(match.groups()&#x5B;0])\r\n \r\ndef unshuffle(text):\r\n  return re.sub(&quot;(\\w+)&quot;, unrepl, text)\r\n \r\nprint(unshuffle(&quot;The eofrft rreuqied to sfssllcceuuy dhiepecr sbecmrald &quot;\r\n                &quot;pgsaases daiaarclltmy ianserecs as wdors get lehegtinr.&quot;))\r\n<\/pre>\n<p>Taking the signature of each word in our example sentence, we obtain<\/p>\n<blockquote><p>The effort reeiqrud to sccefllssuuy dceehipr sabcelmrd paaegsss daaacillmrty iaceenrss as wdors get leeghintr.<\/p><\/blockquote>\n<p>And by applying the signature-to-word mapping to each word, we finally obtain the clear text:<\/p>\n<blockquote><p>The effort required to successfully decipher scrambled passages dramatically increases as words get lengthier.<\/p><\/blockquote>\n<h2>Further thoughts<\/h2>\n<h3>Handling signature collisions<\/h3>\n<p>The code can be modified to allow for signature collisions (collisions happen when multiple words have the same signature): for this, instead of using a dictionary as a word-to-word mapping, we use a dictionary as a word-to-list-of-words mapping:<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n# Load dictionary entries\r\ndictionary = {}\r\nfor word in words:\r\n  sig = signature(word)\r\n\r\n  if not sig in dictionary:\r\n    dictionary&#x5B;sig] = &#x5B;] # Create an empty list\r\n  \r\n  # Add word to the list of all words sharing signature sig\r\n  dictionary&#x5B;sig].append(word)\r\n\r\ndef unshuffle_one(word):\r\n  sig = signature(word)\r\n  return '|'.join(dictionary&#x5B;sig]) if sig in dictionary else word\r\n<\/pre>\n<p>The following function prints a list of all signature collisions.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\ndef find_collisions():\r\n  for sig in dictionary:\r\n    if len(dictionary&#x5B;sig]) &gt; 1:\r\n      print(', '.join(dictionary&#x5B;sig]))\r\n<\/pre>\n<p>Relatively common words triggering signature collisions are listed below.<\/p>\n<table class=\"tight\">\n<tr>\n<th colspan=\"4\">Most common English words triggering signature collisions<\/th>\n<\/tr>\n<tr>\n<td>\n      aboard, abroad<br \/>\n      alerting, altering<br \/>\n      angels, angles<br \/>\n      barking, braking<br \/>\n      beard, bread<br \/>\n      bedroom, boredom<br \/>\n      blowing, bowling<br \/>\n      brunt, burnt<br \/>\n      caller, cellar<br \/>\n      calvary, cavalry<br \/>\n      carve, crave<br \/>\n      cashing, chasing<br \/>\n      catering, creating<br \/>\n      centers, centres<br \/>\n      circles, clerics<br \/>\n      clot, colt<br \/>\n      cloud, could<br \/>\n      compiled, complied\n    <\/td>\n<td>\n      complaint, compliant<br \/>\n      conservation, conversation<br \/>\n      dairies, diaries<br \/>\n      damned, demand<br \/>\n      density, destiny<br \/>\n      divers, drives<br \/>\n      entirety, eternity<br \/>\n      excepted, expected<br \/>\n      feeling, fleeing<br \/>\n      files, flies<br \/>\n      fingers, fringes<br \/>\n      fired, fried<br \/>\n      flower, fowler<br \/>\n      forests, fosters<br \/>\n      gateway, getaway<br \/>\n      hate, heat<br \/>\n      haters, hearts<br \/>\n      incest, insect\n    <\/td>\n<td>\n      lair, liar<br \/>\n      lakes, leaks<br \/>\n      lanes, leans<br \/>\n      maiden, median<br \/>\n      males, meals<br \/>\n      marital, martial<br \/>\n      panel, penal<br \/>\n      panels, planes<br \/>\n      parties, pirates<br \/>\n      patrol, portal<br \/>\n      pedals, pleads<br \/>\n      pluses, pulses<br \/>\n      pointers, proteins<br \/>\n      recourse, resource<br \/>\n      redrawing, rewarding<br \/>\n      reserve, reverse<br \/>\n      resorts, rosters<br \/>\n      sacred, scared\n    <\/td>\n<td>\n      saints, stains<br \/>\n      sales, seals<br \/>\n      shooting, soothing<br \/>\n      signing, singing<br \/>\n      sinks, skins<br \/>\n      skates, stakes, steaks<br \/>\n      sorting, storing<br \/>\n      spared, spread<br \/>\n      spilt, split<br \/>\n      spotlight, stoplight<br \/>\n      spots, stops<br \/>\n      sprite, stripe<br \/>\n      taxes, texas<br \/>\n      tenor, toner<br \/>\n      trail, trial<br \/>\n      tribune, turbine<br \/>\n      warp, wrap<br \/>\n      weird, wired\n    <\/td>\n<\/table>\n<p>To experiment by yourself, you can <a href=\"http:\/\/pit-claudel.fr\/clement\/blog\/files\/shuffle.py\">download the full source code of this article<\/a>.<\/p>\n<p><strong>Did you like this article? Leave a comment! (If you hated it, that&#8217;s fine too: leave a message and tell me what I should improve next time!)<\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article presents an efficient algorithm to unscramble shuffled text (suhffled txet), with a full implementation in Python.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,17,8,14],"tags":[13,15],"_links":{"self":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/9"}],"collection":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/comments?post=9"}],"version-history":[{"count":5,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/9\/revisions"}],"predecessor-version":[{"id":115,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/9\/revisions\/115"}],"wp:attachment":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/media?parent=9"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/categories?post=9"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/tags?post=9"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}