{"id":458,"date":"2012-09-30T02:57:08","date_gmt":"2012-09-30T00:57:08","guid":{"rendered":"http:\/\/pit-claudel.fr\/clement\/blog\/?p=458"},"modified":"2013-12-19T01:22:19","modified_gmt":"2013-12-19T00:22:19","slug":"modeling-and-measuring-string-comparison-performance-in-c-cpp-csharp-and-python","status":"publish","type":"post","link":"https:\/\/pit-claudel.fr\/clement\/blog\/modeling-and-measuring-string-comparison-performance-in-c-cpp-csharp-and-python\/","title":{"rendered":"Modeling and measuring string comparison performance in C, C++, C# and Python."},"content":{"rendered":"<p> <img decoding=\"async\" class=\"wrap\" style=\"padding-top:10px\" alt=\"O(1)\" src=\"http:\/\/pit-claudel.fr\/clement\/blog\/images\/O(1).png\"\/> Comparing strings is often &#8212; erroneously &#8212; said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.<\/p>\n<p><!--more--><\/p>\n<h2>Theoretical amortized cost<\/h2>\n<h3>Infinite strings<\/h3>\n<p>Suppose one draws two random strings \\(A = &#8220;a_0a_1a_2&#8230;&#8221;\\) and \\(B = &#8220;b_0b_1b_2&#8230;&#8221;\\) of infinite length, with \\(a_n\\) and \\(b_n\\) respectively denoting the \\(n\\)<sup>th<\/sup> character of \\(A\\) and \\(B\\). We will assume that characters are independent, and uniformly drawn over the set \\(\\{a, b, c, &#8230;, z\\}\\). For now, we will consider all strings to be of infinite length &#8212; we&#8217;ll study bounded-length strings in the next section.<\/p>\n<p>Given that characters making up \\(A\\) and \\(B\\) are independent, the probability of \\(A\\) differing from \\(B\\) at index \\(n\\) is \\(25\/26\\). Conversely, the probability of \\(A\\) matching string \\(B\\) at index \\(n\\) is \\(1\/26\\).<\/p>\n<p>Consequently, the probability of \\(A\\) matching \\(B\\) up to index \\(n &#8211; 1\\), and differing at index \\(n\\), is \\((1\/26)^n &sdot; (25\/26)\\). Therefore, given that comparing strings up to index \\(n\\) included is \\((n+1)\\) the expected number of characters to compare to discriminate \\(A\\) and \\(B\\) is $$ C = \\sum_{n=0}^{\\infty} (n+1) &sdot; (1\/26)^n &sdot; (25\/26) = 26\/25$$<\/p>\n<p class=\"details\">Indeed, defining \\(C(x) = (1-x) &sdot; \\sum_{n=0}^{\\infty} (n+1) x^n\\), the expected number of comparisons \\(C\\) required to discriminate \\(A\\) and \\(B\\) is \\(C(1\/26)\\). Now \\(C(x) = (1-x) &sdot; \\sum_{n=0}^{\\infty} \\frac{&part; x^n}{&part; x} = (1-x) &sdot; \\frac{&part; \\sum_{n=0}^{\\infty} x^{n+1}}{&part; x} = (1-x) &sdot; (1-x)^{-2} = 1\/(1-x)\\). Plugging \\(x=1\/26\\) yields \\(C = 26\/25 \\approx 1.04\\).<\/p>\n<p>In other words, it takes an average of \\(1.04\\) character comparisons to discriminate two random strings of infinite length. That&#8217;s rather cheap. <\/p>\n<h3>Finite strings<\/h3>\n<p>In this section we assume that \\(A\\) and \\(B\\) are bounded in length by a constant \\(L\\). This implies that the probability of the comparison routine reaching any index greater than \\((L-1)\\) is \\(0\\); hence the expected number of comparisons \\(C_L\\) is $$C_L = \\sum_{n=0}^{L-1} (n+1) &sdot; (1\/26)^n &sdot; (25\/26) = (26\/25) &sdot; (1 &#8211; \\frac{25L + 26}{26^{L+1}})$$<br \/>\nFor any \\(L > 3\\), this is virtually indistinguishable from the previously calculated value \\(C = 1.04\\).<\/p>\n<h3>Conclusion<\/h3>\n<p>The theoretical amortized time required to compare two strings is constant &#8212; independent of their length. Of course, there may be a sizeable overhead associated to comparing strings instead of ints: experimental measurements of this overhead are presented below.<\/p>\n<h2>Experimental measurements<\/h2>\n<p>The following charts present the time it took to perform 100&#8217;000 string comparisons on my laptop, compared to a baseline of 100&#8217;000 integer comparisons, in 3 different programming languages and for string lengths varying from 1 to 500 characters. Despite large variations for small string lengths, measured times quickly converge to a fixed limit as strings lengthen ((If you want to experiment with the tests, you can download the full <a href=\"http:\/\/pit-claudel.fr\/clement\/blog\/files\/string-comparison-performance.py\">python<\/a>, <a href=\"http:\/\/pit-claudel.fr\/clement\/blog\/files\/string-comparison-performance.cpp\">C++<\/a>, or <a href=\"http:\/\/pit-claudel.fr\/clement\/blog\/files\/string-comparison-performance.cs\">C#<\/a> source code.)).<\/p>\n<div class=\"illustration\">\n  <img decoding=\"async\" src=\"http:\/\/pit-claudel.fr\/clement\/blog\/images\/python-string-comparison-performance.png\" alt=\"Chart comparing the relative performance of strings and integer comparisons\" \/><\/p>\n<p>Time required to compare 100&#8217;000 pairs of random strings, as a function of string length, in Python (3.2.3)<\/p>\n<\/div>\n<div class=\"illustration\">\n\t<img decoding=\"async\" src=\"http:\/\/pit-claudel.fr\/clement\/blog\/images\/csharp-string-comparison-performance.png\" alt=\"Chart comparing the relative performance of strings and integer comparisons\" \/><\/p>\n<p>Time required to compare 100&#8217;000 pairs of random strings, as a function of string length, in C# (csc 4.0.30319.17929)<\/p>\n<\/div>\n<div class=\"illustration\">\n\t<img decoding=\"async\" src=\"http:\/\/pit-claudel.fr\/clement\/blog\/images\/c++-string-comparison-performance.png\" alt=\"Chart comparing the relative performance of strings and integer comparisons\" \/><\/p>\n<p>Time required to compare 100&#8217;000 pairs of random strings, as a function of string length, in C++ (gcc 4.3.4).<\/p>\n<\/div>\n<h3>Remarks<\/h3>\n<p>Though large strings (over ~200 characters) experimentally match our statistical model, small strings exhibit a much more complex behaviour. I can think of a two causes for such variations:<\/p>\n<ul>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/String_interning\">String interning<\/a>: Python <a href=\"http:\/\/docs.python.org\/library\/functions.html#intern\">interns<\/a> some strings &#8212; exactly which depends on the implementation. On my laptop, this turns small (< 64 characters) string comparisons into direct pointer comparisons, which are about as fast as integer comparisons.<\/li>\n<li><a href=\"http:\/\/en.wikipedia.org\/wiki\/Locality_of_reference\">Locality<\/a>: In C++ in particular, under <a href=\"http:\/\/pit-claudel.fr\/clement\/blog\/flies\/locality.cpp\">certain conditions<\/a>, applying a function to a <samp>char*<\/samp> gets slower as the referred string lengthens, even if the function only access the first few elements. The effect is barely noticeable for strings over 200 characters.<\/li>\n<\/ul>\n<h2>Conclusion<\/h2>\n<p>Comparing strings, especially small strings, is very fast. The amortized time required to compare two strings is bounded by a constant, which is experimentally about twice the time required to compare two integers in high level languages like Python or C#, and 5 to 7 times higher in low-levels languages like C++.<\/p>\n<p class=\"conclusion\">Do you usually try to avoid comparing strings? Do you have a more detailed explanation of the irregularities observed for lengths < 200 characters? Tell us comments!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Comparing strings is often &#8212; erroneously &#8212; said to be a costly process. In this article I derive the theoretical asymptotic cost of comparing random strings of arbitrary length, and measure it in C, C++, C# and Python.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,72,16,33,57,32,17,8,14],"tags":[74,73,76,62,77,78,63,34,75],"_links":{"self":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/458"}],"collection":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/comments?post=458"}],"version-history":[{"count":5,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/458\/revisions"}],"predecessor-version":[{"id":758,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/posts\/458\/revisions\/758"}],"wp:attachment":[{"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/media?parent=458"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/categories?post=458"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/pit-claudel.fr\/clement\/blog\/wp-json\/wp\/v2\/tags?post=458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}