Monday, July 13, 2009

Replacing Invalid Characters v. Performance

It appears a large amount of StackOverflow posts focus on eeking out an impossible amount of performance for rather mundane tasks. Sometimes, when you read the question, you can't help but wonder why they even want to improve the performance of the given algorithm. The odds are most performance enhancements would come at the expense of readability. However, that does not mean it isn't worth taking a look at--especially if you're bored and don't want to do something else.

I had some fun with the following question: Most efficient way to remove special characters from string. Just from looking at the specification given, my guess is this is going to get called maybe 100 times at the most. O(f(n)) where n is 100 is boring. Even more interesting is that the given solution from the poster is already quite good (barring a logic error). Here is the OP's solution without any logical errors:

public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder();
for (int i = 0; i < str.Length; ++i)
{
if ((str[i] >= '0' && str[i] <= '9')
|| (str[i] >= 'A' && str[i] <= 'Z')
|| (str[i] >= 'a' && str[i] <= 'z')
|| (str[i] == '.' || str[i] == '_'))
sb.Append(str[i]);
}
return sb.ToString();
}

As a frequent code reviewer, I think this method looks great. In fact, running this method against a corpus of 10000 randomly generated lines with 4096 characters executes in 0.0479ms/string. This is very fast. The simplest improvement would be to prefill the StringBuilder to str.Length characters. This results in a new runtime of 0.0442ms/string. Now we're cooking with gas!

A few posters suggested regular expressions. Well, the naive approach (uncompiled) takes 0.489ms/string which is 10 times slower. Making the regular expression compiled takes 0.363ms/string, still 8 times slower. However, take a look at the new code you must maintain:

static Regex replacer = new Regex(@"[^a-zA-Z0-9_.]+", RegexOptions.Compiled);
public static string RemoveSpecialCharacters(string str)
{
return replacer.Replace(str, String.Empty);
}

Ok, so a big drop in performance, but perhaps a big gain in readability and maintainability. However, the OP's algorithm is surely very simple to read. So the regular expressions make sense if this method is going to be called a few times on small strings.

Another suggestion was to use foreach rather than a for-loop. This results in another minor speedup to 0.0412ms/string. The last of the improvements suggested was to go with a lookup table. Personally it is my least favorite because it will be less readable and maintainable. C# 3.0 can help us some, using LINQ to improve those.

static bool[] allowedChars = new List<bool>(
from ii in Enumerable.Range(0, 128)
let c = (char)ii
select (c >= '0' && c <= '9')
|| (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c == '.')
|| (c == '_')
).ToArray();

public static string RemoveSpecialCharacters(string str)
{
StringBuilder sb = new StringBuilder(str.Length);
for (int ii = 0; ii < str.Length; ++ii)
{
if (str[ii] < allowedChars.Length && allowedChars[str[ii]])
sb.Append(str[ii]);
}
return sb.ToString();
}

Take a look at how many extra lines of code we have added. This solution may be the fastest so far, at 0.0399ms/string, but it is also bordering on the ugliest. However, we haven't even touched on the simplest improvement that can be made which results in the largest absolute increase in performance. If you replace the StringBuilder with a char[] a 100% speedup can be achieved!

public static string RemoveSpecialCharacters(string str)
{
int idx = 0;
char[] chars = new char[str.Length];
foreach (char c in str)
{
if ((c >= '0' && c <= '9')
|| (c >= 'A' && c <= 'Z')
|| (c >= 'a' && c <= 'z')
|| (c == '.') || (c == '_'))
{
chars[idx++] = c;
}
}
return new string(chars, 0, idx);
}

This method runs in a blazing 0.0299ms/string, is simple to read, is easy to maintain, and is highly performant. We have a winner folks.