jordan.terrell
Just trying to make sense of things...

Never Underestimate A Well Written Regular Expression

Tuesday, 6 April 2010 11:36 by jordan.terrell

A couple of weeks ago, Kirill Osenkov posted an interview question that got the attention of a few .NET developers, myself included.  Like a moth to a flame, we were all eager to present a solution to this interview question:

In a given .NET string, assume there are line breaks in standard \r\n form (basically Environment.NewLine).

Write a method that inserts a space between two consecutive line breaks to separate any two line breaks from each other.

Roughly 20 answers were given in the comments to Kirill’s post, some with subtle differences, some with completely different approaches.  Some were even written in F#.

When I first saw the interview question, I very quickly came to my answer:

   1: string output = Regex.Replace(input, @"(\r\n)(?=\r\n)", "$1 ");

My answer uses Regular Expressions, which is a concise language to search text, sometimes in complex ways.  I first became aware of Regular Expressions in 2004, and I was immediately enamored with them.  I had always written complex search functions using operations like IndexOf() or directly accessing character arrays.  Text searching always seemed slow to me, but I later realized that it was just my poorly written code.  Very quickly I dove into learning the Regular Expression language (at least the dialect in .NET), and found many uses for it.

I’ve found since then that many developers are either unaware or fearful of Regular Expressions.  I’ll admit, some expressions that I’ve seen look very cryptic and intimidating.  But they are very powerful.  Plus they have the benefit of usually being very fast (although you can write slow ones).

During the discussion in the comments of Kirill’s post, it became obvious that performance is something to be considered in such a routine.  As a result, Rik Hemsley commented that he had created a benchmarking test bed to run each suggested solution.  Here are the results:

image

My solution came out as the best performing.  I say this, not to gloat - because in reality I’m just using what Microsoft wrote for us to use. I say it because I wanted to show that knowing about and using Regular Expressions is important when you need to parse text.  I’m sure that someone could come up with a better performing solution, but for a one-liner, Regular Expressions are hard to beat.

If you interested in learning about Regular Expressions in .NET, the MSDN documentation is pretty good.  Plus there are books that you can read on them as well as sites that have examples.