A beginner's guide to regular expressions

andyMatthews.net

A beginner's guide to regular expressions

I've noticed over the past few months quite a few developers with little to no knowledge of regular expressions (regex from here on out). For whatever reason they haven't taken the time, or had the chance, to learn what I consider to be one of the most powerful, and useful tools available in programming. Even knowing a few basics can really streamline your workflow, and improve your code. Not only are they useful IN code, but they can even help you write code. In this post I'm going to cover some regex basics, then show you some real examples of how they can solve problems for you.

The basics are always a great place to start, so let's look at some syntax. At it's heart, regex are simply a way to match (and replace) one string with another. So we replace the literal string cat, with mouse, or the number 8675309, with 42. Now that would be useful if we wanted to replace many occurrences of 8675309, but that's not all that common. Wouldn't it be nice if, in addition to replacing 8675309 with 42, we could also replace 5318008 with 42? Regular Expressions let's you do that with special strings called meta characters. Meta characters are what gives regex their incredible power. Here's a list of some of the most common meta characters and how they're used.

Meta Characters
- ^ beginning of a string
  - Except when used inside [] as part of a character set
- $ end of a string
- . matches any single character
  - . matches a or A, & or @, 0 or 4
- . matches any single character
  - . matches a or A, & or @, 0 or 4
- ? matches 0 or 1 times
  - https? matches both http and https
- + matches 1 or more times
  - matches bo and boo
- * matches everything (generally to the end of the line)
- \ used to escape meta characters, to use as literal strings

You can probably start to tell that meta characters are powerful, but it's not that common to want to match just one character. Characters sets allow developers to create groupings of characters to be used in a match. Character classes are pre-existing strings which offer the same functionality. To create a character set, simply wrap any number of characters in square brackets []. Let's take a look at characters sets, and character classes.

Character Sets
- [abc] Matches only a,b,c
- [a-zA-Z0-9] Matches all alphanumeric, from char 1 to char 2
- [A-Z] Matches all uppercase letters
- [a-z] Matches all lowercase letters
- [0-9] Matches all numbers
- [x-z] Matches x through z (x, y, and z)
- [^abc] Matches everything but a,b,c ( a ^ inside the brackets means "everything but these characters")
Character Classes

\d matches numbers, same as [0-9]
\w matches "word" characters, same as [A-Za-z0-9_]
\W matches "non-word" characters, essentially everthing BUT the previous set
\t matches tab
\n matches line break
\s matches whitespace characters, generally same as [\t\n ]

It's nice to have the ?, +, and * meta characters available to us, but wouldn't it be nice if we could specifiy a specific number of characters to match? Character ranges allow you to do this by specifying optional start and end numeric values within curly braces { }.

Character Ranges
- {x} Matches exactly x
- {x,y} Matches at least x, not more than y ( i.e. {3,8})
- {x,} Matches at least x
- {,y} Matches no more than y

Finally, Regular Expressions allow you to store one, or more, of your matches into temporary variables, called back references, which can used at other points in your expression. Here's how they work.

Back References
- (.*) Matches entire string, stores temp variable
  - Could be used to wrap a string in bold tag, <b>\1</b>
- (.+)\.php Matches filename, replaces PHP extension with CFM extension.
  - \1.cfm

Whew...fingers are tired. Now that we have a reference for the important aspects of regex, let's look at how we'd use them in real world examples.

Common regular expressions
- ^[\w_]{3,16}$ Matches common usernames
- <myTag[^>]*>(.*?)</myTag> Matches an HTML/xHTML tag
  - By default .* would match everything from that point, to the end of the string. You probably would not want that. By using the ? after the *, it converts the expression into a "lazy match" which gives you "just enough".
- [a-f0-9]{6} Matches any hex color
- ([\w]+[-._+&])*[\w]+@([-\w]+[.])+[a-zA-Z]{2,6} Matches most email addresses
  - Note that this is absolutely not perfect, but should catch most common emails
- (([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? Matches most URLs
  - When matching URLs with the range of characters found in links these days, it's best to find something "off the shelf" as it were. There's just too many possibilities to try to catch everything. This should get you most of the way there however.
- [ \t]+$ Removes spaces and tabs from the end of a line

I hope this will help you get started with regular expressions. Feel free to post your questions in the comments.