Regex is a powerful tool that can be used in your day-to-day programming tasks. Here are four tips on how to use regex in your programming work:

  1. Use regex to validate input. If you need to check the validity of user input, using regex can be a quick and easy way to do so. For example, you could use a regular expression to check for valid email addresses or phone numbers.
  2. Use regex to search for specific text patterns. You can use regex to find specific text patterns in your code or data files. For example, you could use it to find all instances of a certain word or phrase, or all occurrences of a certain pattern of characters.
  3. Use regex as part of an automated testing process. Automated testing tools like Selenium can help you test your code quickly and easily by using regular expressions as part of the test script. This way, you can ensure that your code is working as expected and that there are no errors or unintended consequences associated with it.
  4. Use regex when debugging code problems. When something goes wrong with your code, using regular expressions can help you track down the source of the problem quickly and easily. ..

Regex, short for regular expression, is often used in programming languages for matching patterns in strings, find and replace, input validation, and reformatting text. Learning how to properly use Regex can make working with text much easier.

Regex Syntax, Explained

Regex has a reputation for having horrendous syntax, but it’s much easier to write than it is to read. For example, here is a general regex for an RFC 5322-compliant email validator:

If it looks like someone smashed their face into the keyboard, you’re not alone. But under the hood, all of this mess is actually programming a finite-state machine. This machine runs for each character, chugging along and matching based on rules you’ve set. Plenty of online tools will render railroad diagrams, showing how your Regex machine works. Here’s that same Regex in visual form:

Still very confusing, but it’s a lot more understandable. It’s a machine with moving parts that have rules defining how it all fits together. You can see how someone assembled this; it’s not just a big glob of text.

First Off: Use a Regex Debugger

Before we begin, unless your Regex is particularly short or you’re particularly proficient, you should use an online debugger when writing and testing it. It makes understanding the syntax much easier. We recommend Regex101 and RegExr, both which offer testing and built-in syntax reference.

How Does Regex Work?

For now, let’s focus on something much simpler. This is a diagram from Regulex for a very short (and definitely not RFC 5322 compliant) email-matching Regex:

The Regex engine starts at the left and travels down the lines, matching characters as it goes. Group #1 matches any character except a line break, and will continue to match characters until the next block finds a match. In this case, it stops when it reaches an @ symbol, which means Group #1 captures the name of the email address and everything after matches the domain.

The Regex that defines Group #1 in our email example is:

The parentheses define a capture group, which tells the Regex engine to include the contents of this group’s match in a special variable. When you run a Regex on a string, the default return is the entire match (in this case, the whole email). But it also returns each capture group, which makes this Regex useful for pulling names out of emails.

The period is the symbol for “Any Character Except Newline.” This matches everything on a line, so if you passed this email Regex an address like:

It would match %$#^&%*#%$#^ as the name, even though that’s ludicrous.

The plus (+) symbol is a control structure that means “match the preceding character or group one or more times.” It ensures that the whole name is matched, and not just the first character. This is what creates the loop found on the railroad diagram.

The rest of the Regex is fairly simple to decipher:

The first group stops when it hits the @ symbol. The next group then starts, which again matches multiple characters until it reaches a period character.

Because characters like periods, parentheses, and slashes are used as part of the syntax in Regrex, anytime you want to match those characters you need to properly escape them with a backslash. In this example, to match the period we write . and the parser treats it as one symbol meaning “match a period.”

Character Matching

If you have non-control characters in your Regex, the Regex engine will assume those characters will form a matching block. For example, the Regex:

Will match the word “hello” with any number of e’s. Any other characters need to be escaped to work properly.

Regex also has character classes, which act as shorthand for a set of characters. These can vary based on the Regex implementation, but these few are standard:

.  – matches anything except newline. w – matches any “word” character, including digits and underscores. d – matches numbers. b – matches whitespace characters (i. e. , space, tab, newline).

These three all have uppercase counterparts that invert their function. For example, D matches anything that isn’t a number.

Regex also has character-set matching. For example:

Will match either a, b, or c. This acts as one block, and the square brackets are just control structures. Alternatively, you can specify a range of characters:

Or negate the set, which will match any character that isn’t in the set:

Quantifiers

Quantifiers are an important part of Regex. They let you match strings where you don’t know the exact format, but you have a pretty good idea.

The + operator from the email example is a quantifier, specifically the “one or more” quantifier. If we don’t know how long a certain string is, but we know it’s made up of alphanumeric characters (and isn’t empty), we can write:

In addition to +, there’s also:

The * operator, which matches “zero or more. ” Essentially the same as +, except it has the option of not finding a match. The ? operator, which matches “zero or one. ” It has the effect of making a character optional; either it’s there or it isn’t, and it won’t match more than once. Numerical quantifiers. These can be a single number like {3}, which means “exactly 3 times,” or a range like {3-6}. You can leave out the second number to make it unlimited. For example, {3,} means “3 or more times”.  Oddly enough, you can’t leave out the first number, so if you want “3 or less times,” you’ll have to use a range.

Greedy and Lazy Quantifiers

Under the hood, the * and + operators are greedy. It matches as much as possible, and gives back what is needed to start the next block. This can be a massive problem.

Here’s an example: say you’re trying to match HTML, or anything else with closing braces. Your input text is:

And you want to match everything within the brackets. You may write something like:

This is the right idea, but it fails for one crucial reason: the Regex engine matches “div>Hello World