Ideal regex delimiters in PHP

We want to find regular expression delimiters that enable us to avoid inserting additional escape sequences into our patterns. This is ideal when we want to inject foreign patterns where we can't guarantee which characters will be used.

Regex grammar
Delimiters /foo/i
Pattern /foo/i
Modifier /foo/i

The regex grammar table helps understand the language used in this article. You should also know that metacharacter means any character which has special meaning in regex patterns.

PCRE knows no delimiters

PHP provides regular expression support using the Perl Compatible Regular Expressions (PCRE) library written in C. Perl has the most comprehensive regex features supporting powerful constructs such as assertions, conditionals and recursion allowing more problems to be solved more easily than other regex flavours.

However, PCRE does not provide a way to specify regular expression options (modifiers) within the pattern; that is, the pattern and options are passed separately to the compiler as shown by the following method signature.

pcre *pcre_compile(const char *pattern, int options, ...);  

It is up to implementations like PHP to decide how to expose modifiers to us. Most implementations map each option to a single letter such as the i modifier which sets the PCRE_CASELESS option enabling case-insensitive matching. Modifiers are typically appended to the pattern and separated by delimiters but PCRE itself has no concept of delimiters.

Delimiters in PHP

Most implementations use a pair of slashes to delimit patterns, for example, /pattern/. In PHP we may elect any symbol (except backslash) to delimit our pattern with the crucial caveat that each occurrence of our delimiter within the pattern must be escaped by prefixing a backslash (\), as shown below.

  • /a/b/ – invalid
  • /a\/b/valid
  • #a/b#valid

We'd like to be able to choose a delimiter that will never appear in our pattern to avoid having to insert unnecessary escapes. If we're hard coding our pattern this should be possible even if we have to change the delimiter a few times as our pattern evolves. However, if we need to insert a user-defined pattern how can we ensure that our chosen delimiter will not conflict? The solution lies in asymmetric delimiters.

Asymmetric delimiters

We may never know why PHP is unable to detect the last occurrence of our delimiter without requiring us to escape every occurrence before it but fortunately there is another solution. PHP inherits Perl's bracket delimiter syntax giving us four asymmetric delimiters to choose from:

  • (pattern)i – round brackets
  • [pattern]i – square brackets
  • {pattern}i – curly brackets
  • <pattern>i – angled brackets

Angled brackets

Each pair of brackets except for angled brackets are considered metacharacters in patterns but have no special meaning as delimiters. Since angled brackets are not metacharacters we might think we could avoid having to escape them in patterns delimited by angle brackets.

  • <>> – invalid
  • <\>>valid
  • <<> – invalid
  • <\<>valid

Unfortunately, angled brackets have to be escaped when used as delimiters just like symmetric delimiters, except now we've inherited the burden of having to escape two characters instead of just one! Angled brackets are a step backward but the other three bracket types offer something different.

Non-angled brackets

When we use non-angled brackets as our delimiters we can still use whichever bracket type we choose as normal in patterns without having to escape them and without changing their meaning – exactly what we've been searching for! Each of the following expressions match the character a once.

  • ((a))valid
  • [[a]]valid
  • {a{1}}valid

The ideal delimiter

We have been searching for a delimiter that does not require us to insert additional escapes into our pattern and any of the non-angled brackets serve this purpose, but we still have to choose one of them, so which is best?

Technically each options is as good as the other but curly brackets are the least common metacharacter of the three, and as a quantifier, cannot appear at the start of an expression so they can only double-up at the end. This sets curly brackets apart as the most suitable delimiters to use in PHP.

Conclusion

Three different asymmetric delimiters allow us to avoid extra escaping: round brackets (), square brackets [] and curly brackets {}. As the least used metacharacter, less likely to be confused with other parts of a pattern, curly brackets may be the ideal delimiters for most expressions.


Comments