Irregular Expressions

If you’ve worked extensively with strings, you’ve probably encountered regular expressions (abbreviated “regex” or “regexp”, depending on your political party and the current moon phase—I’ll use regexp below).

Regular expressions are a popular and powerful tool for manipulating, validating, and parsing strings. You can use regular expressions to extract phone numbers, validate email addresses, and inflict pain and suffering to future maintainers of your code (including you!).

But regular expressions are a lot more powerful than you think! This post will describe an example of a way you can use regexps to perform computation, in particular, determining whether a number is prime ¹. (This post assumes familiarity with basic regular expression concepts. If you’re not acquainted with regexps, a decent introduction to them lives here.)

1. Primality testing
2. Explanation
3. Closing remarks

1 Primality testing

I recently came across a regexp-based test for prime numbers², which is what inspired this blog post.

In Perl:

sub is_prime {
  my ($number) = @_;
  return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \z/xms;
}

print join ' ', grep { is_prime($_) } 0..100;

The above program produces a list of primes up to 100, as expected:

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97

2 Explanation

How the hell does a regular expression, which simply matches some string pattern, calculate whether or not a number is prime? Let’s dissect the code, and we’ll see that it’s equivalent to a naive trial division algorithm.

Here’s a simplified and annotated version of the above code.

sub is_prime {
  my ($number) = @_;
  return (1 x $number)          # a string with '1' repeated $number times
    !~                          # negated regexp match operator
    m{
       \A                       # beginning of string
       (?:                      # non-capturing group
         1?                     # zero or one '1'
       |
         (11+)                  # two or more '1's
         \1+                    # one or more repetitions of previous group
       )
       \z                       # end of string
   }xms;                        # eXtended syntax, allows comments and spaces
}

print join ' ', grep { is_prime($_) } 0..100;

First, we construct a string consisting of $number occurrences of the digit 1. We could have selected any other character instead of 1; what’s important is that the string has length $number.

The regexp itself matches any string of 1’s whose length is zero, one, or a composite number. Therefore, we use Perl’s !~ operator, which returns a true value if the input string does not match the pattern. That’ll tell us whether or not $number is prime.

2.1 Regexp constructs used

\A and \z specify that the match operator should return true only when the entire string fits the pattern, instead of the default behavior, which is to return true if any substring of the input fits the pattern.

1? is straightforward; it matches exactly zero or one occurrences of 1.

(11+) matches two or more occurrences of 1. Because of the parentheses, it also captures the matching substring, which can then be referred to later in the pattern using the \1 syntax (1 because it’s the first capturing group).

\1+ then matches one or more occurrences of whatever was matched by (11+?).

2.2 Intuition

The length of the substring captured by (11+?) acts like a trial divisor. The left-hand side of the alternation bar | is fairly simple. It takes care of the edge cases of 0 and 1, which are not prime or composite.

(11+) acts as a trial divisor. It matches a substring of two or more 1’s, capturing however many it matched into group 1. The length of that substring is divided into the rest of the string.

\1+ tries to match a substring whose length is some whole number multiple of the trial divisor. If it reaches exactly the end of the string (of length $number), then $number must be divisible by the trial divisor.

In total, the right-hand side of the alternation bar | ends up matching two or more occurrences of the trial divisor string. Therefore, if it matches the whole string, then $number must be composite.

2.3 Examples

Let’s see how the regexp would work on a couple of example inputs.

2.3.1 n = 1

'1' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;

The string '1', as well as the empty string, matches the left side of the alternation bar, 1?. This is the special case (neither prime nor composite).

2.3.2 n = 9 (composite)

'111111111' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;

We know the left-hand side of the alternation bar won’t match, so let’s focus on the right-hand side: (11+)\1+.

For the capture group, (11+), the regexp engine will match it with the first two 1’s, then the first three, the first four, etc. For each prefix of the string matched by (11+), the engine will attempt to match the remainder of the string against \1+.

When the engine attempts to match the pattern with (11+) corresponding to 11, it will then try to match the rest of the string with (11)+, i.e., some positive even number of 1’s. It will not be able to do so; after consuming the first 8 characters of the input, the regexp engine will not be able to match another occurrence of (11). However, it won’t be at the end of the string yet, so the \z assertion fails and the match must backtrack.

When the engine attempts to match the pattern with (11+) corresponding to 111, it will try to match the remaining 6 characters against (111)+, and will succeed because 3 divides 6 evenly. Therefore, the regular expression matches and the !~ operator returns a false value, telling us that the number 9 is composite.

2.3.3 n = 7 (prime)

'1111111' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;

Again, we’ll focus on the right-hand side of the alternation bar.

(11+) matches prefixes of length 2 or greater. For each of these “divisors”, the engine will attempt to match one or more additional occurrences of the prefix until the end of the string. Since none of (2, 3, …, 6) divide evenly into 7, they cannot match the rest of the pattern successfully.

How about the prefix of length 7, i.e., the entire string being matched by (11+)? Well, since \1+ requires at least one more occurrence of the captured group to appear before the end of the string, the engine will reject the input even though 7 divides itself evenly.

2.4 Performance enhancements

Let’s return to the original version of the code:

sub is_prime {
  my ($number) = @_;
  return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \z/xms;
}

print join ' ', grep { is_prime($_) } 0..100;

In (11+?), the ? modifies the + quantifier, causing it to become “non-greedy”. This means the regexp engine will try to match as few occurrences of the preceding pattern as possible. With this modifier, the engine will try using a divisor of two, then three, then four, and so on increasing, rather than decreasing from $number down to two (which is the default behavior for +). Since a composite number is much more likely to be divisible by 2 than some large divisor, it is much more cost-effective to start small and work upwards.

(?> ) prevents the subpattern contained within from backtracking. For example, in (?> \1+ ), the regexp engine will match as many occurrences of \1 as possible. If it fails to match the remainder of the pattern, it won’t retry the match with fewer occurrences, since that would be futile–it definitely won’t reach the end of the string (\z) after matching fewer characters than before!

With these performance enhancements in place, the regexp-based primality test enjoys the speed of a sensible implementation of a naïve trial-division algorithm, if not a more sophisticated primality testing algorithm.

3 Closing remarks

Regular expressions are more powerful than you think. In fact, they’re strictly more powerful than what mathematicians formally define as “regular expressions”, since they include features like backreferences (\1) and look-ahead/-behind ((?=)).

That being said, never use regular expressions to test primality.

Footnotes:

This material is provided for entertainment purposes only. Never ever do this, except to show off. Please.

Damian Conway, Perl Best Practices. Page 235.

Irregular Expressions

Table of Contents

1 Primality testing

2 Explanation

2.1 Regexp constructs used

2.2 Intuition

2.3 Examples

2.3.1 n = 1

2.3.2 n = 9 (composite)

2.3.3 n = 7 (prime)

2.4 Performance enhancements

3 Closing remarks

Footnotes:

Comments

Table of Contents

1 Primality testing

2 Explanation

2.1 Regexp constructs used

2.2 Intuition

2.3 Examples

2.3.1 n = 1

2.3.2 n = 9 (composite)

2.3.3 n = 7 (prime)

2.4 Performance enhancements

3 Closing remarks

Footnotes:

Comments

Related Posts

Emacs Plugins in OCaml: Brain-what?! (part 2) 2017-11-12

Emacs Plugins in OCaml: Ecaml API Overview (part 3) 2017-11-12

Emacs Plugins in OCaml: Putting It All Together (part 4) 2017-11-19