Irregular Expressions
Published: Last updated:· Tags: perl · Category: uncategorized
If you’ve worked extensively with strings, you’ve probably encountered regular expressions (abbreviated “regex” or “regexp”, depending on your political party and the current moon phase—I’ll use regexp below).
Regular expressions are a popular and powerful tool for manipulating, validating, and parsing strings. You can use regular expressions to extract phone numbers, validate email addresses, and inflict pain and suffering to future maintainers of your code (including you!).
But regular expressions are a lot more powerful than you think! This post will describe an example of a way you can use regexps to perform computation, in particular, determining whether a number is prime 1. (This post assumes familiarity with basic regular expression concepts. If you’re not acquainted with regexps, a decent introduction to them lives here.)
Table of Contents
1 Primality testing
I recently came across a regexp-based test for prime numbers2, which is what inspired this blog post.
In Perl:
sub is_prime { my ($number) = @_; return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \z/xms; } print join ' ', grep { is_prime($_) } 0..100;
The above program produces a list of primes up to 100, as expected:
2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 59 61 67 71 73 79 83 89 97
2 Explanation
How the hell does a regular expression, which simply matches some string pattern, calculate whether or not a number is prime? Let’s dissect the code, and we’ll see that it’s equivalent to a naive trial division algorithm.
Here’s a simplified and annotated version of the above code.
sub is_prime { my ($number) = @_; return (1 x $number) # a string with '1' repeated $number times !~ # negated regexp match operator m{ \A # beginning of string (?: # non-capturing group 1? # zero or one '1' | (11+) # two or more '1's \1+ # one or more repetitions of previous group ) \z # end of string }xms; # eXtended syntax, allows comments and spaces } print join ' ', grep { is_prime($_) } 0..100;
First, we construct a string consisting of $number
occurrences of the digit
1
. We could have selected any other character instead of 1
; what’s
important is that the string has length $number
.
The regexp itself matches any string of 1
’s whose length is zero, one, or a
composite number. Therefore, we use Perl’s !~
operator, which returns a true
value if the input string does not match the pattern. That’ll tell us
whether or not $number
is prime.
2.1 Regexp constructs used
\A
and \z
specify that the match operator should return true only when
the entire string fits the pattern, instead of the default behavior, which is
to return true if any substring of the input fits the pattern.
1?
is straightforward; it matches exactly zero or one occurrences of 1
.
(11+)
matches two or more occurrences of 1
. Because of the parentheses,
it also captures the matching substring, which can then be referred to
later in the pattern using the \1
syntax (1 because it’s the first
capturing group).
\1+
then matches one or more occurrences of whatever was matched by
(11+?)
.
2.2 Intuition
The length of the substring captured by (11+?)
acts like a trial divisor.
The left-hand side of the alternation bar |
is fairly simple. It takes care
of the edge cases of 0 and 1, which are not prime or composite.
(11+)
acts as a trial divisor. It matches a substring of two or more 1
’s,
capturing however many it matched into group 1. The length of that substring
is divided into the rest of the string.
\1+
tries to match a substring whose length is some whole number multiple
of the trial divisor. If it reaches exactly the end of the string (of length
$number
), then $number
must be divisible by the trial divisor.
In total, the right-hand side of the alternation bar |
ends up matching two
or more occurrences of the trial divisor string. Therefore, if it matches the
whole string, then $number
must be composite.
2.3 Examples
Let’s see how the regexp would work on a couple of example inputs.
2.3.1 n = 1
'1' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;
The string '1'
, as well as the empty string, matches the left side of the
alternation bar, 1?
. This is the special case (neither prime nor
composite).
2.3.2 n = 9 (composite)
'111111111' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;
We know the left-hand side of the alternation bar won’t match, so let’s
focus on the right-hand side: (11+)\1+
.
For the capture group, (11+)
, the regexp engine will match it with the
first two 1
’s, then the first three, the first four, etc. For each prefix
of the string matched by (11+)
, the engine will attempt to match the
remainder of the string against \1+
.
When the engine attempts to match the pattern with (11+)
corresponding to
11
, it will then try to match the rest of the string with (11)+
, i.e.,
some positive even number of 1
’s. It will not be able to do so; after
consuming the first 8 characters of the input, the regexp engine will not be
able to match another occurrence of (11)
. However, it won’t be at the end
of the string yet, so the \z
assertion fails and the match must backtrack.
When the engine attempts to match the pattern with (11+)
corresponding to
111
, it will try to match the remaining 6 characters against (111)+
, and
will succeed because 3 divides 6 evenly. Therefore, the regular expression
matches and the !~
operator returns a false value, telling us that the
number 9 is composite.
2.3.3 n = 7 (prime)
'1111111' !~ m/\A (?: 1? | (11+) \1+ ) \z/xms;
Again, we’ll focus on the right-hand side of the alternation bar.
(11+)
matches prefixes of length 2 or greater. For each of these
“divisors”, the engine will attempt to match one or more additional
occurrences of the prefix until the end of the string. Since none of (2, 3,
…, 6) divide evenly into 7, they cannot match the rest of the pattern
successfully.
How about the prefix of length 7, i.e., the entire string being matched by
(11+)
? Well, since \1+
requires at least one more occurrence of the
captured group to appear before the end of the string, the engine will
reject the input even though 7 divides itself evenly.
2.4 Performance enhancements
Let’s return to the original version of the code:
sub is_prime { my ($number) = @_; return (1 x $number) !~ m/\A (?: 1? | (11+?) (?> \1+ ) ) \z/xms; } print join ' ', grep { is_prime($_) } 0..100;
In (11+?)
, the ?
modifies the +
quantifier, causing it to become
“non-greedy”. This means the regexp engine will try to match as few
occurrences of the preceding pattern as possible. With this modifier, the
engine will try using a divisor of two, then three, then four, and so on
increasing, rather than decreasing from $number
down to two (which is the
default behavior for +
). Since a composite number is much more likely to be
divisible by 2 than some large divisor, it is much more cost-effective to
start small and work upwards.
(?> )
prevents the subpattern contained within from backtracking. For
example, in (?> \1+ )
, the regexp engine will match as many occurrences of
\1
as possible. If it fails to match the remainder of the pattern, it won’t
retry the match with fewer occurrences, since that would be futile–it
definitely won’t reach the end of the string (\z
) after matching fewer
characters than before!
With these performance enhancements in place, the regexp-based primality test enjoys the speed of a sensible implementation of a naïve trial-division algorithm, if not a more sophisticated primality testing algorithm.
3 Closing remarks
Regular expressions are more powerful than you think. In fact, they’re
strictly more powerful than what mathematicians formally define as “regular
expressions”, since they include features like backreferences (\1
) and
look-ahead/-behind ((?=)
).
That being said, never use regular expressions to test primality.