Heading image for post: Greedy Gobbling Regex Capture

Greedy Gobbling Regex Capture

Profile picture of Matt Polito

Ever get a regex capture that wasn't exactly what you were expecting? Not able to figure out why? Well if you happen to have utilized a quantifier (* as in .*)... this could be your issue.

By default quantifiers are of the 'greedy' variety. They will try to match as much as possible even going into another part of your regex.

For example, I will be trying to break apart a string that includes a movie title and a service where it is available digitally.

Why Him UVHDX or itunes HD via MA

Interestingly I won't know specifics of the title but I can break it apart by possible service types in the string. So we'll start the regex like so:

/^(?<title>.+)\s(?<type>(?:UVHDX|itunes).*)$/i

This is saying look at the beginning of the string and everything between that and the the white space before the type match is the title. Everything after that will be the type. (All of this is wrapped with a case insensitive flag)

Note: I am using named captures for clarity

Whelp... that sorta happens.

Our result is:

title: Why Him UVHDX or
type: itunes HD via MA

As you can see we're close but not exactly there. Why is that? Upon review of the regex, everything seems to make sense. However this is where greedy quantifiers come into play.

The first quantifier we use for the title match is actually grabbing more of the string than we would think and it all still works because the itunes portion of the match is still there so everything is still valid.

How do we remedy this?

Luckily, there are options to make our quantifiers lazy. We have a global way by passing U (Ungreedy) as a flag

/^(?<title>.+)\s(?<type>(?:UVHDX|itunes).*)$/iU
title: Why Him
type: UVHDX or itunes HD via MA

or a local way using ? against the quantifier. (Be aware that ? in a regex can mean MANY different things depending on where it is placed: YOU'VE BEEN WARNED)

In this case we cant to affect the title's quantifier so we'll add a ? against the .+ in the title capture.

/^(?<title>.+?)\s(?<type>(?:UVHDX|itunes).*)$/i
title: Why Him
type: UVHDX or itunes HD via MA

There we go... now I'm getting the match I would expect.

If you'd like to play around, I've included this example so you can see exactly what is happening.


Photo by Emily Morter

More posts about regex