I don't really understand regular expressions. Can you explain them to me in an easy-to-follow manner? If there are any online tools or books, could you also link to them?
Regex – Learning Regular Expressions
regex
Related Solutions
/
^ # start of string
( # first group start
(?:
(?:[^?+*{}()[\]\\|]+ # literals and ^, $
| \\. # escaped characters
| \[ (?: \^?\\. | \^[^\\] | [^\\^] ) # character classes
(?: [^\]\\]+ | \\. )* \]
| \( (?:\?[:=!]|\?<[=!]|\?>)? (?1)?? \) # parenthesis, with recursive content
| \(\? (?:R|[+-]?\d+) \) # recursive matching
)
(?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers
| \| # alternative
)* # repeat content
) # end first group
$ # end of string
/
This is a recursive regex, and is not supported by many regex engines. PCRE based ones should support it.
Without whitespace and comments:
/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/
.NET does not support recursion directly. (The (?1)
and (?R)
constructs.) The recursion would have to be converted to counting balanced groups:
^ # start of string
(?:
(?: [^?+*{}()[\]\\|]+ # literals and ^, $
| \\. # escaped characters
| \[ (?: \^?\\. | \^[^\\] | [^\\^] ) # character classes
(?: [^\]\\]+ | \\. )* \]
| \( (?:\?[:=!]
| \?<[=!]
| \?>
| \?<[^\W\d]\w*>
| \?'[^\W\d]\w*'
)? # opening of group
(?<N>) # increment counter
| \) # closing of group
(?<-N>) # decrement counter
)
(?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers
| \| # alternative
)* # repeat content
$ # end of string
(?(N)(?!)) # fail if counter is non-zero.
Compacted:
^(?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>|\?<[^\W\d]\w*>|\?'[^\W\d]\w*')?(?<N>)|\)(?<-N>))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*$(?(N)(?!))
From the comments:
Will this validate substitutions and translations?
It will validate just the regex part of substitutions and translations. s/<this part>/.../
It is not theoretically possible to match all valid regex grammars with a regex.
It is possible if the regex engine supports recursion, such as PCRE, but that can't really be called regular expressions any more.
Indeed, a "recursive regular expression" is not a regular expression. But this an often-accepted extension to regex engines... Ironically, this extended regex doesn't match extended regexes.
"In theory, theory and practice are the same. In practice, they're not." Almost everyone who knows regular expressions knows that regular expressions does not support recursion. But PCRE and most other implementations support much more than basic regular expressions.
using this with shell script in the grep command , it shows me some error.. grep: Invalid content of {} . I am making a script that could grep a code base to find all the files that contain regular expressions
This pattern exploits an extension called recursive regular expressions. This is not supported by the POSIX flavor of regex. You could try with the -P switch, to enable the PCRE regex flavor.
Regex itself "is not a regular language and hence cannot be parsed by regular expression..."
This is true for classical regular expressions. Some modern implementations allow recursion, which makes it into a Context Free language, although it is somewhat verbose for this task.
I see where you're matching
[]()/\
. and other special regex characters. Where are you allowing non-special characters? It seems like this will match^(?:[\.]+)$
, but not^abcdefg$
. That's a valid regex.
[^?+*{}()[\]\\|]
will match any single character, not part of any of the other constructs. This includes both literal (a
- z
), and certain special characters (^
, $
, .
).
You can use the filter
function to apply more complicated regex matching.
Here's an example which would just match the first three divs:
$('div')
.filter(function() {
return this.id.match(/abc+d/);
})
.html("Matched!");
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<div id="abcd">Not matched</div>
<div id="abccd">Not matched</div>
<div id="abcccd">Not matched</div>
<div id="abd">Not matched</div>
Related Question
- Regex – How to validate an email address using a regular expression
- Regex – Regular expression to match a line that doesn’t contain a word
- Javascript – How to access the matched groups in a JavaScript regular expression
- Regex – Regular Expressions: Is there an AND operator
- Javascript – How to use a variable in a regular expression
- Regex – a non-capturing group in regular expressions
- Regex – Question marks in regular expressions
Best Solution
The most important part is the concepts. Once you understand how the building blocks work, differences in syntax amount to little more than mild dialects. A layer on top of your regular expression engine's syntax is the syntax of the programming language you're using. Languages such as Perl remove most of this complication, but you'll have to keep in mind other considerations if you're using regular expressions in a C program.
If you think of regular expressions as building blocks that you can mix and match as you please, it helps you learn how to write and debug your own patterns but also how to understand patterns written by others.
Start simple
Conceptually, the simplest regular expressions are literal characters. The pattern
N
matches the character 'N'.Regular expressions next to each other match sequences. For example, the pattern
Nick
matches the sequence 'N' followed by 'i' followed by 'c' followed by 'k'.If you've ever used
grep
on Unix—even if only to search for ordinary looking strings—you've already been using regular expressions! (There
ingrep
refers to regular expressions.)Order from the menu
Adding just a little complexity, you can match either 'Nick' or 'nick' with the pattern
[Nn]ick
. The part in square brackets is a character class, which means it matches exactly one of the enclosed characters. You can also use ranges in character classes, so[a-c]
matches either 'a' or 'b' or 'c'.The pattern
.
is special: rather than matching a literal dot only, it matches any character†. It's the same conceptually as the really big character class[-.?+%$A-Za-z0-9...]
.Think of character classes as menus: pick just one.
Helpful shortcuts
Using
.
can save you lots of typing, and there are other shortcuts for common patterns. Say you want to match a digit: one way to write that is[0-9]
. Digits are a frequent match target, so you could instead use the shortcut\d
. Others are\s
(whitespace) and\w
(word characters: alphanumerics or underscore).The uppercased variants are their complements, so
\S
matches any non-whitespace character, for example.Once is not enough
From there, you can repeat parts of your pattern with quantifiers. For example, the pattern
ab?c
matches 'abc' or 'ac' because the?
quantifier makes the subpattern it modifies optional. Other quantifiers are*
(zero or more times)+
(one or more times){n}
(exactly n times){n,}
(at least n times){n,m}
(at least n times but no more than m times)Putting some of these blocks together, the pattern
[Nn]*ick
matches all ofThe first match demonstrates an important lesson:
*
always succeeds! Any pattern can match zero times.A few other useful examples:
[0-9]+
(and its equivalent\d+
) matches any non-negative integer\d{4}-\d{2}-\d{2}
matches dates formatted like 2019-01-01Grouping
A quantifier modifies the pattern to its immediate left. You might expect
0abc+0
to match '0abc0', '0abcabc0', and so forth, but the pattern immediately to the left of the plus quantifier isc
. This means0abc+0
matches '0abc0', '0abcc0', '0abccc0', and so on.To match one or more sequences of 'abc' with zeros on the ends, use
0(abc)+0
. The parentheses denote a subpattern that can be quantified as a unit. It's also common for regular expression engines to save or "capture" the portion of the input text that matches a parenthesized group. Extracting bits this way is much more flexible and less error-prone than counting indices andsubstr
.Alternation
Earlier, we saw one way to match either 'Nick' or 'nick'. Another is with alternation as in
Nick|nick
. Remember that alternation includes everything to its left and everything to its right. Use grouping parentheses to limit the scope of|
, e.g.,(Nick|nick)
.For another example, you could equivalently write
[a-c]
asa|b|c
, but this is likely to be suboptimal because many implementations assume alternatives will have lengths greater than 1.Escaping
Although some characters match themselves, others have special meanings. The pattern
\d+
doesn't match backslash followed by lowercase D followed by a plus sign: to get that, we'd use\\d\+
. A backslash removes the special meaning from the following character.Greediness
Regular expression quantifiers are greedy. This means they match as much text as they possibly can while allowing the entire pattern to match successfully.
For example, say the input is
You might expect
".+"
to match only 'Hello,' and will then be surprised when you see that it matched from 'Hello' all the way through 'you?'.To switch from greedy to what you might think of as cautious, add an extra
?
to the quantifier. Now you understand how\((.+?)\)
, the example from your question works. It matches the sequence of a literal left-parenthesis, followed by one or more characters, and terminated by a right-parenthesis.If your input is '(123) (456)', then the first capture will be '123'. Non-greedy quantifiers want to allow the rest of the pattern to start matching as soon as possible.
(As to your confusion, I don't know of any regular-expression dialect where
((.+?))
would do the same thing. I suspect something got lost in transmission somewhere along the way.)Anchors
Use the special pattern
^
to match only at the beginning of your input and$
to match only at the end. Making "bookends" with your patterns where you say, "I know what's at the front and back, but give me everything between" is a useful technique.Say you want to match comments of the form
you'd write
^--\s+(.+)\s+--$
.Build your own
Regular expressions are recursive, so now that you understand these basic rules, you can combine them however you like.
Tools for writing and debugging regexes:
Books
Free resources
Footnote
†: The statement above that
.
matches any character is a simplification for pedagogical purposes that is not strictly true. Dot matches any character except newline,"\n"
, but in practice you rarely expect a pattern such as.+
to cross a newline boundary. Perl regexes have a/s
switch and JavaPattern.DOTALL
, for example, to make.
match any character at all. For languages that don't have such a feature, you can use something like[\s\S]
to match "any whitespace or any non-whitespace", in other words anything.