To match a string that contains only those characters (or an empty string), try
"^[a-zA-Z0-9_]*$"
This works for .NET regular expressions, and probably a lot of other languages as well.
Breaking it down:
^ : start of string
[ : beginning of character group
a-z : any lowercase letter
A-Z : any uppercase letter
0-9 : any digit
_ : underscore
] : end of character group
* : zero or more of the given characters
$ : end of string
If you don't want to allow empty strings, use + instead of *.
As others have pointed out, some regex languages have a shorthand form for [a-zA-Z0-9_]
. In the .NET regex language, you can turn on ECMAScript behavior and use \w
as a shorthand (yielding ^\w*$
or ^\w+$
). Note that in other languages, and by default in .NET, \w
is somewhat broader, and will match other sorts of Unicode characters as well (thanks to Jan for pointing this out). So if you're really intending to match only those characters, using the explicit (longer) form is probably best.
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL modifier (the trailing s
in the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the /.../
are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with the character class [\s\S]
:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n
characters. Before, and after each character, there's an empty string. So a list of n
characters will have n+1
empty strings. Consider the string "ABhedeCD"
:
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e
's are the empty strings. The regex (?!hede).
looks ahead to see if there's no substring "hede"
to be seen, and if that is the case (so something else is seen), then the .
(dot) will match any character except a line break. Look-arounds are also called zero-width-assertions because they don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no "hede"
up ahead, before a character is consumed by the .
(dot). The regex (?!hede).
will do that only once, so it is wrapped in a group, and repeated zero or more times: ((?!hede).)*
. Finally, the start- and end-of-input are anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD"
will fail because on e3
, the regex (?!hede)
fails (there is "hede"
up ahead!).
Best Answer
EDIT: Since this has gotten a lot of views, let me start by giving everybody what they Googled for:
Now that that's out of the way, most of the following is meant as commentary on how complex regex can get if you try to be clever with it, and why you should seek alternatives. Read at your own risk.
This is a very common task, but all the answers I see here so far will accept inputs that don't match your number format, such as
,111
,9,9,9
, or even.,,.
. That's simple enough to fix, even if the numbers are embedded in other text. IMHO anything that fails to pull 1,234.56 and 1234—and only those numbers—out ofabc22 1,234.56 9.9.9.9 def 1234
is a wrong answer.First of all, if you don't need to do this all in one regex, don't. A single regex for two different number formats is hard to maintain even when they aren't embedded in other text. What you should really do is split the whole thing on whitespace, then run two or three smaller regexes on the results. If that's not an option for you, keep reading.
Basic pattern
Considering the examples you've given, here's a simple regex that allows pretty much any integer or decimal in
0000
format and blocks everything else:Here's one that requires
0,000
format:Put them together, and commas become optional as long as they're consistent:
Embedded numbers
The patterns above require the entire input to be a number. You're looking for numbers embedded in text, so you have to loosen that part. On the other hand, you don't want it to see
catch22
and think it's found the number 22. If you're using something with lookbehind support (like .NET), this is pretty easy: replace^
with(?<!\S)
and$
with(?!\S)
and you're good to go:If you're working with JavaScript or Ruby or something, things start looking more complex:
You'll have to use capture groups; I can't think of an alternative without lookbehind support. The numbers you want will be in Group 1 (assuming the whole match is Group 0).
Validation and more complex rules
I think that covers your question, so if that's all you need, stop reading now. If you want to get fancier, things turn very complex very quickly. Depending on your situation, you may want to block any or all of the following:
Just for the hell of it, let's assume you want to block the first 3, but allow the last one. What should you do? I'll tell you what you should do, you should use a different regex for each rule and progressively narrow down your matches. But for the sake of the challenge, here's how you do it all in one giant pattern:
And here's what it means:
Tested here: http://rextester.com/YPG96786
This will allow things like:
It will block things like:
There are several ways to make this regex simpler and shorter, but understand that changing the pattern will loosen what it considers a number.
Since many regex engines (e.g. JavaScript and Ruby) don't support the negative lookbehind, the only way to do this correctly is with capture groups:
The numbers you're looking for will be in capture group 1.
Tested here: http://rubular.com/r/3HCSkndzhT
One final note
Obviously, this is a massive, complicated, nigh-unreadable regex. I enjoyed the challenge, but you should consider whether you really want to use this in a production environment. Instead of trying to do everything in one step, you could do it in two: a regex to catch anything that might be a number, then another one to weed out whatever isn't a number. Or you could do some basic processing, then use your language's built-in number parsing functions. Your choice.