Regular Expression
Teacher: Today we will learn regex and how to use it in Java.
Jonny & Alice screaming with fear, they said in chorus, sir it is the most confusing thing which makes our life miserable as a programmer.
Teacher: Smiled!!! And replied yes I also used to think in this way when I am student :), But it is not hard as you think, Just need some key points to remember, Yes you have to remember key points like you remember History and Geography.
I try to point out those keys which will break your fear.
Before that tell me why you are saying regex confusing and also share the confusion to me.
Alice: Sir, The common problem is, it is hard to read and write, What we mean by that is , If we have a string to validate with a complex pattern, say email validation if you look the regex for that, it is a one liner with multiple backslashes, third brackets, first bracket etc. so we often perplex how to understand what it says. In simple word just seeing a regex solution, we don’t understand what it tries to say.
Teacher: So you mean readability right, you prefer to write more code to avoid RegEx, so code increases readability but lets me tell you regex is a Holy Grail if you unleash its power you can write concise and readable code.
Jonny: Sir another problem is there is no fix solution for a problem let takes the example of email validation if you search it in google you can see a ton of different solutions to validate email. So it is hard to take the right one?
Teacher: This is because you are not understood the crux of regex. Any other problem??
Students: Pin drop silence there.
Teacher slowly takes a step towards board and start his lesson.
What is Regex?
The teacher said Regular Expression is a technique for search a pattern in a String, This search pattern can be very simple to very complex, a word to a sentence, or an expression made by different meta-characters or symbol used in the regex.
To understand regex correctly we need to know metacharacters/symbols and it’s meaning, This is the only thing you need to remember.
We found regex hard because we are not able to understand the usage of symbols.
Let take a look what are the different symbols used in the regex.
We can classify regex symbols in 3 brackets.
- Meta-Characters.
- Ranges & reserved symbols.
- Quantifiers.
Meta-Characters : In regex, there are some reserved metacharacters which have
pre-defined meanings to express some common patterns like the digit, whitespace etc in a compact way.
Meta Character
|
Expression
|
Alternate Expr.
|
Definition
|
To Express digit
|
\d
|
[0-9] or [^\D]
|
By this we represent a digit character
|
To Express anything but not digit
|
\D
|
[^0-9] or [^\d]
|
By this we represent a non-digit character
|
To Express a word
|
\w
|
[a-zA-Z_0-9] or [^\W]
|
By this we represent a word character
|
To Express anything but not a word
|
\W
|
[^a-zA-Z_0-9] or [^\w]
|
By this we represent a non-word character
|
To Express a whitespace
|
\s
|
[\t\n\x0b\r\f] or [^\S]
|
By this we represent any whitespace like \r,\t,\n etc
|
To Express anything but not a whitespace
|
\S
|
[^\t\n\x0b\r\f] or [^\s]
|
By this we represent any non whitespace
|
To Express a boundary
|
\b
|
[a-zA-Z0-9_]
|
By this we represent a boundary
|
Ranges & reserved symbols : In regex when we try to match pattern, some information has to mention like how many times a pattern will be matched or you want to match the beginning of the string or end of the string or more complex pattern like maximum how many times a pattern can be a String or minimum etc. we defined them using ranges and reserved symbols.
Symbol
|
Description
|
Example
|
Example Definition
|
.
|
Any character
|
.ha.
|
Start with any character followed by ha then any character -- sham match: gyan: not match
|
^
|
Check beginning of the line
|
^sha
|
If line starts with sha matched else false
sham : match :Aha “ not match
|
$
|
Check end of the line
|
tra$
|
If line ends with tra matched else false
Mitra: match :Chakra “ not match
|
[xyz]
|
Match either x or y or z
|
a[xyz]
|
ax : Matched
aa : not matched
|
[xyz][abc]
|
Match x,y or z followed by a or b or c
|
s[hwo][abc]
|
sha : Matched
sou : Not matched
|
XA
|
Exactly X followed by A
|
sm
|
sm: Matched
Sa : Not Matched
|
X|A
|
X or A
|
s[X|A]
|
sX: Matched
sZ: Not Matched
|
[^abc]
|
Remember : When ^ uses in side third braces act as Negate.
|
s[^abc]m
|
shm:Matched
sam:Not Matched
|
[a-c1-10]
|
Match between a to c and digit 1 to 10 remember
|
s[x-z1-10]
|
sy:Matched
sb : Not Matched
|
()
|
Used for Grouping
|
(s[^yz])(a|b)([a-c1-10]
|
sab1: Matched
shac: Matched
syab: Not Matched
sabb: Matched
|
Quantifiers: Quantifiers say how many times a pattern can be found in a String.
Quantifiers
|
Description
|
Example
|
Example Definition
|
*
|
Pattern can occurs zero to many times
|
s(\s)*m
|
sm:Matched
s m : Matched
s:m:Not Matched
|
+
|
Pattern can occurs one to many times
|
s(\s)+m
|
s m : Matched
sm:Not Matched
|
?
|
Pattern can occur no or one time
|
s(h)?a
|
sha:Matched
sa:Matched
shha:Not Matched
|
{X}
|
Pattern must occurs exactly X times
|
s(\d)(3)
|
s123:Matched
s1234:Not Matched
s1:Not Matched
|
{X,Y}
|
Pattern must occurs at least X and at maximum Y
|
s(\d)(2,4)
|
s12:Matched
s1:Not Matched
s12345:Not Matched
|
Email Validation :
Teacher : So Jonny earlier you said that Email validation is confusing, Now can you guys tell us what below email validation says,
^[A-Za-z0-9]+(\\.[A-Za-z0-9-]+)*
@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$;
@[A-Za-z0-9-]+(\\.[A-Za-z0-9]+)*(\\.[A-Za-z]{2,})$;
Jonny: Yes sir, first part says ^[A-Za-z0-9-\\+]+, email must start with any characters and there must be one occurrence,^ denotes the start of the line and + says one or more, so email can start with any characters with any length.
Sir: Very good, Alice you tell me the second part.
Alice : (\\.[A-Za-z0-9-]+)*, this says that after first part it followed by a dot then again any length of characters but at least one and this part is optional as * is in the last.
Sir: Impressive.
Jonny: @[A-Za-z0-9-]+ Then it strictly matches @ and then at least one character. As + is there.
Alice : (\\.[A-Za-z0-9]+)* again it follows by the dot and at least one character and it is optional again.
Jonny : (\\.[A-Za-z]{2,})$ then email ends($) with a dot and any character in a-z or A-z and length between two to any.
Sir: Great, Now Alice, tell me a Valid Email according to this regex.
Alice: shamik.mitra@gmail.co.in or shamik@gmail.com
Sir: good, Jonny tell me an invalid one
Jonny: shamik.mitra@co.i or .mitra@gmail.co.uk
Sir: Well it seems you are learning regex very quickly. So before finish today's lesson I give you one tip, stick above tables in your desk so every day you can go through the regex symbols then easily you will remember the Regex.