Perl - Regular Expressions
A regular expression
is a string of characters that define the pattern or patterns you are viewing.
The syntax of regular expressions in Perl is very similar to what you will find
within other regular expression.supporting programs, such as sed, grep,
and awk.
The basic method for
applying a regular expression is to use the pattern binding operators=~ and !~.
The first operator is a test and assignment operator.
There are three
regular expression operators within Perl
·
Match Regular Expression - m//
·
Substitute Regular Expression - s///
·
Transliterate Regular Expression - tr///
The forward slashes in
each case act as delimiters for the regular expression (regex) that you are
specifying. If you are comfortable with any other delimiter then you can use in
place of forward slash.
The Match Operator
The match operator,
m//, is used to match a string or statement to a regular expression. For
example, to match the character sequence "foo" against the scalar
$bar, you might use a statement like this:
#!/usr/bin/perl
$bar = "This is foo
and again foo";
if ($bar =~ /foo/){
print "First time is matching\n";
}else{
print "First time is not matching\n";
}
$bar = "foo";
if ($bar =~ /foo/){
print "Second time is matching\n";
}else{
print "Second time is not matching\n";
}
When above program is
executed, it produces following result:
First time is matching
Second time is matching
The m// actually
works in the same fashion as the q// operator series.you can use any
combination of naturally matching characters to act as delimiters for the
expression. For example, m{}, m(), and m>< are all valid. So above
example can be re-written as follows:
#!/usr/bin/perl
$bar = "This is foo
and again foo";
if ($bar =~ m[foo]){
print "First time is matching\n";
}else{
print "First time is not matching\n";
}
$bar = "foo";
if ($bar =~ m{foo}){
print "Second time is matching\n";
}else{
print "Second time is not matching\n";
}
You can omit m from
m// if the delimiters are forward slashes, but for all other delimiters you
must use the m prefix.
Note that the entire
match expression.that is the expression on the left of =~ or !~ and the match
operator, returns true (in a scalar context) if the expression matches.
Therefore the statement:
$true = ($foo =~ m/foo/);
will set $true to 1
if $foo matches the regex, or 0 if the match fails. In a list context, the
match returns the contents of any grouped expressions. For example, when
extracting the hours, minutes, and seconds from a time string, we can use:
my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);
Match Operator Modifiers
The match operator
supports its own set of modifiers. The /g modifier allows for global matching.
The /i modifier will make the match case insensitive. Here is the complete list
of modifiers
Modifier
|
Description
|
i
|
Makes the match
case insensitive
|
m
|
Specifies that if
the string has newline or carriage return characters, the ^ and $ operators
will now match against a newline boundary, instead of a string boundary
|
o
|
Evaluates the
expression only once
|
s
|
Allows use of .
to match a newline character
|
x
|
Allows you to use
white space in the expression for clarity
|
g
|
Globally finds
all matches
|
cg
|
Allows the search
to continue even after a global match fails
|
Matching Only Once
There is also a
simpler version of the match operator - the ?PATTERN? operator. This is
basically identical to the m// operator except that it only matches once within
the string you are searching between each call to reset.
For example, you can
use this to get the first and last elements within a list:
#!/usr/bin/perl
@list = qw/food foosball subeo footnote terfoot canic
footbrdige/;
foreach (@list)
{
$first = $1 if ?(foo.*)?;
$last = $1 if /(foo.*)/;
}
print "First: $first, Last: $last\n";
When above program is
executed, it produces following result:
First: food, Last: footbrdige
Regular Expression Variables
Regular expression
variables include $, which contains whatever the last grouping
match matched; $&, which contains the entire matched
string; $`, which contains everything before the matched string;
and $', which contains everything after the matched string.
Following code demonstrates the result:
#!/usr/bin/perl
$string = "The food is
in the salad bar";
$string =~ m/foo/;
print "Before: $`\n";
print "Matched: $&\n";
print "After: $'\n";
When above program is
executed, it produces following result:
Before: The
Matched: foo
After: d is in the salad bar
The Substitution Operator
The substitution
operator, s///, is really just an extension of the match operator that allows
you to replace the text matched with some new text. The basic form of the
operator is:
s/PATTERN/REPLACEMENT/;
The PATTERN is the
regular expression for the text that we are looking for. The REPLACEMENT is a
specification for the text or regular expression that we want to use to replace
the found text with. For example, we can replace all occurrences of dog with catusing
following regular expression:
#/user/bin/perl
$string = "The cat sat
on the mat";
$string =~ s/cat/dog/;
print "$string\n";
When above program is
executed, it produces following result:
The dog sat on the mat
Substitution Operator Modifiers
Here is the list of
all modifiers used with substitution operator
Modifier
|
Description
|
i
|
Makes the match
case insensitive
|
m
|
Specifies that if
the string has newline or carriage return characters, the ^ and $ operators
will now match against a newline boundary, instead of a string boundary
|
o
|
Evaluates the
expression only once
|
s
|
Allows use of .
to match a newline character
|
x
|
Allows you to use
white space in the expression for clarity
|
g
|
Replaces all
occurrences of the found expression with the replacement text
|
e
|
Evaluates the
replacement as if it were a Perl statement, and uses its return value as the
replacement text
|
The Translation Operator
Translation is
similar, but not identical, to the principles of substitution, but unlike
substitution, translation (or transliteration) does not use regular expressions
for its search on replacement values. The translation operators are:
tr/SEARCHLIST/REPLACEMENTLIST/cds
y/SEARCHLIST/REPLACEMENTLIST/cds
The translation
replaces all occurrences of the characters in SEARCHLIST with the corresponding
characters in REPLACEMENTLIST. For example, using the "The cat sat on the
mat." string we have been using in this chapter:
#/user/bin/perl
$string = 'The cat sat on the
mat';
$string =~ tr/a/o/;
print "$string\n";
When above program is
executed, it produces following result:
The cot sot on the mot.
Standard Perl ranges can also be used, allowing you
to specify ranges of characters either by letter or numerical value. To change
the case of the string, you might use following syntax in place of the uc function.
$string =~ tr/a-z/A-Z/;
Translation Operator Modifiers
Following is the list
of operators related to translation
Modifier
|
Description
|
c
|
Complement
SEARCHLIST.
|
d
|
Delete found but
unreplaced characters.
|
s
|
Squash duplicate
replaced characters.
|
The /d modifier
deletes the characters matching SEARCHLIST that do not have a corresponding
entry in REPLACEMENTLIST. For example:
#!/usr/bin/perl
$string = 'the cat sat on the
mat.';
$string =~ tr/a-z/b/d;
print "$string\n";
When above program is
executed, it produces following result:
b b b.
The last modifier,
/s, removes the duplicate sequences of characters that were replaced, so:
#!/usr/bin/perl
$string = 'food';
$string = 'food';
$string =~ tr/a-z/a-z/s;
print "$string\n";
When above program is
executed, it produces following result:
fod
More complex regular expressions
You don't just have
to match on fixed strings. In fact, you can match on just about anything you
could dream of by using more complex regular expressions. Here's a quick cheat
sheet:
Following table lists
the regular expression syntax that is available in Python.
Pattern
|
Description
|
^
|
Matches beginning
of line.
|
$
|
Matches end of
line.
|
.
|
Matches any
single character except newline. Using m option allows it to match newline as
well.
|
[...]
|
Matches any
single character in brackets.
|
[^...]
|
Matches any
single character not in brackets
|
*
|
Matches 0 or more
occurrences of preceding expression.
|
+
|
Matches 1 or more
occurrence of preceding expression.
|
?
|
Matches 0 or 1
occurrence of preceding expression.
|
{ n}
|
Matches exactly n
number of occurrences of preceding expression.
|
{ n,}
|
Matches n or more
occurrences of preceding expression.
|
{ n, m}
|
Matches at least
n and at most m occurrences of preceding expression.
|
a| b
|
Matches either a
or b.
|
\w
|
Matches word
characters.
|
\W
|
Matches nonword
characters.
|
\s
|
Matches
whitespace. Equivalent to [\t\n\r\f].
|
\S
|
Matches
nonwhitespace.
|
\d
|
Matches digits.
Equivalent to [0-9].
|
\D
|
Matches
nondigits.
|
\A
|
Matches beginning
of string.
|
\Z
|
Matches end of
string. If a newline exists, it matches just before newline.
|
\z
|
Matches end of
string.
|
\G
|
Matches point
where last match finished.
|
\b
|
Matches word
boundaries when outside brackets. Matches backspace (0x08) when inside
brackets.
|
\B
|
Matches nonword
boundaries.
|
\n, \t, etc.
|
Matches newlines,
carriage returns, tabs, etc.
|
\1...\9
|
Matches nth
grouped subexpression.
|
\10
|
Matches nth
grouped subexpression if it matched already. Otherwise refers to the octal
representation of a character code.
|
[aeiou]
|
Matches a single
character in the given set
|
[^aeiou]
|
Matches a single
character outside the given set
|
The ^ metacharacter
matches the beginning of the string and the $ metasymbol matches the end of the
string. Here are some brief examples
# nothing in the string (start and end are
adjacent)
/^$/
# a three digits, each followed by a whitespace
# character (eg "3 4 5 ")
/(\d\s){3}/
# matches a string in which every
# odd-numbered letter is a (eg
"abacadaf")
/(a.)+/
# string starts with one or more digits
/^\d+/
# string that ends with one or more digits
/\d+$/
Lets have alook at
another example
#!/usr/bin/perl
$string = "Cats go
Catatonic\nWhen given Catnip";
($start) = ($string =~ /\A(.*?) /);
@lines = $string =~ /^(.*?) /gm;
print "First word: $start\n","Line starts: @lines\n";
When above program is
executed, it produces following result:
First word: Cats
Line starts: Cats When
Matching Boundaries
The \b matches
at any word boundary, as defined by the difference between the \w class and the
\W class. Because \w includes the characters for a word, and \W the opposite,
this normally means the termination of a word. The \B assertion
matches any position that is not a word boundary. For example:
/\bcat\b/ # Matches 'the cat sat' but not 'cat on the mat'
/\Bcat\B/ # Matches 'verification' but not 'the
cat on the mat'
/\bcat\B/ # Matches 'catatonic' but not
'polecat'
/\Bcat\b/ # Matches 'polecat' but not
'catatonic'
Selecting Alternatives
The | character is
just like the standard or bitwise OR within Perl. It specifies alternate
matches within a regular expression or group. For example, to match
"cat" or "dog" in an expression, you might use this:
if ($string =~ /cat|dog/)
You can group
individual elements of an expression together in order to support complex
matches. Searching for two people.s names could be achieved with two separate tests,
like this:
if (($string =~ /Martin Brown/) ||
($string =~ /Sharon Brown/))
This could be written as follows
if ($string =~ /(Martin|Sharon) Brown/)
Grouping Matching
From a
regular-expression point of view, there is no difference between except, perhaps,
that the former is slightly clearer.
$string =~ /(\S+)\s+(\S+)/;
and
$string =~ /\S+\s+\S+/;
However, the benefit
of grouping is that it allows us to extract a sequence from a regular
expression. Groupings are returned as a list in the order in which they appear
in the original. For example, in the following fragment we have pulled out the
hours, minutes, and seconds from a string.
my ($hours, $minutes, $seconds) = ($time =~ m/(\d+):(\d+):(\d+)/);
As well as this
direct method, matched groups are also available within the special $x
variables, where x is the number of the group within the regular expression. We
could therefore rewrite the preceding example as follows:
#!/usr/bin/perl
$time = "12:05:30";
$time =~ m/(\d+):(\d+):(\d+)/;
my ($hours, $minutes, $seconds) = ($1, $2, $3);
print "Hours : $hours, Minutes: $minutes, Second: $seconds\n";
When above program is
executed, it produces following result:
Hours : 12, Minutes: 05, Second: 30
When groups are used
in substitution expressions, the $x syntax can be used in the replacement text.
Thus, we could reformat a date string using this:
#!/usr/bin/perl
$date = '03/26/1999';
$date =~ s#(\d+)/(\d+)/(\d+)#$3/$1/$2#;
print "$date\n";
When above program is
executed, it produces following result:
1999/03/26
The \G Assertion
The \G assertion
allows you to continue searching from the point where the last match occurred.
For example, in the following code we have used \G so that we can search to the
correct position and then extract some information, without having to create a
more complex, single regular expression:
#!/usr/bin/perl
$string = "The time is:
12:31:02 on 4/12/00";
$string =~ /:\s+/g;
($time) = ($string =~ /\G(\d+:\d+:\d+)/);
$string =~ /.+\s+/g;
($date) = ($string =~ m{\G(\d+/\d+/\d+)});
print "Time: $time, Date: $date\n";
When above program is
executed, it produces following result:
Time: 12:31:02, Date: 4/12/00
The \G assertion is
actually just the metasymbol equivalent of the pos function, so between regular
expression calls you can continue to use pos, and even modify the value of pos
(and therefore \G) by using pos as an lvalue subroutine:
Regular-expression Examples
Literal characters:
Example
|
Description
|
Perl
|
Match
"Perl".
|
Character classes:
Example
|
Description
|
[Pp]ython
|
Match
"Python" or "python"
|
rub[ye]
|
Match
"ruby" or "rube"
|
[aeiou]
|
Match any one
lowercase vowel
|
[0-9]
|
Match any digit;
same as [0123456789]
|
[a-z]
|
Match any
lowercase ASCII letter
|
[A-Z]
|
Match any
uppercase ASCII letter
|
[a-zA-Z0-9]
|
Match any of the
above
|
[^aeiou]
|
Match anything
other than a lowercase vowel
|
[^0-9]
|
Match anything
other than a digit
|
Special Character Classes:
Example
|
Description
|
.
|
Match any
character except newline
|
\d
|
Match a digit:
[0-9]
|
\D
|
Match a nondigit:
[^0-9]
|
\s
|
Match a
whitespace character: [ \t\r\n\f]
|
\S
|
Match
nonwhitespace: [^ \t\r\n\f]
|
\w
|
Match a single
word character: [A-Za-z0-9_]
|
\W
|
Match a nonword
character: [^A-Za-z0-9_]
|
Repetition Cases:
Example
|
Description
|
ruby?
|
Match
"rub" or "ruby": the y is optional
|
ruby*
|
Match
"rub" plus 0 or more ys
|
ruby+
|
Match
"rub" plus 1 or more ys
|
\d{3}
|
Match exactly 3
digits
|
\d{3,}
|
Match 3 or more
digits
|
\d{3,5}
|
Match 3, 4, or 5
digits
|
Nongreedy repetition:
This matches the
smallest number of repetitions:
Example
|
Description
|
<.*>
|
Greedy
repetition: matches "<python>perl>"
|
<.*?>
|
Nongreedy:
matches "<python>" in "<python>perl>"
|
Grouping with parentheses:
Example
|
Description
|
\D\d+
|
No group: +
repeats \d
|
(\D\d)+
|
Grouped: +
repeats \D\d pair
|
([Pp]ython(, )?)+
|
Match
"Python", "Python, python, python", etc.
|
Backreferences:
This matches a
previously matched group again:
Example
|
Description
|
([Pp])ython&\1ails
|
Match
python&pails or Python&Pails
|
(['"])[^\1]*\1
|
Single or
double-quoted string. \1 matches whatever the 1st group matched . \2 matches
whatever the 2nd group matched, etc.
|
Alternatives:
Example
|
Description
|
python|perl
|
Match
"python" or "perl"
|
rub(y|le))
|
Match
"ruby" or "ruble"
|
Python(!+|\?)
|
"Python"
followed by one or more ! or one ?
|
Anchors:
This need to specify
match position
Example
|
Description
|
^Python
|
Match
"Python" at the start of a string or internal line
|
Python$
|
Match
"Python" at the end of a string or line
|
\APython
|
Match
"Python" at the start of a string
|
Python\Z
|
Match
"Python" at the end of a string
|
\bPython\b
|
Match
"Python" at a word boundary
|
\brub\B
|
\B is nonword
boundary: match "rub" in "rube" and "ruby" but
not alone
|
Python(?=!)
|
Match
"Python", if followed by an exclamation point
|
Python(?!!)
|
Match
"Python", if not followed by an exclamation point
|
Special syntax with parentheses:
Example
|
Description
|
R(?#comment)
|
Matches
"R". All the rest is a comment
|
R(?i)uby
|
Case-insensitive
while matching "uby"
|
R(?i:uby)
|
Same as above
|
rub(?:y|le))
|
Group only
without creating \1 backreference
|
Perl - Sending Email
Using sendmail utility
Sending a plain
message
If you are working on
Linux/Unix machine then you can simply use sendmail utility
inside your Perl program to send email. Here is sample script that can send an
email to a given email ID. Just make sure given path for sendmail utility is
correct. This may be different for your Linux/Unix machine.
#!/usr/bin/perl
$to = 'abcd@gmail.com';
$from = 'webmaster@yourdomain.com';
$subject = 'Test Email';
$message = 'This is test email
sent by Perl Script';
open(MAIL, "|/usr/sbin/sendmail -t");
# Email Header
print MAIL "To: $to\n";
print MAIL "From: $from\n";
print MAIL "Subject: $subject\n\n";
# Email Body
print MAIL $message;
close(MAIL);
print "Email Sent Successfully\n";
Actual above script
is a client email script which is will draft email and submit to the server
running locally on your Linux/Unix machine. This script will not be responsible
for sending email to actual destination. So you have to make sure email server
is properly configured and running on your machine to send email to the given
email ID.
Sending an HTML
message
If you want to send
HTML formatted email using sendmail then you simply need to addContent-type:
text/html\n in the header part of the email as follows:
#!/usr/bin/perl
$to = 'abcd@gmail.com';
$from = 'webmaster@yourdomain.com';
$subject = 'Test Email';
$message = '<h1>This is
test email sent by Perl Script</h1>';
open(MAIL, "|/usr/sbin/sendmail -t");
# Email Header
print MAIL "To: $to\n";
print MAIL "From: $from\n";
print MAIL "Subject: $subject\n\n";
print MAIL "Content-type: text/html\n";
# Email Body
print MAIL $message;
close(MAIL);
print "Email Sent Successfully\n";
Using MIME::Lite module
If you are working on
windows machine then you will not have access on sendmail utility. But you have
alternate to write your own email client using MIME:Lite perl module. You can
download this module from MIME-Lite-3.01.tar.gz and install it on your either
machine Windows or Linux/Unix. To install it follow the following simple steps:
$tar xvfz MIME-Lite-3.01.tar.gz
$cd MIME-Lite-3.01
$perl Makefile.PL
$make
$make install
That's it and you
will have MIME::Lite module installed on your machine. Now you are ready to
send your email with simple scripts explained below.
Sending a plain
message
Now following is a
script which will take care of sending email to the given email ID:
#!/usr/bin/perl
use MIME::Lite;
$to = 'abcd@gmail.com';
$cc = 'efgh@mail.com';
$from = 'webmaster@yourdomain.com';
$subject = 'Test Email';
$message = 'This is test email
sent by Perl Script';
$msg = MIME::Lite->new(
From
=> $from,
To
=> $to,
Cc
=> $cc,
Subject
=> $subject,
Data
=> $message
);
$msg->send;
print "Email Sent Successfully\n";
Sending an HTML
message
If you want to send
HTML formatted email using sendmail then you simply need to addContent-type:
text/html\n in the header part of the email. Following is the script
which will take care of sending HTML formatted email:
#!/usr/bin/perl
use MIME::Lite;
$to = 'abcd@gmail.com';
$cc = 'efgh@mail.com';
$from = 'webmaster@yourdomain.com';
$subject = 'Test Email';
$message = '<h1>This is
test email sent by Perl Script</h1>';
$msg = MIME::Lite->new(
From
=> $from,
To
=> $to,
Cc
=> $cc,
Subject
=> $subject,
Data
=> $message
);
$msg->attr("content-type" => "text/html");
$msg->send;
print "Email Sent Successfully\n";
Sending an
attachement
If you want to send
an attachement then following script serve the purpose:
#!/usr/bin/perl
use MIME::Lite;
$to = 'abcd@gmail.com';
$cc = 'efgh@mail.com';
$from = 'webmaster@yourdomain.com';
$subject = 'Test Email';
$message = 'This is test email
sent by Perl Script';
$msg = MIME::Lite->new(
From
=> $from,
To
=> $to,
Cc
=> $cc,
Subject
=> $subject,
Type
=> 'multipart/mixed'
);
# Add your text message.
$msg->attach(Type => 'text',
Data => $message
);
# Specify your file as attachement.
$msg->attach(Type => 'image/gif',
Path => '/tmp/logo.gif',
Filename
=> 'logo.gif',
Disposition => 'attachment'
);
$msg->send;
print "Email Sent Successfully\n";
You can attache as
many as files you like in your email using attach() method.
Using SMTP Server
If your machine is
not running an email server then you can use any other email server available
at remote location. But to use any other email server you will need to have an
id, its password, URL etc. Once you have all the required information, you
simple need to provide that information in send() method as
follows:
$msg->send('smtp', "smtp.myisp.net", AuthUser=>"id", AuthPass=>"password" );
You can contact your
email server administrator to have above used information and if a user id and
password is not already available then your administrator can create it in
minutes.
Comments
Post a Comment