Tuesday, October 7, 2014

Regular expressions with Python

This is post on getting and writing programs with regular expressions using python as programming language.

There would be many theoretical intro of why do we use regualar expressions. In simple, regex are used for parsing data either in search engine or web scraping.

Regular expressions: The following line of code is the syntax of regex in python.
Regexp<name> = regexp.search()

Before getting into the lines of code let's have a glance at the patterns.

Quick patterns:
[abc]
A single character of: a, b, or c
[^abc]
Any single character except: a, b, or c
[a-z]
Any single character in the range a-z
[a-zA-Z]
Any single character in the range a-z or A-Z
^
Start of line
$
End of line
\A
Start of string
\z
End of string
.
Any single character
\s
Any whitespace character
\S
Any non-whitespace character
\d
Any digit
\D
Any non-digit
\w
Any word character (letter, number, underscore)
\W
Any non-word character
\b
Any word boundary


(…)
Capture everything enclosed
(a|b)
a or b
a?
Zero or one of a
a*
Zero or more of a
a+
One or more of a
a{3}
Exactly 3 of a
a{3,}
3 or more of a
a{3,6}
Between 3 and 6 of a
a{ ,6}
Not more than 6 of a







For any pattern  :

? – A regex followed by ? implies zero or one of the occurrence
*-   A regex followed by *implies Zero or more of the occurrence (Used for optional )
+ - A regex followed by + implies one or more occurrence
{x, y} – A regex with boundaries repeat itself in between x and y
{x, }  - A regex with lower bound is >/ x
{, y} – A regex wit upper bound  is  </ y
{x} – A regex with exact x times

Using # is a comment
Using \# in a regexp matches for the character #.

. : Matches any character except \n
\. : Matches only dot

Special case in escape character:
For the regexp [-a-zA-Z]: It considers hypen(–) as a character and looks for it.


With Python:

R with regular expressions in python:

Using r ‘regexp’ turns off backslash impact of python in the expression

Re library:

To work with regular expressions in python we need to import ‘re’ library into our code and is as below:
import re

If a regular expression looks like this:

Pattern = r’^[A-Z][a-z]{2}\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$’

It might be difficult to revisit the code and to debug this. Python avoids this with multiline expression using VERBOSE mode.

Pattern = r’’’
               ^
                [A-Z][a-z]{2}
\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}
$
               ‘’’


Exp = re.compile(pattern, re.VERBOSE) or
Exp = re.compile(pattern, re.X)

Simple example code for Regular expression


import sys
import re

address_pattern = r'''
^
(?P<Address1>P\.*O\.*\s*(BOX|Box|box)\s\d{1,5}) # address1 field

(?P<City>\s*\w*\W*\s*\w*\W*\s*\w*\W*\s*) # text between address and zip code

$
'''

address_reg_exp = re.compile(address_pattern, re.VERBOSE)

text = "PO Box 1055 Jefferson City, MO 65102"


match = address_reg_exp.search(text)

g = match.groups()
if match:
print match.group('Address1')
print match.group('City')

No comments:

Post a Comment