Extended Global Regular Expressions Print (egrep)


Searching Text Files With Egrep

There are different ways to search a text! You can use a text editor which uses regular expression pattern or you can use utility egrep. I thing egrep is the best way to search!

When we search using egrep with some regular expression, it attempts to match the regular expression to each line of each file,  and display those lines in which a match is found. Egrep is freely available for Linux windows and Mac


egrep ‘^(from|Subject):’ filename

While invoking the above egrep, do you know what happen exactly?

Egrep interprets the first command line argument as regular expression and the second as a file to search. Note the single quote in the above figure is not a regular expression but are needed while using command shell, when using egrep

Let’s analyze what the various part of regex mean in a movement! In this case, the parentheses “^” and the “|” characters are regular expression metacharacters which is combine with othe character to get the desire output.

If your regular expression doesn’t use metacharacters the egrep will consider it as a plain text search


egrep ‘is’ filename
Searching for ‘is’ in a file finds and displays all lines with the two letters i.s in a row. This includes (This, is, dis, basis, English). The Key point is that the regular expression searching is not done on a “word” basis . egrep don’t have any idea of English words it only knows the concepts of bytes and lines only.


Egrep Metacharacters
Let’s start to explore some of the egrep metacharacters that supply its regular-Expression power.

Starting (^) and Ending ($) of a line

Caret  (^) and Dollar ($)are the simplest metacharacters to understand.
Example 1 : ^hai which means it will match if you have the beginning of a line followed immediately by h and den followed immediately by a and den i.
Example 2 : is$ which means it will match if you have letter ‘I’ immediately followed by ‘s’ at the end of the line.
Example 3: ^hello$ which means a line that consists of only hello – spaces, punctuations are not allowed. Its also case sensitive.

Character Classes ([])
Matching any one of several characters
Let’s think that you wanna search for word grey in a file. You also want to check whether gray is there. So how will you do this? Its little tricky but very useful if you know it. The regular-expression construct [ ] , usually called a character class, lets you to do this.

Example 1 : if you want to search both grey or gray. You need the following command
egrep ‘gr[ea]y’ filename
This means to find letter g followed by r , followed by either an e or an a , followed by y.

Example 2 : if you want to allow capitalization of a word’s first letter, such as with [Nn]ithin. It will select both nithin and Nithin

Example 3 : while searching HTML tags its very useful for example if you want to select all header tags . Just use ‘<H[123456]>’ . Which will select all the tags ie <H1> <H2><H3><H4><H5><H6>

Note : Inside the Character Classes we can use ‘-‘ to show a range of characters.

Example 4 : Instead of using ‘<H[123456]>’ we can use ‘<H[1-6]>’ . There are few more example like [0-9] , [a-zA-Z0-9]. Multiple ranges are fine, so ![0123456789abcdefABCDEF]” can be written as [0-9a-fA-F] (or, perhaps, [A-Fa-f0-9], since the order in which ranges are given doesn’t matter).

Negated character classes([^])
Inside the character classes if you use ^ the meaning is entirely different. If you use [^1-6] instead of [1-6],  it will matches a character that is not 1-6.
We’ve already seen one example, the range-building dash. It is valid only inside a character class (and at that, only when not first inside the class). ˆ is a line anchor outside a class, but a class metacharacter inside a class (but, only when it is immediately after the class’s opening bracket; otherwise, it’s not special inside a class also).


% egrep ‘ta[^z]’ filename

The output will be (tap, tap,tape,table,tag)

Let’s have a look at the above mention output. Words “ted” “ta” and “tom” are not listed.  In the case of “ta” both words are in the list but still its not displayed.
Remember, a negated character class means “match a character that’s not listed” and not “don’t match what is listed.” These might seem the same, but the “ta” example shows the subtle difference. A convenient way to view a negated class is that it is simply shorthand for a normal class that includes all possible characters except those that are listed.
“ta” example is little confusing i.e. The regular expression calls for ‘t’ followed by ’a’ and  followed by  a character  that’s not ‘z’.

Matching Any Character with Dot

Metacharacter( .)
This is shorthand for a character class that matches any character. It can be used when we want to have an “any character here” placeholder in your expression.

For Example
, if you want to search for a date such as 21/12/1983, 21.12. 83, or even 21-12- 83 you can use a regular expression like “  21[-./]12[-./] 83“.
Instead of the above example you can also use “ 21.12.83“ which means the dot in between the numbers are any character placeholder in your expression

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s