Regular Expressions in Scripting
No matter how beautiful and advanced the GUIs and window managers are, the real power of Linux remains in automation through the command line. Automation means having a program perform a task without any manual intervention. A program doesnt necessarily have to be a compiled language such as C. In fact, automation programs are best written in interpreted languages. This is because interpreted programs are faster to modify and re-run themselves. The other important requirements for such automated programs are:
* Simple and easy file operations (redirections) .
* Simple and easy pattern operations (regular expressions) .
* Ability to achieve the maximum functionality with minimum coding by having powerful functions / commands and
dynamic variable typing.
* Ability to mix and match many such languages (invocations) .
Languages that meet the above-mentioned requirements are generally referred to as scripting languages. Tcl/Tk, Perl, Python and Shell are a few of the popular scripting languages. All of these are almost equally powerful in computation. The difference lies in the power of expression of logic in each one of them, and that is the deciding factor when selecting a scripting language. The best part of these scripting languages is the ability to mix and match many such languages to leverage the best of each.
What do we mean by script or scripting. A script is a program that is written in any of the scripting languages. Initially, these programs used to be simple and repetitive in nature. But as things evolved, scripts started to be used for sophisticated tasks too. And the term scripting is a colloquial term for the writing of such a program.
In this article, we will talk about one of the most important components of scripting: Regular Expressions.
The concept of regular expression comes from the Formal Automata Theory. For the sake of definition, a regular expression is a string that describes or matches a set of strings, according to certain syntax rules. They are usually used to give a concise description of a set, without having to list all elements. In layman terms, it is a shorthand notation for a particular set of patterns.
Listed below are the basic regular operators in Formal Automata Theory (FAT):
* OR denoted by `+'
* AND denoted by `.'
* POWER denoted by superscript
* Special cases of power are superscripted * and +, denoting 0, or more, and 1 or more powers, respectively And their
equivalents in scripting languages are:
* OR denoted by [ ]
* AND denoted by juxtaposing (as in algebra)
* POWER denoted by { }
* * and + denoted by * and + but not superscripted
The regular operands are the symbols from a pre-defined symbol set S. For the simplest case, let S = {0, 1}. Here are a few regular expression examples based on what we have discussed so far:
Description FAT Notation Script Notation
For all binary strings/numbers (0 + 1)+ [01]+
For all even binary numbers (0 + 1)*0 [01]*0
For all 5-digit binary numbers (0 + 1)5 [01]{5}
For all 5-digit binary numbers not starting with 0 1(0 + 1)4 1[01]{4}
(Some food for thought: Try and write a regular expression for the following:)
* All binary numbers containing 1, only in pairs
* All binary palindromes
* All binary numbers divisible by 3
Now, in real life, scripting for just binary patterns is not that useful. We need the complete character set as the set S. With that, we have too many symbols to use, apart from more complicated real-life patterns. So, additional notations were added. Some of the most often used ones are:
* RANGE denoted by `-', e.g., [ABC...Z] may be represented as [A-Z]; [abc...z] as [a-z] and [01...9] as [0-9]
* `.' denotes `Any single character'
* `^' denotes Start of line
* `$' denotes End of line
* {m,n} is shorthand for OR of POWERS {m}, {m+1}, ..., {n}.
n could be left blank to indicate infinity
In the scripting world, this extended set S and all such associated notations are the ones that form the regular expressions.
Examples from scripting:
* An identifier: [_A-Za-z][_A- Za-z0-9]*
* All hex values: [0-9A-Fa-f]+
* A complete line: ^.*$
* All decimal values with less than or equal to 10 digits: [0- 9]{1,10}
* {0,} is identical to *
* {1,} is identical to +
As an exercise, try to write a regular expression for an e-mail id. One might wonder what these will be used for. One of the most common and powerful uses is search and replace.
Another much needed use is extracting patterns and operating on them, e.g., extracting e-mail IDs from a file to send e-mails in groups.
In our next article, we will explore these abilities and will look at how to use these regular expressions in various shell commands.