Wednesday, September 25, 2013

Regular Expressions - Part 1

A complete version of this and the following blog entry will be added as an appendix to this leanpub book.

Update 3rd-Sep-2014: Published almost a year after Part 1, here is Regular Expressions - Part 2 (Much longer and more detailed than this entry!)

Introduction

A regular expression describes a text pattern. Regular expressions can be used to locate complex patterns within textual data and these patterns can then be replaced with new text. Regular expressions have a reputation for being difficult to write and even more difficult to maintain. There is at least a grain of truth in a witticism attributed to Jamie Zawinski:

Some people, when confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems.


Source of the quote is here.

Some factors that contribute to the complexity of regular expressions are:
  1. The cryptic and terse nature of the syntax.
  2. The same symbols can have different meanings in different contexts.
  3. Multiple ways specifying patterns for the same purpose.

To get a feel for how complex regular expressions can become, have a look at some of the examples here! This expression is supposed to test for a valid email address, I have not tested it so cannot say if it does what it is supposed to do but it is worth having a look at the to get a feel for how truly complex regular expressions can become.

Despite their potential complexity, regular expressions are really useful tools and some of the most useful ones are not at all difficult. They crop up in many different settings and can greatly simplify task that would be exceedingly difficult without them. There is lots of freely available information on them on the internet and a simple Google search can often be used to find a ready-made solution that someone  else has written and then posted on a website or blog for others to use. As with any code taken from the internet, it is really important to test it to ensure it works as advertised.

Regular expressions, from here on referred to as REs, first became popular in UNIX text editing tools such as ed and sed back in the 1970s. Perl greatly extended the power of the sed implementation and and Perl REs have become the de facto standard to such an extent that other languages frequently describe their own RE implementations as "Perl compatible". Although Perl has done much to both popularise and extend REs, other languages, including JavaScript, now have good RE implementations of their own as do many relational databases such as Oracle, MySQL and PostgreSQL.

JavaScript's RE syntax and semantics were borrowed directly from Perl but some advanced Perl RE techniques are absent.  Google Apps Script implements JavaScript REs in full so client side RE examples can often be used in Google Apps Scripts and vice versa.

Incidentally, REs have been available in Excel VBA for some time (on Windows but not on Mac OS X) via a VBScript library that has to be loaded from Visual Basic Editor (Tools->References->Microsoft VBScript Regular Expression 5.5) . This fact is often over-looked by VBA developers.

Google Spreadsheet Regular Expression Functions

Google Spreadsheets has an extensive range of built-in functions for text manipulation; The full list is here. Many of these are also found in Excel and in other spreadsheet applications but three RE-related ones are unique to Google Spreadsheets (to the best of my knowledge):

  1. REGEXMATCH
  2. REGEXEXTRACT
  3. REGEXREPLACE

These three functions provide a convenient starting point for introducing REs without having to jump straight into JavaScript.

REGEXMATCH is the easiest of the three, so examples of its use are given first. It simply returns a Boolean value indicating if the given pattern is found in the input text.

Figure 1: Testing function REGEXMATCH with input and regular expressions.

Figure 1 above is a screenshot of a spreadsheet where the REGEXMATCH function was tested. Column A contains the test input, column B contains the result of formula given in column C. The question the formula was written to answer is stated in column C.

The REGEXMATCH function takes two arguments: The test text and the regular expression enclosed in double quotes. The test text can be either a cell value, the return text from another spreadsheet function, or a literal value. If it is a literal value then it too must be enclosed in double quotes.

Here is a very brief description of the REs given above:

  • The first, "\d", matches any digit (0-9). If the input contains at least one digit, it returns TRUE. The "\d" character combination is what is known as a character class abbreviation. The backslash is a metacharacter which essentially means that it signifies something other than its literal meaning. When it precedes a "d" in an RE the combination means a single instance of any digit. This can also be specified using a character class where it is specified as "[0-9]". This is an example of the redundancy in REs, there are frequently equivalent alternative ways of specifying patterns.
  • The second RE, "[^a-zA-Z0-9.\s]", is a more complex example. The objective here is to determine if the test text contains any character that is not a digit, period (dot), or an English alphabet letter. It uses the a fore-mentioned character class. This is a negated character class, its negated nature is conferred by the caret (^). The caret is another metacharacter but it is a metacharacter with different meaning depending on its context. When present as the first character in a character class (that is within [ and ]), it means a match will be found if some character not specified within the character class is found in the test string. The example here uses a character class range, "a-z" and "A-Z" mean any lower case or upper case letter, respectively. 0-9 means any digit and is equivalent to "\d". The period is another interesting character because inside a character class it represents itself but outside one, it means any character except a new line character. The examples that return TRUE for the test RE do so because they contain currency symbols. The "\s" within the character class is another example of a character class abbreviation and it represents the space character, tab, and new line.
  • The "^\d" RE returns TRUE when its test text begins with a digit. The caret symbol in this context is an anchor; It does not represent a character but rather a position, the equivalent for the end of the string is the dollar ($) symbol. The caret character has a dual role in REs, in the earlier example, when it is the first character within a character class it negates the meaning.
  • The RE "colou?r" matches an input string if it contains the string "color" or "colour". This could either be as a word or as part of a word such as "de-colouring" for example. The "?" is another metacharacter with multiple meanings. In this context, the immediately preceding character the "u" in this example is optional, is made optional by the "?".
  • The final RE, "^\d+$", returns TRUE only if the test text is an integer. The caret and dollar metacharacters are both anchors in this context meaning start and end of a the string, respectively. The TO_TEXT spreadsheet function is necessary to convert the numbers into text. There is no automatic coercion of numbers into text and the REGEXMATCH function expects text.

The REGEXEXTRACT And REGEXREPLACE Functions

REs  can also be used to great effect with these two functions.

The REGEXREPLACE is the more useful of the two functions. It takes three arguments:
  1. The text input text. This can be from a spreadsheet cell, the result of a spreadsheet formula, or a string literal.
  2. The regular expression to be replaced which is enclosed in double quotes.
  3. The string literal to replace the regular expression, also enclosed in double quotes.

REGEXREPLACE executes a global replace, all occurrences of the regular expression are replaced.
Some examples are shown in Figure 2 below:

Figure 2: Testing the REGEXREPLACE Function

Figure 2 above is a screenshot of a spreadsheet where the REGEXREPLACE function was tested. Column A contains the test input, column B contains the result of formula given in column C. The target pattern to be replaced is described in column C.

Here is a brief description of the REs shown in figure 2 above.
  • The first RE "\d" simply deletes all digits from the input text where "\d" represents any digit as explained in the REGEXMATCH examples.
  • The second example deletes all lower-case letters. These are identified by the RE "[a-z]".
  • The third example ("[-]+") replaces one or more consecutive  dash characters with a single space. The plus sign (+) is another metacharacter and means one or more of the preceding character. The ? just means 0 or 1.
  • The RE "\b" in the fourth example introduces another anchor; The word boundary anchor \b. Like the string start and end ^ and $ anchors, \b does not represent characters. Instead it represents the non-word-word boundary. The RE's understanding of a word is broader than that of a human language. Any run of one or more alphanumeric characters or underscore are considered to be words. The interesting point in this example is that the anchor \b can be replaced with characters, in this example the caret (^) was used.
  • The last example replaces the RE "\d+$" with nothing. In effect, it deletes any trailing digits in the text. 

The REGEXEXTRACT function extracts only the first occurrence of an RE in the input text. For example, this invocation returns 23.

=REGEXEXTRACT("Total Cost € 23:", "\d+")

However, it would be much more useful if it were to return an array of matches. The SPLIT function can return an array elements so that this formula:

=SPLIT("123:456:789", ":")

returns the values of the resulting array in separate cells. Unfortunately, the SPILT function does not accept an RE as a delimiter as the split function in Perl, for example, does

This article has just introduced REs using the spreadsheet functions that Google provides. Part 2 will take REs further by showing how they can be written in Google Apps Script/JavaScript.



3 comments:

  1. This comment has been removed by a blog administrator.

    ReplyDelete
  2. What if I want to use RegExMatch as a condition for that cell? I'm doing a conditional formatting for color. So does the cell contain "Ham"? Then make row orange via =$D1= (does the cell contain the string ham?) ... how do I ask if "Ham" or "haMburger" returns true, thereby coloring the row? Thanks in advance for any help!!

    ReplyDelete
    Replies
    1. I can't figure out the conditional formatting for the rule (??), but I found that REGEXMATCH is inferior to ISNUMBER(SEARCH("string")), due to case sensitivity. Making a hidden column of binaries yields the conditional formatted result

      Delete