Wednesday, September 3, 2014

Regular Expressions - Part 2

Introduction

An earlier posting entitled Regular Expressions - Part 1, covered the use of regular expressions (regex) in Google Spreadsheet functions. This post covers the more conventional use of regexes in JavaScript by showing how they can be applied in Google Apps Scripts (GAS). Unless the code examples use spreadsheet-specific objects, most of the examples shown here should work in browser client JavaScript or in Node.js. I will review this blog entry and revise and extend it over time. I am also considering writing a longer piece and may post it on Leanpub. This entry assumes a reasonable level of competence in JavaScript. Where spreadsheet-specific examples are given, knowledge of GAS and the main Google Spreadsheet GAS objects is assumed. These topics are covered in the Google documentation an in my book Google Spreadsheet Programming.

This is quite a long blog entry and uses some more advanced GAS techniques such as closures and callbacks.

Note: This is still a rough version and is subject to review and editing.

Regexes In JavaScript

Some general points on regexes relevant to all versions/dialects of JavaScript:
  • Regex syntax borrowed from Perl
  • Support for a sub-set of Perl regexes
  • Regexes are objects
  • As with Function, Array, Object, etc there are two ways of creating regex objects:
    1. Constructor functions (var re = new RegExp("re_string");)
    2. Literal notation using forward slashes (var re = /re_code/;)
  • The literal notation is generally preferable
  • The constructor function approach is required when the regexes are built dynamically (an example of this is given towards the end of this entry)
  • Once a regex object has been created there are two ways of using it by invoking:
    1. Either of the regex methods test() and exec() that take strings as arguments
    2. One of the four string methods search(), match(),  replace() or split() that take regex objects as arguments
The regex syntax was covered in the earlier blog entry when regex spreadsheet functions were discussed but it will be reviewed again here in the example code. It is worth noting at the outset that any general material on JavaScript regexes is relevant to their application in Google Apps Script. A very good starting point for general reading on regexes in JavaScript is the Mozilla Developer Network entry on the topic.

Regexes In Google Apps Script

There are two ways to apply regexes in Google Spreadsheets
  1. As double-quoted string arguments to spreadsheet functions, REGEXMATCH , REGEXEXTRACT, REGEXREPLACE as discussed in the entry Regular Expressions - Part 1
  2. As regex objects in GAS where the target text is derived from spreadsheet cells, input forms, as a GAS variable or via some programmatic process.

GAS provides a complete implementation of JavaScript regexes. The syntax used to define them and the RegExp and string methods that take RegExp objects as arguments are the same as those provided by JavaScript in the browser or in Node.js on the server. Therefore, learning regexes in GAS is a good investment in time and effort since the knowledge gained can be used in other JavaScript environments and can be applied to other programming languages as well. Regexes are implemented by most programming languages and, analogous to SQL and relational databases, the dialects and level of functionality implemented may vary, but the underlying principles are the same.

Application Regexs

  1. Determine if patterns exist in a target string
  2. Extract sub-strings from the target string based on patterns
  3. Replace patterns
  4. Validate a target string by ensuring that it conforms to specified patterns
Note the word patterns is used in each of the four bullet points above. This was intentional; Regexes are all about finding patterns in text and then acting on what is found. If the aim is to just find and deal with straight literal text, for example, every occurrence of "John" in a target string, then regexes are not required. However, if the requirement is to find every occurrence of "John" and "Johnny", regardless of case but to ignore it when embedded in words like "Johnson", then the task becomes more difficult. It can of course be done without regexes but would require a lot more code than a solution that used regexes. Regexes are one of those skill sets that once you have started to use them, you wonder how you ever managed without them.

If spreadsheets dealt solely with numerical data, properly formatted dates and well defined text, there would be no need for regexes when manipulating them. However, in the real word, semi-structured or unstructured data is frequently handled in spreadsheets. Older versions of Excel limited text entries to 256 characters but this was subsequently increased to 32,767! Google Spreadsheets have a text limit of 32,000 characters per cell (I verified this by using the spreadsheet REPT() function because I was unable to find the official Google limit). This is slightly lower than the Excel limit but still very substantial. There is therefore ample opportunity to store large quantities of semi-structured or unstructured data in spreadsheets to the extent that manipulation of such data is beyond the capabilities of in-built spreadsheet text function. Regexes to the rescue!

Regex Examples In Google Spreadsheets

To summarize what has been stated previously; There are two aspects to learning regexes in any JavaScript environment:
  1. The regex syntax. The syntax is concise but quite cryptic
  2. The methods of RegExp objects that take string arguments and the string methods that take RegExp arguments. There are six methods in total; Two of the former and four of the latter.

The following examples will use each of the six methods and will describe the regex syntax for example

Overview Of Regex Concepts

Before delving in with regex examples in GAS, the following set of bullet points is given to summarize some key regex concepts.
  • Regexes are composed of literal characters and metacharacters. In a regex a literal character matches itself. Metacharacters have a special meaning within a regex. Their special meaning can be revoked by preceding them with a backslash; This operation is termed escaping. A consecutive occurrence of metacharacters is sometimes known as a metasequence.
  • The period (dot) . represents any character except a new line.
  • There are short-cut metasequences for representing different groups of character types. Prominent among these are \d for any digit, \s for any type of white space, and \w for any alphanumeric character or underscore. Each of these metasequences has an uppercase partner with the opposite meaning so that \D and \S match a non-digit and a non-whitespace character, respectively
  • Quantifiers mandate a minimum number of the preceding character or group of characters. They are *, + and  ? The star (*) signifies zero or more occurrences of the preceding character or group of characters if the characters are enclosed by parentheses (). + signifies one or more occurrences of the preceding character or group of characters if the characters are enclosed by parentheses (). ? signifies zero or one occurrence of the preceding character or group of characters if the characters are enclosed by parentheses ().
  • Braces ({}) specify another type of quantifier known as the interval quantifier. They can be used to specify a minimum, maximum, or range of occurrences of a preceding character or group of characters  if the characters are enclosed by parentheses ().
  • The ? metacharacter when preceded by any of the quantifiers changes the quantifier default behaviour of "match as much as possible" to a "match as little as possible". By default regex matches are "greedy" and ? makes them "parsimonious".
  • Alternations can be specified with the bar character (|). For example, "ab|cd" matches either an "ab" or "cd" in a target string. Parentheses can be used to constrain the extent of the alternation so that "a(b|cd)" will match either a "b" or a "cd".
  • Character classes are created within square brackets ([]) and specify the occurrence of  any one of the characters within the brackets. If the first character within the brackets is the caret symbol (^), it is called negated character class; That is, any character other than those listed within the square brackets constitutes a match. Different types of ranges are recognized within character classes so that "[0-9]" matches any digit and "[a-z]" matches any lowercase letter.
  • Parentheses perform three roles within regexes. As mentioned above (1) they constrain the reach of alternatives. (2) They also define groups that can be used be treated as a unit by the *, +, ?, and {} interval quantifiers. (3) They form groups that can be captured into variables.
  • Zero width assertions are metacharacters that match positions and not characters. They are the string start and string end metacharacters, ^ and $ respectively, and the \b metasequence that specifies a non-word/word boundary.
  • Modifiers are single letter flags that appear outside the regex that affect how the regex is applied. JavaScript has three modifiers: (1) The i flag makes the match case-insensitive (2) The m flag allows the ^ and $ zero-width assertions to match at the start and end of lines in multi-line strings. (3) The g flag ( g = global) affects the behaviour of certain RegExp and string methods when the regex can match at multiple locations within the target string. Its use will be explained in the examples that follow. Any combination of the i, g, and m modifiers can be used as required.

The points above cover the very basics of regexes as used in JavaScript/GAS. A relatively small number of characters can be used to compose complex text search patterns. A complicating factor in learning regexes is that certain metacharacters have multiple context-dependent meanings. For example, ^ has a different meaning depending on whether it is outside of or within a character class. The ? metacharacter can be either a quantifier or a modifier of quantifiers where it alters the default "greedy" matching. It is also used with other characters to form a metasequence for non-capturing parentheses and look ahead assertions. These uses will be explained in examples below.


Note On Non-ASCII Characters
I have only run examples on ASCII input strings. I will point out a few issues with the extended Latin character set but there is no coverage of characters from other alphabets such as Cyrillic. When I talk of a character, I am not talking about a Unicode character. I realise this is a somewhat parochial approach but I simply have no experience of matching non-Latin characters.

Test Data

Here is a screenshot of some test data that will be used to demonstrate RE techniques:




These 14 values are captured in a JavaScript array using this function:

// Get an array of all the values in range A1:A12
//  of the active sheet.
// The Range getValues() method returns an array-of-arrays
// The array map() method is used to extract the first element
//  of each inner array.

function getCellValuesAsArray() {
  var ss = SpreadsheetApp.getActiveSheet(),
      inputAddress = 'A1:A12',
      rng = ss.getRange(inputAddress),
      values = rng.getValues().map(function(innerArray) { return innerArray[0]; });
  return values;
}


The function given above uses techniques and array methods discussed in earlier blog entry.

Does A Pattern Exist? The RegExp test() Method

The easiest way to check if a pattern exists in a target string is to use the RegExp test() method. This method takes a single string argument and returns true if the pattern represented by the RegExp is present in the target string. Here are some examples where the array elements returned by getValuesAsArray() are each evaluated with different RE patterns using the test() method. The test() method is the easiest of all the methods involving REs but it can be used to fully explore RE syntax so it will be heavily used for this purpose in the examples that follow.

Note: To Execute the examples from the GAS editor, you need to select some function. For the examples below, I selected the getCellValuesAsArray() method but did not enclose the actual RE code snippets in a function. GAS executes them anyway. Check the View->logs window for the output

1: All Elements Containing The String Java
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /Java/.test(element); }));

// Output: [Java, JavaScript, Javascript, JavaScript is not Java]

The above code snippet takes the array of cell values as input and filters the array using the RegExp test() method in the array filter() method. The regex is defined between the forward slashes. These forward slashes are the literal notation for creating RegExp objects. The test() method is called repeatedly on each array element by the function argument of the Array filter() method. Any element containing the string "Java" is returned.

2: All Elements Containing The Word Java
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /\bJava\b/.test(element); }));

// Output: [Java, JavaScript is not Java]

This regex is more interesting than example 1. It only finds elements in the input array where Java appears as a word. This effect is achieved by the preceding and succeeding "\b" metasequence. Metacharacters and metasequences represent something other than their literal selves. The \b metasequence marks locations at word boundaries. The regex sense of a word is different from a human language understanding; Any run of alphanumeric characters and underscores can make a word. Contrast this with looking for the string "Java" bounded by spaces:
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return / Java /.test(element); }));

// Output: []

This returns nothing. The space-bounded string does not work when the word is at the beginning or the end of a string.

3: All elements Containing The Word JavaScript Regardless Of Case
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /javascript/i.test(element); }));

// [JavaScript, Javascript, JavaScript is not Java]

This regex uses the i modifier to make the match pattern test case-insensitive. Two other modifiers m and g will be covered in later examples.

4: All Elements Containing The String "Perl" Regardless Of Case
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /perl/i.test(element); }));

// Output: [Perl, Writing PHP properly, Perl, Python and Ruby are popular scripting languages]

This list includes the element "Writing PHP properly". The word "properly" contains the characters "perl" so providing the match. The \b word boundary metasequence can be used to guard against this type of match if it is not wanted.

5: All Elements Containing Either The Words "Perl" Or "Python"
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /\bPython\b|\bPerl\b/i.test(element); }));

// Output: [Perl, Python, Perl, Python and Ruby are popular scripting languages]

The bar metacharacter (|) specifies alternation. The RE matches any elements wit the word "python" or perl" in any case.

6: All Elements Containing The Words  "Perl" And "Python" In Any Order
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /\bPython\b/i.test(element)
                                                    && /\bPerl\b/i.test(element); }));

// Output: [Perl, Python and Ruby are popular scripting languages]

The easiest way to express an and condition where the order does not matter is to use two regex tests with the JavaScript and (&&) operator. The above regex will select any element in the input array that has both word "Perl" and "Python" in in it regardless of case (the i modifier) or the order of the words.

7: All Elements Beginning With C Followed By Optional # Or ++
var testStrings = getCellValuesAsArray();
Logger.log(testStrings.filter(function(element) { return /^C(#|\+\+)?/.test(element); }));

// Output: [C++, C#, C]

The regex here demonstrates the ^ anchor so that listed elements begin with "C". The C can be followed by optional # or ++. The bar symbol (|) signifies alternation and the parentheses set the limits of the alternation. The ? metacharacter is what makes the alternation optional. Note how the + signs are escaped with backslashes in order to convert them from metacharacters to literals. When testing this regex I first tried this /\bC(#|\+\+)?\b/ but this regex also returned "Objective-C". The reason for this is that the regex treats the hyphen as a non-word character so that the C is treated as a word. This is standard regex behaviour but it is a potential "gotcha".

These seven examples will suffice to introduce the regex syntax and describe the RegExp test() method. The RegExp test() method always returns a Boolean, the regex matches the string argument or it does not. The regex pattern can be as complex as you wish but the only information you will get from the test() method is a true or false. In order to do things like find all matches, split the string into an array based on a Regex match,  or extract or replace parts of matches, the RegExp exec() method or one of the string methods match(), split() or replace(). These will be discussed with examples but first there is one string method called search() to briefly discuss before moving on.

String search() Method

The string search() method takes either a string literal or a regex as an argument and returns the position at which the argument matches or -1 if there is no match. Two important points: Firstly, it converts string literal arguments into regexes and, secondly, the string count begins at zero. There is little else to be said about this method. Personally, I prefer the RegExp() test() method if I just want to check for a match but if information on the position of the match is important, use search().

Here is an example of its usage:
var testSearchStr = 'He paid 35 dollars';
var testSearchRe = /\b\d\d\b/;
Logger.log(testSearchStr.search(testSearchRe).toFixed(0));


This reports the first occurrence of two consecutive digits as a "word" (\b metasequence). Note the toFixed() method was called because GAS would otherwise print 8.0 and that looks a bit silly in this context.


String split() Method

Google spreadsheets come with a handy built-in function called SPLIT() that takes an input string and returns an array based on the split text given. This can be a single character or a number of characters. The following screenshot shows its usage:
 

What if you want to split on a pattern, you need to use the string split() method instead. The following screenshot shows some test data (a contrived example). The task is to split on colons or semicolons surrounded by optional white space:




The following GAS function reads each cell in range A1:A10 and writes the split() output to columns B, C, D, F:
// Takes the active selection of the active spreadsheet
//  and splits cell contents on colons and semi-colons
//  bounded by optional spaces.
// The resulting array is written to adjacent columns.
// For demo purposes only. This function would be
//  refactored in a real application.

function splitDemo() {
  var ss = SpreadsheetApp.getActiveSpreadsheet(),
      inputRng = ss.getActiveRange(),
      cell,
      i,
      re = /\s*[:;]\s*/,
      rowCount = inputRng.getNumRows(),
      cellValue,
      cellElements = [],
      j;
  for(i = 1; i <= rowCount; i += 1) {
    cell = inputRng.getCell(i, 1);
    cellValue = cell.getValue();
    cellElements = cellValue.split(re);
    for (j = 0; j < cellElements.length; j += 1) {
      cell.offset(0, j+1).setValue(cellElements[j]);
    }
  }
}

As noted in the function comments, this is not a production-ready function. It was just written to show how Google Spreadsheet cell values can be used to call the split() method in GAS using a RegExp object as an argument.

The splitDemo() function code uses techniques covered in earlier blog entries and in my book but, as a refresher, here is a short summary of the key points:
  • Assigns the active spreadsheet object to the ss variable
  • Assigns the selected range of the spreadsheet object to the variable inputRng
  • Creates a RegExp object and assigns it to variable re
  • Uses a simple for loop to iterate over all the selected cells
  • Uses the i loop variable as the row argument to the Range getCell() method and assigns the getCell() return value to the variable cell
  • Extracts the value in the cell object as a string using the Range getValue() method
  • Calls the split() method of the returned string value using the RegExp object as an argument
  • Prints the resulting array to adjacent columns by using a simple for loop to iterate over the array

The output is shown in the screenshot below:


A final point on the split() method, it takes a second optional integer argument that allows the number of elements from the resulting array to be given. Therefore if only the first two elements are of interest, a second argument, 2, would return an array without the superfluous elements.

String replace() Method

The string replace() method is extremely powerful. In its simplest application, it can be used to replace one string literal with another where the first argument is the string to replace and the second is the replacement string. Unlike the search() method covered earlier, it does not convert either of its arguments into RegExp objects. The replace() method is very powerful for the following reasons:
  • It is not limited to replacing literal values. Its first argument can be a RegExp object thereby allow complex patterns to be replaced
  • The full regex match and parts of the match defined by grouping parentheses can be manipulated using special variables
  • The second argument to replace() can be a function that can compute the replacement dynamically. This is extremely powerful!

Here is GAS function that converts dates from American format (MM/DD/YYYY) to ISO 8601 format YYYY-MM-DD. Well, it sort of does it, a later version will do it properly:
// Converts any dates in American Month/Day/Year to YYYY-MM-DD
// Works with either / or - as date delimiter
// Should report months like February as "02" Will fix in later version

function convertDateToYYYYMMDD(cellValue) {
  return cellValue.replace(/(\d{1,2})[/-](\d{1,2})[/-](\d{4})/g, "$3-$1-$2");
}


The function takes an input string and replaces any occurrences one or two digits followed by either a hyphen or forward slash followed by one to two digits followed once again by a hyphen or forward slash followed by four digits. The capturing parentheses are key to how the replacement works. There are three sets of them in the given regex. Going from left to right, each set of parentheses creates a special variable starting at $1. In this example, $1 is set to what ever was captured for the month, $2 to whatever was captured for the day and $3 to whatever was captured for the year. These special variable names have been taken from the Perl language. The replacement does a swap by putting the year first, then the month, then the day. Note also the g modifier (g for global); Its role is crucial here, if it is dropped only the first date would be replaced.

This screenshot shows how the function can be called as a user-defined function (UDF):


The input comes from column A and the output is shown in column B. There are a number of issues with this version of the function. Firstly, ISO 8601 dates should use two characters for all days and months so that February, for example, is given as "02". Secondly, in cell A5 the match is buried in a string so it may not be a date. The next version deals with these two issues and shows how the replacement string can be generated by a function.

Here is version 2:
// Perform proper ISO 8601 date conversion
// Ignore dates that are not "words" (bounded by \b)
// Use an inner function to do the conversion

function convertDateToYYYYMMDD_V2(cellValue) {
  var replacementFunction = function(fullMatch, month, day, year) {
    month = month.length === 1 ? '0' + month : month;
    day = day.length === 1 ? '0' + day : day;
    return year + '-' + month + '-' + day;
  };
  return cellValue.replace(/\b(\d{1,2})[/-](\d{1,2})[/-](\d{4})\b/g, replacementFunction);
}


This version is a good deal more complicated than the earlier version. It too can be called as a UDF as can be seen on the screenshot below. The interesting aspect of this version is the inner function expression. The inner function expression creates a function object and assigns it to the variable replacementFunction. Note the semicolon after the closing brace in the function expression. This is classic functions as first class objects in JavaScript. This function expression takes the entire match as its first argument (it is not used in this example), and then each of the three variables assigned to the matched text within the parentheses are passed as additional arguments. The JavaScript ternary operator is used to conditionally prepend a zero (0) to month and day if they have one character, in this way "2" for February becomes "02". The function variable is then used as a callback and passed as the second argument to the replace() method. The example given should convey the enormous potential and power of this approach to do some very complex text manipulation. It is also testament to the power and flexibility of JavaScript/GAS functions. Here is a screenshot showing how this function can be called from the spreadsheet as a UDF:





The above example is worth studying, testing with various inputs and modifying. The inner function expression will appear strange to those coming from VBA where functions and subroutines are not nearly so powerful as in JavaScript/GAS.

Here is one further example that converts dollar values to their equivalent euro values and replaces the $ symbol with the euro symbol:

// Given a cell value replace all occurrences of $ followed
//  by numbers or decimal points with the Euro symbol followed
//  by the amount in Euros (0.75 Euros to a Dollar at time of writing)

function convertDollarToEuro(cellValue) {
  var replacementFunction = function(fullMatch, dollarValue) {
    var euroValue = +dollarValue * 0.75;
    return "\u20AC" + euroValue.toFixed(2);
  };
  return cellValue.replace(/\$([\d.]+)/g, replacementFunction);
}


This example is very similar to the previous one in that it uses an inner function expression to do the replacement. The conversion rate is hard-coded at 0.75 (true when writing) but it could have been added as a parameter to the outer function. Note the unicode "\u20AC" for the Euro symbol. Here is a screenshot demonstrating how it can be called as a UDF:





The final example for the replace() function demonstrates how matches can be highlighted. It is a useful trick considerably easier to understand than the previous two examples.

Here is a function that that highlights any matches in the target string by enclosing them in hash (pound) symbols:

// Highlight any matches that are not "uppercase letter, one or
//  more digits, uppercase letter
function highlightNonStandard(cellValue) {
  return cellValue.replace(/[A-Z]\d+[a-z]/g, "#$&#");
}


It flags any matches where there is an uppercase letter immediately followed by one or more digits, immediately followed by a lowercase letter. It is given as an example to demonstrate the special variable $& in the replacement string. It represents the string matched by the entire regex. It is another one of those special Perl variables that JavaScript inherited.

Here is a screenshot showing this function called as a UDF:


Note at time of writing the regex /[A-Z]\d+[^A-Z]/ does not work as expected. This should mean match an uppercase letter, followed by one or more digits followed by a character that is not an uppercase digit. Oddly this doesn't work at all. I'll check it in Node.js and post an update. Perhaps I am missing something here or perhaps there is a bug in GAS! 

String match() Method

This is the final one of the four string methods that can take RegExp objects as arguments. Like the search() method, it can take a literal string as an argument and it similarly converts that string into a RegExp object. It always returns an array but the presence or absence of the g modifier determines the type of array returned. A GAS example should help to clarify this distinction:

// Given a DNA sequence match one to three characters
//  (non-global match).

function demoMatchNonGlobal() {
  var dna = 'TTTTACGTACGTCC';
  Logger.log(dna.match(/[ACGT]{1,3}/));
}
// Output [TTT]


Without the g modifier, this function printed an array with the first match as the single element "TTT". The interval quantifier specifies between one and three of the characters in the character class. You may ask why the match is three characters long: The answer is that quantifiers are greedy by default. They grab the maximum possible while still allowing the entire regex to match.

Note: The GAS output differs from what you get when trying the same code in Node.js
// Node.js
> var dna = 'TTTTACGTACGTCC';
undefined
> console.log(dna.match(/[ACGT]{1,3}/));
[ 'TTT', index: 0, input: 'TTTTACGTACGTCC' ]


Node.js and standard browser JavaScript both provide additional information. However, the Node.js array is also only composed of a single element (its length = 1).

When the global modifier is added, GAS returns all matches:
// Given a DNA sequence match one to three characters
//  (Global match).

function demoMatchGlobal() {
  var dna = 'TTTTACGTACGTCC';
  Logger.log(dna.match(/[ACGT]{1,3}/g));
}
// Output: [TTT, TAC, GTA, CGT, CC]


The same output is generated from Node.js.

Once an array of matches is generated, it can be accessed programmatically and manipulated in GAS.

RegExp exec() Method

This is the final of the six methods to be discussed. It is the second RegExp method, test() is the first.
I am only going to give it brief coverage here
Here is a GAS example:

function testRegExpExec() {
  var s = 'There were 22 sheep, 12 cows and 13 pigs',
      re = /\d+/g,
      match;
  match = re.exec(s);
  Logger.log(match);
  match = re.exec(s);
  Logger.log(match);
  match = re.exec(s);
  Logger.log(match);
  match = re.exec(s);
  Logger.log(match);
}


At each invocation it returns the index of the next match or null if there is no next match. That is the coverage exec() gets in this posting. It may merit further discussion in a future posting.

RegExp Constructor Functions

Earlier in this blog entry it was stated that there are two ways of creating RegExp objects:
  1. The literal notation (//)
  2. The constructor function notation (new RegExp())
It was stated that the former is generally preferable to the latter but that there is a specific instance when the latter is required. That is when the regex is not known until runtime. It may be taken as input from a user via a dialog entry textbox or from a spreadsheet cell. The important point is that the constructor function syntax allows users to specify regexes.

Here is a GAS function that builds a RegExp object at runtime:
// This function gets its input arguments from a spreadsheet cells
// It uses these arguments to create a RegExp using the 
//  RegExp constructor function.
function runtimeRegExp(searchString, pattern, modifiers) {
  var re = new RegExp(pattern, modifiers);
  return re.test(searchString);
}

It can be called as a UDF as shown in the following screenshot:


The interesting aspect of this UDF is that it takes the target string, the pattern, and the modifiers (only filled in in row 5, blanks are ok) as values from cells in the spreadsheet. It uses the give regex pattern and modifiers (if any) to create a RegExp object. It then uses the test() method to determine if there is a match with the target string value from column A. Some of the regexes above introduce syntax not discussed before so here is a short synopsis of the new material
  • Row 3 has the regex pattern ^$, this matches an empty string (or a blank cell in the spreadsheet setting).
  • Row 5 uses the back reference syntax. In the string replace() method the parentheses capture their contents into $1, $2, etc. In the actual regex itself, the contents of parentheses are available from left to right as back references and they are denoted by \1, \2, etc. This regex says capture a word and then see if you can match the same word again allowing only spaces between the occurrences. This is a trick shown in most regex tutorials and books.

Concluding Remarks

Regexes are very powerful tools for general spreadsheet programming. Google Spreadsheets support them through novel regex functions and through GAS. GAS implements JavaScript regexes which in turn are a subset of the Perl implementation. Many useful and some esoteric, but still useful, Perl features are missing from the JavaScript implementation. Two very useful modifiers that are absent are x and s. The x modifier allows regexes to be spread over lines with embedded comments. Use of this modifier makes longer regexes much more readable and supportable. The s modifier allows the . metacharacter to match embedded new lines. Its absence in JavaScript implementations requires care when using regexes with embedded new lines because such metasequences as .* and .+ will not match across embedded new lines. A workaround is to use [\s\S]+ or [\s\S]* instead. These mean match as many white space characters and non-white space characters as possible.
This tutorial covered mostly ASCII examples, there may be issues with non-ASCII characters in some situations. Here is a GAS test of a non-English character from the extended ASCII character set:var fr = 'vous êtes';
var re1 = /[ê]/;
var re2 = /\bê/;
Logger.log(re1.test(fr)); // returns true
Logger.log(re2.test(fr)); // returns true
GAS treats the "ê" correctly in both tests. However, here is what happens with the equivalent code in Node.js (v0.10.18):
> var fr = 'vous êtes';
> var re1 = /[ê]/;
> var re2 = /\bê/;
> console.log(re1.test(fr));
true
> console.log(re2.test(fr));
false

The Node.js version does not recognise the word boundary at the ê! The lesson here is that there may be inconsistencies between different JavaScript regex implementations, especially when non-ASCII characters are involved.

Suggestions, comments criticisms are welcome. As I said at the beginning, at some stage I would like to bring all this material together in an edited and expanded form and  make it available in PDF.

Cheers

Mick

email: mick@google-spreadsheet-programming.com



















1 comment:

  1. That post was amazingly helpful. Thank you.

    If you find a bit of time -- I'm going to continue looking around -- perhaps you could hint on this task I've set myself to.

    I want a long list of words and short phrases in an array, but the spreadsheet function you have there is perfect, so I can add and edit. This will be over 100 patterns. I want to put this list against my gDoc (I write fiction) had have each of those words found, and have the background highlighted so I can scan and find them. I have several tweeks I would like to add later, like different colors for different offenses, but right now a basic yellow highlight will work and save me several hours of editing time.

    What is necessary, is that this happens after a request, not automatically while I'm writing -- writing is never helped by editing as you go.

    Anyway, your post here is great, and thank you for taking the time to create it -- very cool indeed.

    Glenn

    ReplyDelete