Monday, January 10, 2011

Using the Google App Scripting API - Part 2

See chapters 4 and 5 in this book on Leanpub  

The previous blog post introduced the App Scripting API for Google Spreadsheets but the examples given were for demonstration purposes only.  In order to put some of the API classes previously discussed to practical use and to further explore JavaScript, here is a more detailed example that performs a “typical” spreadsheet programming type of task where an input range is processed and the script output is written to another sheet in the same spreadsheet file.



Task Description:  Data extracted from one-to-many database table relationships is frequently presented in a spreadsheet summary format where the column on the one side of the relationship is represented as single cell values while the values from the many side of the relationship are given in multi-value, delimited, format in another column.   


Here is an  example of data from such a relationship:



gene snps
CASP3 rs1049253,rs2705881,rs6948
CHAF1A rs11668886,rs243341,rs243356,rs73234
CIDEB rs2295307,rs4374085,rs4568
CSNK1E rs135750,rs1534891,rs2075984
CSPG6 rs3829193,rs884587
FANCF rs337484,rs3740615,rs448101
NT5C rs750844,rs899317
PCNA rs17349,rs4239761,rs6053149
POLL rs3095803,rs3730484,rs7874
POLR2G rs3829248,rs4061933,rs633742
RAD17 rs2010352,rs299087,rs3756399
RAD51C rs302864,rs302874,rs302877,rs6503874
TREX1 rs3774808,rs730566,rs734072
UBE2N rs11107025,rs12597,rs1483003



There are two columns, gene (single-valued) and snps (Single-Nucleotide Polymorphisms (SNPs)  RS IDs, multi-valued, comma-separated).  This describes a relationship where a gene is associated with multiple SNPs, SNPs can also be associated with multiple genes but, for illustrative purposes, this scenario can be ignored here.  Another more general example of this type of data might be a list of manager names or IDs with a concatenated list of the staff who work for them in another column. The data presented above is fine for reports or summary tables but multi-valued fields are difficult to query and are problematic for relational databases. However, they are a common format and sometimes have to be processed into something more usable.
How can the data be extracted such that the number of rows is expanded so that each SNP is represented in its own row and the relationship to gene is maintained.  


The desired output format is:
gene_name snp_rs_id
CASP3 rs1049253
CASP3 rs2705881
CASP3 rs6948
CHAF1A rs11668886
CHAF1A rs243341
CHAF1A rs243356
CHAF1A rs73234
CIDEB rs2295307
…........................................
etc.




Breaking the task into steps:

  1. Define the input range (here it is in a sheet named “sample_data” with the data in range “A2:B15”
  2. Add a sheet for output, here called “output”
  3. Process each row of input so that each SNP is given its own row and the link to its associated gene is maintained


Post-processing checks

  • The output row count should be the total number of SNP RS IDs.


JavaScript Code (heavily commented)

// Extract an input sheet summarising a one-to-many relationship so that a new sheet is created with the
// multiple concatenated values in the "many" column each extracted to their own row.
function extractMultivaluedCells() {
   // Get a spreadsheet object.
   var active_spreadsheet = SpreadsheetApp.getActiveSpreadsheet();
   // Using te spreadsheet object, define a sheet object using the sheet name.
   // Assumes that this sheet name exists in the spreadsheet object!
   var data_sheet = active_spreadsheet.getSheetByName("sample_data");
   // Get a range object from the sheet object using address notation.
   var input_rng = data_sheet.getRange("A2:B15");
   // The column number with single values associated with multiple values in another column.
   var one_side_column = 1;
   // The column number with multi-valued cells.
   var many_side_column = 2;
   // The delimiter used to separate the values in the multi-valued column.
   var delimiter = ',';
   // Add a new sheet and keep a reference to the added sheet object.
   var added_sheet = active_spreadsheet.insertSheet("output");
   // Get an object reference for a single cell in the added sheet for output column names.
   var header_row = added_sheet.getRange("A1");
   // Determine the number of rows in the input range
   var input_row_count = input_rng.getNumRows();
   // Obtain a range object for the first row of output data.
   var output_start = added_sheet.getRange("A2");
   // Declare a counter to be used to track the output row.
   var output_offset = 0;
   
   // Write the column header information for output.
   header_row.offset(0, 0).setValue("gene_name");
   header_row.offset(0, 1).setValue("snp_rs_id");
   
   // Process each row of the input range.
   for ( var i = 1; i <= input_row_count; i++ ) {
       // Get the gene name by first using the "getCell()" method of the input range object using
       // row and column coordinates (Note this is one-based!), then return the value for that cell.
       var gene = input_rng.getCell(i,one_side_column).getValue();
       // Return an array of SNP IDs by getting a range object representing one cell as above,
       // retrieve the multi-valued string from that cell,
       // call the "split()" string method passing in the "delimiter" value as an argument to
       // return an array of SNP IDs
       var snps = input_rng.getCell(i,many_side_column).getValue().split(delimiter);
       // Store the number of SNP IDs for the current input row.
       var snp_count = snps.length;
       // Enter an inner for loop to process each SNP ID.
       for ( var idx = 0; idx < snp_count; idx++ ) {
           // Write the gene ID to column 1 of the output sheet.
           output_start.offset(output_offset,0).setValue(gene);
           // Write each SNP ID in the array to column 2.
           output_start.offset(output_offset,1).setValue(snps[idx]);
           // Increment the variable used to track the output row so that the next
           // SNP in the array is written to a new row.
           output_offset++;
       }
   }
 
}

To use the above code
  • Paste the input data into a sheet called “sample_data”
  • Paste the JavaScript code into the Script Editor and run the function called “extractMultivaluedCells
  • This will produce output as shown above in a sheet simply called “output”.


Code Notes
  • The comments should explain what each line of the function is supposed to do.
  • The code shown could be improved by adding error handling and it could also be generalised by adding arguments for the sheet, range, and columns to process.  These measures would make it more robust and re-usable.  Future posts will try to cover such issues.  
  • For the moment, it suffices to demonstrate using JavaScript to achieve a typical data extraction task.
  • For comparison, here is Excel VBA code that takes the same input and writes the same output.

' Process a column with multi-valued cells to produce a new sheet where each value in the multi-valued column gets its own row.
'  See JavaScript notes above for explanation of the code
Sub ExtractMultivaluedCells()
       Dim input_rng As Range: Set input_rng = ActiveWorkbook.Worksheets("sample_data").Range("A2:B15")
       Dim one_side_column As Integer: one_side_column = 1
       Dim many_side_column As Integer: many_side_column = 2
       Dim delimiter As String: delimiter = ","
       Dim added_sheet As Worksheet: Set added_sheet = ActiveWorkbook.Worksheets.Add
       added_sheet.Name = "output"
       Dim header_row As Range: Set header_row = Worksheets("output").Range("A1")
       Dim input_row_count As Integer: input_row_count = input_rng.Rows.Count
       Dim output_start As Range: Set output_start = Worksheets("output").Range("A2")
       Dim output_offset As Integer: output_offset = 0
       Dim i As Integer, idx As Integer
       Dim gene As String
       Dim snps() As String
       
       header_row.Offset(0, 0).Value = "gene_name"
       header_row.Offset(0, 1).Value = "snp_rs_id"
       
       For i = 1 To input_row_count
           gene = input_rng.Cells(i, one_side_column).Value
           snps = Split(input_rng.Cells(i, many_side_column).Value, delimiter)
           For idx = 0 To UBound(snps)
               output_start.Offset(output_offset, 0).Value = gene
               output_start.Offset(output_offset, 1).Value = snps(idx)
               output_offset = output_offset + 1
           Next idx
       Next i
       
End Sub
  • The VBA code above is deliberately written to resemble the analagous JavaScript as closely as possible.  It is not how one would normally write Excel VBA, this code is for illustration purposes only.  The lack of error handling and the hard-coded worksheet, range addresses and column numbers are also noted here as deficiencies that would need to be rectified for a working application to be shared with others.
  • The JavaScript version is recognisably similar to the VBA one when one looks at method names, e.g. offset() in both versions (lowercase for JavaScript and Offset in VBA is actually a property) while the VBA Cells property is used here to return a range object consisting of a single cell in a manner similar to the “getCell()” method of the range object in the JavaScript version.
  • In the JavaScript version, one can add a sheet to a spreadsheet with the same name as a pre-existing one and, unlike Excel VBA, no error is generated!
  • Anyone who already knows Excel VBA and who is prepared to learn JavaScript, should not encounter too many difficulties in learning Google spreadsheet programming.


The above code shows how a typical spreadsheet data transformation task can be executed in Google spreadsheets using JavaScript.



The next post will concentrate on one of JavaScript’s best features, object literals, that can be used to write really powerful data munging code.  If the task is to reverse the process shown above, i.e., one wants to generate  the one-to-many summary as output, that task is made trivial by using object literals as associative arrays.  More on this soon!

3 comments:

  1. Is it possible to modify this example to dynamically create the 1-to-many relationship based on key input into the cell. In this example the static data is already in tabular format. I am wondering if it would have been possible to enter a new snps value in the first sheet and have the logic automatically create the many side, one entry at a time?

    ReplyDelete
  2. Hi mdc1, just picked up your comment. You can set what Google Spreadsheets calls a "trigger" for a given script by selecting the "Triggers" menu from the script editor window for a given spreadsheet, from the "events" you can then select "on edit". The script above inserts a new sheet with a hard-coded name but, unlike Excel, you can have more than one sheet with the same name! Not sure I've answered your question. I hope to do a post in the near future on forms that could cover triggering events.

    ReplyDelete
  3. HI there, I am trying to set up something similar to this but on the reverse logic.

    I have two separate sheets, let's say A and B, where A has unique values and B corresponding ones that reflect a one-to-many relationship (1 A --has--> Many Bs) and I would like to have a cell in each row of the A spreadsheet automatically show which rows of B match, all inside one cell in a concatenated form.

    How would you go around this?

    ReplyDelete