Tuesday, November 13, 2012

Enhancing Statistical Capabilities with User-Defined Functions

The code presented here was originally written in Excel VBA to calculate confidence intervals for sample means.  Although both Excel and Google spreadsheets have built-in functions to do this, these functions make certain assumptions about the underlying data distribution that may not always be valid so a well known alternative method was implemented in both JavaScript and VBA to do this.

Main Points
  • Use of bootstrapping to derive the confidence interval for the sample mean
  • User-defined JavaScript function to perform this task
  • Extensive use of JavaScript arrays and array methods to implement the bootstrapping function
  • Highlight some potential traps when copying and pasting JavaScript code from the internet
  • Present Excel VBA code to compare and contrast with the JavaScript version

Note: Use statistical packages like R for serious statistics and make sure you understand the tests you are using and the assumptions such tests make.  If possible, consult with a statistician at all stages from experimental design to data analysis.

Confidence intervals are widely used in statistics for evaluating the precision of a parameter estimated by sampling a population.  A common example is the confidence interval of a sample mean and a commonly reported confidence interval is 95%.  A 95% confidence interval of the sample mean tells us that if we were to repeatedly calculate sample means for the population we would expect our true population mean to be included within the confidence interval 95% of the time.  Confidence intervals of the mean can be calculated by both Excel and the Google spreadsheet application by using the CONFIDENCE function, see the Microsoft documentation for this function.  This function is fine for data that at least approximates to a Normal distribution.  There is an alternative method that can be used based on bootstrapping (Wikipedia link) that has only become feasible since computerised statistical applications and electronic spreadsheets became widely available.  It relies on hundreds of repeated estimations of the mean from the sample data so some programming is required to use it. Since I am not a statistician, I feel uncomfortable about making too many assumptions about the data so the bootstrapping method is appealing.  However, to use it in either Google spreadsheets or Excel requires some coding which is the subject of this blog.

Both VBA and JavaScript code are given here to show how the bootstrapping can be implemented for estimation of the confidence interval of a sample mean.  Comparing and contrasting the code in the Excel and Google versions highlights some important differences between the functionality of the two languages.


Sample Data
Firstly, for some sample data here are 12 human body temperature measurements:
37
36
37.1
37.1
36.2
37.3
36.8
37
36.3
36.9
36.7
36.8
The above data was taken from Intuitive Biostatistics by Harvey Motulsky, a highly recommended book if, like me, you are interested in statistics but are not so mathematically inclined.
If these data points are entered in either an Excel or a Google spreadsheet, the 95% confidence interval of the mean can be calculated using the formula:
=CONFIDENCE(0.05,STDEV(A1:A12),12)
See the Excel documentation link above for further information on the CONFIDENCE function.

To calculate the 95% confidence using the bootstrapping method, we need code that does the following:
  • Randomly select a sample of the same size from the 12 values allowing the same value to be chosen more than once, that is, sampling with replacement.
  • Calculate a mean for new sample
  • Repeat the previous two steps multiple times (1000 times in this example) and store all the 1000 means that were calculated by re-sampling the original sample set.
  • Sort the 1000 calculated means in ascending numerical order.
  • The confidence interval corresponds to the 26th and 975th values.  2.5% of the calculated means are below the lower value and 2.5% are above the upper value.

To code this in JavaScript, we need to:
  • Transfer the sample values from a spreadsheet range to a JavaScript array.
  • Perform a number of random re-samplings of this array, 1000 was chosen in the example code.
  • Calculate the means for each of the arrays, store the means, and, after the 1000 trials are complete, sort the array of means, identify the 26th and 975th elements, and return the result as a string.
// A function to calculate 95% confidence intervals of the sample mean by the bootstrapping method.
function calculateCI(values) {
 var n = values.length;
 // This check is required because if called as a user-defined function values will be an array of arrays!
 if( values[0] instanceof Array ) {
//Could also use Array.isArray(values[0]) above to check for “arayness”
    values = values.map( function(value) { return value[0]; } );
 }
 var rand_idx;
 var arr;
 var arr_permuted = [];

 var permute_times = 1000;

 for( var i = 1; i <= 1000; i++ ) {
   arr = [];
   for( var j = 0; j < n; j++ ) {
       rand_idx = Math.floor(Math.random() * n);
       arr[j] = values[rand_idx];
   }
   arr_permuted.push(mean(arr));
 }

 arr_permuted.sort(function(a,b){return a > b ? 1 : a < b ? -1 : 0;});

 return  arr_permuted[25].toFixed(2) + " - " + arr_permuted[974].toFixed(2);

}

// Return the arithmetic mean (average) for an array of numbers.
function mean(arr) {
   var element_count = arr.length;
   var total = 0;
   
   for( var idx = 0; idx < element_count; idx++ ) {
       total += arr[idx];
   }
   
   return total/element_count;
   
}

Notes On The Code Above
  • The function calculateCI(values)  can be called as a user-defined function that can take a range as of cell values as input.  To do this, the “instanceof” check on the first value is required because the range of input values is in the form of an array of arrays.  If the first input value is an array, this is taken to mean that the entire input is composed of an array of arrays so a map function is applied to build a new array of just the first elements of each sub-array.  A consequence of this is that the input values must all come from a single column.
  • Something to be aware of with JavaScript is that the code will run without error without the instanceof check but it will always return “NaN - NaN” when called as a user-defined function.  It runs fine when called like so:
function test() {
 var temps = [37,36,37.1,37.1,36.2,37.3,36.8,37,36.3,36.9,36.7,36.8];
 Browser.msgBox(calculateCI(temps));
}
  • It took me some time to figure out why it didn’t work as expected when called as a user-defined function though I had already identified the potential snag earlier in a posting about arrays.  This point also emphasizes the fact that JavaScript’s flexibility and dynamic typing come at a price.  Checking and testing is paramount!
  • The method used for generating the random indexes used to sample the array of values was taken from this source.  The linked article highlights the importance of choosing the correct Math method to avoid bias.  We all cobble together code from Google searches but there is an inherent danger in doing so if we do not fully understand what the code does.  I was unaware of the bias in using round() method until I read that blog.  Once again, testing output against expectations is crucial and all code, including code on this blog, should be treated as experimental until it has been rigorously tested for its intended purpose.  I cannot emphasize this point strongly enough: blithely copying and pasting code, for any programming language, into your project with proper scrutiny and testing, is a recipe for disaster!
  • After the 1000 iterations the re-sampled means are stored in an array that needs to be sorted numerically.  This brings up another issue because the JavaScript default sort is textual but a function can be passed to the sort() method to make the sort numeric ascending, the code used was taken from this source where a description is given of the sort() method.
  • Once the array is sorted, the 26th and 975th elements are taken as the lower and upper confidence limits of the 95% confidence interval.

 Excel VBA Code (comments in green, code in blue)
Option Explicit
'Return an array of numerical values from an input range
‘This could be simplified as a range can return an array of variants but the
‘more verbose code was preferred to ensure all input was numeric and the array generated
‘was one-dimensional and zero-based.
Function GetArrayFromRange(rng As Range) As Double()
    Dim cell As Range
    Dim ArrayFromRange() As Double
    Dim idx As Long: idx = 0
   
    For Each cell In rng.Cells
            ReDim Preserve ArrayFromRange(idx)
           ArrayFromRange(idx) = cell.Value
           idx = idx + 1
    Next cell
   
    GetArrayFromRange = ArrayFromRange
   
End Function
'Return the mean of an input array of numbers using the built-in spreadsheet function "AVERAGE"
Function GetArrayMean(arr() As Double) As Double

    GetArrayMean = Application.WorksheetFunction.Average(arr)

End Function
'Taking an input array of numbers, return a new array of the same size by re-sampling
'Check this link to see how the random indexes are generated:
' http://chandoo.org/wp/2008/08/13/simulate-dice-throws/
Function GetNewArrayBySampling(arr() As Double) As Double()
    Dim arrSampled() As Double
    Dim i As Long
    Dim randIndex As Long
    Dim lastElementIndex As Long: lastElementIndex = UBound(arr)
   
    For i = 0 To lastElementIndex
           ReDim Preserve arrSampled(i)
           randIndex = Int(Rnd() * lastElementIndex)
           arrSampled(i) = arr(randIndex)
    Next i
   
    GetNewArrayBySampling = arrSampled
   
End Function


'The subroutine that calculates the confidence interval
' It prompts for the range
' Extracts the numerical values from that range into an array
' Adds a new worksheet to be used to hold and sort the 1000 re-sampled means
' Generates the 1000 re-sampled means and writes them to the new worksheet.
' Perform a numerical ascending sort of the 1000 re-sampled means
' Displays a MsgBox with with the 26th and 975th cells in the sorted range
'Cleans up by deleting the new worksheet
Sub Main()
    Dim tmpSpreadsheetName As String: tmpSpreadsheetName = "tmpResampled"
    Dim resampleTimes As Long: resampleTimes = 1000
    Dim inputDataRng As Range
    Dim inputDataArray() As Double
    Dim newWorksheet As Worksheet
    Dim i As Long
    Dim mean As Double

    Set inputDataRng = Application.InputBox("Select data for Confidence Interval", "Confidence Interval Calculation", , , , , , 8)
    inputDataArray = GetArrayFromRange(inputDataRng)
    Set newWorksheet = ActiveWorkbook.Worksheets.Add
    newWorksheet.Name = tmpSpreadsheetName
   
    For i = 1 To resampleTimes
           mean = GetArrayMean(GetNewArrayBySampling(inputDataArray))
           newWorksheet.Range("A1").Offset(i - 1, 0).Value = mean
    Next i
   
    newWorksheet.Range("A1:A1000").Sort key1:=Cells(1, 1), order1:=xlAscending
   
    MsgBox "CI (95%): " & newWorksheet.Range("A26").Value & "-" & newWorksheet.Range("A975").Value
    Debug.Print "CI (95%): " & newWorksheet.Range("A26").Value & "-" & newWorksheet.Range("A975").Value
    newWorksheet.Delete
   
End Sub

Notes ON The VBA Code
  • The VBA code is considerably more verbose than the JavaScript version but it could be shortened.  
  • VBA is strongly typed (at least when we avoid the variant type) and this can help to pick up erroneous input.
  • There is no need to write a “mean()” function because in Excel VBA it is possible to call a worksheet function just as if it were a VBA function.  This is a nice feature that does not appear to be possible in Google spreadsheets using JavaScript. Please let me know if there is a way to do this aside from the hack of writing the formula to a cell and then retrieving the result!
  • There is no built-in VBA array sort function or method, various array sorting code is available but writing the re-sampled means to a temporary sheet and doing the sorting there is an acceptable workaround.

I'd like to re-emphasize that serious statistical analysis merits consideration of serious statistical packages.  Spreadsheets are fine for basic exploratory-type work but concerns have been raised repeatedly about Excel in this regard.  Many of the problems may have been fixed but it is worth reading this piece before doing any heavy analysis in Excel.  As for Google spreadsheets, I do not know how it fares in this regard but it is probably best to treat it with caution too.  R is an excellent free and open source statistical package worth exploring.  However, I found the learning curve quite steep though the R programming language does, especially in its functions and functional approach, bear some resemblance to JavaScript.  Finally, JStat which runs in the browser came to my attention recently.  This looks very promising although the documentation is sparse at the moment but it's certainly one to watch! 

2 comments:

  1. Although both Excel and Google spreadsheets have built-in functions to do this, feng shui

    ReplyDelete
  2. Yes, they do, but they do it in a different way and they make assumptions about the data that might not be true. The method shown here does not make assumptions about the data distribution.

    ReplyDelete