Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 2.4 KB

statistics.md

File metadata and controls

68 lines (50 loc) · 2.4 KB

Stuart ML - Basic Statistics

Summary statistics

We provide column summary statistics for RDDs through the function colStats() available in the stuart-ml.stat.statistics module.

colStats() returns an instance of MultivariateStatisticalSummary, which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count.

local Vectors = require 'stuart-ml.linalg.Vectors'
local sc = require 'stuart'.NewContext()

local observations = sc:parallelize({
	Vectors.dense(1.0, 10.0, 100.0),
	Vectors.dense(2.0, 20.0, 200.0),
	Vectors.dense(3.0, 30.0, 300.0)
})
local summary = statistics.colStats(observations)

print(summary:mean()) -- a dense vector containing the mean value for each column
{2,20,200}

print(summary:variance()) -- column-wise variance
... TODO ...

print(summary:numNonzeros()) -- number of nonzeros in each column
{3,3,3}

Correlations

Calculating the correlation between two series of data is a common operation in statistics. In Stuart ML we provide the flexibility to calculate pairwise correlations among many series. The supported correlation method is currently Pearson’s correlation.

The stuart-ml.stat.statistics module provides methods to calculate correlations between series. Depending on the type of input, two RDD[Number]s or an RDD[Vector], the output will be a Number or the correlation Matrix respectively.

local statistics = require 'stuart-ml.stat.statistics'
local stuart = require 'stuart'
local Vectors = require 'stuart-ml.linalg.Vectors'

local sc = stuart.NewContext()
local seriesX = sc:parallelize({1, 2, 3, 3, 5})  
local seriesY = sc:parallelize({11, 22, 33, 33, 555})

-- compute the correlation using Pearson's method
local correlation = statistics.corr(seriesX, seriesY, 'pearson')
print('Correlation is', correlation)

local data = sc:parallelize({
  Vectors.dense(1.0, 10.0, 100.0),
  Vectors.dense(2.0, 20.0, 200.0),
  Vectors.dense(5.0, 33.0, 366.0)
})  -- note that each Vector is a row and not a column

-- calculate the correlation matrix using Pearson's method
local correlMatrix = statistics.corr(data, 'pearson')
print(correlMatrix)
Correlation is	0.8500286768773
1                 0.97888346588947  0.99038956952757  
0.97888346588947  1                 0.99774832339861  
0.99038956952757  0.99774832339861  1