Skip to content
nightscape edited this page Mar 9, 2012 · 12 revisions

This document describes several of the major components of the 0.2 release of ScalaNLP-Data and Scalanlp-Learn. It’s intended as something of an overview, and not a reference, for that, see the scaladocs.

Getting started

We use SBT 0.11.2 to build scalanlp. Just run ./sbt publish-local to install.

To get a REPL, run ./sbt console, which will let you play around with Scalanlp.

Important Packages

Scalanlp-Data

Configuration

scalanlp.config.Configuration is a general mechanism for handling
program configuration. It relies on reflection and bytecode munging
to work. In particular, classes defined in the REPL won’t work.

Configurations can be made in three ways:

  1. from a Map[String,String]:
    Configuration.fromMap(m)
  2. from Java-style properties files:
    Configuration.fromPropertiesFiles(Seq(f1,f2,f3))
  3. from command line arguments:
    CommandLineParser.parseArguments(args)

For basic usage, you can read in a property with

config.readIn[T]("property.name", [optional default])
. If a property named “property.name” cannot be found, it will look for just “name”. If that fails, it will use either the default or throw an exception, if none is provided.

Configuration also supports reflectively processing case classes. For instance, the Trainer object in scalanlp.classify has the following parameters:


case class TrainerParams(
  @Help(text="The kind of classifier to train. {Logistic,SVM,Pegasos}") `type`: String= "Logistic",
  @Help(text="Input file in svm light format.") input: File= new java.io.File("train"),
  @Help(text="Output file for the serialized classifier.") output: File = new File("classifier.ser"),
  @Help(text="Prints this") help:Boolean = false)

We can read in a TrainerParams by saying config.readIn[TrainerParams](“prefix”). Nested parameters have “.${name}” appended to this prefix.

The listing also illustrates a few other features. We have the Help annotation for displaying usage information with the GenerateHelp object. Configuration also supports Files natively. Finally, though it’s not used here, recursive case classes are supported.

Scalanlp-Learn

Distributions

ScalaNLP also provides a fairly large number of probability distributions built in. These come with access to either probability density function (for discrete distributions) or pdf functions (for continuous distributions). Many distributions also have methods for giving the mean and the variance.


  scala> val poi = new Poisson(3.0);
  poi: scalanlp.stats.distributions.Poisson = <function1>

  scala> val samples = poi.sample(10);
  res21: List[Int] = List(3, 5, 5, 2, 2, 1, 1, 2, 4, 1)

  scala> samples map { poi.probabilityOf(_) }
  res23: List[Double] = List(0.6721254229661636, 0.504094067224622, 0.504094067224622, 0.44808361531077556, 0.44808361531077556, 0.1493612051035918, 0.1493612051035918, 0.44808361531077556, 0.6721254229661628, 0.1493612051035918)
  
  scala> val doublePoi = for(x <- poi) yield x.toDouble; // meanAndVariance requires doubles, but Poisson samples over Ints
  doublePoi: java.lang.Object with scalanlp.stats.distributions.Rand[Double] = scalanlp.stats.distributions.Rand$$anon$2@2c98070c

  scala> scalanlp.stats.DescriptiveStats.meanAndVariance(doublePoi.samples.take(1000));
  res29: (Double, Double) = (3.018,2.9666426426426447)

  scala> (poi.mean,poi.variance)
  res30: (Double, Double) = (3.0,3.0)

TODO: exponential families

Optimization

ScalaNLP’s optimization package includes several convex optimization routines and a simple linear program solver. Convex optimization routines typically take a
DiffFunction[T], which is a Function1 extended to have a gradientAt method, which returns the gradient at a particular point. Most routines will require
a Scalala enabled type: something like a Vector or a Counter.

Here’s a simple DiffFunction: a parabola along each vector’s coordinate.



scala> import scalanlp.optimize._
import scalanlp.optimize._

scala> import scalala.tensor.dense._
import scalala.tensor.dense._

scala> import scalala.library.Library.norm
import scalala.library.Library.norm

scala>       val f = new DiffFunction[DenseVector[Double]] {
               def calculate(x: DenseVector[Double]) = {
                 (norm((x -3) :^ 2,1),(x * 2) - 6);
               }
             }
f: java.lang.Object with scalanlp.optimize.DiffFunction[scalala.tensor.dense.DenseVector[Double]] = $anon$1@7593da36


Note that this function takes its minimum when all values are 3. (It’s just a parabola along each coordinate.)




scala> f.valueAt(DenseVector(0,0,0))
res0: Double = 27.0

scala> f.valueAt(DenseVector(3,3,3))
res1: Double = 0.0

scala> f.gradientAt(DenseVector(3,0,1))
res2: scalala.tensor.dense.DenseVector[Double] = 
 0.00000
-6.00000
-4.00000

scala> f.calculate(DenseVector(0,0))
res3: (Double, scalala.tensor.dense.DenseVector[Double]) = 
(18.0,-6.00000
-6.00000)


You can also use approximate derivatives, if your function is easy enough to compute:


scala> def g(x: DenseVector[Double]) = (x - 3.0):^2 sum
g: (x: scalala.tensor.dense.DenseVector[Double])Double

scala> g(DenseVector(0.,0.,0.))
res5: Double = 27.0

scala> val diffg = new ApproximateGradientFunction(g)
diffg: scalanlp.optimize.ApproximateGradientFunction[Int,scalala.tensor.dense.DenseVector[Double]] = <function1>

scala> diffg.gradientAt(DenseVector(3,0,1))
res6: scalala.tensor.dense.DenseVector[Double] = 
 1.00000e-05
-5.99999
-3.99999

Ok, now let’s optimize f. The easiest routine to use is just LBFGS, which is a quasi-Newton method that works well for most problems.



scala> val lbfgs = new LBFGS[DenseVector[Double]](maxIter=100, m=3) // m is the memory. anywhere between 3 and 7 is fine. The larger m, the more memory is needed.
lbfgs: scalanlp.optimize.LBFGS[scalala.tensor.dense.DenseVector[Double]] = scalanlp.optimize.LBFGS@c7d97d5

scala> val minimum = lbfgs.minimize(f,DenseVector(0,0,0))
res7: scalala.tensor.dense.DenseVector[Double] = 
 3.00000
 3.00000
 3.00000


scala> f(minimum)
res8: Double = 0.0


You can also use a configurable optimizer, using FirstOrderMinimizer.OptParams. It takes several parameters:


case class OptParams(batchSize:Int = 512, regularization: Double = 1.0, alpha: Double = 0.5, maxIterations:Int = -1, useL1: Boolean = false, tolerance:Double = 1E-4, useStochastic: Boolean= false) { // ... }

batchSize applies to BatchDiffFunctions, which support using small minibatches of a dataset. regularization integrates L2 or L1 (depending on useL1) regularization with constant lambda. alpha controls the initial stepsize for algorithms that need it. maxIterations is the maximum number of gradient steps to be taken (or -1 for until convergence). tolerance controls the sensitivity of the convergence check. Finally, useStochastic determines whether or not batch functions should be optimized using a stochastic gradient algorithm (using small batches), or using LBFGS (using the entire dataset).

OptParams can be controlled using scalanlp.config.Configuration, which we described earlier.

Clone this wiki locally