-
Notifications
You must be signed in to change notification settings - Fork 50
Tour
This document describes several of the major components of the 0.2 release of ScalaNLP-Data and Scalanlp-Learn. It’s intended as something of an overview, and not a reference, for that, see the scaladocs.
We use SBT 0.11.2 to build scalanlp. Just run ./sbt publish-local to install.
To get a REPL, run ./sbt console, which will let you play around with Scalanlp.
scalanlp.config.Configuration is a general mechanism for handling
program configuration. It relies on reflection and bytecode munging
to work. In particular, classes defined in the REPL won’t work.
Configurations can be made in three ways:
- from a Map[String,String]:
Configuration.fromMap(m)
- from Java-style properties files:
Configuration.fromPropertiesFiles(Seq(f1,f2,f3))
- from command line arguments:
CommandLineParser.parseArguments(args)
For basic usage, you can read in a property with
config.readIn[T]("property.name", [optional default])
Configuration also supports reflectively processing case classes. For instance, the Trainer object in scalanlp.classify has the following parameters:
case class TrainerParams(
@Help(text="The kind of classifier to train. {Logistic,SVM,Pegasos}") `type`: String= "Logistic",
@Help(text="Input file in svm light format.") input: File= new java.io.File("train"),
@Help(text="Output file for the serialized classifier.") output: File = new File("classifier.ser"),
@Help(text="Prints this") help:Boolean = false)
We can read in a TrainerParams by saying config.readIn[TrainerParams](“prefix”). Nested parameters have “.${name}” appended to this prefix.
The listing also illustrates a few other features. We have the Help annotation for displaying usage information with the GenerateHelp object. Configuration also supports Files natively. Finally, though it’s not used here, recursive case classes are supported.
ScalaNLP also provides a fairly large number of probability distributions built in. These come with access to either probability density function (for discrete distributions) or pdf functions (for continuous distributions). Many distributions also have methods for giving the mean and the variance.
scala> val poi = new Poisson(3.0);
poi: scalanlp.stats.distributions.Poisson = <function1>
scala> val samples = poi.sample(10);
res21: List[Int] = List(3, 5, 5, 2, 2, 1, 1, 2, 4, 1)
scala> samples map { poi.probabilityOf(_) }
res23: List[Double] = List(0.6721254229661636, 0.504094067224622, 0.504094067224622, 0.44808361531077556, 0.44808361531077556, 0.1493612051035918, 0.1493612051035918, 0.44808361531077556, 0.6721254229661628, 0.1493612051035918)
scala> val doublePoi = for(x <- poi) yield x.toDouble; // meanAndVariance requires doubles, but Poisson samples over Ints
doublePoi: java.lang.Object with scalanlp.stats.distributions.Rand[Double] = scalanlp.stats.distributions.Rand$$anon$2@2c98070c
scala> scalanlp.stats.DescriptiveStats.meanAndVariance(doublePoi.samples.take(1000));
res29: (Double, Double) = (3.018,2.9666426426426447)
scala> (poi.mean,poi.variance)
res30: (Double, Double) = (3.0,3.0)
TODO: exponential families
ScalaNLP’s optimization package includes several convex optimization routines and a simple linear program solver. Convex optimization routines typically take a
DiffFunction[T], which is a Function1 extended to have a gradientAt method, which returns the gradient at a particular point. Most routines will require
a Scalala enabled type: something like a Vector or a Counter.
Here’s a simple DiffFunction: a parabola along each vector’s coordinate.
scala> import scalanlp.optimize._
import scalanlp.optimize._
scala> import scalala.tensor.dense._
import scalala.tensor.dense._
scala> import scalala.library.Library.norm
import scalala.library.Library.norm
scala> val f = new DiffFunction[DenseVector[Double]] {
def calculate(x: DenseVector[Double]) = {
(norm((x -3) :^ 2,1),(x * 2) - 6);
}
}
f: java.lang.Object with scalanlp.optimize.DiffFunction[scalala.tensor.dense.DenseVector[Double]] = $anon$1@7593da36
Note that this function takes its minimum when all values are 3. (It’s just a parabola along each coordinate.)
scala> f.valueAt(DenseVector(0,0,0))
res0: Double = 27.0
scala> f.valueAt(DenseVector(3,3,3))
res1: Double = 0.0
scala> f.gradientAt(DenseVector(3,0,1))
res2: scalala.tensor.dense.DenseVector[Double] =
0.00000
-6.00000
-4.00000
scala> f.calculate(DenseVector(0,0))
res3: (Double, scalala.tensor.dense.DenseVector[Double]) =
(18.0,-6.00000
-6.00000)
You can also use approximate derivatives, if your function is easy enough to compute:
scala> def g(x: DenseVector[Double]) = (x - 3.0):^2 sum
g: (x: scalala.tensor.dense.DenseVector[Double])Double
scala> g(DenseVector(0.,0.,0.))
res5: Double = 27.0
scala> val diffg = new ApproximateGradientFunction(g)
diffg: scalanlp.optimize.ApproximateGradientFunction[Int,scalala.tensor.dense.DenseVector[Double]] = <function1>
scala> diffg.gradientAt(DenseVector(3,0,1))
res6: scalala.tensor.dense.DenseVector[Double] =
1.00000e-05
-5.99999
-3.99999
Ok, now let’s optimize f. The easiest routine to use is just LBFGS, which is a quasi-Newton method that works well for most problems.
scala> val lbfgs = new LBFGS[DenseVector[Double]](maxIter=100, m=3) // m is the memory. anywhere between 3 and 7 is fine. The larger m, the more memory is needed.
lbfgs: scalanlp.optimize.LBFGS[scalala.tensor.dense.DenseVector[Double]] = scalanlp.optimize.LBFGS@c7d97d5
scala> val minimum = lbfgs.minimize(f,DenseVector(0,0,0))
res7: scalala.tensor.dense.DenseVector[Double] =
3.00000
3.00000
3.00000
scala> f(minimum)
res8: Double = 0.0
You can also use a configurable optimizer, using FirstOrderMinimizer.OptParams. It takes several parameters:
case class OptParams(batchSize:Int = 512, regularization: Double = 1.0, alpha: Double = 0.5, maxIterations:Int = -1, useL1: Boolean = false, tolerance:Double = 1E-4, useStochastic: Boolean= false) { // ... }
batchSize applies to BatchDiffFunctions, which support using small minibatches of a dataset. regularization integrates L2 or L1 (depending on useL1) regularization with constant lambda. alpha controls the initial stepsize for algorithms that need it. maxIterations is the maximum number of gradient steps to be taken (or -1 for until convergence). tolerance controls the sensitivity of the convergence check. Finally, useStochastic determines whether or not batch functions should be optimized using a stochastic gradient algorithm (using small batches), or using LBFGS (using the entire dataset).
OptParams can be controlled using scalanlp.config.Configuration, which we described earlier.
ScalaNLP also contains an implementation of a Support Vector Machine (SVM) using the Pegasos optimizer.
Here is an example of how to use it:
import bundles.MutableInnerProductSpace
import scalala.library.Library._
import scalanlp.util._
import scalanlp.util.logging._
import scalanlp.data.Example
import scalala.operators._
import scalala.tensor._
import scalanlp.stats.distributions.Rand
import scalala.generic.math.CanNorm
import scalala.generic.collection.CanCreateZerosLike
import scalanlp.data._
import scalala.tensor.dense._
import scalanlp.classify._
val data = DataMatrix.fromURL(new java.net.URL("http://www-stat.stanford.edu/~tibs/ElemStatLearn/datasets/spam.data"),-1,dropRow = true);
var vectors = data.rows.map(e => e map ((a:Seq[Double]) => DenseVector(a:_*)) relabel (_.toInt));
vectors = vectors.map { _.map{ v2 => v2 /norm(v2,2) }};
vectors = Rand.permutation(vectors.length).draw.map(vectors) take 10;
println(vectors.length);
val trainer = new SVM.Pegasos[Int,DenseVector[Double]](100,batchSize=1000) with ConsoleLogging;
val classifier = trainer.train(vectors);
for( ex <- vectors.take(30)) {
val guessed = classifier.classify(ex.features);
println(guessed,ex.label);
}