Spark 1.5.0の多層パーセプトロンのサンプルコードを試す(Ubuntu Linux)

2015年9月頃にこの記事をメモして、下書きのまま放置していた。
2016年3月に試したらSpark1.5のチュートリアルがネット上から消えていて試すのが大変だった。。

Ubuntu LinuxにSpark 1.5.0を入れて、
Scalaの対話インターフェイスで、多層パーセプトロンのサンプルコードを試す。
Multilayer perceptron classifier - spark.ml - Spark 1.6.1 Documentation

spark-1.5.0-bin-cdh4.tgz  をダウンロードして解凍。
~/spark/
に配置。
cd spark/
bin/spark-shell
Scala対話インターフェイスを起動。

複数行になっていた箇所は、1行に直して貼り付け。

val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")

エラーっぽいメッセージが出た。

scala> val model = trainer.fit(train)
15/09/13 00:30:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
15/09/13 00:30:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
15/09/13 00:30:32 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel = mlpc_1e50280e1dfd

println("Precision:" + evaluator.evaluate(predictionAndLabels))
の結果が、
Precision:0.9636363636363636
と出たので、とりあえず最後まで処理された?

import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// Load the data stored in LIBSVM format as a DataFrame.
val data = sqlContext.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4
// and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))

2016年3月

下書きから公開にするために念のためにもう1回実行。

MacOSのPyCharmで実行したコード

package main.scala

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row
import org.apache.spark.sql.SQLContext


object MultiLayerPerceptron {
  def main(args: Array[String]) {

    val conf = new SparkConf().setAppName("MultiLayerPerceptron").setMaster(args(0))
    val sc = new SparkContext(conf)

    // import sqlContext.implicits._ is needed to use .toDF()
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    // Load training data
    val data_path = "/Users/kubottt/Documents/spark/spark-1.5.0/data/mllib/sample_multiclass_classification_data.txt"
    val data = MLUtils.loadLibSVMFile(sc, data_path).toDF()
    // Split the data into train and test
    val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
    val train = splits(0)
    val test = splits(1)
    // specify layers for the neural network:
    // input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
    val layers = Array[Int](4, 5, 4, 3)
    // create the trainer and set its parameters
    val trainer = new MultilayerPerceptronClassifier()
      .setLayers(layers)
      .setBlockSize(128)
      .setSeed(1234L)
      .setMaxIter(100)
    // train the model
    val model = trainer.fit(train)
    // compute precision on the test set
    val result = model.transform(test)
    val predictionAndLabels = result.select("prediction", "label")
    val evaluator = new MulticlassClassificationEvaluator()
      .setMetricName("precision")
    println("Precision:" + evaluator.evaluate(predictionAndLabels))
    System.exit(0)
  }
}

Ubuntuのspark shellで実行したコード。

import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.sql.Row

// Load training data
val data_path = "/home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt"
val data = MLUtils.loadLibSVMFile(sc, data_path).toDF()
// Split the data into train and test
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network:
// input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3)
// create the trainer and set its parameters
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
// train the model
val model = trainer.fit(train)
// compute precision on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
println("Precision:" + evaluator.evaluate(predictionAndLabels))

エラーその1

16/03/14 02:09:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
16/03/14 02:09:11 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
org.apache.spark.SparkContext.<init>(SparkContext.scala:81)
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
$iwC$$iwC.<init>(<console>:9)

spark shellで

val conf = new SparkConf().setAppName("MultiLayerPerceptron").setMaster(args(0))
val sc = new SparkContext(conf)

のような記述をするとこのエラーになるっぽい。

sudo kill してみたけど違った。

結果

kubottt@kubottt-Diginnos:~/spark$ bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76)
Type in expressions to have them evaluated.
Type :help for more information.
16/03/14 02:25:43 WARN Utils: Your hostname, kubottt-Diginnos resolves to a loopback address: 127.0.1.1; using 192.168.5.7 instead (on interface wlan0)
16/03/14 02:25:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
16/03/14 02:25:44 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set.
Spark context available as sc.
16/03/14 02:25:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 02:25:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 02:26:06 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/03/14 02:26:06 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/03/14 02:26:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/14 02:26:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 02:26:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
SQL context available as sqlContext.

scala> import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.{SparkContext, SparkConf}

scala> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier

scala> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

scala> import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.util.MLUtils

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val data_path = "/home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt"
data_path: String = /home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt

scala> val data = MLUtils.loadLibSVMFile(sc, data_path).toDF()
data: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
splits: Array[org.apache.spark.sql.DataFrame] = Array([label: double, features: vector], [label: double, features: vector])

scala> val train = splits(0)
train: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val test = splits(1)
test: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val layers = Array[Int](4, 5, 4, 3)
layers: Array[Int] = Array(4, 5, 4, 3)

scala> val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100)
trainer: org.apache.spark.ml.classification.MultilayerPerceptronClassifier = mlpc_d9cc5263fe2a

scala> val model = trainer.fit(train)
16/03/14 02:28:01 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
16/03/14 02:28:01 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
16/03/14 02:28:01 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed
model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel = mlpc_d9cc5263fe2a

scala> val result = model.transform(test)
result: org.apache.spark.sql.DataFrame = [label: double, features: vector, prediction: double]

scala> val predictionAndLabels = result.select("prediction", "label")
predictionAndLabels: org.apache.spark.sql.DataFrame = [prediction: double, label: double]

scala> val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_172c6eb8dc15

scala> println("Precision:" + evaluator.evaluate(predictionAndLabels))
Precision:0.9636363636363636

scala>