Spark 1.5.0の多層パーセプトロンのサンプルコードを試す(Ubuntu Linux)
2015年9月頃にこの記事をメモして、下書きのまま放置していた。
2016年3月に試したらSpark1.5のチュートリアルがネット上から消えていて試すのが大変だった。。
Ubuntu LinuxにSpark 1.5.0を入れて、
Scalaの対話インターフェイスで、多層パーセプトロンのサンプルコードを試す。
Multilayer perceptron classifier - spark.ml - Spark 1.6.1 Documentation
spark-1.5.0-bin-cdh4.tgz
をダウンロードして解凍。
~/spark/
に配置。
cd spark/
bin/spark-shell
でScala対話インターフェイスを起動。
複数行になっていた箇所は、1行に直して貼り付け。
val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100) val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision")
エラーっぽいメッセージが出た。
scala> val model = trainer.fit(train) 15/09/13 00:30:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 15/09/13 00:30:32 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 15/09/13 00:30:32 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel = mlpc_1e50280e1dfd
println("Precision:" + evaluator.evaluate(predictionAndLabels))
の結果が、
Precision:0.9636363636363636
と出たので、とりあえず最後まで処理された?
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator // Load the data stored in LIBSVM format as a DataFrame. val data = sqlContext.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt") // Split the data into train and test val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L) val train = splits(0) val test = splits(1) // specify layers for the neural network: // input layer of size 4 (features), two intermediate of size 5 and 4 // and output of size 3 (classes) val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100) // train the model val model = trainer.fit(train) // compute precision on the test set val result = model.transform(test) val predictionAndLabels = result.select("prediction", "label") val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision") println("Precision:" + evaluator.evaluate(predictionAndLabels))
2016年3月
下書きから公開にするために念のためにもう1回実行。
MacOSのPyCharmで実行したコード
package main.scala import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.ml.classification.MultilayerPerceptronClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.mllib.util.MLUtils import org.apache.spark.sql.Row import org.apache.spark.sql.SQLContext object MultiLayerPerceptron { def main(args: Array[String]) { val conf = new SparkConf().setAppName("MultiLayerPerceptron").setMaster(args(0)) val sc = new SparkContext(conf) // import sqlContext.implicits._ is needed to use .toDF() val sqlContext = new SQLContext(sc) import sqlContext.implicits._ // Load training data val data_path = "/Users/kubottt/Documents/spark/spark-1.5.0/data/mllib/sample_multiclass_classification_data.txt" val data = MLUtils.loadLibSVMFile(sc, data_path).toDF() // Split the data into train and test val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L) val train = splits(0) val test = splits(1) // specify layers for the neural network: // input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes) val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters val trainer = new MultilayerPerceptronClassifier() .setLayers(layers) .setBlockSize(128) .setSeed(1234L) .setMaxIter(100) // train the model val model = trainer.fit(train) // compute precision on the test set val result = model.transform(test) val predictionAndLabels = result.select("prediction", "label") val evaluator = new MulticlassClassificationEvaluator() .setMetricName("precision") println("Precision:" + evaluator.evaluate(predictionAndLabels)) System.exit(0) } }
Ubuntuのspark shellで実行したコード。
import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.ml.classification.MultilayerPerceptronClassifier import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.mllib.util.MLUtils import org.apache.spark.sql.Row // Load training data val data_path = "/home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt" val data = MLUtils.loadLibSVMFile(sc, data_path).toDF() // Split the data into train and test val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L) val train = splits(0) val test = splits(1) // specify layers for the neural network: // input layer of size 4 (features), two intermediate of size 5 and 4 and output of size 3 (classes) val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100) // train the model val model = trainer.fit(train) // compute precision on the test set val result = model.transform(test) val predictionAndLabels = result.select("prediction", "label") val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision") println("Precision:" + evaluator.evaluate(predictionAndLabels))
エラーその1
16/03/14 02:09:11 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 16/03/14 02:09:11 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at: org.apache.spark.SparkContext.<init>(SparkContext.scala:81) org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017) $iwC$$iwC.<init>(<console>:9)
spark shellで
val conf = new SparkConf().setAppName("MultiLayerPerceptron").setMaster(args(0)) val sc = new SparkContext(conf)
のような記述をするとこのエラーになるっぽい。
sudo kill してみたけど違った。
結果
kubottt@kubottt-Diginnos:~/spark$ bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.5.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_76) Type in expressions to have them evaluated. Type :help for more information. 16/03/14 02:25:43 WARN Utils: Your hostname, kubottt-Diginnos resolves to a loopback address: 127.0.1.1; using 192.168.5.7 instead (on interface wlan0) 16/03/14 02:25:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 16/03/14 02:25:44 WARN MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. Spark context available as sc. 16/03/14 02:25:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16/03/14 02:25:46 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16/03/14 02:26:06 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0 16/03/14 02:26:06 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException 16/03/14 02:26:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/03/14 02:26:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) 16/03/14 02:26:10 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies) SQL context available as sqlContext. scala> import org.apache.spark.{SparkContext, SparkConf} import org.apache.spark.{SparkContext, SparkConf} scala> import org.apache.spark.ml.classification.MultilayerPerceptronClassifier import org.apache.spark.ml.classification.MultilayerPerceptronClassifier scala> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator scala> import org.apache.spark.mllib.util.MLUtils import org.apache.spark.mllib.util.MLUtils scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> val data_path = "/home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt" data_path: String = /home/kubottt/spark/data/mllib/sample_multiclass_classification_data.txt scala> val data = MLUtils.loadLibSVMFile(sc, data_path).toDF() data: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L) splits: Array[org.apache.spark.sql.DataFrame] = Array([label: double, features: vector], [label: double, features: vector]) scala> val train = splits(0) train: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> val test = splits(1) test: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> val layers = Array[Int](4, 5, 4, 3) layers: Array[Int] = Array(4, 5, 4, 3) scala> val trainer = new MultilayerPerceptronClassifier().setLayers(layers).setBlockSize(128).setSeed(1234L).setMaxIter(100) trainer: org.apache.spark.ml.classification.MultilayerPerceptronClassifier = mlpc_d9cc5263fe2a scala> val model = trainer.fit(train) 16/03/14 02:28:01 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/03/14 02:28:01 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 16/03/14 02:28:01 ERROR LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed model: org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel = mlpc_d9cc5263fe2a scala> val result = model.transform(test) result: org.apache.spark.sql.DataFrame = [label: double, features: vector, prediction: double] scala> val predictionAndLabels = result.select("prediction", "label") predictionAndLabels: org.apache.spark.sql.DataFrame = [prediction: double, label: double] scala> val evaluator = new MulticlassClassificationEvaluator().setMetricName("precision") evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_172c6eb8dc15 scala> println("Precision:" + evaluator.evaluate(predictionAndLabels)) Precision:0.9636363636363636 scala>