Spark¶
Overview¶
Configures like Hadoop 1
Executing engine completly in RAM therefore lightning fast
Can integrate with YARN
Can use HDFS, S3, HBase, Cassandra, local files…
Easier programming interface using resilient distributed dataset (RDD) - an immutable list distributed over the cluster
Installation¶
Unzip latest Spark prebuild for latest Hadoop (can also be used standalone)
echo "root@some-slave-host" >> /opt/spark/conf/slaves
/opt/spark/bin/start-all.sh
Status Overview¶
Point your web browser to http://<yourmasternode>:8080
Run example
bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://127.0.0.1:7077 \
--executor-memory 1G \
--total-executor-cores 1 \
lib/spark-examples-1.1.1-hadoop2.4.0.jar \
1000
Get a Python shell on the cluster¶
/opt/spark/bin/pyspark
Sample code¶
Grep for failures in kern.log
log = sc.textFile("/var/log/kern.log")
rrd = log.filter(lambda x: "error" or "failure" in x.lower())
for x in rrd.collect(): print x
Sum up bytes of an apache acces log
log = sc.textFile("/var/log/httpd/access.log")
log.map(lambda x: x.split(" ")[9]).filter(lambda x: "-" not in x).map(lambda x: int(x)).sum()
Add slave nodes to a running cluster¶
On the slave node execute
/opt/spark/sbin/start-slave.sh some-worker-id spark://master-node:7077
Troubleshooting¶
Be sure if remotely submiting jobs to use the DNS name and not IP
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory