###### Spark ###### Overview ======== * Configures like Hadoop 1 * Executing engine completly in RAM therefore lightning fast * Can integrate with YARN * Can use HDFS, S3, HBase, Cassandra, local files... * Easier programming interface using resilient distributed dataset (RDD) - an immutable list distributed over the cluster Installation ============ * Unzip latest Spark prebuild for latest Hadoop (can also be used standalone) .. code:: bash echo "root@some-slave-host" >> /opt/spark/conf/slaves /opt/spark/bin/start-all.sh Status Overview =============== * Point your web browser to http://:8080 * Run example .. code:: bash bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark:// \ --executor-memory 1G \ --total-executor-cores 1 \ lib/spark-examples-1.1.1-hadoop2.4.0.jar \ 1000 Get a Python shell on the cluster ================================= .. code:: python /opt/spark/bin/pyspark Sample code =========== * Grep for failures in kern.log .. code:: python log = sc.textFile("/var/log/kern.log") rrd = log.filter(lambda x: "error" or "failure" in x.lower()) for x in rrd.collect(): print x * Sum up bytes of an apache acces log .. code:: python log = sc.textFile("/var/log/httpd/access.log") log.map(lambda x: x.split(" ")[9]).filter(lambda x: "-" not in x).map(lambda x: int(x)).sum() Add slave nodes to a running cluster ==================================== * On the slave node execute .. code:: bash /opt/spark/sbin/start-slave.sh some-worker-id spark://master-node:7077 Troubleshooting =============== * Be sure if remotely submiting jobs to use the DNS name and not IP * ``TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory`` Monitoring ========== * http://www.hammerlab.org/2015/02/27/monitoring-spark-with-graphite-and-grafana/