This guide explains the use of OpenStack Sahara as configured in the MOC. Sahara provides virtualized Big Data as a Service. The following will be a walkthrough of both the cluster and job operations that Sahara provides, with plenty of helpful tips (and warnings) sprinkled throughout. If you have questions or discover bugs, please OPEN A TICKET:

Accessing the Sahara UI: Sahara actions are contained within the “Data Processing” section in the “Project” section of Horizon.

Sahara End-User Guide

Cluster creation

You may also use the following wizard:

You can also watch this video:

Plugin selection

Three plugins are offered. Each is considered “simple”.

  • Vanilla: Offers Hadoop 2.7.1 plus Pig, Hive, Oozie, and YARN
  • Spark: Offers Spark 2.1.0 (without YARN)
  • Storm: Offers Storm 1.0.1

New plugin versions are offered each time MOC’s OpenStack version is upgraded. (Last upgrade: January 2018.)

Node group template design

Navigate to Data Processing → Clusters → Node Group Templates → Create Template

  • Name: Must follow rules for valid hostname
  • Flavor: Make sure to move away from default m1.nano, since this will probably cause kernel panic and crash Sahara engine
  • Availability Zone: Just set to “nova”
  • Floating IP pool: optional (you can assign it later through “Instances” area); probably only need it for master node group
  • Storage location: choice between Ephemeral and Cinder
    • Cinder can be much bigger than ephemeral disk
    • Make sure to set Cinder Availabilty Zone and Volume Type
  • Make sure to select base image
  • Node processes: typical topology…
    • Vanilla: master=[namenode, resourcemanager,historyserver, oozie], worker=[datanode, nodemanager]
    • Spark: master=[namenode, master], worker=[datanode,slave]
    • Storm: master=[nimbus], worker=[supervisor, zookeeper]
  • Parameters: Available settings depend on which processes have been selected; you may wish to set dfs.datanode.du.reserved, Heap Size for various processes
  • Security: Auto Security Group feature should handle opening the proper ports

Cluster template design

Navigate to Data Processing → Clusters → Cluster Templates → Create Template

  • Name: Must follow rules for valid hostname
  • Anti-Affinity (optional): Makes sure that instances start on different compute hosts. This is good for reliable HDFS replication. You are bound by number of compute hosts
  • Node Groups:
    • Plugin must match
  • Parameters: Available settings depend on which processes have been selected; you may wish to set dfs.replication (defaults to number of data/worker nodes), dfs.blocksize

Cluster launching and scaling

Navigate to Data Processing → Clusters → Cluster Templates → Launch Cluster

  • Name: Must follow rules for valid hostname
  • Cluster Template: Plugin must match
  • Cluster Count: Create “separate” clusters – (alternatively, keep in mind if you want more nodes, then you can scale cluster later, or edit cluster template)
  • Base Image: Usually there is just one choice of OS/version per plugin, but sometimes more
  • Keypair: Highly recommended, since images do not have default passwords
  • Neutron Management Network: make sure to choose a private (tenant) network, not the floating network

Cluster access

Job execution

Swift integration

Jobs on Sahara clusters support Swift I/O. To be more specific, this means you can write a Swift path instead of an HDFS path in any MapReduce or Spark job. Here are some helpful hints regarding the use of Swift:

  • Paths: Format is swift://<container>.sahara/path/to/file
  • Authentication: Username and password must be passed at runtime. If you are using the Sahara API/UI to submit jobs, then this may be able to be abstracted away by creating a “Data Source” (More about this in the next section). Username and password are set by fs.swift.service.sahara.username and fs.swift.service.sahara.password. The value fs.swift.service.sahara.tenant is automatically set when the cluster is first created, but you can override it if needed.
  • Easy transfer: It may be easier to move data from Swift into HDFS first, before running a job. You can use the “distributed copy” command, e.g.
    • [wc_code]hadoop distcp -D fs.swift.service.sahara.username=USERNAME -D fs.swift.service.sahara.password=PASSWORD INPUT_PATH OUTPUT_PATH[/wc_code]

Job submission through UI

  • Data Sources: A useful abstraction of Swift paths (credentials applied automatically, too!) or HDFS paths
    • Name: no spaces
    • Data Source Type: Swift or HDFS, but Sahara doesn’t validate that your path or credentials are valid, so be careful
    • URL: Just regular HDFS path, or swift://<container>.sahara/path
  • Job Binaries: These are any jar file, script, etc. that you submit with your job.
    • Advised that you store in Swift
  • Job Templates: This is where the actual job is designed
    • Name: no spaces
    • Main Binary: for Hive, Shell, Spark, and Storm
    • Libs: any additional files, or more commonly the main binary for Java/MapReduce
    • Interface Arguments: Allows you to add custom fields to the job launching form, for args and config parameters (will come in handy in next section)
  • Jobs: (launch from Data Processing → Job Templates → Launch on Existing Cluster)
    • Cluster: Make sure to choose the right one!
    • Input/Output: These only show up for MapReduce and Hive jobs. Otherwise you will set your input and output paths as args (or they could be hardcoded in your job)
    • Main Class: Required for Java and Spark jobs, equivalent to setting in Configuration section (or by interface argument)
    • Configuration – Swift: “Use Data Source Substitution for Names and UUID” allows you to use datasource://<source_name> in args — otherwise you can pass fs.swift.service.sahara.username or fs.swift.service.sahara.password and use regular swift:// style paths
    • Arguments: These are command line arguments, often used for input and output args
      Interface Arguments: If you designed these in Job Template step, you will see them here. If they are set as required, you will get reminded to set them if you forget, which can be very helpful since there can be many parameters you have to remember to pass for jobs with Swift I/O, etc.
    • Status (after launching): PENDING, RUNNING, etc. are not very descriptive. Recommended to use web console for your cluster. Oozie is good for this on Vanilla cluster. For Spark and Storm, you can also look in /tmp/<PLUGIN>-edp/<JOB_TEMPLATE>/<JOB_ID>/ , where files like launch_command.log and stderr in that folder are very useful.

Manual interaction

For users who already know a lot about Hadoop etc., they may prefer to use their cluster “manually”.

  • Where programs live
    • Vanilla: /opt/hadoop, /opt/hive
    • Spark: /opt/spark
    • Storm: /usr/local/
  • Permissions
    • Spark and Storm just have “ubuntu” user for everything
    • Vanilla: Many Hadoop files and programs belong to “hadoop” user