Engage1 Infrastructure

 

Overview

The Big Data Testbed is a large, collaborative cluster which serves as a test site for a wide variety of innovative high performance computing technologies.  It is located at the Massachusetts Green High Performance Computing Center (MGHPCC).

MOC’s industry partners in the Big Data project include Brocade, Intel, Lenovo and Red Hat. The MIT Research Computing


Architecture

The Big Data Testbed consists of 18 compute racks, one storage rack, and an experimental OpenStack rack.  MIT provides compute servers for the cluster, which are housed in MIT racks.  At scheduled times, MIT nodes are made available for provisioning via the MOC network, and then are moved back to the MIT compute pool when the time is up.  We are currently exploring rapid node deployment using Bare Metal Imaging (BMI). Future plans include moving nodes to between MIT and MOC more dynamically via requests to MIT’s provisioning system.

Each compute rack will eventually feature an Intel cache server with two high performance SSDs, configured as cache tier of the cluster’s Ceph storage backend. Currently 5 such servers are actively deployed.

The cluster is backed by Red Hat Ceph Storage.  The base Ceph tier, containing the majority of the storage capacity, is installed on ten Lenovo x3650 servers.  Each server is equipped with nine 4TB SATA drives, plus three faster SSDs for journaling.  The replicated nature of Ceph (three replication) means that the total working capacity of the base storage tier is about 103TB.  Three small Quanta servers serve as monitors for the cluster.

The cluster network is a special ‘bifurcated ring’ architecture designed for this project by engineers at Brocade.  The ring is designed to create short paths between any two points, yet still be expandable with minimum disruption to the existing infrastructure.  Currently, we have 22 Brocade VDX6740 switches in the cluster – one for each of the 18 compute racks, one in the OpenStack rack, and three in the storage rack.  The 10 servers are split among the 3 switches, so that the storage backend is accessible from 3 different points in the wider ‘ring’ architecture.

The last piece of the cluster is an MOC compute rack with 7 Quanta servers and 6 Lenovo System x3550 servers. This rack features a small OpenStack Liberty installation, an experimental OpenShift environment, and a multinode setup used for SDN research and an OpenFlow research project. This rack also houses the HIL master which controls networking for the Engage1 deployment.


Diagrams

Big Data Testbed Infrastructure Diagram

Engage1

RGW-Proxy Overview

RGWsSSDcacheInMOC


Project Team

Core Project Team

  • Chris Hill, EAPS Principal Research Scientist; Earth, Atmospheric, and Planetary Sciences (MIT)
  • Radoslav Milanov, Senior Infrastructure Engineer (Boston University)
  • Laura Kamfonik, Junior Infrastructure Engineer (Boston University) 
  • Paul Hsi (MIT)
  • Rahul Sharma, Co-op Intern (Boston University) 
  • Piyanai Saowarattitada, MOC Director of Engineering (Boston University) 

Contributors

  • Jon Bell (Boston University)
  • Dave Cohen (Intel)
  • Rob Montgomery (Brocade)
  • Mark Presti (Brocade)
  • Bob Newton (Lenovo)
  • Jon Proulx (MIT – CSAIL)
  • Garrett Wollman (MIT – CSAIL)
  • Tyler Brekke (Red Hat Enterprise Linux)
  • Huitae Kim (Red Hat Enterprise Linux)
  • Joe Fontecchio (University of Massachusetts) 

Timeline

  • July 2015
    • Engineers from Lenovo install the 10 storage servers at MGHPCC.
    • Extensive planning of the Brocade fabric is begun.
  • September 2015 – The 22 VDX6740 Switches are configured by Rob M. (Brocade), and later installed in the racks.
  • October 2015 – The Brocade Fabric is installed, with over 100 cables connecting switches across two datacenter pods.
  • November 2015 – 10 small Quanta servers are installed, which later become admin/service nodes, Ceph monitors, and OpenStack nodes.
  • December 2015 – Brocade conducts two training sessions at BU.
  • January 2016 – Red Hat conducts a weeklong Ceph training at BU.  Tyler B. and  H. Kim (Red Hat) are on-site to help with the Ceph installation.
  • February 2016 – An initial set of Intel Cache servers are configured, and an experimental radosgw-proxy service is set up.  This service will be tested on the cluster with Hadoop in the near future.
  • March 2016 – Deploying a handful physical systems for Hadoop Bare metal deployment POC.  Exploring the possibility of being part of the RH Ceph High Touch Beta program for early access/support for the upstream Infernalist release.
  • April 2016 – Hadoop Bare metal deployment against additional hardware loaners from Lenovo as well as against 200+ MIT MRI systems. Evaluate Infernalist before deployment for a possible deployment…
  • May 2016 –  Depending on the road map of the Big Data user cases, deploy Infernalist.
  • Summer 2016 – Big Data  and HPC use case support. Testing of BMI deployment.
  • Fall 2016 – Deploy Anycast setup for cache tiering experiments in the Brocade environment
  • Winter 2017 Upgrade to Mitaka
  • Spring 2017 Small staging area with 1 cache server and 3 compute nodes added to MOC rack.
  • Summer 2017 (Planned)
    • Deploy User Automation
    • Upgrade to Newton
    • Upgrade to Ocata (pending RHEL release)

Leave a Reply

Your email address will not be published. Required fields are marked *