Big Data Enablement @MOC

Overview

MOC is part of the Mass Big Data Initiative and at its core is a BigData project. Leveling the field for BigData Platform (BDP) innovation by enabling research and innovation in a production environment is among MOC’s main goals. More specifically, its aims are (i) to enable BigData Platform (BDP) research in the Commonwealth region (ii) enabling users & government to easily upload and share public datasets (iii) exploiting innovative infrastructures such as SSDs, low-latency networking, RDMA, accelerators etc… in BDP research.

Currently we are focused on multiple studies that will establish the foundation for the above goals, such as building a public dataset repository, building an on-demand bare-metal BigData provisioning system, building an efficient storage and datacenter-scale caching mechanism that utilize SSDs and enable fast processing of public datasets in BigData platforms provisioned over MOC, providing container-based cloud services using OpenShift, and exposing innovative infrastructures such as GPUs and FPGAs to cloud users as a service.


Sub-Projects

On-Demand Bare-Metal BigData @ MOC:
Increasingly, there is a demand for bare-metal bigdata solutions for applications that cannot tolerate the unpredictability and performance degradation of virtualized systems. Existing bare-metal solutions can introduce delays of 10s of minutes to provision a cluster by installing operating systems and applications on the local disks of servers. This has motivated recent research developing sophisticated mechanisms to optimize this installation. These approaches assume that using network mounted boot disks incur unacceptable run-time overhead. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating systems and applications, and network mounting the boot disk and applications result in negligible run-time impact while leading to faster provisioning time. We developed a network mounted bigdata provisioning system prototype and working on turning our solution to an on-demand bare-metal bigdata provisioning system.

For more details check:

Ata Turk, Ravi S. Gudimetla, Emine Ugur Kaynar, Jason Hennessey, Sahil Tikale, Peter Desnoyers, and Orran Krieger. An experiment on bare-metal bigdata provisioning. In 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16), Denver, CO, 2016. USENIX Association.

MOC Dataset Repository (Cloud Dataverse):
In collaboration with Harvard Dataverse team, we are building a dataset repository similar to AWS Public datasets, that will initially host public datasets of the Commonwealth of Massachusetts and then will expand to hosting public and community datasets of researchers in the Massachusetts region. The core project that we work on to provide a dataset repository with added computational abilities is called the Cloud Dataverse project. MOC Dataset Repository (MOC-DR) is the first implementation of Cloud Dataverse.

MOC-DR will host a variety of data sets. Large data sets from the Commonwealth of Massachusetts, academy and industry ( e.g. daily Massachusetts traffic congestion and public transportation information, CERN physical experimentation datasets, human genome datasets and stock market exchange transaction archives) that normally require hours or days to locate, download, customize, and analyze will be available for access. Dataset metadata and associated provenance information will be hosted on MOC-DR, a Dataverse portal [1].

Usage scenarios of the dataset repository include:

(1) Users will be able to upload data sets into MOC-DR: (a) A user will login to MOC-DR, (b) will be authenticated against the MOC authentication system (e.g. Keystone), (c) will be presented an interface for uploading datasets and providing the metadata information (current Dataverse data upload interface). (d) Datasets will be stored in MOC Object Storage backend (CEPH), with which Dataverse communicates via Swift calls.

(2) Users will be able to download Datasets from MOC-DR: (a) A user will log into MOC-DR (using Keystone credentials), (b) browse through public datasets, (c) select a dataset to work on, (d) obtain the Swift Object Store endpoints for the selected dataset, (f) from here on out a user can either use the given endpoints to download the datasets or use MOC Big Data Processing System to analyze the data on MOC without needing to worry about the cost of storing the data or the time required to download it.

MOC Big Data as a Service (BDaaS) on Engage1:
In collaboration with MIT, NorthEastern, Lenovo, Intel, Brocade, RedHat, and CEPH teams, we are building an on demand Big Data processing system that is closely tied with the MOC public dataset repository. The service will run on the servers of the Engage1 cluster.

One envisioned standard use case for this system is as follows:

(1) User selects a dataset to work on the MOC Big Data Processing environment: After selecting a dataset to work on, (a) users will be directed to MOC On-Demand BigData processing UI, (b) will be able to provision a BigData environment to work on the selected dataset (by setting environment configuration parameters, e,g, number of nodes, storage, processing, memory capacity, the set of Big Data processing applications (e.g. Hadoop, Pig, Spark, …) to have, etc…), (c) using the users credentials and her desired configurations, “MOC Imaging System” will create a Big Data processing cluster for the user, and (d) user will be able to use Hadoop-ecosystem applications to process and analyze the dataset she selected using provided resources.

We envision to provide a number of options for the Big Data processing environment: (A) The Big Data processing system can be:
(I) a bare metal environment provisioned via MOC Bare Metal Imaging (BMI) system: (II) a virtualized environment provisioned via OpenStack Sahara, or (III) a containerized environment (possibly provisioned via Bluedata’s container based solutions).

(B) The dataset (or the subset of the dataset, e.g. imagery data from the last week) selected by the user can be prefetched into the caching servers in the processing cluster. This prefetching will prevent multiple small object requests to the storage backend and will guard against variance that may be caused by input fetching during computation.

Example dataset types: We plan to host a number of datasets. A representative dataset type we plan to host is large read only datasets that require an extraction and transformation phase to enable further analysis such as the CMS Primary Datasets [2], MBTA Ridership, Alerts & Service Quality Datasets [3], or The 3,000 Rice Genomes Project [4]. These types of datasets are by nature well divided into separately operable chunks, they get incremental updates (e.g. daily, weekly, monthly,…), the latest datasets are considered to be more ‘hot’, and most processing tasks can be done in a single pass over a number of incremental units of data (e.g. compute/compare daily aggregates for the last 10 days).

[1] http://dataverse.org
[2] http://opendata.cern.ch/collection/CMS-Primary-Datasets
[3] http://www.mbta.com/about_the_mbta/document_library/?search=blue+book&submit_document_search=Search+Library
[4] http://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-7

Existing studies indicate that in the HDFS, a few files account for a very high number of accesses and there is a skew in access frequency across HDFS data. Thus, we believe that more sophisticated caching architecture and policies that includes the frequently accessed files will bring considerable benefit to big data analytics.

We are working on improving the performance of Big Data processing Frameworks (e.g. Hadoop, Spark…etc) in MOC BDaaS by implementing a two level cache architecture. We currently concentrate on improving the analytics of read-only datasets that have high re-use.

In Engage1 environment each Compute Rack has 18 compute servers and one caching server with two 1.6TB SSD drives. MOC-DR will be stored in the CEPH backend storage which has 10 OSD servers each of which have 9 HDD drives.

In the cache server, we will deploy a two level caching mechanism. For the first level cache, we plan to use a Varnish cache (or another reverse proxy can be used as well), and to enable a second level cache we plan to implement data caching capabilities into the Rados Gateway (RGW) implementation. Currently, swift bucket gateway only caches the metadata, however we will modify it and make sure data is also cached along with the metadata.

In the figure below you can see more details of the caching design we envision for the caching servers:

arch

Beside the advantageous noted previously of reusing cache data, one of the main features of this project would be prefetching the entire dataset in secondary cache before or meanwhile the cluster is executing the aggregation phase (Mapper in MapReduce). By applying prefetching mechanism, we can hide or decrease two types of delay: (1) the backend network delay, (2) the backend hard drive delay.

We propose to:
Prefetch all data in secondary cache
Cache chunks data in Varnish cache
Varnish cache can be configured to store data in memory, file, persistent memory and massive storage engine. We suggest to store metadata in varnish and a shared data memory between cache level 2 and cache level 1 as data cache memory.
Because of security issues, cached data chunks in Varnish data cache are duplicated by the number of users using this cache.
Secondary cache should store prefetched data distributedly.


Project Team

Core Project Team

  • Prof. Orran Krieger (Boston University) 
  • Prof. Peter Desnoyers (Northeastern University) 
  • Ata Turk, PhD (Boston University) 
  • Ugur Kaynar (Boston University) 
  • Ravisantosh Gudimetla (Red Hat)
  • Sahil Tikale Nikhil (Boston University) 
  • Mania Abdi (Northeastern University)
  • Mohammad Hossein Hajkazemi (Northeastern University)

Collaborators

  • David Cohen (Intel)
  • Gary Berger (Brocade)
  • William Nelson (Lenovo)
  • Dataverse Team (Harvard)
  • Mercè Crosas (Harvard)
  • Leonid Andreev (Harvard)
  • Zhidong Yu (Intel)
  • Jingyi Zhang (Intel)
  • Piyanai Saowarattitada (MOC)
  • Tom Nadeau (Brocade)
  • Mark Presti (Brocade)
  • Robert Montgomery (Brocade)
  • Pete Moyer (Brocade)

Timeline

  • April–August 2017: 2-tiered caching: upstreaming developed code (CEPH).
  • August–November 2017: Caching write-back support.
  • August–November 2017: Containers as an MOC service: production OpenShift setup.
  • November– 2018: Production MOC BigData as a Service solution.

Planning and Getting Involved

 To get involved in this project, please send email to (mail:moc-team-list@bu.edu) and/or join the #moc irc channel on free node.

Leave a Reply

Your email address will not be published. Required fields are marked *