Micro-talks III

March 3, 2020, 2:00 – 3:10PM

Challenges and Opportunities for AI Industrialization

Abstract: We are witnessing a great awakening of AI, thanks to the proliferation of specialized chips, the breakthroughs in machine learning techniques, and the explosion of digital data. Although many enterprises consider AI strategically important, only a very small portion of them have revenue-generating AI systems running today. Advancing from machine learning prototypes in a lab setting to enterprise AI systems for production use imposes many challenges and requires a systematic approach. In this talk, I will sample some challenges an enterprise will face in order to adopt AI in a scalable manner and realize its full benefits. I will also discuss research opportunities associated with these challenges.

Presenter: Hui Lei is Vice President and CTO of Cloud and Big Data at Futurewei Technologies. Previously he was Director and CTO of Watson Health Cloud at IBM, an IBM Distinguished Engineer, and an IBM Master Inventor. He is a Fellow of the IEEE, a past Editor-in-Chief of the IEEE Transactions on Cloud Computing, a past Chair of the IEEE Technical Committee on Business Informatics and Systems, and an author of over 80 patents. He received a Ph.D. in Computer Science from Columbia University.

Bayesian Learning for Online Parameter Tuning of Complex Scientific Applications

Abstract: As scientific applications and the underlying hardware to deploy them are becoming more and more complex, the number of tunable parameters available is increasing rapidly. Hence, the auto-tuning of parameters for achieving optimum performance is becoming a challenging and time-consuming task. In this talk, I will present our recent work on Bayesian Optimization-based solution to provide a theoretically-grounded, practical and efficient solution for auto-tuning complex scientific applications. Our solution makes multiple novel enhancements to minimize the auto-tuning time without any prior domain-specific information, binary instrumentation, or high-overhead offline training.
Presenter: Rohan is a Ph.D. student in Computer Engineering who joined Northeastern University in Fall 2019 just after completing his undergraduate in Electronics and Communication Engineering from the University of Calcutta, India. His research areas include serverless computing, high performance computing systems and architecture.  He is especially interested in exploring the trade-offs between performance, fairness, energy-efficiency, and resiliency of large scale computing systems.

Hybrid Cloud Storage

Abstract: The values offered by public cloud services are clear for analytic workloads. Specialized hardware such as GPUs for doing AI/ML may make more sense to effectively lease with Opex rather than invest Capex on infrastructure that is not continually utilized. However, it may not make sense to build large data sets inside public clouds due both to the cost multiple compared to building out and maintaining private infrastructure and the lock-in nature of using public cloud services. These drivers then lead toward a hybrid architecture where large data sets are built and maintained in private clouds but compute/analytic clusters are spun up in public clouds to the actual analytics on these data sets. Maintaining a hybrid architecture as described introduces challenges with latency and bandwidth to the public cloud compute cluster from the private data lake. In this presentation we describe research being done at by Mass Open Cloud and Red Hat researchers to build caching solutions to maximize throughput of these leased analytics clusters and avoid re-reading the same data from the external private data lake.

Emine Ugur Kaynar is a Ph.D. candidate in the Department of Computer Science at Boston University(BU) working with Prof. Orran Krieger Prof. Larry Rudolph and Prof. Peter Desnoyers. She is a member of the Systems Research Group and associated with Mass Open Cloud (MOC).
Her research interests lie broadly in the fields of storage systems, cloud computing, and data-intensive computing. Currently, her research is focused on designing and building cache architectures for object storage systems in data centers, and exploring the performance characteristic of erasure-coded storage systems.

Impact of OS Design and Hardware Configuration on the Power Performance Tradeoff

Abstract: Energy proportional computing is a challenging feat when faced with the management of modern I/O intensive datacenter workloads with strict service level agreements (SLAs). We focus on saving power in the context of key-value stores that have 99\% tail latencies in the hundreds of microseconds. Typically, additional power savings can be gained when server nodes are underutilized during periods of low to medium traffic by using CPU power limiting or frequency tuning methods. In this work, we tackle power consumption in relation to the dual operation of system software stacks and network interface cards (NICs). We first explore manipulating a time-based interrupt throttling mechanism on modern NICs to tune interrupt rates and control the batching of packet processing. By exploiting this feature and artificially increasing packet processing delay, a system is able to buffer more packets and process less interrupts. This method was able to achieve over 22\% power savings compared to using CPU power limiting features when used in Linux.

We also compared and contrasted CPU and NIC tuning in the context of a baremetal library OS specialized for Memcached in order to understand their behavior in the performance-power tradeoff. Our baremetal library OS was able to achieve 2.5X higher peak throughput than Linux while using 3X lower power. Moreover, we find that our library OS is more than 2-3X more efficient in number of instructions per watt used, and therefor is more sensitive to CPU frequency changes. Our results demonstrate that optimizing the dataplane code path in system software stacks paired with the right hardware tuning can scale up I/O intensive workloads while effectively lowering power consumption.
Presenter: Han Dong is a Ph.D. candidate in the Department of Computer Science at Boston University (BU) working with Prof. Orran Krieger and Prof. Jonathan Appavoo. He is a member of the Scalable Elastic Systems Architecture Group (SESA). His main interests lie in application specific optimization of the entire software systems stack and hardware. He is also interested in the power-performance tradeoff in diverse systems software.

Using ESI in (De)-Centralized Environments: Why and How?

Abstract: Today many organizations choose to host their physically deployed clusters outside of the cloud for security, price or performance reasons. Such organizations form a large section of the economy including financial companies, medical institutions and government agencies. Organizations host these clusters in their private data-centers or rent a colocation facility. Clusters are typically stood up with sufficient capacity to deal with peak demand resulting in silos of under-utilized hardware. Elastic Secure Infrastructure (ESI) is a platform, created at Mass Open Cloud, that enables physically deployed clusters to break these silos. It enables rapid multiplexing of bare-metal servers between clusters with different security requirements. In this talk we present BareShala and FLOCX, which leverage ESI to improve aggregate resource efficiency in centralized and decentralized environments respectively. BareShala is a system architecture for centralized bare-metal resource management. An organization can use it to improve aggregate resource efficiency in its data-center while ensuring that service-level objectives (SLOs) of clusters are not violated. FLOCX is a decentralized market based incentive system where organizations hosting clusters in a co-located data-center can trade their servers to meet demand fluctuations. We will describe preferences and constraints of different clusters and how they can benefit from such a marketplace. We will discuss the economic model that drives the market followed by the initial design of the prototype.

Sahil Tikale is a PhD candidate in the department of Electrical and Computer Engineering at Boston University co-advised by Prof. Orran Krieger, Prof. Larry Rudolph, Prof. David Starobinsky and Prof. Peter Desnoyers. He is broadly interested in design, building and evaluation of cloud-scale systems. Currently his research is focussed on building a multi-provider cloud that uses market based economic models for trading bare-metal servers within a co-located data-center. 

He has been a core member of the Elastic Secure Infrastructure (ESI) group at the Mass Open Cloud and has also spent two summers interning at the Boston office of the Red Hat Open Innovation Labs. 

Prior to joining PhD, he has worked in the industry for 6 years designing and managing critical IT infrastructure for clients from Banking, Government and Research sectors in India and Singapore.

Apoorve Mohan is a doctoral candidate at Northeastern University co-advised by Prof. Gene Cooperman and Prof. Orran Krieger. Broadly, he is interested in Systems and Networking and his current research revolves around improving efficiency, security, and operation of bare-metal clouds. As a doctoral candidate, he has been a core member of the Elastic Secure Infrastructure (ESI) group at the Mass Open Cloud and has also spent two summers interning at IBM Research T.J. Watson Yorktown Heights, NY. Previously, he was also involved in developing Baadal, an IaaS cloud platform for hosting different academic services by the government of India. (http://apoorve.com/)