March 2, 2020, 11:30 – 12:30 PM
The Open Cloud FPGA Testbed – Supporting Research on Emerging Datacenter Configurations
Abstract: FPGAs are now ubiquitous in datacenters performing system tasks like SDN and metrology, system applications like encryption and compression, and applications-as-a-service like machine learning and big data analysis. But as is the case universally with commercial clouds, these FPGAs are not accessible to outside users; as a result, there is no existing infrastructure that supports research on emerging datacenter configurations. To address this deficiency, the NSF has funded the Open Cloud Testbed (OCT) to be built and operated by the University of Massachusetts, Boston University, and Northeastern University.
In this talk we first describe current datacenter hardware configurations and the many places that FPGAs (potentially) reside within each node. We then give an overview of how these are supported in the OCT. We end with a very briefly catalog of potential research projects and an invitation to for feedback and participation.
Martin Herbordt is Professor of Electrical and Computer Engineering at Boston University where he directs the Computer Architecture and Automated Design Lab. His research spans Architecture and High Performance Computing. He and his group have been working for many years in accelerating HPC applications with FPGAs and GPUs, and in building systems integrating FPGAs. More recently their focus has been on middleware and system aspects of large-scale FPGA clusters and clouds, the latter especially in Bump-in-the-Wire configurations.
Miriam Leeser is Professor of Electrical and Computer Engineering at Northeastern University. She has been doing research in hardware accelerators, including FPGAs and GPUs, for decades, and has done ground breaking research in floating point implementations, unsupervised learning, medical imaging, privacy preserving data processing and wireless networking. She received her BS degree in Electrical Engineering from Cornell University, and Diploma and Ph.D. Degrees in Computer Science from Cambridge University in England. She has been a faculty member at Northeastern since 1996, where she is head of the Reconfigurable Computing Laboratory and a member of the Computer Engineering group. She is a senior member of ACM, IEEE and SWE. She is the recipient of an NSF Young Investigator Award. Throughout her career she has been funded by both government agencies and companies, including DARPA, NSF, Google, MathWorks and Microsoft. She received the prestigious Fulbright Scholar Award in 2018.
Programming FPGAs – The Open Source Way
Abstract: FPGAs provide the performance, power efficiency and flexibility needed to meet increasing complexity, diversity and dynamicity of data center workloads. However, despite becoming increasingly ubiquitous, this promise of FPGAs has thus far been largely unfulfilled. This is due to proprietary tooling which not only limits functionality and performance, but also exposes numerous attack surfaces. In this talk, we will present a completely open source toolchain for FPGAs that can perform Synthesis, Place&Route and Board Programming without using any vendor software. This toolchain will not only allow developers to better target the underlying device architecture, but will also open the floodgates for innovation as we can start to leverage FPGAs in a more effective and secure manner.
Presenter: Ahmed Sanaullah is a Senior Data Scientist for the Red Hat Office of the CTO, working on all things FPGA. This includes a number of open source projects across the hardware and software stacks, such as compilers, drivers and hardware operating systems. He has a PhD in Computer Engineering from Boston University, an MSc in Electrical and Electronics Engineering from the University of Nottingham, and a BS in Electrical Engineering from Lahore University of Management Sciences.
Leveraging Distributed Research Cloud Infrastructures for Domain Science Research and Experimentation
Abstract: Computational science today depends on complex, data-intensive applications operating on datasets from a variety of scientific instruments. A major challenge is the integration of data into the scientist’s workflow. Recent advances in dynamic, networked cloud resources provide the building blocks to construct reconfigurable, end-to-end infrastructure that can increase scientific productivity. However, applications have not adequately taken advantage of these advanced capabilities. In the context of the DyNamo project funded under the NSF Campus CyberInfrastructure program, we have developed a novel network-centric platform, Mobius, which enables high-performance, adaptive data flows and coordinated access to distributed multi-cloud resources (cloud research infrastructure like ExoGENI, Chameleon Cloud, XSEDE JetStream, Mass Open Cloud, etc.), and data repositories for atmospheric scientists.
We have demonstrated the effectiveness of our approach by evaluating time-critical, adaptive weather sensing workflows, which utilize advanced networked infrastructure to ingest live weather data from radars and compute data products on dynamically provisioned resources on hybrid, multi-cloud platforms, which are used for timely response to weather events. The workflows are orchestrated by the Pegasus workflow management system. We have shown that our approach results in timely processing of CASA weather workflows under different infrastructure configurations and network conditions. Our findings show that using our network-centric platform powered by advanced layer2 networking techniques results in faster, more reliable data throughput, makes multi-cloud resources easier to provision, and the workflows easier to configure for operational use and automation.
We are extending our work such that domain science data flows can be effectively adapted and optimized by leveraging Software-Defined Exchanges (SDX), and the Quality of Service of the end-to-end provisioned infrastructure can be transparently maintained by active monitoring and control. Our current plans also include supporting a wider federation of cloud infrastructure (public clouds like Amazon EC2 and other cloud resources like Open Cloud Testbed, CloudLab etc.) using Mobius. Using the connected, distributed multi-cloud federation enabled by DyNamo, we are continuing to support (a) a wider range of adaptive weather sensing workflows performing wind computations and hail formation, and (b) ingest of streaming data and on-demand computations for workflows employing data from the Ocean Observatory Initiative (OOI) NSF Large Facility.
Presenter: Dr. Anirban Mandal serves as the Assistant Director for network research and infrastructure at RENCI, UNC-Chapel Hill. He leads efforts in science cyberinfrastructures. His research interests include resource provisioning, scheduling, performance analysis, and anomaly detection for distributed systems, cloud computing, and data-driven scientific workflows. He serves as the PI for the NSF CC* DyNamo project, and is a co-PI on the NSF Cyberinfrastructure Center of Excellence (CI CoE) Pilot project.
Strategic Management of Shared Cloud Resources
Abstract: We provide insights into user behavior from a game-theoretic perspective in various settings where shared cloud resources are available from purchase. One setting involves the shared/buy-in paradigm for cloud computing, where users choose between two tiers of services: shared services and buy-in services. An important feature of shared/buy-in computing consists of making idle buy-in resources available to other users, which has been shown to enhance the utilization rate of the cloud. The other main setting we investigate is related to advance reservation, in which priority access to cloud resources may be requested in advance of when they will be needed.
Jonathan Chamberlain is a first year PhD student in Computer Engineering at Boston University, also receiving his BA in Mathematics and MS in Systems Engineering from BU. He is a member of the Laboratory of Networking & Information Systems (NISLAB) at BU, with research interests including game theory, network systems, and cloud computing. He is currently working on analyses of user behavior and the social welfare under various pricing schemes applicable to cloud markets (whether open or otherwise).
Zhenpeng Shi is a second-year Ph.D. student in Electrical Engineering at Boston University. He is currently a member of the Laboratory of Networking & Information Systems (NISLAB) at BU. His research interests include game theory, mechanism design, cloud computing, and network economics. He has been working on analyzing strategic cloud user behaviors from a game-theoretic perspective and evaluating outcomes of different pricing schemes accordingly. He received his B.Eng degree from Harbin Institute of Technology, Harbin, China.
Secure and Customized Hypervisor with Qemu
Abstract: Hypervisors are the backbone of current virtual machine (VM) based Clouds, as they enable VM booting, device abstraction and sharing, as well as hardware virtualization support. They are also projected to play a central role on next generation container environments for security, by supporting micro virtual machine (microVM) technologies. In either case, security and performance of the hypervisor is at the center of the state of the art research in Cloud infrastructure. The most critical hypervisor software component is the virtual machine monitor (VMM), a user-space process that can access the largest amount of kernel APIs and has direct control over devices and their drivers. Because of its central role, any system is as safe as the deployed VMM. VMM projects include the widely used standard Qemu, which can also act as operating system emulator, and more recent efforts based on the Rust language, such as Firecraker, Cloud Hypervisor, and CrosVM. The goal of Qemu is generality, i.e. the ability to support any type of hardware platform, including legacy devices for backward compatibility. Rust-based VMMs rely on the base language security properties to claim improved security, as compared to Qemu, which is developed in C. In some cases (e.g. Firecraker) they target high performance (start up time, density) for a reduced class of applications (serverless) and only on specific hardware platforms.
In this talk we will start by showing how Qemu is modularized and configurable to deliver a reduced runtime surface of attack. With these mechanisms, we will show how it can be customized to a specific Cloud hardware platform and what is the impact on the attack surface and how this compares to Rust-based VMMs. We will then show that further reduction is possible, by analyzing coverage traces on realistic Cloud workloads. Our conclusion is a set of compile- and run-time mechanisms that can be used to deploy a minimal Qemu to a production-level Cloud infrastructure.
Presenter: Daniele Buono is a Research Staff Member at the IBM T.J. Watson Research Center. He joined the Data-Centric Systems group at IBM Research in 2014, where he focused on High-Performance Computing, specifically Big Data Analytics and linear algebra-based graph algorithms.
He was deeply involved in the early testing of Summit and Sierra, two supercomputers currently placed at number 1 and 2 of the TOP500 list. His work on the project ranged from pre-silicon performance verification of the POWER9 processor, to porting benchmarking applications to the supercomputers. He then shifted his interest in Cloud Infrastructure and Security. He is currently working on hardening KVM-based hypervisors for Cloud Applications.