March 3, 2020, 4:00 – 5:10 PM
Providing fast and meaningful insights into enterprise datacenters
Abstract: In this talk I will present an overview of SnailTrail, a system that leverages existing datacenter logging pipelines to ingest and process logs of events in real time and provide datacenter administrators with timely insights into the functionality of the running systems. The talk will focus on two use cases: (i) online reconstruction of user sessions from individual logs, which is often the first step in many datacenter management tasks, and (ii) online critical path analysis of long-running applications, which can be used to identify performance bottlenecks at runtime.

Presenter: Ioannis (John) Liagouris is a research scientist at the Hariri Institute for Computing and an adjunct assistant professor at Boston University. His research interests lie in distributed systems and databases. Before joining BU, he was a visiting scholar at the RISELab, UC Berkeley, a senior researcher (oberassistent) at the Systems Group, ETH Zurich, a visiting research fellow at the University of Hong Kong (HKU), and a research assistant at the Information Management Systems Institute (IMSI) of the “Athena” Research Center, Greece. John obtained a 5-year diploma in Electrical and Computer Engineering in 2008, and a PhD in 2015, both from NTU Athens.
A Just-in-Time Framework for Tracing Distributed Cloud Applications
Abstract: It is extremely difficult to understand where to enable instrumentation a priori to help diagnose problems that may occur in the future. We present Pythia, an automated cross-layer instrumentation framework, which explores the space of possible instrumentation choices and enables instrumentation needed to diagnose a newly-observed problem in production systems. Pythia builds on distributed tracing and uses statistical techniques to identify where instrumentation is needed. This talk will discuss 1) the scalable design of Pythia 2) our progress on identifying promising data structures to represent the instrumentation search space across multiple data center stack layers (e.g., application and kernel). These structures must trade-off between compactness, exhaustiveness, and accuracy. 3) Creating algorithms to search this space quickly while staying under a specific instrumentation budget.

Presenter: Emre Ates (emreates.github.io) is a Ph.D. candidate with the Department of Electrical & Computer Engineering at Boston University. He received his B.Sc. in Electrical and Electronics Engineering from the Middle East Technical University, Turkey. His current research interests include automated analytics on large-scale computing systems and distributed systems.
Workflow motif: an abstraction for debugging distributed systems
Abstract: Abstractions, such as APIs and libraries, enable complex distributed applications to be built. This is because they allow developers to build more complicated applications out of smaller building blocks (i.e., smaller applications) w/o necessarily understanding the latter’s implementation details. But, when it comes to problem diagnosis, there are very few abstractions to help engineers. In many cases, engineers must diagnose problems using no abstractions whatsoever (i.e., using flat logs of raw application events). In this proposal, we discuss the workflow motif, which we argue is a powerful abstraction that is useful for a variety of diagnosis tasks. Workflow motifs describe frequent processing patterns observed across distributed application requests. They represent the building blocks of how distributed applications process different requests. Understanding their properties provides significant insight into a distributed application’s performance and correctness.
To enable motifs extractions, we need to first capture how these requests are processed within a distributed application and then find the frequent motifs within those executions. The former can be enabled by end-to-end tracing infrastructures, the latter can be extracted using frequent pattern mining algorithms. End-to-end tracing infrastructures such as XTrace and Dapper captures the workflow of how request is services within its execution and then represent these traces as graphs of execution. However, the raw output of the end-to-end tracing infrastructures, couldn’t meet all of our requirements, because (1) to be able to extract patterns LEITMOTIF needs to know the boundaries of the pattern within the execution graph, and (2) it requires to be aware of synchronization points; both Dapper and XTrace falls short for the requirements. On the other hand, we found out that the approaches to find frequent patterns with the graphs are too general and doesn’t meet all of our requirements: (1) they don’t respect the causal dependency that exist within the graphs of executions, (2) they don’t consider the boundaries required for patterns, (3) they are time consuming and not scalable. In this research, we show how LeitMotif is useful for debugging and diagnostics of distributed applications and we show how we address the shortcomings of its enablers to make it a practical approach.

Presenter: Mania Abdi is a Ph.D candidate in the Khoury college of computer science at Northeastern university. She received here M.Sc in computer engineering from the Sharif University of Technology, Iran. Her research interests are in distributed storage, debugging distributed systems, graph mining and machine learning.
Challenges and opportunities in large-scale stream processing
Abstract: Recent efforts by academia and open-source communities have established stream processing as a principal data analysis technology across industry. All major cloud vendors offer streaming dataflow pipelines and online analytics as managed services. Notable use-cases include real-time fault detection in space networks, city traffic management, dynamic pricing for car-sharing, and anomaly detection in financial transactions. At the same time, streaming dataflow systems are increasingly being used for event-driven applications beyond analytics, such as orchestrating microservices and model serving. In this talk, I will discuss recent advances, trends, and open problems in large-scale stream processing.

Presenter: Vasiliki (Vasia) Kalavri is an Assistant Professor at Boston University Department of Computer Science. She is working on distributed stream processing and large-scale graph analytics and she is a PMC member of Apache Flink. Before joining BU, Vasia was a postdoctoral fellow at ETH Zurich and received a joint PhD from KTH Stockholm and UCLouvain.
Fuzzing Virtual Devices in Cloud Hypervisors
Abstract: The market for public cloud platforms is valued in the hundreds of billions of dollars. Hypervisors form the backbone of the cloud and are, therefore, security-critical applications which are attractive targets for potential attackers. Past vulnerabilities demonstrate that the implementations of virtual-devices are the most common source of security-bugs in hypervisors. In my talk, I will present our novel approach for fuzzing virtual devices in the popular open-source QEMU hypervisor. Our fuzzer combines a standard coverage-guided strategy with further guidance, based on hypervisor-specific behaviors. Our fuzzer guarantees reproducible input execution and can, optionally, take advantage of existing virtual-device test-cases. Using our fuzzer, we have already found, reported and helped patch bugs in devices such as virtio-net, and the serial device.

Presenter: Alexander Bulekov is a PhD Student studying Computer Engineering at Boston University, advised by Prof. Manuel Egele. Alexander is an intern at Red Hat, where he is researching hypervisor security.