1. PROVaaS: Provenance-as-a-Service. In Theory and Practice of Provenance (TaPP) (poster) 2014 [ PDF]

  2. H. Meng, R. Kommineni, Q. Pham, R. Gardner, T. Malik, and D. Thain. An Invariant Framework for Conducting Reproducible Computational Science. In Journal of Computational Science, Elsevier, 2015. Invited [ PDF], [ Software]

    Computational reproducibility depends on the ability to not only isolate necessary and sufficient computational artifacts but also to preserve those artifacts for later re-execution. Both isolation and preservation present challenges in large part due to the complexity of existing software and systems as well as the implicit dependencies, resource distribution, and shifting compatibility of systScience projects are increasingly investing in computational reproducibility. Constructing software pipelines to demonstrate reproducibility is also becoming increasingly common. To aid the process of constructing pipelines, science project members often adopt reproducible methods and tools. One such tool is CDE, which is a software packaging tool that encapsulates source code, datasets and environments. However, CDE does not include information about origins of dependencies. Consequently when multiple CDE packages are combined and merged to create a software pipeline, several issues arise requiring an author to manually verify compatibility of distributions, environment variables, software dependencies and compiler options. In this work, we propose software provenance to be included as part of CDE so that resulting provenance-included CDE packages can be easily used for creating software pipelines. We describe provenance attributes that must be included and how they can be efficiently stored in a light-weight CDE package. Furthermore, we show how a provenance in a package can be used for creating software pipelines and maintained as new packages are created. We experimentally evaluate the overhead of auditing and maintaining provenance and compare with heavy weight approaches for reproducibility such as virtualization. Our experiments indicate minimal overheads. ems that result over time ”all of which conspire to break the reproducibility of an application. Sandboxing is a technique that has been used extensively in OS environments in order to isolate computational artifacts. Several tools were proposed recently that employ sandboxing as a mechanism to ensure reproducibility. However, none of these tools preserve the sandboxed application for re-distribution to a larger scientific community aspects that are equally crucial for ensuring reproducibility as sandboxing itself. In this paper, we describe a framework of com- bined sandboxing and preservation, which is not only efficient and invariant, but also practical for large-scale reproducibility. We present case studies of complex high-energy physics applications and show how the framework can be useful for sandboxing, preserving, and distributing applications. We report on the completeness, performance, and efficiency of the framework, and suggest possible standardization approches.

  3. Q. Pham, S. Thaler, T. Malik, B. Glavic, I. Foster. Light-weight Database Virtualization. In IEEE International Conference on Data Engineering (ICDE), 2015. [ PDF], [ PPT], [ Software]

    We present a light-weight database virtualization (LDV) system that allows users to share and re-execute applications that operate on a relational database (DB). Previous methods for sharing DB applications, such as companion websites and virtual machine images (VMIs), support neither easy and efficient re-execution nor the sharing of only a relevant DB subset. LDV addresses these issues by monitoring application execution, including DB operations, and using the resulting execution trace to create a lightweight re-executable package. A LDV package includes, in addition to the application, either the DB management system (DBMS) and relevant data or, if the DBMS and/or data cannot be shared, just the application-DBMS communications for replay during re-execution. We introduce a linked DB-operating system provenance model and show how to infer data dependencies based on temporal information about the DB operations performed by the application's process(es). We use this model to determine the DB subset that needs to be included in a package in order to enable re-execution. We compare LDV with other sharing methods in terms of package size, monitoring overhead, and re-execution overhead. We show that LDV packages are often more than an order of magnitude smaller than a VMI for the same application, and have negligible re-execution overhead.

  4. Q. Pham, T. Malik, I. Foster. SOLE: Towards Descriptive and Interactive Publications. In Implementing Reproducible Research, Editor: Victoria Stodden, et. al, CRC Press, 2014. [ PDF], [ Software]

    Prior to the computational-driven revolution in science, research papers provided the pri-mary mechanism for sharing novel methods and data. Papers described experiments involving small amount of data, derivations on that data, and associated methods and algorithms. Readers reproduced the results by repeating the physical experiment, performing hand calculation, and/or logical argument. The scientific method in this decade has become decisively computational, involving large quantities of data, complex data manipulation tasks, and large, and often distributed, software stacks. The research paper, in its current text form, is only able to summarize the associated data and computation rather than reproduce it computationally. While papers corroborate descriptions through indirect means, such as by building companion websites that share data and software packages, these external websites continue to remain disconnected from the content within the paper, making it difficult to verify claims and reproduce results. There is an critical need for systems that minimize this disconnect. We describe Science Object Linking and Embedding (SOLE), a framework for creating descriptive and interactive publications by linking them with associated science objects, such as source codes, datasets, annotations, workflows, re-playable packages, and virtual machine images. SOLE provides a suite of tools that assist the author to create and host science objects that can then be linked with research papers for the purpose of assessment, repeatability, and verification of research. The framework also creates a linkable representation of the science object with the publication and manages a bibliography-like specification of science objects. In this chapter, we introduce SOLE, and describe its use for augmenting the content of computation-based scientific publications. We present examples from climate science, chemistry, biology, and computer science.

  5. Q. Pham, T. Malik, I. Foster. Auditing and Maintaing Provenance in Software Packages. In International Provenance and Annotation Workshop (IPAW), 2014.

    [ PDF], [ PPT], [ Software]

    Science projects are increasingly investing in computational reproducibility. Constructing software pipelines to demonstrate reproducibility is also becoming increasingly common. To aid the process of constructing pipelines, science project members often adopt reproducible methods and tools. One such tool is CDE, which is a software packaging tool that encapsulates source code, datasets and environments. However, CDE does not include information about origins of dependencies. Consequently when multiple CDE packages are combined and merged to create a software pipeline, several issues arise requiring an author to manually verify compatibility of distributions, environment variables, software dependencies and compiler options. In this work, we propose software provenance to be included as part of CDE so that resulting provenance-included CDE packages can be easily used for creating software pipelines. We describe provenance attributes that must be included and how they can be efficiently stored in a light-weight CDE package. Furthermore, we show how a provenance in a package can be used for creating software pipelines and maintained as new packages are created. We experimentally evaluate the overhead of auditing and maintaining provenance and compare with heavy weight approaches for reproducibility such as virtualization. Our experiments indicate minimal overheads.

  6. D. Zhao, C. Shou, T. Malik, I. Raicu. Distributed Data Provenance for Large-Scale Data-Intensive Computing, In IEEE Cluster, 2013. [ PDF]

    It has become increasingly important to capture and understand the origins and derivation of data (its provenance). A key issue in evaluating the feasibility of data provenance is its performance, overheads, and scalability. In this paper, we explore the feasibility of a general metadata storage and management layer for parallel file systems, in which metadata includes both file operations and provenance metadata. We experimentally investigate the design optimality whether provenance metadata should be loosely-coupled or tightly integrated with a file metadata storage systems. We consider two systems that have applied sim- ilar distributed concepts to metadata management, but focusing singularly on kind of metadata: (i) FusionFS, which implements a distributed file metadata management based on distributed hash tables, and (ii) SPADE, which uses a graph database to store audited provenance data and provides distributed module for querying provenance. Our results on a 32-node cluster show that FusionFS+SPADE is a promising prototype with negligible provenance overhead and has promise to scale to petascale and beyond. Furthermore, FusionFS with its own storage layer for provenance capture is able to scale up to 1K nodes on BlueGene/P supercomputer.

  7. Q. Pham, T. Malik, I. Foster. Using Provenance for Repeatability. In USENIX NSDI Workshop on Theory and Practice of Provenance (TaPP), 2013.[ PDF], [ PPT], [ Software]

    We present Provenance-to-use (PTU), a tool that minimizes computation time during repeatability testing. Authors can use PTU to build a package of their software program and include a provenance trace of an initial, reference execution. Testers can perform a partial deterministic replay of the package by choosing a subset of the processes based on the process’ compute, memory and I/O utilization obtained during the reference execution. Using the provenance trace, PTU guarantees that events are processed in the same order using the same data from one execution to the next. We show the efficiency of PTU for conducting repeatability testing of workflow-based scientific programs.

  8. T. Malik, A. Gehani, D. Tariq, F. Zaffar. Managing and Querying Distributed Data Provenance in SPADE. In Data Provenance and Data Management for eScience, Springer, 2012. [ PDF], [ Software]

    Users can determine the precise origins of their data by collecting detailed provenance records. However, auditing at a finer grain produces large amounts of metadata. To efficiently manage the collected provenance, several provenance management systems, including SPADE, record provenance on the hosts where it is gen- erated. Distributed provenance raises the issue of efficient reconstruction during the query phase. Recursively querying provenance metadata or computing its transitive closure is known to have limited scalability and cannot be used for large provenance graphs. We present matrix filters, which are novel data structures for representing graph information, and demonstrate their utility for improving query efficiency with experiments on provenance metadata gathered while executing distributed workflow applications.

  9. A. Gehani, D. Tariq, B. Baig, T. Malik. Policy-Based Integration of Provenance Metadata. In IEEE International Symposium on Policies for Distributed Systems and Networks (POLICY), 2011. [ PDF]

    Reproducibility has been a cornerstone of the scientific method for hundreds of years. The range of sources from which data now originates, the diversity of the individual manipulations performed, and the complexity of the orchestrations of these operations all limit the reproducibility that a scientist can ensure solely by manually recording their actions.

  10. T. Malik, L. Nistor, A. Gehani. Tracking and Sketching Distributed Data Provenance. In IEEE eScience, 2010. [ PDF], [ Software]

    Current provenance collection systems typically gather metadata on remote hosts and submit it to a central server. In contrast, several data-intensive scientific applications require a decentralized architecture in which each host maintains an authoritative local repository of the provenance metadata gathered on that host. The latter approach allows the system to handle the large amounts of metadata generated when auditing occurs at fine granularity, and allows users to retain control over their provenance records. The decentralized architecture,however, increases the complexity of auditing, tracking, and querying distributed provenance. We describe a system for capturing data provenance in distributed applications, and the use of provenance sketches to optimize subsequent data provenance queries. Experiments with data gathered from distributed workflow applications demonstrate the feasibility of a decentralized provenance management system and improvements in the efficiency of provenance queries

  11. D. That, G. Fils, Z. Yuan, T. Malik. Sciunits: Reusable Research Objects. In IEEE eScience, 2017 [ PDF ],[ PPT ]

    Science is conducted collaboratively, often requiring knowledge sharing about computational experiments. When experiments include only datasets, they can be shared using Uniform Resource Identifiers (URIs) or Digital Object Identifiers (DOIs). An experiment, however, seldom includes only datasets, but more often includes software, its past execution, provenance, and associated documentation. The Research Object has recently emerged as a comprehensive and systematic method for aggregation and identification of diverse elements of computational experiments. While a necessary method, mere aggregation is not sufficient for the sharing of computational experiments. Other users must be able to easily recompute on these shared research objects. In this paper, we present the sciunit, a reusable research object in which aggregated content is recomputable. We describe a Git-like client that efficiently creates, stores, and repeats sciunits. We show through analysis that sciunits repeat computational experiments with minimal storage and processing overhead. Finally, we provide an overview of sharing and reproducible cyberinfrastructure based on sciunits gaining adoption in the domain of geosciences.

Presentations

  1. GeoDataspace: Better Tools for Metadata Management The EarthCube All Hands Meeting, May 2015.

  2. A Reproducible Framework Powered By GlobusThe Globus World, Apr. 2015. ( Presented by Kyle Chard)

  3. GeoDataspace: Simplifying Data Management Tasks with Globus. The American Geophysical Union (AGU), Dec. 2014.

  4. Reproducibility is hard. Not NP-hard. The Notre Dame DASPOS Workshop, Sept. 2014.

  5. Towards Verifiable Publications. The SIAM Annual Meeting, Jul. 2014

  6. Active Publications, IEEE eScience, 2013