Documentation

A Brief Introduction to Sciunit

From time to time, you may find that it is hard to verify or reproduce someone else’s research, even though it is only programs, data, and output. Programs may be built in different ways, may behave differently if not running in the author’s environment, and can accept different arguments at runtime. All these factors contribute to the difficulties of trying out others’ work. How about, starting from your next paper, you publish the article along with a “sciunit” research object to encapsulate all the workflows you plan to demonstrate, allowing your readers and reviewer to try out your work?

Basic concepts

Sciunit

“Sciunit” is both the name of the reusable research object we defined and also the name of command-line tool that creates, manages, and shares sciunits. A sciunit consists of multiple executions. Each execution refers to an execution of a command under Linux. The command may be a single binary, may start with the name of a specific virtual machine for managed languages, such as “java,” or may be a shell script that contains multiple commands. An execution may also be a series of terminal inputs that capture your interaction with a UNIX shell. In all cases, the execution have its runtime dependencies determined during an “auditing” phase, saved to a sciunit, and can be “repeated” later on a different machine without pulling in any dependencies of the execution.

Each execution is assigned an execution id, starting from e1. In your paper, you describe each workflow you want to discuss and reference it using its corresponding execution id in a sciunit – just like referencing a figure using a numerical figure id.

Sciunit is accessible both via command-line utility and the sciunit API available in python. The following tutorial primarily focuses on operating sciunit through the command-line. At the end, we show how to utilize the same commands through the sciunit python API.

Command-line Tool

The sciunit(1) command-line tool owns the directory ~/sciunit. A workspace for a sciunit is a subdirectory under ~/sciunit. The command

> sciunit create Project1
Opened empty sciunit at ~/sciunit/Project1

creates and opens an empty sciunit called “Project1,” where ~/sciunit/Project1 is its workspace.

You can switch among multiple existing workspaces with the open command:

> sciunit open Project2
Switched to sciunit 'Project2'

You can also bring a sciunit into a workspace and then open it. The following chapters describe more forms of the open command.

Capturing and testing your program

Let’s start with a “Hello World” program in a shell script.

> cat hello.sh
#!/bin/sh
echo 'hello, world'

We can run this program

> ./hello.sh
hello, world

Now let’s try to capture this program with sciunit. Assume that we just created a new sciunit called “Project1.” Rerun the program with sciunit exec:

> sciunit exec ./hello.sh
hello, world

Now this program, along with its all its dependencies, has been captured as e1 in “Project1.” The “show” command can display the details of the last captured execution:

> sciunit show e1
     id: e1
sciunit: Project1
command: ./hello.sh
   size: 2.82 MB
started: 2020-10-10 08:42

As we claimed, this execution can be repeated on a different machine. We will do so in the remaining chapters, but before that, we should test it locally:

> sciunit repeat e1
hello, world

If you investigate the workspace

> ls ~/sciunit/Project1/
cde-package  e1.json  sciunit.db  vvpkg.bin  vvpkg.db

, you will find a directory named “cde-package” and a few other files. The “cde-package” is a temporary directory that consists of all necessary files for this execution to repeat; for example, you can even find a “libc.so.6” somewhere under this directory. The rest of the files are the underlying implementation of the conceptual “sciunit.”

Now let’s try a different way to capture an execution – capture as you go:

> sciunit exec -i
Interactive capturing started; press Ctrl-D to end.

Wait, that’s it? No, you are merely inside a subshell: all the commands you run from now on will be captured. For example:

> echo 'hello'
hello
> ./hello.sh
hello, world

Now press “Ctrl-D”:

> exit
Interactive capturing ended

These commands all become execution e2, and you can repeat it as well.

So far, we created two executions within the sciunit. You can list them with the list command

> sciunit list
   e1 Oct 10 08:42 ./hello.sh
   e2 Oct 10 11:00

, or remove one of them with the rm command. Note that after a removal all the remaining executions retain their current execution ids, and new executions will be assigned ids which are higher than the remaining ones.

Continue your work on another machine

While developing your paper, you might want to capture more executions on another machine, testing your sciunit in another environment, or maybe share the sciunit with a coauthor. Conceptually, you want to copy & paste, remotely. The easiest way is to use the copy command:

> sciunit copy
mSLLTj#

Give it a second, and it returns a code. You can then “paste” the sciunit over the Internet by running

sciunit open mSLLTj#

on the target machine. The heavy lifting utilizes the file.io service. The code is only valid for one day. Once pasted, the code is gone.

If you investigate the ~/sciunit directory on the machine in which you initiated the copy,

> ls ~/sciunit
Project1  Project1.zip

you will find a new zip file. As you can imagine, it is a zipped version of the sciunit “Project1.” The open command can also open a zipped sciunit. So if you do not want to use file.io, you can instead use the sciunit copy -n command to generate this file, and deliver the file to some other machine or to someone else.

Prepare your paper for review

The zip file mentioned above is the research object you are going to publish along with your paper. You can manually select and upload such a file on websites that host sharing of research objects, however, if you are using HydroShare, maintaining and updating your draft articles can be drastically simplified with the sciunit(1) tool.

Issue the following command to create a new article for the current sciunit:

sciunit push my_new_article --setup hs

“my_new_article” is a codename for your article. Codenames are useful for maintaining multiple articles you created from the same sciunit, and you should pick a codename that describes an article’s use, such as “debug.” “hs” is short for “hydroshare” to indicate the service you are talking to.

The above command prompts you for

Please go to the following link and authorize access:

https://www.hydroshare.org/o/authorize/?response_type=code&client_id=vG5R4zZFO6uJZBj3m0DWtUK6Va44jTQ4KoqtaLpn&redirect_uri=https:%2F%2Fsciunit.run%2Fcb&state=RCeAb6zxbEuw6yHQPDpWu26iHQIcan

Paste the authorization code:

Here we are running OAuth2 flow for HydroShare. After you have authorized the sciunit tool in a web browser and pasted in the auth code,

Paste the authorization code: AoxTbXnjzTfIa3OP5d5unxImPn0Noc
Logged in as "Yuan, Zhihao <lichray@gmail.com>"
Title for the new article: New Article for Project1
new: 8.93MB [00:01, 4.72MB/s]

input the title for the article and wait for the upload to finish. Now you can go to https://www.hydroshare.org/my-resources/ to view your new article on HydroShare. A newly-created article lacks information for publication and is private.

After each successful “push,” the codename involved is recorded for the next

sciunit push

command to pick up. So after you make a few changes to the current sciunit, such as capturing a new execution, the above command can silently keep your article on HydroShare up-to-date. However, if you run

> sciunit push my_new_article --setup hs
Logged in as "Yuan, Zhihao <lichray@gmail.com>"
Create a new version of the article "New Article for Project1"? [y/N]

again, you are creating another new article rather than updating the existing one. If you answer ‘y’, the article will be a new “version” (HydroShare feature) of the existing one; if you answer ‘n’, the new article can have a different title. In case you accidentally run into this query and cannot answer it, just press “Ctrl-D” to cancel the operation.

Sciunit API

All the sciunit commands commands available through the command-line tool can also be accessed from within your python program through the sciunit API. Following is the basic syntax for using the API for a few commands. The rest of the commands are invoked in the same way:


	> from sciunit2.api import Sciunit
	> sciunit = Sciunit()
	> sciunit.create("Project1")
	Opened empty sciunit at ~/sciunit/Project1
	> sciunit.open("Project2")
	Switched to sciunit 'Project2'

Manual

sciunit [--version] [--help]

sciunit <command> [<args...>]

DESCRIPTION

A command line utility to create, manage, and share sciunits. A sciunit is a lightweight and portable unit that contains captured, repeatable program executions.

OPTIONS

General Options

--version show program's version number and exit

-h, --help show help message and exit

Commands

sciunit create <name>: Create a new sciunit under ~/sciunit/<name> and open it. If the directory already exists, exit with an error.
sciunit open <name>|<token#>|<path to sciunit.zip>: Open the sciunit under ~/sciunit/<name> or designated by a token# obtained from sciunit copy, or one in a zipped sciunit package by extracting it to a temporary directory.
sciunit open -m <name>: Rename the currently-opened sciunit to <name> and open it.
sciunit exec <executable> [<args...>]: Capture the execution of the given executable with the command line arguments args. The newly-created execution is added to the currently-opened sciunit and assigned execution id "eN", where N is a monotonically-increasing decimal. The first execution created in a sciunit has execution id "e1". Note that the command line is launched using execvp(3) rather than interpreted by a shell.
sciunit exec -i: Launch the current user's shell and capture the user's interactions with the shell. This may involve executing multiple commands. A new execution is created on exiting the shell.
sciunit list: List the existing executions in the currently-opened sciunit.
sciunit show <execution id>: Show detailed information about a specific execution in the currently-opened sciunit.
sciunit repeat <execution id>: Repeat the execution of execution id from the currently-opened sciunit exactly as it happened earlier.
sciunit given <glob> 'repeat' <execution id> [<%|args...>]: Repeat the execution of execution id with additional files or directories specified by glob. The command expands glob into a list of filenames in the style of glob(3), substitutes the first occurrence of %, if any, in the optional args for the 'repeat' mini-command with those filenames, and repeats the execution as if those files or directories are available relative to its current working directory at capture time.
sciunit commit: Commit the last repetition done by the repeat or the given command in the currently-opened sciunit as a new execution.
sciunit rm <execution id>: Remove an existing execution from the currently-opened sciunit. A malformed execution id causes an error. Removing a nonexistent execution has no effect.
Note: the execution is removed from the records, but its data remains and may be shared with other executions.
sciunit rm <eN>–[M]: Remove executions within a range, from eN to eM, inclusive. M may be omitted for a range from eN to the most recent.
sciunit diff <execution id1> <execution id2>: Shows difference between the two given executions in terms of their directory structures.
sciunit sort <execution ids...>: Reorder the executions in the currently-opened sciunit to ensure that the executions specified in the arguments appear consecutively in the sciunit list command.
sciunit push <codename> --setup <service>: Create an article on a research object sharing service and attach the currently opened sciunit to the article. Assign different codenames to track multiple articles or multiple versions of an article created from a sciunit. The supported services include Figshare (fs) and HydroShare (hs).
sciunit push [<codename>]: Update the last pushed article with the latest sciunit data if no argument present. Otherwise, update the article referred to by codename.
sciunit copy: Copy the currently-opened sciunit to file.io and obtain a token for remotely opening it. The token is invalidated after being accessed or after one day, whichever happens first.
sciunit copy <remote>|<name>: Copy a sciunit to a remote server or ~/sciunit/<name>.
sciunit copy -n: Archive the currently-opened sciunit to ~/sciunit/<name>.zip.