Documentation
A Brief Introduction to Sciunit
From time to time, you may find that it is hard to verify or reproduce someone
else’s research, even though it is only programs, data, and output.
Programs may be built in different ways, may behave differently if not running
in the author’s environment, and can accept different arguments at runtime.
All these factors contribute to the difficulties of trying out others’ work.
How about, starting from your next paper, you publish the article along with a
“sciunit” research object to encapsulate all the workflows you plan to
demonstrate, allowing your readers and reviewer to try out your work?
Basic concepts
Sciunit
“Sciunit” is both the name of the reusable research object we defined and also
the name of command-line tool that creates, manages, and shares sciunits. A
sciunit consists of multiple executions. Each execution refers to an
execution of a command under Linux. The command may be a single binary, may
start with the name of a specific virtual machine for managed languages, such
as “java,” or may be a shell script that contains multiple commands. An
execution may also be a series of terminal inputs that capture your
interaction with a UNIX shell. In all cases, the execution have its runtime
dependencies determined during an “auditing” phase, saved to a sciunit, and
can be “repeated” later on a different machine without pulling in any
dependencies of the execution.
Each execution is assigned an execution id, starting from e1
. In your
paper, you describe each workflow you want to discuss and reference it using
its corresponding execution id in a sciunit – just like referencing a figure
using a numerical figure id.
Sciunit is accessible both via command-line utility and the sciunit API available in python. The following tutorial primarily focuses on operating sciunit through the command-line. At the end, we show how to utilize the same commands through the sciunit python API.
Command-line Tool
The sciunit(1) command-line tool owns the directory ~/sciunit
. A
workspace for a sciunit is a subdirectory under ~/sciunit
. The command
> sciunit create Project1
Opened empty sciunit at ~/sciunit/Project1
creates and opens an empty sciunit called “Project1,” where
~/sciunit/Project1
is its workspace.
You can switch among multiple existing workspaces with the open
command:
> sciunit open Project2
Switched to sciunit 'Project2'
You can also bring a sciunit into a workspace and then open it. The following
chapters describe more forms of the open
command.
Capturing and testing your program
Let’s start with a “Hello World” program in a shell script.
> cat hello.sh
#!/bin/sh
echo 'hello, world'
We can run this program
> ./hello.sh
hello, world
Now let’s try to capture this program with sciunit
. Assume that we just
created a new sciunit called “Project1.” Rerun the program with sciunit
exec
:
> sciunit exec ./hello.sh
hello, world
Now this program, along with its all its dependencies, has been captured as
e1
in “Project1.” The “show” command can display the details of the last
captured execution:
> sciunit show e1
id: e1
sciunit: Project1
command: ./hello.sh
size: 2.82 MB
started: 2020-10-10 08:42
As we claimed, this execution can be repeated on a different machine. We will
do so in the remaining chapters, but before that, we should test it locally:
> sciunit repeat e1
hello, world
If you investigate the workspace
> ls ~/sciunit/Project1/
cde-package e1.json sciunit.db vvpkg.bin vvpkg.db
, you will find a directory named “cde-package” and a few other files. The
“cde-package” is a temporary directory that consists of all necessary files
for this execution to repeat; for example, you can even find a “libc.so.6”
somewhere under this directory. The rest of the files are the underlying
implementation of the conceptual “sciunit.”
Now let’s try a different way to capture an execution – capture as you go:
> sciunit exec -i
Interactive capturing started; press Ctrl-D to end.
Wait, that’s it? No, you are merely inside a subshell: all the commands you
run from now on will be captured. For example:
> echo 'hello'
hello
> ./hello.sh
hello, world
Now press “Ctrl-D”:
> exit
Interactive capturing ended
These commands all become execution e2
, and you can repeat it as well.
So far, we created two executions within the sciunit. You can list them with
the list
command
> sciunit list
e1 Oct 10 08:42 ./hello.sh
e2 Oct 10 11:00
, or remove one of them with the rm
command. Note that after a removal all
the remaining executions retain their current execution ids, and new
executions will be assigned ids which are higher than the remaining ones.
Continue your work on another machine
While developing your paper, you might want to capture more executions on
another machine, testing your sciunit in another environment, or maybe share
the sciunit with a coauthor. Conceptually, you want to copy & paste,
remotely. The easiest way is to use the copy
command:
> sciunit copy
mSLLTj#
Give it a second, and it returns a code. You can then “paste” the sciunit
over the Internet by running
sciunit open mSLLTj#
on the target machine. The heavy lifting utilizes the
file.io service. The code is only valid for one day. Once
pasted, the code is gone.
If you investigate the ~/sciunit
directory on the machine in which you
initiated the copy,
> ls ~/sciunit
Project1 Project1.zip
you will find a new zip file. As you can imagine, it is a zipped version of
the sciunit “Project1.” The open
command can also open a zipped sciunit.
So if you do not want to use file.io, you can instead use
the sciunit copy -n
command to generate this file, and deliver the file to
some other machine or to someone else.
Prepare your paper for review
The zip file mentioned above is the research object you are going to publish
along with your paper. You can manually select and upload such a file on
websites that host sharing of research objects, however, if you are using
HydroShare, maintaining and updating your draft
articles can be drastically simplified with the sciunit(1) tool.
Issue the following command to create a new article for the current sciunit:
sciunit push my_new_article --setup hs
“my_new_article” is a codename for your article. Codenames are useful for
maintaining multiple articles you created from the same sciunit, and you
should pick a codename that describes an article’s use, such as “debug.” “hs”
is short for “hydroshare” to indicate the service you are talking to.
The above command prompts you for
Please go to the following link and authorize access:
https://www.hydroshare.org/o/authorize/?response_type=code&client_id=vG5R4zZFO6uJZBj3m0DWtUK6Va44jTQ4KoqtaLpn&redirect_uri=https:%2F%2Fsciunit.run%2Fcb&state=RCeAb6zxbEuw6yHQPDpWu26iHQIcan
Paste the authorization code:
Here we are running OAuth2 flow for HydroShare. After you have authorized
the sciunit tool in a web browser and pasted in the auth code,
Paste the authorization code: AoxTbXnjzTfIa3OP5d5unxImPn0Noc
Logged in as "Yuan, Zhihao <lichray@gmail.com>"
Title for the new article: New Article for Project1
new: 8.93MB [00:01, 4.72MB/s]
input the title for the article and wait for the upload to finish. Now you
can go to https://www.hydroshare.org/my-resources/ to view your new article
on HydroShare. A newly-created article lacks information for publication and
is private.
After each successful “push,” the codename involved is recorded for the next
sciunit push
command to pick up. So after you make a few changes to the current sciunit,
such as capturing a new execution, the above command can silently keep your
article on HydroShare up-to-date. However, if you run
> sciunit push my_new_article --setup hs
Logged in as "Yuan, Zhihao <lichray@gmail.com>"
Create a new version of the article "New Article for Project1"? [y/N]
again, you are creating another new article rather than updating the existing
one. If you answer ‘y
’, the article will be a new “version” (HydroShare
feature) of the existing one; if you answer ‘n
’, the new article can have a
different title. In case you accidentally run into this query and cannot
answer it, just press “Ctrl-D” to cancel the operation.
Sciunit API
All the sciunit commands commands available through the command-line tool can also be accessed from within your python program through the sciunit API. Following is the basic syntax for using the API for a few commands. The rest of the commands are invoked in the same way:
> from sciunit2.api import Sciunit
> sciunit = Sciunit()
> sciunit.create("Project1")
Opened empty sciunit at ~/sciunit/Project1
> sciunit.open("Project2")
Switched to sciunit 'Project2'
Manual
sciunit
[--version] [--help]
sciunit
<command> [<args...>]
DESCRIPTION
A command line utility to create, manage, and share sciunits. A sciunit is a lightweight and portable unit that contains captured, repeatable program executions.
OPTIONS
General Options
--version
show program's version number and exit
-h, --help
show help message and exit
Commands
sciunit create
<name>
- Create a new sciunit under ~/sciunit/<name> and open it. If the directory already exists, exit with an error.
sciunit open
<name>|<token#>|<path to sciunit.zip>
- Open the sciunit under ~/sciunit/<name> or designated by a token# obtained from
sciunit copy
, or one in a zipped sciunit package by extracting it to a temporary directory.
sciunit open -m
<name>
- Rename the currently-opened sciunit to <name> and open it.
sciunit exec
<executable> [<args...>]
-
Capture the execution of the given executable with the command line arguments args. The newly-created execution is added to the currently-opened sciunit and assigned execution id "eN", where N is a monotonically-increasing decimal. The first execution created in a sciunit has execution id "e1". Note that the command line is launched using execvp(3) rather than interpreted by a shell.
sciunit exec -i
- Launch the current user's shell and capture the user's interactions with the shell. This may involve executing multiple commands. A new execution is created on exiting the shell.
sciunit list
- List the existing executions in the currently-opened sciunit.
sciunit show <execution id>
- Show detailed information about a specific execution in the currently-opened sciunit.
sciunit repeat
<execution id>
- Repeat the execution of execution id from the currently-opened sciunit exactly as it happened earlier.
sciunit given
<glob> 'repeat' <execution id> [<%|args...>]
- Repeat the execution of execution id with additional files or directories specified by glob. The command expands glob into a list of filenames in the style of glob(3), substitutes the first occurrence of %, if any, in the optional args for the 'repeat' mini-command with those filenames, and repeats the execution as if those files or directories are available relative to its current working directory at capture time.
sciunit commit
- Commit the last repetition done by the repeat or the given command in the currently-opened sciunit as a new execution.
sciunit rm
<execution id>
-
Remove an existing execution from the currently-opened sciunit. A malformed execution id causes an error. Removing a nonexistent execution has no effect.
Note: the execution is removed from the records, but its data remains and may be shared with other executions.
sciunit rm
<eN>–[M]
-
Remove executions within a range, from eN to eM, inclusive. M may be omitted for a range from eN to the most recent.
sciunit diff
<execution id1> <execution id2>
- Shows difference between the two given executions in terms of their directory structures.
sciunit sort
<execution ids...>
-
Reorder the executions in the currently-opened sciunit to ensure that the executions specified in the arguments appear consecutively in the
sciunit list
command.
sciunit push
<codename> --setup <service>
- Create an article on a research object sharing service and attach the currently opened sciunit to the article. Assign different codenames to track multiple articles or multiple versions of an article created from a sciunit. The supported services include Figshare (fs) and HydroShare (hs).
sciunit push
[<codename>]
- Update the last pushed article with the latest sciunit data if no argument present. Otherwise, update the article referred to by codename.
sciunit copy
- Copy the currently-opened sciunit to file.io and obtain a token for remotely opening it. The token is invalidated after being accessed or after one day, whichever happens first.
sciunit copy
<remote>|<name>
- Copy a sciunit to a remote server or ~/sciunit/<name>.
sciunit copy -n
- Archive the currently-opened sciunit to ~/sciunit/<name>.zip.
SEE ALSO
Globus: https://www.globus.org/
A Brief Introduction to Sciunit
From time to time, you may find that it is hard to verify or reproduce someone else’s research, even though it is only programs, data, and output. Programs may be built in different ways, may behave differently if not running in the author’s environment, and can accept different arguments at runtime. All these factors contribute to the difficulties of trying out others’ work. How about, starting from your next paper, you publish the article along with a “sciunit” research object to encapsulate all the workflows you plan to demonstrate, allowing your readers and reviewer to try out your work?
Basic concepts
Sciunit
“Sciunit” is both the name of the reusable research object we defined and also the name of command-line tool that creates, manages, and shares sciunits. A sciunit consists of multiple executions. Each execution refers to an execution of a command under Linux. The command may be a single binary, may start with the name of a specific virtual machine for managed languages, such as “java,” or may be a shell script that contains multiple commands. An execution may also be a series of terminal inputs that capture your interaction with a UNIX shell. In all cases, the execution have its runtime dependencies determined during an “auditing” phase, saved to a sciunit, and can be “repeated” later on a different machine without pulling in any dependencies of the execution.
Each execution is assigned an execution id, starting from e1
. In your
paper, you describe each workflow you want to discuss and reference it using
its corresponding execution id in a sciunit – just like referencing a figure
using a numerical figure id.
Sciunit is accessible both via command-line utility and the sciunit API available in python. The following tutorial primarily focuses on operating sciunit through the command-line. At the end, we show how to utilize the same commands through the sciunit python API.
Command-line Tool
The sciunit(1) command-line tool owns the directory ~/sciunit
. A
workspace for a sciunit is a subdirectory under ~/sciunit
. The command
> sciunit create Project1 Opened empty sciunit at ~/sciunit/Project1
creates and opens an empty sciunit called “Project1,” where
~/sciunit/Project1
is its workspace.
You can switch among multiple existing workspaces with the open
command:
> sciunit open Project2 Switched to sciunit 'Project2'
You can also bring a sciunit into a workspace and then open it. The following
chapters describe more forms of the open
command.
Capturing and testing your program
Let’s start with a “Hello World” program in a shell script.
> cat hello.sh #!/bin/sh echo 'hello, world'
We can run this program
> ./hello.sh hello, world
Now let’s try to capture this program with sciunit
. Assume that we just
created a new sciunit called “Project1.” Rerun the program with sciunit
exec
:
> sciunit exec ./hello.sh
hello, world
Now this program, along with its all its dependencies, has been captured as
e1
in “Project1.” The “show” command can display the details of the last
captured execution:
> sciunit show e1 id: e1 sciunit: Project1 command: ./hello.sh size: 2.82 MB started: 2020-10-10 08:42
As we claimed, this execution can be repeated on a different machine. We will do so in the remaining chapters, but before that, we should test it locally:
> sciunit repeat e1 hello, world
If you investigate the workspace
> ls ~/sciunit/Project1/ cde-package e1.json sciunit.db vvpkg.bin vvpkg.db
, you will find a directory named “cde-package” and a few other files. The “cde-package” is a temporary directory that consists of all necessary files for this execution to repeat; for example, you can even find a “libc.so.6” somewhere under this directory. The rest of the files are the underlying implementation of the conceptual “sciunit.”
Now let’s try a different way to capture an execution – capture as you go:
> sciunit exec -i Interactive capturing started; press Ctrl-D to end.
Wait, that’s it? No, you are merely inside a subshell: all the commands you run from now on will be captured. For example:
> echo 'hello' hello > ./hello.sh hello, world
Now press “Ctrl-D”:
> exit
Interactive capturing ended
These commands all become execution e2
, and you can repeat it as well.
So far, we created two executions within the sciunit. You can list them with
the list
command
> sciunit list e1 Oct 10 08:42 ./hello.sh e2 Oct 10 11:00
, or remove one of them with the rm
command. Note that after a removal all
the remaining executions retain their current execution ids, and new
executions will be assigned ids which are higher than the remaining ones.
Continue your work on another machine
While developing your paper, you might want to capture more executions on
another machine, testing your sciunit in another environment, or maybe share
the sciunit with a coauthor. Conceptually, you want to copy & paste,
remotely. The easiest way is to use the copy
command:
> sciunit copy mSLLTj#
Give it a second, and it returns a code. You can then “paste” the sciunit over the Internet by running
sciunit open mSLLTj#
on the target machine. The heavy lifting utilizes the file.io service. The code is only valid for one day. Once pasted, the code is gone.
If you investigate the ~/sciunit
directory on the machine in which you
initiated the copy,
> ls ~/sciunit Project1 Project1.zip
you will find a new zip file. As you can imagine, it is a zipped version of
the sciunit “Project1.” The open
command can also open a zipped sciunit.
So if you do not want to use file.io, you can instead use
the sciunit copy -n
command to generate this file, and deliver the file to
some other machine or to someone else.
Prepare your paper for review
The zip file mentioned above is the research object you are going to publish along with your paper. You can manually select and upload such a file on websites that host sharing of research objects, however, if you are using HydroShare, maintaining and updating your draft articles can be drastically simplified with the sciunit(1) tool.
Issue the following command to create a new article for the current sciunit:
sciunit push my_new_article --setup hs
“my_new_article” is a codename for your article. Codenames are useful for maintaining multiple articles you created from the same sciunit, and you should pick a codename that describes an article’s use, such as “debug.” “hs” is short for “hydroshare” to indicate the service you are talking to.
The above command prompts you for
Please go to the following link and authorize access: https://www.hydroshare.org/o/authorize/?response_type=code&client_id=vG5R4zZFO6uJZBj3m0DWtUK6Va44jTQ4KoqtaLpn&redirect_uri=https:%2F%2Fsciunit.run%2Fcb&state=RCeAb6zxbEuw6yHQPDpWu26iHQIcan Paste the authorization code:
Here we are running OAuth2 flow for HydroShare. After you have authorized the sciunit tool in a web browser and pasted in the auth code,
Paste the authorization code: AoxTbXnjzTfIa3OP5d5unxImPn0Noc Logged in as "Yuan, Zhihao <lichray@gmail.com>" Title for the new article: New Article for Project1 new: 8.93MB [00:01, 4.72MB/s]
input the title for the article and wait for the upload to finish. Now you can go to https://www.hydroshare.org/my-resources/ to view your new article on HydroShare. A newly-created article lacks information for publication and is private.
After each successful “push,” the codename involved is recorded for the next
sciunit push
command to pick up. So after you make a few changes to the current sciunit, such as capturing a new execution, the above command can silently keep your article on HydroShare up-to-date. However, if you run
> sciunit push my_new_article --setup hs Logged in as "Yuan, Zhihao <lichray@gmail.com>" Create a new version of the article "New Article for Project1"? [y/N]
again, you are creating another new article rather than updating the existing
one. If you answer ‘y
’, the article will be a new “version” (HydroShare
feature) of the existing one; if you answer ‘n
’, the new article can have a
different title. In case you accidentally run into this query and cannot
answer it, just press “Ctrl-D” to cancel the operation.
Sciunit API
All the sciunit commands commands available through the command-line tool can also be accessed from within your python program through the sciunit API. Following is the basic syntax for using the API for a few commands. The rest of the commands are invoked in the same way:
> from sciunit2.api import Sciunit
> sciunit = Sciunit()
> sciunit.create("Project1")
Opened empty sciunit at ~/sciunit/Project1
> sciunit.open("Project2")
Switched to sciunit 'Project2'
Manual
sciunit
[--version] [--help]
sciunit
<command> [<args...>]
DESCRIPTION
A command line utility to create, manage, and share sciunits. A sciunit is a lightweight and portable unit that contains captured, repeatable program executions.
OPTIONS
General Options
--version
show program's version number and exit
-h, --help
show help message and exit
Commands
sciunit create
<name>- Create a new sciunit under ~/sciunit/<name> and open it. If the directory already exists, exit with an error.
sciunit open
<name>|<token#>|<path to sciunit.zip>- Open the sciunit under ~/sciunit/<name> or designated by a token# obtained from
sciunit copy
, or one in a zipped sciunit package by extracting it to a temporary directory. sciunit open -m
<name>- Rename the currently-opened sciunit to <name> and open it.
sciunit exec
<executable> [<args...>]- Capture the execution of the given executable with the command line arguments args. The newly-created execution is added to the currently-opened sciunit and assigned execution id "eN", where N is a monotonically-increasing decimal. The first execution created in a sciunit has execution id "e1". Note that the command line is launched using execvp(3) rather than interpreted by a shell.
sciunit exec -i
- Launch the current user's shell and capture the user's interactions with the shell. This may involve executing multiple commands. A new execution is created on exiting the shell.
sciunit list
- List the existing executions in the currently-opened sciunit.
sciunit show <execution id>
- Show detailed information about a specific execution in the currently-opened sciunit.
sciunit repeat
<execution id>- Repeat the execution of execution id from the currently-opened sciunit exactly as it happened earlier.
sciunit given
<glob> 'repeat' <execution id> [<%|args...>]- Repeat the execution of execution id with additional files or directories specified by glob. The command expands glob into a list of filenames in the style of glob(3), substitutes the first occurrence of %, if any, in the optional args for the 'repeat' mini-command with those filenames, and repeats the execution as if those files or directories are available relative to its current working directory at capture time.
sciunit commit
- Commit the last repetition done by the repeat or the given command in the currently-opened sciunit as a new execution.
sciunit rm
<execution id>-
Remove an existing execution from the currently-opened sciunit. A malformed execution id causes an error. Removing a nonexistent execution has no effect.
Note: the execution is removed from the records, but its data remains and may be shared with other executions. sciunit rm
<eN>–[M]- Remove executions within a range, from eN to eM, inclusive. M may be omitted for a range from eN to the most recent.
sciunit diff
<execution id1> <execution id2>- Shows difference between the two given executions in terms of their directory structures.
sciunit sort
<execution ids...>-
Reorder the executions in the currently-opened sciunit to ensure that the executions specified in the arguments appear consecutively in the
sciunit list
command. sciunit push
<codename> --setup <service>- Create an article on a research object sharing service and attach the currently opened sciunit to the article. Assign different codenames to track multiple articles or multiple versions of an article created from a sciunit. The supported services include Figshare (fs) and HydroShare (hs).
sciunit push
[<codename>]- Update the last pushed article with the latest sciunit data if no argument present. Otherwise, update the article referred to by codename.
sciunit copy
- Copy the currently-opened sciunit to file.io and obtain a token for remotely opening it. The token is invalidated after being accessed or after one day, whichever happens first.
sciunit copy
<remote>|<name>- Copy a sciunit to a remote server or ~/sciunit/<name>.
sciunit copy -n
- Archive the currently-opened sciunit to ~/sciunit/<name>.zip.
SEE ALSO
Globus: https://www.globus.org/