Pipeline 5.8

Cloud Storage Module

You can specify your data in the cloud as input and output to any Pipeline workflows. Currently supported cloud storage vendors are Amazon S3, Box and Dropbox. To start, link your cloud account to the LONI Pipeline under Tools > Database & Cloud Storage > Cloud Storage tab. For more information, please refer to the user guide.

NDAR Support

The Pipeline supports integration from the National Database for Autism Research (NDAR) database. For more information, please refer to the user guide.

Server Library Testing

Server library testing instructions are available under server terminals. It provides commands to launch CLI to validate and execute workflows in server library.

Multiple Clients

Now you can have multiple clients (GUI and CLI) connect to the same session (with the same credential).

Session Sharing

This version includes workflow session sharing (read only) feature for guests. It’s a server preference, and requires server admin to specify sharing username. Once enabled, guests can see all workflows by this user, but cannot modify (stop/pause/restart) them.

An automatic metadata augmentation

It is common for an application to generate a table of values as output. In order to extract the values from all cells and append them to the metadata file, we would have to define a specific rule for each cell. This can be time consuming. To expedite this, there is an automatic extraction feature that asks the user to define the characteristics of the output table and uses this information to append elements to the metadata file. Learn more

Macbook Pro Retina display support

If you have a Retina Macbook pro, this new version will look much better on your notebook than previous versions.

(August 21, 2012) Pipeline version 5.5 is available for download. This update added a new feature in Metadata augmentation and fixed many bugs. For more details read Release Notes

Introduction
Installation
Interface overview
1. Connection manager
Building a workflow
Execution
1. Executing a workflow
2. Viewing output

1. Introduction

This Quick Start Guide to the LONI Pipeline covers the fundamentals of building a Pipeline. For a more detailed description of Pipeline features, please see the User Guide.

2. Installation

2.1 Requirements

The only requirement of the Pipeline client is an installation of JRE 1.6 or higher, which can be downloaded from Oracle. In terms of memory consumption, it’s unlikely that you’ll need to worry about having sufficient RAM to run the Pipeline.

2.2 Downloading

To get the latest version of the LONI Pipeline, go to the Pipeline web site and click on the download link in the navbar at the top. A LONI account is required to download LONI software, you can fill an application here.

2.3 Setup and launching

OS X: To install the program, double click the disk image file you downloaded, and drag the LONI Pipeline application into the Applications folder. Once the program is done copying you can unmount (eject) the disk image and throw it in the trash. To start the Pipeline, just go to your Applications folder and double-click on the LONI Pipeline application.

Windows: To install on Windows, double-click the installer and follow the on-screen instruction. Once it finishes installing, you can throw away the installer and launch the program by going to the Start menu->Programs->LONI Pipeline and start the program.

Linux/Unix: Extract the contents of the file to a location on disk, and execute the PipelineGUI script. Make sure you have the java binary in your path.

3. Interface overview

3.1 Connecting to Pipeline servers

If you need to connect to different Pipeline servers, go to the ‘Window’ menu and click on ‘Connections…’. Alternatively, you can click on the disconnected circles at the bottom right of the window, and in the popup menu click on ‘Connections…’.

In here you can add a connection to any Pipeline server that you want to access. If you don’t know of any servers you can add the LONI Pipeline server (cranium.loni.usc.edu) but you will need to apply for a LONI cranium account to actually connect to it. Please note this account is different from the general LONI account. Once you’ve entered the connection, go ahead and click ‘Connect’ then close the dialog. After 30 seconds or so you’ll notice that your server library has been populated with tools from the server.

4. Building a workflow

Open a new workflow by going to File->New.

4.1 Dragging in modules

Go to the server library at the left and expand the desired package. Click on a module and drag it into the workflow canvas that you just opened. Repeat this step for all other modules that you need.

4.2 Connecting modules

Each module in a workflow can have some inputs and outputs. The inputs are on the top, and the outputs on the bottom. Connect the modules by clicking on the output parameter of a module and then dragging the mouse pointer to the following module’s input parameter.

When you attempt to make a connection, the Pipeline does some initial checking to make sure the connection is valid. For example, it won’t let you connect a file type parameter to a number type parameter, or connecting an output to another output and more.

4.3 Setting parameter values

Now, specify values for the input parameters of each module which do NOT have a connection to a previous module. Double click on the input parameter and select the input value, making sure to choose an input that correctly matches the parameter type (File, Directory, String, Number or Enumerated). Also, File parameters can require a specific file type, so make sure to check this too if necessary.

Once you’ve set the inputs, you’ll want to specify a destination for the output of the final module. Double-click on the output parameter and specify the path where you want the output(s) to be written to.

Note that you can mix data that is located on your computer and the computer that the server resides on, and the Pipeline will take care of moving data back and forth for you. For example, the input to a module could be located on your local drive, but you could set the output to be written to some location on the Pipeline server or vice versa.

4.4 Processing multiple inputs

One of the strengths of the LONI Pipeline is its ability to simplify processing of multiple pieces of data, by using the same workflow you use to process a single input. In order to do this, you can create a Data Source and use it to feed a list of inputs into the first module. Right click on any blank space in the workflow canvas and select ‘Add Data Source’. In the dialog that opens enter some information about the data source, and then click on the ‘Data’ tab. From here, you can click on ‘Add files’ at the bottom of the dialog and add multiple files into the list, or you can just type in the path to a file manually. Note that at the bottom there is an option for a server in case you want the data source to represent data on another computer.

4.5 Enable/Disable parameters

Most modules have 2-3 required parameters on them, and several more optional parameters. If you want to exercise any of those additional options, simply double-click on the module and you’ll see a list of all the required and optional parameters for that module. For each additional option you want to use just click on the box on the left side of its name to enable it. Conversely, to disable it click on the box again. Notice that you are not able to disable parameters that are required.

4.6. Saving Workflows

In order to save a workflow, go to File->Save.

5. Execution

5.1 Executing a workflow

Once you’ve completed your workflow, you can execute the workflow by simply clicking on the ‘Play’ button at the bottom of the workflow area. If the program needs a connection to a server, it will prompt you for a username and password. If you’ve already stored a username and password to the server in your list of connections, then it will automatically connect for you.

Once all necessary connections have been made and has completed the workflow will begin to execute.

5.2 Viewing output

As the modules continue executing you can view the output and error streams of any completed module. You can bring up the log viewer by going to Window->Log Viewer or more easily, right-clicking on the module that you want to view information about and click on ‘Show Output Logs.’ This will bring up the log viewer and set its focus on the module that was clicked.

LONI Pipeline Hands-On Training Day, UCLA

When: Monday, May 21, 2012, 9AM – 3PM
Where: LONI DIVE Theater, NRB #225, UCLA (Map)

Are you interested in learning more about LONI Pipeline? A programming expert wanting to know more about the integration of your tools with Pipeline? Are you in charge of a compute cluster and think that Pipeline might be right for you and your lab? Then you should attend one or more of these hands-on Pipeline training sessions! Please bring your own laptop.

9am-11am – Pipeline Workflow Basics: We’ll teach you how to get started creating your own Pipeline workflows and on your way to becoming a Pipeline expert in no time. This session will be specifically focused on the hands-on construction of Pipeline workflows from the ground up. Ideal for students and trainees having limited programming experience but a need to analyze large data sets using existing processing tools. Bring your PC/Mac laptop. Pipeline software to be provided. Topics covered include parts of user guide and handbook.

11am-1pm – Pipeline Module Library Construction: Got a set of Linux command line C, Java, Python, Bash programs that you would like to see as Pipeline modules? We’ll help you create a personal library of Pipeline modules and workflows from your own tools making them sharable with members of your lab and others! Have your laptop handy with access to your software routines and we’ll get you started. Topics covered include parts of user guide and handbook.

1pm-3pm – Pipeline Server Installation, User, and Job Management – Are you a lab or center IT manager looking for a grid enabled compute environment for your users? Want user and job management tools to help maximize cluster utility? Are you a lab PI with a new cluster and want to put it to good use? Then this session is for you. We’ll describe all of the steps for deploying Pipeline to your own compute cluster and how to manage users and track job submissions, etc. Topics covered include server guide.

Registration The registration is closed.

The event flier can be downloaded here (pdf format, 4 MB).

Web Service Module

User can now use web services in the workflow by creating Web Service module. SOAP (Simple Object Access Protocol) based web services are supported, all you have to do is to provide WSDL file for the web service, and the Pipeline will parse and generated appropriate web service module. For more information, check our User Guide – Web Service Module.

Workflow Comparison Utility

The Workflow Comparison (diff) Utility can compare workflows within the Pipeline interface and show the differences. In order to launch this component, look for the Diff Workflows item under the Tools menu. For more information, check our User Guide – Workflow Diff Utility.

Data Extraction

Metadata from Study modules can be read and written by any execution modules. Data Extraction enables extract (read) contents from the metadata and feed contents to the executable/module. Any value from the metadata can be pulled and put along with input or output parameter under the command line. For more information, check our User Guide – Data extraction.

Metadata Augmentation

Metadata from Study modules can be read and written by any execution modules. Metadata Augmentation allows the modification (write) of metadata with contents generated from the underlining executable. You can add, modify or remove elements from the metadata file, with values from input parameters, or output stream and error stream of the executable. For more information, check our User Guide – Metadata Augmentation.

Previewer

You can preview the image output files of your completed workflows. You can preview the outputs in two ways, hoover over the mouse pointer at the output parameter node on the workflow canvas, or hoover over the mouse at output files panel of the module. If you have multiple instances, by pointing mouse at output parameter node on the canvas, the previewer will let you scroll through all of the instances. Commonly used image file formats (e.g. .img, .nii, .mnc) are supported.

Copy Output

Copy output is a handy feature that allows you copy any completed modules’ outputs. When you paste, each output parameter will be converted as data source, with all the output files listed. If there are multiple output parameters for that module, multiple data sources will be created with their corresponding files. This feature is helpful when you want to take already completed output files from one workflow to a new workflow.

Cancel Instance

Pipeline 5.3 allows user to cancel any pending instance of a module. All the related instances of the subsequent modules will be canceled as well.

Custom Grid and Environment Variables

Server administrators can now control Pipeline’s grid engine variable usage and set restrictions to them and their values. This also allows arbitrary variable names and values. In addition, user can now define environment variables for any module. For more information, check our Server Guide – Grid Variables Policy.

For detailed change logs, please check the release notes.

keywords: change log, release note

Module Creation Utility

Pipeline now can create new modules based on help string, manual file, or a webpage showing usage and parameter descriptions. All you have to do is to paste these texts, and Pipeline will attempt to convert this textual description of the tool execution syntax to a module definition. It covers most standard man page format, and you can modify any parameters and the module will be updated on the fly. More info…

Instance Restart

You can now restart any completed or errored instance in a workflow, under Execution Information. The instance and its successor instances will be resubmitted to run. Their output files will be deleted as well, to avoid possible conflict on subsequent run.

Tools Usage Stats

The Pipeline server now stores executable usage information on all tools/packages. The Usage Stats page shows the usage statistics on the LONI Pipeline Cranium server. It contains daily and monthly usage information including CPU hours of all executions, number of jobs submitted to the grid, and number of connections to the server. It also has usage information on individual packages, identified by the Package field under module’s definition.

Local Workflow Persistence

Now your local workflow statuses will be stored for later retrieval. Upon starting the Pipeline client, you’ll notice all your locally-ran sessions under Active Sessions. You can reconnect to any session to check statuses of its modules.

Copy with Input

Copy with input is a handy feature that allows you copy any modules with their inputs on a completed workflow. You will see the option when you select completed module(s) and right-click. It copies everything you selected, additionally, it also copies necessary input files from the preceding modules that are not selected. These intermediate inputs will be put included as data sources.

Remote Update for Server

Pipeline server admins can now update and restart the Pipeline server, as well as update external tools/packages for the Pipeline server remotely using server terminal utility. When a new version is released, server terminal utility will inform the server admin, the server can be updated with a click of a button.

Server Status Charts

Pipeline 5.2 shows the status of connected server in graphical charts. The charts include information on how busy the server is, and how much resources you’re currently using and how much free resources are available for you. All this information appears at the bottom right corner of Pipeline window when you connect to the server.

Error Log for Data Sinks

Failed data sink modules now includes detail information on the error messages. Just right-click on errored data sink and choose View Error Logs.

Execution Log Pages

For modules with many instances, execution log dialog now divides the single list into page view.

Annotation Features

New annotation features include image annotations with resize, collapsible/expandable annotations, and workflow toolbar with annotation shortcuts.

For detailed change logs, please check the release notes.

keywords: change log, release note

Pipeline Web Start (PWS) allows users to start and run the LONI Pipeline application from a web browser without installation. Users who choose this application will have access to all of the functions and features included in the downloadable version. PWS is accessible via an anonymous guest login or user-authentication to connect to remote Pipeline servers.

Alternatively, you can download the installer for the LONI Pipeline desktop application here. Desktop application and web start have the same features.

Click here to launch the Pipeline Web Start

Once the the JNLP file is downloaded, please double-click the file to launch the program.

The examples below will launch the PWS client, load specific predefined workflow protocols, and submit workflows to a dedicated LONI Pipeline server using guest privileges.

Neuro Imaging

	AIR: Automated Image Registration
	SVPASEG: Sub-Volume Probabilistic Atlas Segmentation

Bioinformatics and Genomics

	MAQ, SAMTools, Bowtie (Integrated)
	miBLAST: Basic Local Alignment Search Tool

Notes

Java version 7 or higher is required. The latest Java is recommended.
On Mac, FireFox browser is required. And please make sure in your Mac OSX settings, you allow apps downloaded from anywhere. Click here for more information.
Once the the JNLP file is downloaded, you may have to double-click the file if it doesn’t start automatically.
You have to allow access when it asks for permission to run on your computer.
You can launch any public accessible workflow with this format: http://users.loni.usc.edu/~pipeline/serverlib/jnlp_writer.php?url=[URL_TO_WORKFLOW]&server=[SERVER_NAME]&execute=true For more information, please check instructions for setting up Pipeline Web Start.

For more information on the LONI Pipeline application, please check our Quick Start Guide and User Guide.

Want to host Pipeline Web Start on your server? Instructions for setting up Pipeline Web Start on your server.

FAQs

I get “pipeline.jnlp can’t be opened because it is from an unidentified developer.” message when trying to run Pipeline Web Start on Mac OSX, how can I fix this?

It is likely that your OS X’s GateKeeper prevents you from running LONI Pipeline. Open System Preferences and click on “Security & Privacy” icon. Unlock the lock in case it is locked and select “Anywhere” option. Now try to open LONI Pipeline package again and installation should work. After the Pipeline installation is done, you can change the Gatekeeper option to previously selected value. For more information, please refer to Apple’s documentation.

Editing Preferences.xml
Server Configuration Tool

You can customize your Pipeline server in two ways. Either by editing preferences.xml file, or using the Server Configuration Tool.

3.1 Editing Preferences.xml

You can find out the XML tags for a particular preference, or read the old version of server guide, here: Server Preferences Configuration.

3.2 Server Configuration Tool

This tool is included in the Pipeline distribution and can be opened by typing:

$ java -classpath Pipeline.jar server.config.Main

The Server Configuration Tool will open with default values. If you already have a Pipeline server preferences file, you can load it up by pressing the Load button on lower right.

You can view and make changes in any of the fields (to be explained in details below). After you are done, just click Save button on lower right corner. It will first validate your inputs and then save it to a file. If you already have a preferences file loaded, it will be overwritten, otherwise, it will ask where you want to save this newly created preferences file. At any time, you can restore to default values by clicking Restore button on lower right corner. After restore, all the changes after the last save will be discarded.

When you are done configuring preferences, you can start the Pipeline server with saved preferences file by

$ java -classpath Pipeline.jar server.Main -preferences preferences.xml

3.2.2.1 General

The General tab lets you specify the most basic information needed for the Pipeline server.

3.2.1.1 Hostname and port

The server hostname is the hostname of the computer that you want the server to run on. It requires the fully qualified domain name of the computer that it is on.

The default port number is 8001. If you are using default port, you can leave the field blank. You can identify the server by just using the server hostname. If you use a non-default port, you have to identify the server by hostname:port format.

3.2.1.2 Temporary directory

The temporary directory is where all intermediate files for all the executed programs are stored on the server. This directory should be accessible from the Pipeline server as well as compute nodes. The Pipeline server will create a structure under there, and the compute nodes will read from and write to that directory. For example if you specify /ifs/tmp, the Pipeline server will create a directory /ifs/tmp/username/timestamp and put all the working files there.

Where username is the user that is running the server and timestamp is the time at which each workflow gets translated before execution. Inside each of those ‘timestamp’ folders will be all the intermediate files produced by executables from submitted workflows. Depending on the number of users using your server and the kind of work they do, this directory can balloon up very quickly.

If you check the Secure checkbox, the Pipeline server will store the temporary directory more securely. Each workflow’s directory will be accessible only for the user who started the workflow.

It is suggested to specify a non-existing directory as temporary directory if you start your Pipeline server for the first time. The Pipeline server will create appropriate files at start up.

3.2.1.3 Scratch directory

The scratch directory is utilized in workflows to make data sink paths portable across Pipeline environments. In particular, this field is linked to the {$tempdir} special variable available to all Pipeline workflows. The server administrator should configure this to be the path to a directory to which all Pipeline users can write. If the grid is utilizing NFS, this directory must be available to all the compute nodes that the Pipeline server manages. When configured, this value is stored in the Pipeline user’s home directory, inside the userdata.xml file. Upon connecting to the Pipeline server, the client requests this value and stores it in the local userdata.xml file. Any references to the {$tempdir} variable in workflows are replaced with the configured value.

3.2.1.4 Log file location

If you want to explicitly set the directory location that your log files will write to, you can specify the path here. In order to define the prefix in which the log files will be named, simply add that to the end of the directory path. The unique number denoting the log file will be appended onto the file name. For example, if we specify log file location as /nethome/users/pipelnv4/server/events.log, then log files will be created in the /nethome/users/pipelnv4/server/ directory, and will be named events.log.0, events.log.1, and so forth.
Note : The directory of log files MUST EXIST, before starting the server.

3.2.1.5 Use privilege escalation

When you have different users connecting to your Pipeline server, you might want to enforce different access restrictions on each user. If you’re running your Pipeline server on a Linux/Unix based system (including OS X), you can enable privilege escalation which will make the Pipeline server issue commands as the user who submits a workflow for execution. For example, if user ‘jdoe’ connects to a Pipeline server with privilege escalation enabled, any command that is issued on behalf of that user will be prefixed with ‘sudo -u jdoe ‘. This way all the files that are accessed and written by the user on the Pipeline server will be done on behalf of ‘jdoe’.

Remember, there is no harm in not enabling privilege escalation on your Pipeline server. All files will simply be created and read as the Pipeline server user. You will be giving uniform access to your system to all users. Additionally, it makes it easy to lock down the access of all Pipeline users because you only have to lock down the access of one actual user on your system; the Pipeline user.

In order to enable this feature in the Pipeline, you need to do two things. 1) modify your system’s sudoers file to allow the user that runs the Pipeline server to sudo as any user that will be allowed to connect to the system and 2) check “Use privilege escalation” under server configuration tool.

Enable guests option lets you grant guest log in to the server. The guest username is randomly generated by the client with the format “guest-[random 8-character string]”. If enabled, the guest will be able to connect to the server, submit workflow as Mapped guest user, and retrieve workflows by this guest. Regular users will not be affected in any way, and each guest will not have access to other guests’ workflow either. Pipeline Web Start and Workflow Library Navigator are set up with guests enabled.

3.2.1.6 Persistence

Pipeline server uses hsqldb to store information, including workflow status and module status. By default, it is stored in Pipeline server’s memory, and will be removed when the Pipeline server stops. Alternatively, you can start a hsqldb server and make it save to an external file. You can go to hsqldb website to download the jar file. To start a hsqldb process, run something like this:

java -cp ./lib/hsqldb.jar org.hsqldb.Server -database.0 file:/user/foo/mydb -dbname.0 xdb

After successfully starting hsqldb, you can put the persistence database information to server’s preference. The URL for the above example will be:

jdbc:hsqldb:hsql://localhost/xdb

Username and password can be configured as well, refer to hsqldb documentation for more information.

3.2.1.7 Days to persist status

This item specifies number of days a completed workflow can be stored on the server. The Pipeline server periodically checks and cleans up completed workflow sessions older than the number of days specified (counting from the workflow session’s completion time). The default value is 30 days. If a session is cleared, all its temporary files under the temporary directory will be removed.

3.2.1.8 History directory

History directory is a directory to store history/logging information on the server. If specified, each submitted workflow will be stored with timestamps and user information, even when user resets the workflow or the workflow is automatically removed from active sessions. Pipeline admins could see the history lists under server terminal’s Workflows tab and Users tab (check Show full history).

3.2.1.9 Crawler persistence URL

This item specifies the hsqldb URL of optional workflow crawler. If a valid hsqldb URL is specified, Pipeline server will start a workflow crawler thread, which crawls every submitted workflows (and history workflows if there exists) for module usage statistics. The statistical information is used to help user select and explore modules. Features such as Module Suggest, ranked server library search result, and Workflow Miner all use statistics gathered from crawler.

The crawler persistence database uses hsqldb, like the main persistence database. Setting it up is similar to setting up main persistence, with slight changes in db name and ports, for example
Main: java -cp ./lib/hsqldb.jar org.hsqldb.Server -database.0 file:/user/foo/maindb -dbname.0 xdb -port 9002
Crawler: java -cp ./lib/hsqldb.jar org.hsqldb.Server -database.1 file:/user/foo/crawlerdb -dbname.1 cdb -port 9003

URLs
Main: jdbc:hsqldb:hsql://localhost:9002/xdb
Crawler: jdbc:hsqldb:hsql://localhost:9003/cdb

3.2.1.7 Server library

When Pipeline client users connect to a server, the client syncs up the library of module definitions available on that server. The location of that library on the server is specified by Server library location field in the preferences. By default, the location is set to one of the following locations (based on OS), so you don’t need to specify this preference if you’re happy with it:

Linux/Unix – $HOME/documents/Pipeline/ServerLib/
OS X – $HOME/Documents/Pipeline/ServerLib/
Windows – %HOME%\Application Data\LONI\pipeline\ServerLib\
Windows Vista/Seven – %HOME%\Documents\Pipeline\ServerLib\

Put all the module definitions that you want to make available to users into the directory, when the server starts up, it reads in all the .pipe files in the directory (and all its sub directories). When clients connect, they will obtain a copy of the library on their local system. The server monitors the directory for changes/additions in any of the files while it runs. if change occurs, the server will automatically checks the files (no restart required) and synchronize clients again when they reconnect. Even when clients are connected during the change, they will get the new version of the server library instantly.

By default, the sever monitors the timestamp of the server library directory. However, if you have sub directories and you make changes inside these sub directories, server will not see the change as the library directory’s timestamp was not changed. You can solve this by checking “Monitor library update file”, and specifying a path (by default, it’s a file called .monitorFile under your library directory), this way the Pipeline server will monitor this particular file, instead of the library directory, to know when the library is updated. So when you update your server library, you can update this file to inform the server the library has been updated.

3.2.2 Grid

The Grid tab lets you specify the grid-related parameters for the Pipeline server. If it’s not enabled, all jobs will run locally on the Pipeline server machine.

3.2.2.1 Grid engine native specification

This field lets you specify a native specification string that goes along with your job submission (if none of that makes any sense, just skip this preference because you don’t need to use it on your server). On the LONI Pipeline server we use the following native specification preference:

-shell y -S /bin/csh -w n -q pipeline.q -b y _pmem _pstack _pcomplex _pmpi -N _pjob

The native spec you should use for your installation will vary, but if you’re using an Oracle Grid Engine (previously known as Sun Grid Engine) installation and you want to use the same string, you’ll want to change the -q pipeline.q to reflect the submission queue (if any) that you will be using.

Optionally, you can add _pmem and _pstack to the native specification. _pmem enables user define maximum memory per module, and _pstack enables user to define the stack size. Both of these can be configured by the user using the latest Pipeline client, and they all use the default set by the grid engine unless user specifies.

You can also add following:
_pcomplex which refers to Grid complex resource attributes.
_gridvars which refers to Grid Variable Policy ( Available since version 5.3)

3.2.2.2 Job name prefix

In some environments, server admins may want to visually see which jobs are submitted by pipeline and which jobs are not. This can be done by setting prefixes for all jobs which are started by pipeline. The “Job name prefix” field expects a string value which will be prepended on each job’s name. For example, you could set the value of it to “pipeline_” and all job names will start with it ( i.e. pipeline_echo, pipeline_sleep, pipeline_bet, pipeline_reallign, etc. ).

IMPORTANT: This change will be in effect ONLY if Grid Engine Native Specification contains the _pjob value. The “_pjob” value sets the name of each job with the same name as executable filename. If _pjob is not specified, then prefix won’t be prepended to job’s name.

3.2.2.3 Grid complex resource attributes

Grid complex resource attributes lets the Pipeline server check for the jobs which are submitted by Pipeline but not monitored by it anymore. This happens when the job is in a submission process and the server turns off. When the job submission is complete and Pipeline is down, the job id will not be written in the Pipeline database. Which means that this job will use the slot, but Pipeline will not “remember” the job id. When server restarts it gets the list of running jobs on cluster and compares with its database. To determine which jobs are submitted with current server Pipeline uses grid complex resource attributes. When Pipeline finds jobs which are submitted by current Pipeline but are out of control, it deletes them to free up the slot.

This tag allows you to assign custom complex attributes to all submitted jobs by the server, which will make jobs identifiable. You can have multiple values in the tag seperated by comma. For example
pipeline, serverId=server1
Following defines two attributes 1) pipeline which is equal to TRUE and 2) serverId which is equal to server1.

This tag is just a definition of complex attributes. In order to use them you have to define _pcomplex in Grid engine native specification. In our case, the _pcomplex will be replaced with -l pipeline -l serverId=server1 when submitting the job to the grid.

Note that the Grid manager has to be configured properly to accept jobs with given resource attributes.

3.2.2.4 Grid Variables Policy

Starting from Pipeline version 5.3, server administrators are able to control grid engine variable usage and set restrictions to them and their values.

By default, if nothing is set, Pipeline will not allow any grid variables to be used by users.

To configure Grid Variable Policy, open the Configuration tool’s Grid Panel and follow the example instructions below:

Let’s say we want to give users a permission to use variables h_vmem and h_stack. So we need to select “Disallow all variables, except specified” from drop down box Grid Variables policy and write in the textbox semicolon separated values:
h_vmem; h_stack

As each grid engine is different, there can be different formats of how to specify grid variables. For example for SGE we need to have “-l” prefix before each resource value. So we need to put “-l” in “Prefix before each tab” field.

Any of these variables may require a suffix to be attached after their values, for example in SGE h_vmem needs to be written in following format h_vmem=5GAs you can see there is a G suffix and we need to have that in our policy.

To introduce the suffix, we will change the “Grid Variables policy” to:
h_vmem [G]; h_stack[m]

Where “G” is the suffix for h_vmem variable and “m” is suffix for h_stack. After saving this change, without any server restart, Pipeline will not allow any grid variable to be used other than these two.

Alternatively Pipeline administrations can set limits for values, for example if we want h_vmem to have value from 1Gb to 8Gb and h_stack from 0mb to 256mb, then we’ll need to have following configuration:
h_vmem [1-8G]; h_stack [0-256m]

If you want to specify some specific values, but not ranges of numbers, then you have to use following syntax:
h_specific_value ( valueX, valueY, valueZ )

If you combine this with previous configs, then you’ll get:
h_vmem [1-8G]; h_stack [0-256m]; h_specific_value ( valueX, valueY, valueZ )

So as you can see, ranges must be specified with following format [Min-Max Suffix] and specific values must be ( Comma, Separated, Values )

Another option is to allow everything except some of the variables, this has the same format and in order to use it, select “Allow all variables, except specified” item from drop down box.

And finally, you can set multiple ranges for each variable. This becomes useful when you want to give a choice of suffixes to users. For example if you want your users be able specify the h_vmem and in Gigabytes and in Megabytes. So the configuration should look like this:
h_vmem [1-8G, 1000-8000M]; h_stack [0-256m]; h_specific_value ( valueX, valueY, valueZ )

If the user doesn’t specify a suffix to value ( for example : 4 instead of 4G ) then Pipeline will automatically use configured suffix. If there are multiple suffixes, Pipeline will set the first suffix as default.

NOTE: The Grid Variable Policy will be enabled only when there is _pgridvars variable defined in Grid Engine Native Specification field. Otherwise, the Pipeline validation will only validate grid variables, but will not include them in job submission specification. Here is an example of native specification which will enable the Grid Variable Policy:
-shell y -S /bin/csh -w n -q pipeline.q -b y _pgridvars _pcomplex _pmpi -N _pjob
So if you want to use Grid Variable Policy, make sure you have _pgridvars variable defined in native specification.

Please see Grid engine native specification for more details.

3.2.2.5 Grid maximum submit threads

This field lets you specify the number of parallel job submissions. For example, if you specify 1, it will submit jobs one by one. If you specify 10, it will allow maximum of 10 parallel submissions at a time.

3.2.2.6 Grid total slots

The Pipeline client lets user to see the activity of the Pipeline server, how busy the server is and how many total slots are available on the grid. These total slots fields tell the server how many total slots available on the grid either dynamically (by running the command) or statically (by providing a number).

Grid total slots command asks for a command line query to get the total number of available slots for the queue that the Pipeline server is using. Refer to your cluster management documentation for the appropriate query. By using this tag, the server will query the grid engine periodically to get the latest number of available slots, and update the number automatically, and broadcast the new number to clients.

The grid total slots field lets you give the number of total grid slots for the cluster. This number is less accurate if the slot counts on the grid changes regularly.

You can provide both dynamic command and a static number, server will use the dynamic command first, and if it doesn’t work, it will use the static number.

3.2.2.7 Max number of resubmissions for “error stated” jobs

When submitting jobs to the grid, jobs can fail before starting to run. For example, if the job submitted with sudo but the node where the execution started doesn’t have information about that user, then the grid engine would complain and put the job in error state. In this case, it is possible that resubmission of this job would put the execution in another node which doesn’t have the same problem.

This property tells pipeline how many times to resubmit the job after its state is set to “error”. Please note that this doesn’t mean that every execution which fails pipeline will try to resubmit, no, it will not resubmit any job which execution successfully started but the exit code of the executable is other than 0. For example, if you run cp command with one argument, it will fail, but pipeline won’t resubmit it because the execution has been successfully completed independent of its result.

3.2.2.8 Array jobs

Sometimes job submission can take a long time when each instance of a module is submitted as an individual job. Array jobs solve this problem and dramatically improve the job submission speed. Starting from version 5.1 you can enable array jobs on the Pipeline server so instead of submitting individual jobs it can submit array of jobs. This feature improves the speed of submission especially for modules with multiple hundreds/thousands of instances. Dependent of the module’s number of instances there is 10%-65% speed improvement when using array jobs versus individual jobs.

Parameters

When a workflow is started and it contains a module with large number of instances, submitting array jobs small number of tasks at the beginning will be time efficient.

Before submitting the job array, Pipeline has to create a special script and configure it for each instance. This procedure takes time and during the preparation grid engine will be idle. For example: Let’s say we have a module with 1000 instances. It will take shorter time to prepare 50 jobs for submission than 1000 jobs. So you can configure the value of “break into chunks when number of jobs exceeds“, which tells Pipeline to split into chunks instances of those modules which has cardinality X and more. It is a positive integer number and indicates the minimum cardinality value the module is required to have in order to split into chunks.

The chunk size tells the Pipeline server the size of an array job. So if we set break into chunks when number of jobs exceeds 200, and chunk size 50, and user submits a workflow with 200 instances, the Pipeline server will divide them into 4 array jobs, each with 50 instances.

Sometimes after the first chunk it is better to gradually increase chunk size, because after the first chunk submission the grid engine will not be idle and we can afford submitting larger array jobs. If this is enabled, after the first submission the Pipeline server will double the chunk size upon each iteration.

For example: If we have a module with 1000 instances and chunk size is 50, break into chunks when number of jobs exceeds 200, then it will submit array jobs with following sizes

Array Job #	Number of instances	Total submitted so far
1	50	50
2	100	150
3	200	350
4	400	750
5	250	1000

The maximum chunk size sets the upper limit of the increase. The Pipeline server will double the chunk size for each array job until the maximum chunk size is reached.

There is one special case: during the submission Pipeline checks how many instances remain to be submitted and when the remaining is more than the limit but there is only less than 10% left, then remaining instances will be carried with last job array. Here’s is an example with total of 768 instances.

Array Job #	Number of instances	Total submitted so far/ Remaining
1	50	50 / 718
2	100	150 / 618
3	200	350 / 418
4	418	768 / 0

The last 418 is more than the specified 400. But 418-400=18 which is less than 76 ( 10% of total 768 instances).

3.2.2.9 Grid plugin

The Pipeline server lets you create your own plugins to communicate with various Grid managers (see also Pipeline Grid Plugin API Developers Guide). Pipeline package contains two built-in plugins for Oracle Grid Engine (previously known as Sun Grid Engine) which are JGDIPlugin and DRMAAPlugin. In installed package of Pipeline, under the lib directory there is directory called plugins in which you can find these two plugins.

If you are using either JGDI or DRMAA, just select it under the drop down, everything will be filled in for you. If you are using your customized plugin, you need to specify the jar file location, and class name of the Plugin used by Pipeline.

IMPORTANT: Some plugins require to be defined in class path. For example DRMAA Plugin requires from you to put the path of drmaa.jar in class path when starting the server. So to start the server with DRMAA plugin you need to have

$ java -cp .:/usr/pipeline/dist/lib/plugins/drmaa.jar Pipeline.jar server.Main

3.2.2.10 Finished job retrieval method

This option is mostly for advanced cases. When Pipeline server restarts,it is possible that some of the jobs which continue to run will finish while the server is offline. In this case when server starts up in needs to get information about the finished job from grid engine’s accounting database or files or some other place. Different grid plugins have different options for retrieving information about finished jobs. For example, the JGDI Plugin supports the option for retrieving information from SGE’s ARCo database. It may be possible that the same plugin has different methods to get that information and in that case Pipeline server administrator would want to choose the method.

This property requires a string value which will be sent to the grid plugin. Most of the grid plugins have their default methods for retrieving job information. In those cases this property is not needed to be set. For example, if we want to use ARCo database from JGDI plugin we could just leave this property empty ( as ARCo database is the default method for JGDI plugin ) or just simply set it to “arco”. Please refer to grid plugin documentation to get more information about what methods they support and how to configure them.

3.2.2.11 Pipeline user is a grid engine admin

When privilege escalation is set to true, pipeline submits jobs on behalf of the users who submit the workflow. If Pipeline user is a Grid Engine admin ( configured by grid engine administrators ) and if it is able to delete any users jobs, then this property is recommended to be unchecked as the performance will be faster.

When this checkbox is selected, then pipeline will tell the grid plugin to kill the job on behalf of the user. For example in SGE that call would be sudo -u #user qdel #jobId*, where #user is the username and #jobId is job’s Id.

3.2.2.12 Grid job accounting

After Pipeline server restarts some jobs may already been finished or changed their status. This events haven’t been caught as Pipeline server was not running at that moment. In order to get the status of “missed” events, Pipeline gets information from configured Sun’s Accounting and Reporting Console (ARCo) database. Note this feature is only tested for Oracle Grid Engine (previously known as Sun Grid Engine) with JGDI and DRMAA plugins, if you are using another grid manager, and it does not work, please report it on our Pipeline forum.

Assuming ARCo database is configured and running (refer to ARCo website and your system administrator for help). To configure ARCo database in Pipeline you need to put information about the database URL, username, and password. The URL looks like this:

jdbc:mysql://hostname/db_name

3.2.3 Access

The Access tab lets you configure user-access and workflow management for the Pipeline server.

3.2.3.2.1 Server admins

This field lets you specify a list of user names that are Pipeline server admins. Pipeline server admins can connect to the server via Pipeline Server Terminal Utility to monitor and manage the server. They can view, stop, delete anyone’s workflows on the server, view a list of connected users and their IP addresses, server’s memory and thread usage, as well as edit all the server preferences discussed on this section.

If there’s no server admin specified yet, you need to directly edit server’s preferences.xml file. The xml element is called ServerAdmins. For example, if you want john and jane to be server admin, then preferences.xml file should look like this:

<preferences> ...(other items)... <ServerAdmins>john, jane</ServerAdmins> </preferences>

After saving changes to preferences.xml file, you need to restart Pipeline server.

3.2.3.2.2 Directory access control

The Pipeline server lets you control user access on the executables/modules they can run and files they can browse on the server. Below is a matrix chart for different mode:

Mode	Executables Access Control	Remote File Browser Access Control
0	Never	Never
1	No with exceptions	Never
2	Yes with exceptions	Never
3	No with exceptions	No with exceptions
4	Yes with exceptions	Yes with exceptions
5	No with exceptions	Same as shell permissions
6	Yes with exception	Same as shell permissions
7	Same as shell permissions	Same as shell permissions
8	Users with shell access: Same as shell permissionsUsers without shell access: Yes with exceptions	Users with shell access: Same as shell permissionsUsers without shell access: Yes with exceptions

Never means Pipeline server will not do any access control restrictions for any user. Note this will not affect operating system’s authentication and access control, in other words, the credentials required to connect to the Pipeline server and the rights required to execute programs will not be affected by the settings here. No with exceptions means access control is not enabled for all users except those marked in controlled users will be restricted. Yes with exceptions means all users will be restricted except for those specified in controlled users will be allowed. Same as shell permissions means the remote file browser will act as if user logged in to the server using shell.

Controlled users
This is a list of users separated by commas (i.e. john,bob,mike) which will indicate conditional users. Depending on the control mode, These users will be restricted or allowed.

Controlled directories
This is a list of directories separated by commas (i.e. /usr/local,/usr/bin), which will be the only directories allowed for restricted users.

Examples
We want to restrict user john, bob, mike to execute programs only in: /usr/local and /usr/bin, and let every user browse using remote file browser as shell does, we would have these configurations:
Mode: 5
Controlled users: john,bob,mike
Controlled directories: /usr/local,/usr/bin

Another example, if we want to restrict all users to execute programs only in: /usr/local and /usr/bin, but allow users john, bob, mike to run without restrictions, and let every user browse using remote file browser as shell does, we would have these configurations:
Mode: 6
Controlled users: john,bob,mike
Controlled directories: /usr/local,/usr/bin

3.2.3.2.3 User management

Pipeline version 5.1 introduces new user management feature. It has a special algorithm that provides a fair share resource to all users.

How it works
When enabled, you need to set up a percentage that will limit each user’s jobs. The percentage is calculated by taking the number of free slots at submission time.

For example: the Pipeline server has 150 slots available and the user management percentage is set to 50%.

The first user will be able to use 50% of 150, so 75 slots will be used by user A.
Then User B will be able to submit 50% of free slots which is 50% of (total 150) – (user A’s 75) = 50% of (free 75) = 37 and so on.

Pipeline server constantly checks and monitors the user usage and adjusts each user’s limit.

3.2.3.2.4 Non-user-based Job management

Pipeline version 7.0 introduces new job management feature that has limits independent of the limits on the user who submitted it. The job management algorithm has been updated to consider both user management and job management factors when decided to run a job.

How it works

When enabled, you need to set the module name and limit for number of jobs that can be run by a module with this name. The limit is global across all users and jobs submit via Pipeline.
For example: A user runs a module named ControlledModule1 which submits 180 jobs. Other users who run modules with this same name (name only, the actual definition of the module is not considered) will find that this module can only submit 20 jobs and the rest will get backlogged. This assumes that their user quota (see above) has not been reached yet. If they user quota is already reached, Pipeline will not submit any jobs.

This feature is useful for jobs that have special limitations that need to be enforced. For example, a job that communicates with some external service and you don’t want to overload this service with thousands of requests. You can create a uniquely named module to describe this job and limit it across the server to protect resources.

3.2.3.2.5 Workflow management

Workflow management lets you manage active (running and paused) workflows per user on the Pipeline server. If it’s enabled and a limit has been set, the number of active workflows for any single user will not exceed the limit. Any workflow submitted after the limit is reached will have a backlogged status. They will be queued until any of the user’s running workflows completes.

3.2.4 Mappings

Updated: The Packages and Executables tabs are now disabled when running configuration tool. Instead, you can edit by connecting to the running Pipeline server using Server Terminal > Preferences tab. All the values under Packages and Executables tabs will be dynamically updated without restarting the server.

The Packages and Executables tabs give the administrator the ability to configure mappings that allow for Pipeline workflow portability between Pipeline servers. In particular, information about an executable (in the form of the executable package, version, or name) can be used to determine where that executable resides on a given file system and whether it requires the setting of certain environment variables. The following sections describe this interface in greater detail.

3.2.4.1 Packages

The Packages tab allows you to set up a list of available packages/tools and their locations on the server. This allows the Pipeline server automatically correct user-submitted module’s executable location based on its package name and version.

There are five columns in the table, package name and version, location, variables and sources. Package name and version are used to compare against user-submitted module definition, they should uniquely identify a package on the server. Version value can be empty which corresponds to empty version in the module’s definition. Package location is the local path to the package where all the executables are located. Variables and sources are optional, which defines environmental variables needed for the package (in name=value format) and scripts to be sourced before running respectively.

To add a package, click on Add and put name and version of the package. You can double-click on any of the columns to edit them, for Variables and Sources, you’ll see a new dialog pops up after the double-click, to let you easily input multiple entries. To delete a package, just select it and click Remove. Make sure to hit the ‘Save Server Preferences’ button to save the settings.

3.2.4.2 Executables

The Executables tab improves the portability of Pipeline modules in the following sense. If we have a Pipeline module that runs /usr/bin/java -jar /usr/local/some_package-1.0/bin/program.jar, we can use the package mapping utility (described above), to say that program.jar belongs to the package ‘some_package’, version ‘1.0’, include this information in both the Pipeline module definition and the Packages tab in the server terminal GUI, and we would be able to share this module with a user connecting to a different Pipeline server (with the proper configuration). However, we are still assuming that both systems have /usr/bin/java. With the executables mapping, the executable name ‘java’ can be configured to map to any location, making the module fully portable.

Much like the Packages tab, the Executables tab allows the administrator to create a correspondence between executables and their location in the relevant filesystem. In this case, the interface has three columns. In the first column, the user needs to specify the executable name (e.g., java, perl). The second column requires the user to specify a version number for the executable. This can be useful if Pipeline users are utilizing several different versions of an executable and need to be able to specify the version needed by a particular Pipeline module. If the version number is not relevant, an asterisk (*) can be used to indicate any version. Finally, the last column, labeled ‘Location,’ is used to indicate the path on the grid’s execution hosts where the executable resides (e.g., /usr/bin/java).

To add an executable, click on Add and enter the executable filename, version, and full path. To delete an executable, just select it and click Remove. Again, make sure to hit the ‘Save Server Preferences’ button to save the settings.

3.2.4.3 Utilities

The Pipeline makes use of several utility plugins and the amount of these plugins will undoubtedly increase as the Pipeline grows. Currently, the three independent applications that the Pipeline uses are Smartline, XNAT, and IDA. These java applications are completely portable; they are packaged as jar files and are included in the standard Pipeline distribution. When setting up a Pipeline server, the administrator needs to perform three steps to ensure that these tools can be used by Pipeline clients which connect to it. First of all, the xnat, smartline, and idaget directories need to be moved to a common location. Secondly, this directory needs to be configured into the Package Mapping for the server. In particular, the package name “Pipeline Utilities” with the generic version “*” should map to /path/to/utilities, where /path/to/utilities is a directory containing the xnat, smartline, and idaget directories. This can be done very easily under the Preferences->Packages tab in the Server Terminal Utility. Finally, the path to Java needs to be configured using the Preferences->Executables tab in the Server Terminal Utility. The Package Name should be “java”, the Version should specify the version of java (or you can enter “*”), and the Location should be the path to java on your grid’s execution hosts.

3.2.5 Advanced

The Advanced tab allows you to set up a list more advanced features.

3.2.5.1 Failover

The Pipeline server has an automatic failover feature. It is available on server running UNIX/Linux/Mac OS X. Failover improves robustness and minimizes service disruptions in the case of a single Pipeline server failure. This was achieved by using two actual servers, a primary and a secondary, a virtual Pipeline Server name, and de-coupling and running the persistence database on a separate system. The two servers monitor the state of its counterpart. In the event that the primary server with the virtual Pipeline server name has a catastrophic failure, the secondary server will assume the virtual name, establish a connection to the persistence database and take ownership of all current Pipeline jobs dynamically.

Requirements
3 separate hosts (1 master, 1 slave, 1 persistence).
Virtual IP address of the server.
User who runs pipeline should have full access to execute command ifconfig

How it works
Server of Host B pings to server Host A every 5 seconds (the interval is configurable). When there is no response (timeout) it retries ping for 3 times (the number of retries is configurable) and if all retries are unsuccessful then Host B creates an IP alias on network interface specified by and and switches to Master mode.
Alias interface
This specifies the name of interface on which the Pipeline server will create a sub interface to do IP aliasing.
Alias sub interface number
This specifies the sub interface number on which Pipeline should create the alias IP Address. If nothing specified, Pipeline will automatically find first available sub interface number and will add IP Alias on it. For example if your primary interface is eth0 and eth0:0 and eth0:1 are busy with another IP addresses, Pipeline will use eth0:2.
WARNING: If one of sub-interfaces contains IP Address of specified Hostname, Pipeline will give an error and exit.
Post failover script
This is the path to a script which will be called after the master server’s process terminates.

Instructions how to configure failover

Copy pipeline server files and preferences file to two different hosts, let’s say Host A and Host B. Also we will need database to be in third host (Host C).
Put the hostname and persistence information to server preferences. Under Advanced tab, check Enable failover, give alias interface and change other parameters if needed.
Start the persistence database on Host C.
Start the Pipeline server on Host A. It is the master server.
Start the Pipeline server on Host B. It is the slave server. When Host A goes down, Host B will take over.

3.2.5.2 Log email

The Pipeline server can send log messages via email. Specify recipients email address (separate multiple entries by comma), the sender, and SMTP host. The Pipeline server will not send more than 1 message within 10 minutes. If there’s a lot of log messages, it will combine them and send in 1 email every 10 minutes.

3.2.5.3 Network

The packet size is the size in bytes of the Pipeline communication packets between the server and the client. The timeout specifies the timeout in seconds of the Pipeline communication protocol.

3.2.5.4 Maximum number of threads for active jobs

As the server becomes busier and busier, at times users may be submitting more jobs at once than the server’s capacity to handle. You can set up maximum number of threads for active jobs to prevent this. Active jobs are submitting, queued, running jobs. When number of active jobs reaches the maximum limit, server will put new jobs into a backlogged queue. When there is an available slot for execution the first job in the backlogged queue will be submitted. For grid setups, you should probably have the limit higher than the number of compute nodes available to you, because submitting to the grid takes a non-negligible amount of time, and it’s best to keep your compute nodes crunching at all times.

3.2.5.5 HTTP query server

The Pipeline server provides API for querying workflow data, including session list, session status, output files. It is helpful when you (or your program) want to query workflows on Pipeline server, without the need of Pipeline client. Please note, once enabled, it does not require any login authorization to see any workflows on the server. By default, this feature is not enabled on the Pipeline server. To enable, check Enable HTTP query server under Advanced tab and put a port number, for example 8021.

When the server is running, you can go to http://cerebro-rsn2.loni.usc.edu:8021/ and it shows an XML file listing all the APIs. Currently there are five functions:

getSessionsList
getSessionWorkflow
getSessionStatus
getInstanceCommand
getOutputFiles

getSessionsList returns all the active sessions on this Pipeline server. It does not take any argument, and the query URL looks like this:
http://cerebro-rsn2.loni.usc.edu:8021/getSessionsList

The Pipeline server returns an XML file listing all the active sessions, with their session IDs.

<sessions count="1"> <session> cerebro-rsn2.loni.usc.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492 </session> </sessions>

getSessionWorkflow returns the workflow file (.pipe file). It takes session ID as argument. The query URL looks like this:

http://cerebro-rsn2.loni.usc.edu:8021/getSessionWorkflow?sessionID=cerebro-rsn2.loni.usc.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492

getSessionStatus returns the status of the workflow execution, when it started, if it has finished, what time it finished, what are the nodes and instances in this workflow, and for each node, if they finished successfully. The query URL looks like this:

http://cerebro-rsn2.loni.usc.edu:8021/getSessionStatus?sessionID=cerebro-rsn2.loni.usc.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492

getInstanceCommand returns the command of the execution. It takes session ID, node name (which can be found by calling getSessionStatus), and instance number (which can also be found by calling getSessionStatus). The query URL looks like this:

http://cerebro-rsn2.loni.usc.edu:8021/getInstanceCommand?sessionID=cerebro-rsn2.loni.usc.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492&nodeName=BET_0&instanceNumber=0

getOutputFiles returns the path of output files generated by the node. It takes session ID, node name, instance number, and parameter ID. The query URL looks like this:

http://cerebro-rsn2.loni.usc.edu:8021/getOutputFiles?sessionID=cerebro-rsn2.loni.usc.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492&nodeName=BET_0&instanceNumber=0&parameterID=BET.OutputFile_0

3.2.5.6 Automatically clean up old files

If Automatically clean up old files in temporary directory is checked, then any session in the temporary directory that are older than two times the days to persist status will be removed. This will not happen under normal circumstances, because persistence database keeps track of all sessions, and no temporary directories older than days to persist status should exist. It happens in rare situations such as when the Pipeline server restarts with its persistence database manually deleted, or the temporary directory was used by other programs, and so on.

3.2.5.7 Maximum number of metadata threads

This specifies the maximum number of metadata generator threads used in workflows with Study module.

3.2.5.8 Warn when free disk space is low

If enabled, Pipeline will warn when temp and home directories’ free space goes below the specified %.

3.2.5.9 Server status

The server status refresh interval specifies how often should the server checks for grid usage counts and sends the information to the clients. The connected clients will get the server status information.

3.2.5.10 Directory source recursive timeout

This specifies the maximum time allowed in seconds of the recursive directory source listing feature. When user specifies a directory that contains lots of sub directories and files, the recursive directory source listing may take a long time. This timeout value limits the time allowed for a recursive listing on a directory.

3.2.5.11 External network access queue

This specifies the queue that has external network access enabled on its compute nodes. Some module may require external network access, and the general queue for the Pipeline server may not have that. By setting up a special queue that has external network access, user can submit modules with external network access checked on their Pipeline client, and the server will submit the jobs on that queue.

3.2.5.12 Validation warning

When there’s a missing input file or executable path in a workflow, and user tries to submit the workflow, Pipeline gives a validation error and asks the user to correct the invalid file. But in some circumstances the file may not exist yet at validation time, and will be valid at a later time, and as a result the workflow can’t be used as is. Validation warning solves this issue. If enabled, such invalid path will result in a warning message and user can proceed to submitting the workflow by dismissing the warning.

3.2.5.13 NFS directories for validation

If Pipeline server is used in grid environment, a shared file system such as NFS is often used. If user specified a valid file under non-NFS directory, it would pass validation but likely cause problem as the file is not accessible on each compute nodes. This issue can be solved by specifying all NFS directories of the system. Validation will catch files that are not under these specified directories.

3.2.5.14 Check and verify output files

If this option is enabled, Pipeline server will check the existence of output files after the module completed successfully. If output file was missing while the executable exited normally (return value 0), Pipeline will report the module as error/failed and not continue to the subsequent modules.

3.2.5.15 Test server library

By pressing the button, it will generate the instructions for testing the server library workflows. It uses command line interface to batch validate workflows, so you know if there’re any issues with your server library workflows.

Previous: 2. Installation

Table of Contents

Next: 4. Authentication

Distributed Pipeline Server Installation Utility
Conventional Installation (without DPS utility)

2.1 Distributed Pipeline Server Installation Utility

The Distributed Pipeline Server Installer is a GUI installer that allows you to install and configure 3 types of resources – backend grid management resources (Grid Engine), the Pipeline server, and a number of computational imaging and informatics software tools. After successfully running the installer, you will have a running Pipeline server with grid engine managing jobs on your machine(s), imaging and informatics software tools installed, as well as a set of predefined workflows and modules in your server library.

2.1.1 Requirements

The requirements for the Pipeline server installation can be found on Distributed Pipeline Server Installer page.

Warning: If any of the requirements are not met, there may be unexpected behavior in the installer (e.g. hanging, crashing). If you have any questions, please contact pipeline@loni.usc.edu

A complete installation (including grid engine, the Pipeline server, and all software tools) can take several hours. However, this is mostly because some of the tools take a long time to download (e.g. FSL can take up to 6 hours, depending on your internet speed). If you skip the tools or have already downloaded the ones that require manual download, the total installation time is less than 30 minutes.

2.1.2 Warning

When you run the DPS installation utility to install the Pipeline server, the underlying scripts will edit the firewall rules to open up the Pipeline port for connections from clients. Be forewarned that these changes can cause unexpected results on your system. We recommend backing up your iptables before starting the installation. In the future, this automatic configuration step will be made more robust.

2.1.3 Downloading

Download the installer from the Pipeline website, under Downloads > Distributed Pipeline Server Installer.

2.1.4 GUI

The graphical interface of the DPS utility simplifies the installation experience for the user by hiding unessential details and only asking the user for minimal configuration preferences. The steps are documented below and are accompanied by screenshots.

2.1.4.1 Start the Installer

To start the installer, open a terminal, change directories to the directory where the installer file is located, and type

su root (how to become root)
tar -zxvf pipelineServerInstaller.tar.gz
cd pipelineServerInstaller
./launchInstaller.sh

2.1.4.2 Select Components

After reading and agreeing to the license, you will be asked for an installation location and what components you want to install:

You can select any* or all of the components. It will guide you through all the steps needed for the installation.

* For example, if you have already installed SGE before launching this installer, then deselect the Oracle Grid Engine component. Likewise, if you only want to install the latest tools, you can select the Neuro Imaging Tools component and uncheck the rest.

The installer will verify the Shared File System Location given. It is required to have it on NFS if the server is set to use a grid. The shared file system is used for the Pipeline server to store intermediate files of workflows and to install Grid Engine and Tools.

2.1.4.3 Install Grid Engine

In this section you can configure Grid Engine installation. You can specify an installation location, cluster name (which uniquely identifies a specific Grid Engine cluster), spool directory (for spooling data), and execution hosts (hosts that execute the tasks (jobs)). You can leave installation location, cluster name and spool directory as they are, but you must provide a list of hostnames. You must provide fully qualified domain names, so something like “host1”, “localhost” or “127.0.0.1” is not allowed.

2.1.4.4 Install Pipeline

In this section you can configure the Pipeline server. You can specify an installation directory, Pipeline server address, port and user to run the Pipeline server process. The username must already exist and you can have the option to have its sudo file modified to accommodate privilege escalation.

User authentication lets you specify the authentication mechanism for the Pipeline server. If you already have NIS configured (there are plenty of online help resources, e.g. configure NIS server and client), it’s recommended to select the NIS option. Otherwise, you can select SSH Based option, which runs ssh command to test the provided credential. You can also choose No Authentication to let anybody connect to your sever. This option should only be used for testing and on a server with limited internal network access.

If the modify sudoers file option is selected, the installer will modify the operating system’s sudoers file so that the Pipeline server user will be able to sudo as any user, except root and the optional list of users provided. For example, if you have some user that can sudo as root, then this user should be listed as an exception, so that the Pipeline user will not be able to gain root access.

Install Pipeline with SGE already installed

If you already have SGE installed and the SGE_ROOT variable is defined on your system, you can skip SGE installation by unchecking the Oracle Grid Engine checkbox from step 3 (General Configuration). The Pipeline configuration window will now have an additional checkbox to “Enable Grid Submission” which needs to be selected if you want to use Pipeline with your pre-installed SGE.

Upon checking the “Enable Grid Submission” checkbox, you will need to select a grid plugin. In order to communicate with SGE, Pipeline uses Grid Plugins. LONI provides two plugins for SGE: JGDI Plugin and DRMAA Plugin. If you are using SGE we highly recommend using JGDI Plugin as it supports more Pipeline features and is more reliable. You can choose DRMAA Plugin if you have other DRMAA supported Grid Manager installed and want to integrate Pipeline with it.

The last step is to choose the submission queue. The installer will list all of your available queues and you have to pick one for Pipeline. If you don’t have a special queue already set up for Pipeline then you can use the default queue of SGE (all.q). If you do not have any queues defined in SGE, you will have to create one yourself.

Installing Pipeline without SGE

If you don’t have SGE installed, and you uncheck the Oracle Grid Engine checkbox from step 3 (General Configuration), the installer will install Pipeline without Grid Engine. All jobs submitted to the Pipeline server will run locally on the server. You have to be careful with number of jobs submitted to the server as high number of jobs will negatively affect the server’s performance. Please see Maximum number of threads for active jobs if you want to set limits on the number of parallel running jobs.

2.1.4.5 Install Neuro Imaging Tools

In this section you can select which imaging software tools and server library files to install.

There are two components that can be selected for each NeuroImaging tool:
• the tool itself (binaries, executables, and scripts)
• the modules/workflows (.pipe files) associated with that tool.

You may select either or both options for any tool, but please note that you can only install workflows for tools that are already installed or you have selected to install.I f you select to install the workflows for a tool but not the tool itself, and the tool cannot be found in the default installation directory (shared file system path + “tools”) then you will be prompted to provide where that tool is installed (second image). If you find yourself here by mistake, click back and modify your selection.

If the installation type for a tool is “Automatic”, it will be installed automatically without the need for user input. Some tools are marked as “Semi-Automatic” (e.g. FSL and FreeSurfer), which means that they require you to manually download the installer files for that tool from the developer’s website. This is because of the licensing restriction imposed on the software. For these types of tools, you will be shown a window after clicking ‘Next’ which contains instructions on what website to visit, which files to install, and any other requirements for that tool.

When you satisfy all the requirements for a tool, it will begin installing in the background immediately. A green check mark will appear next to that tool in the drop menu, indicating that you have provided the necessary information and can move on to the next tool. You may preemptively cancel the installation of a tool by clicking the ‘Don’t install’ at the bottom of the window. When all tools are either installing or cancelled, this window will close automatically.

Install the tools without installing Pipeline or SGE

If, at a later time, you want to install updated versions of some tools, you can have it installed without installing the Pipeline and/or SGE. Simply check only the NeuroImaging Tools in the general configuration section of the installer, then click Next and it will go directly to the tools installation step, skipping the Pipeline and SGE installation steps.

Please note that NeuroImaging tools can only be installed if you also selected to install the Pipeline Server or already have the server installed. If you select to install these tools without selecting to install the server, and the preferences.xml file cannot be found in its default location, a browse button will appear so that the location to your preferences.xml file can be provided. If you don’t have a preferences file, it means you have not installed the server yet and it should be selected during the installation process.

2.1.4.6 Install Bioinformatics Tools

The process for installing Bioinformatics Tools is the same as NeuroImaging (outlined in the previous step) except there are currently no “Semi-Automatic” tools in this section. Note this is the final step before the Pipeline installation utility takes over and starts to download/install files so only hit ‘Install’ if you are sure that all of your previous settings are correct.

Install the tools without installing Pipeline or SGE

Just as with NeuroImaging tools, Bioinformatics tools can only be installed if you also selected to install the Pipeline Server or already have the server installed. If you select to install these tools without selecting to install the server, and the preferences.xml file cannot be found in its default location, a browse button will appear so that the location to your preferences.xml file can be provided. If you selected NeuroImaging tools as well and already indicated the path to your preferences file in the NI Tools Configuration panel, you will not see this button.

2.1.4.7 Finish Install

After the installation has successfully completed, you will be shown a summary screen. Clicking the Finish button with “Start the LONI Pipeline Server” checked will exit the installer and launch the Pipeline server. You can also check the “Start Client to validate the installation” option to launch the client and test run a workflow.

Additionally, you may want to configure advanced server preferences by clicking on “Configure the server with advanced options…”. This will automatically open the server configuration tool, where you can edit the details of your server.

If you have any questions, please contact pipeline@loni.usc.edu

2.1.4.8 Start the Server

If you checked the “Start the LONI Pipeline Server” option on the summary page of the installation, the Pipeline server process will be started. To check the logs of the Pipeline server, go to the Pipeline server’s directory (/usr/pipeline by default), specified in the Install Pipeline step. You will find files called outputStream.log and errorStream.log, which store output and error stream information. You can verify if the server started successfully by checking the contents of the outputStream.log file. It should look something like this:

[ 1/6 ] Connecting to Persistence Database..............DONE [117ms] [ 2/6 ] Starting server on port 8001....................DONE [1152ms] [ 3/6 ] Loading server library..........................DONE [31ms] [ 4/6 ] Loading server packages info....................DONE [7ms] [ 5/6 ] Checking to resume backlogged workflows.........DONE [0ms] [ 6/6 ] Checking to resume active workflows.............DONE [0ms] [ SUCCESS ] Server started.

You can stop and start the Pipeline server by calling (root access required):

/etc/init.d/pipeline stop
/etc/init.d/pipeline start

The Pipeline and persistence database will be started/stopped in order, and the pipeline user will run these processes.

If you don’t have root access, you can stop and start the Pipeline server as the pipeline user. It will be equivalent to the init.d method above. To stop and start the Pipeline server, go to the Pipeline server’s directory and type

./killServer.sh
./launchServer.sh

Always check if the server has started successfully by viewing the outputStream.log file. If it shows error on persistence database, you can stop and start the persistence database process by typing:

./db/stopDB.sh
./db/startDB.sh

After the persistence database has been restarted, restart the Pipeline server as noted above.

2.1.5 Command Line Installation

An alternative to using the GUI to install the Pipeline server is an automated method that relies on a configuration file. All of the fields that are entered via the GUI are represented within a hierarchical XML file. A default configuration file is included in the distribution directory (dist/install_files) of the installer, which you can download here). After you set up your configuration file, you can run the installation in automatic mode by typing the following into your shell:

tar -zxvf pipelineServerInstaller.tar.gz
cd pipelineServerInstaller
./launchInstaller.sh -auto dist/install_files/DefaultInstallationPreferencesFile.xml

A complete template for the XML file can be found here. If you use this template as a starting point, note that it has a lot of placeholders and is not set up to run “as is”, so you would have to make many modifications. For reference, each of the tags is documented below:

DistributedPipelineServerInstaller: root tag, contains all other tags
SharedFileSystemPath: path to a directory that is shared (via NFS) between the host running the Pipeline server and qmaster, admin, and execution hosts of SGE
JDKLocation: only include this tag if you don’t already have Oracle JDK running on the host where you’re installing the Pipeline server; the value should be the path to the JDK RPM, which you can install from the Oracle page
PipelineServer: use attribute enabled=”true” to indicate that you would like to install the Pipeline server; the children of this element will specify information about the server installation

InstallLocation: specifies location where Pipeline server is to be installed
Hostname: specifies the hostname of the host where Pipeline server is being installed
Port: specifies port on which the Pipeline server will be accepting connections from clients
Username: specifies user that will be running the Pipeline server
TempDir: specifies a directory where Pipeline modules will write intermediate files
ScratchDir: specifies a scratch directory where sample workflows will write their outputs; this value then becomes available to users through the pre-defined ${tempdir} variable, documented here
GridSubmission: use attribute enabled=”true” to indicate that you would like the Pipeline to submit jobs via grid engine to execution hosts; otherwise, the jobs will be run locally on the host running the Pipeline server

GridPlugin: options are JGDI or DRMAA
GridSubmissionQueue: the SGE queue where Pipeline should submit its jobs

UsePrivilegeEscalation: options are true or false; privilege escalation is documented here
DBInstallLocation: path to a directory where you would like to install the Pipeline database; if it doesn’t exist, it will be created by the installer
StartPipelilneOnSystemStartup: set value to true if you would like to configure the system to start the Pipeline server on startup; false, otherwise
AuthenticationModule: options are SSH, NIS, and NoAuth; these are documented here
ModifySudoers: use attribute enabled=”true” to indicate that you want to add the Pipeline user to the sudoers list

SuperUsers: comma-separated list of users that you don’t want the Pipeline server to sudo as (default: root)

MemoryAllocation: specify the amount of memory you would like to allocate to the Pipeline server/database, in megabytes

PreferencesPath: if the Pipeline server is not being installed (i.e., the PipelineServer element is missing or has attribute enabled=”false”), then the user must specify the path to the Pipeline server preferences file (by default, the path is /usr/pipeline/preferences.xml); if the Pipeline server is being installed, you can omit this element.
SGE: use attribute enabled=”true” to indicate that you would like to install Son of Grid Engine; the tags that follow will describe some of the preferences for the installation; you can find documentation on SGE here

SGERoot: path to directory where you would like to install SGE (default: /usr/local/sge)
SGECluster: name of cluster that you would like to install (default: cluster)
SubmitHosts: specify hostnames of machines which will be configured to handle job submission and control; you can do this using one hostname per Host element, as children of the SubmitHosts element
ExecHosts: specify hostnames of machines which will be execution hosts; use same format as for SubmitHosts
AdminHosts: specify hostnames of machines that will be used for SGE administration purposes; use same format as for SubmitHosts
AdminUsername: user that will serve as SGE administrator
SpoolDir: path to a directory that will be used for spooling during installation
Queue: use attribute configure=”true” to indicate that you would like to configure a queue at the end of SGE installation; this is documented here

Name: the name of the new queue that you would like to configure
Hosts: the hosts that you would like to add to the queue
Slots: the slots that you would like to add to the queue (the difference between hosts and slots is documented here)

Tools: use the attribute enabled=”true” to indicate that you would like to install some tools; also use the path attribute to specify the directory where you would like to install the tools (note that this should be in an NFS-shared directory)

NeuroImagingTools: use the attribute enabled=”true” to indicate that you would like to install one or more NeuroImaging tools; true/false values for the all_executables and all_serverlibs tags indicate that you want to install the executables and/or .pipe files for all NeuroImaging tools, regardless of what values each tool is set to.

Available neuroimaging tools: AFNI, AIR, BrainSuite, FSL, FreeSurfer, LONI, MINC, ITK, DTK, GAMMA, and SPM; for each of these, the executables=”true” attribute is used to activate the tool installation and the serverlib=”true” attribute is used to activate the .pipe files for that tool; note that FSL, FreeSurfer, and DTK require that the user specify a sub element, namely ArchivePath, whose value is the path to the archive file, downloaded manually from the software website.

BioinformaticsTools: same attributes as NeuroImagingTools tag

Available bioinformatics tools: EMBOSS, Picard, MSA, BATWING, BayesAss, Formatomatic, GENEPOP, Migrate, GWASS, MrFAST, Bowtie, SamTools, PLINK, MAQ, miBLAST; again, the enabled attributes can be used to indicate activation or deactivation of installation for each of these elements

2.1.6 Troubleshoot

The following is a list of common problems and explanation:

– The provided directory seems not to be a network file shared (NFS) directory.
The installer will verify the Shared File System location given. It is required to be on NFS if the server is set to use a grid. The shared file system is used for the Pipeline server to store intermediate files of workflows and to install Grid Engine, NeuroImaging, and Bioinformatics Tools.

– For a Grid Engine installation, the local hostname cannot be “localhost” and/or the IP address is like 127.0.*.*
You must provide fully qualified domain names as hostnames (such as “host1″); “localhost” or “127.0.0.1″ is not allowed.

– Cannot enable Grid submission as SGE doesn’t have any queue.
If you do not have any queue defined in SGE, you have to create one yourself and recheck “Enable Grid submission” checkbox and select the queue.

– Why I can’t connect to the server?
If you have the Pipeline server running but you can’t have your client connect to it (shows “Server not found” message), you need to check your firewall settings and enable port 8001.

– Why is my first workflow taking so long?
When you have SGE installed and you submit jobs for the first time, it may take a long time to get the jobs running. This is because initially the SGE sees the compute nodes loaded heavily, but as time passes, the loading information will be updated more accurately.

2.2 Conventional Installation (without DPS utility)

If you’d like to install the Pipeline server by hand, here are some instructions on how to get started. Note that if you choose this route, you’ll have to carry out quite a bit of configuration on your own. This is only recommended if you’ve done it before or have a thorough understanding of the inner workings of the Pipeline server. Otherwise, use the DPS utility.

2.2.1 Requirements

The Pipeline server can run on any system that is supported by JRE 1.6 or higher, so the first thing to do is head over to the official Java website to download the latest JRE/JDK. If you run the server on Windows, you will not be able to use privilege escalation (you might not even need/want it). Also the Failover feature is only supported by Unix/Linux systems. All other features are available for all platforms.

The amount of memory required varies based on the load you will expect on the server, but for a reference point, as of summer 2010, the main Pipeline server running at LONI has been set to accept a max load of 620 jobs, and its memory footprint hovers between 50-300MB depending on the load and garbage collection scheme.

2.2.2 Downloading

Head over to the Pipeline download page and download the latest version of the program for Linux/Unix. The server and the client are both in the same jar file, so you only need to change the Main entry point when starting up the server. Extract the contents of the download to the location you want to install the server at.

2.2.3 Starting the server

Now let’s start the server for the first time. Get to a prompt and switch to the directory where you copied the Pipeline.jar and lib directory and type:

$ java -classpath Pipeline.jar server.Main

Assuming you have java in your path, you should have received the following message back in your terminal window:

[ 1/6 ] Connecting to Persistence Database..............DONE [61ms] [ 2/6 ] Starting server on port 8001....................DONE [747ms] [ 3/6 ] Loading server library..........................DONE [336ms] [ 4/6 ] Loading server packages info....................DONE [2ms] [ 5/6 ] Checking to resume backlogged workflows.........DONE [46ms] [ 6/6 ] Checking to resume active workflows.............DONE [0ms] [ SUCCESS ] Server started.

That’s not enough to have a fully functional server yet, but we’re a step closer, so go ahead and break out of the process by hitting Ctrl-C and then let’s begin configuration process.

Previous: 1. Introduction

Table of Contents

Next: 3. Configuration

Pipeline 5.8

Cloud Storage Module

NDAR Support

Server Library Testing

Multiple Clients

Session Sharing

An automatic metadata augmentation

Macbook Pro Retina display support

1. Introduction

2. Installation

2.1 Requirements

2.2 Downloading

2.3 Setup and launching

3. Interface overview

3.1 Connecting to Pipeline servers

4. Building a workflow

4.1 Dragging in modules

4.2 Connecting modules

4.3 Setting parameter values

4.4 Processing multiple inputs

4.5 Enable/Disable parameters

4.6. Saving Workflows

5. Execution

5.1 Executing a workflow

5.2 Viewing output

LONI Pipeline Hands-On Training Day, UCLA

When: Monday, May 21, 2012, 9AM – 3PM Where: LONI DIVE Theater, NRB #225, UCLA (Map)

Web Service Module

Workflow Comparison Utility

Data Extraction

Metadata Augmentation

Previewer

Copy Output

Cancel Instance

Custom Grid and Environment Variables

For detailed change logs, please check the release notes.

Module Creation Utility

Instance Restart

Tools Usage Stats

Local Workflow Persistence

Copy with Input

Remote Update for Server

Server Status Charts

Error Log for Data Sinks

Execution Log Pages

Annotation Features

For detailed change logs, please check the release notes.

Neuro Imaging

Bioinformatics and Genomics

Notes

FAQs

3.1 Editing Preferences.xml

3.2 Server Configuration Tool

3.2.2.1 General

3.2.1.1 Hostname and port

3.2.1.2 Temporary directory

3.2.1.3 Scratch directory

3.2.1.4 Log file location

3.2.1.5 Use privilege escalation

3.2.1.6 Persistence

3.2.1.7 Days to persist status

3.2.1.8 History directory

3.2.1.9 Crawler persistence URL

3.2.1.7 Server library

3.2.2 Grid

3.2.2.1 Grid engine native specification

3.2.2.2 Job name prefix

3.2.2.3 Grid complex resource attributes

3.2.2.4 Grid Variables Policy

3.2.2.5 Grid maximum submit threads

3.2.2.6 Grid total slots

3.2.2.7 Max number of resubmissions for “error stated” jobs

3.2.2.8 Array jobs

Parameters

3.2.2.9 Grid plugin

3.2.2.10 Finished job retrieval method

3.2.2.11 Pipeline user is a grid engine admin

3.2.2.12 Grid job accounting

3.2.3 Access

3.2.3.2.1 Server admins

When: Monday, May 21, 2012, 9AM – 3PM
Where: LONI DIVE Theater, NRB #225, UCLA (Map)