stackPACK v2.2
 Introduction |  WebProjectManager |  WebPipe |  WebProbe |  WebReport |  Support |  About 


                  COMMAND LINE USERS MANUAL
                  
                     stackPACK(tm) v2.2
        Clustering, Alignment and Expression Analysis System


 ==============================================================================
 TABLE OF CONTENTS
 ==============================================================================

1. stackPACK Conventions
2. Input Data Format
    2.1 GenBank flatfile format
    2.2 FASTA format
        2.2.1  Simple FASTA format 
        2.2.2  STACK FASTA format 
        2.2.3  NCBI FASTA format
        2.2.4  Mixed or Unknown FASTA format 
3. Running stackPACK
    3.1  Setting the Environment
    3.2  Managing Data
    3.3  Creating a Project
        3.3.1 Creating a project from the command line
        3.3.2 Creating a project using the menu system
    3.4 Importing Data 
    3.5 Masking Data
        3.5.1 Masking Parameters   
    3.6 Clustering Data
        3.6.1 Cluster Parameters
    3.7 Assembling Data
        3.7.1 Assembly Parameters and Configuration
    3.8 Analyzing Data
        3.8.1 Analysis Configuration
    3.9 Linking Data
    3.10 Incremental Addition of Data
    3.11 Undoing steps in the pipeline
    3.12 Restarting steps in the pipeline
    3.13 Handling projects with the same name
    3.14 Converting stackPACK v2.1 projects to v2.2
 4. Web-based Viewing and Output of Data
 5. Exporting Data from the Command Line
 6. Expert Configuration
 7. Support/Questions
 8. About stackPACK
 
 ==============================================================================



1. stackPACK Conventions

The following conventions are used in the stackPACK commandline interface
and in this Command Line Users Manual:

- When usage is provided for command line programs, insert valid operations 
  in command line statements using the following conventions:

    <operation>            Compulsory command line operations
    [operation]             Optional command line operations
    [operation1|operation2] Choose operation 1 OR operation2
 
- When using stackPACK from the command line, the user can view the
  command line usage of a program by typing the program name without any 
  operations or parameters specified. (e.g. stack_ProjectManager)



2. Input Data Format

The stackPACK system accepts input in GenBank flatfile format or FASTA format.
In addition to these, stackPACK also accepts PHRED quality score file formats.


2.1 GenBank flatfile format

GenBank flatfile format is defined as the format of sequence entries in the
GenBank database or as downloaded from the NCBI web site (e.g. Entrez search 
results) when GenBank format is specified. The full GenBank format specification
is found in section 3.4 of the GenBank release notes. GenBank format 
should be used to parse maximum annotation information upon import.
 

2.2 FASTA format

The stackPACK system accepts FASTA files with several different header line 
formats. The system attempts to parse annotations such as direction, clone ID 
and library name from the FASTA header line when available and in a recognized 
format.


2.2.1 Simple FASTA format (SIMPLE)

>[accession].[direction] [clone ID] 

Where direction is either "r1" for a 3-Prime clone or "f1" for a 5-Prime clone 

e.g.       >37463.f1 g83244
           ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
           TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
           CTCAGTCGTACGTACGTACGT

Note: A sequence with an accession like "R.C.3746" cannot be specified as 
      Simple FASTA format, as the program will parse "R" for the accession and 
      "C" for the direction. For accessions beginning with "R.C.", the GUESS 
      option should be used.


2.2.2 Stack FASTA format (STACK)

>[accession] [gi] | [accession] CLONE: [clone] CLONE_LIB: [clonelib] LEN: [len] 
FILE [source file] [direction<5-PRIME|3-PRIME>] DEFN: [descriptive text] 

e.g.       >T27877 g609975 | T27877 CLONE: 17194 CLONE_LIB: Human Eye LEN: 505
            bp FILE gbest3.seq 5-PRIME DEFN: EST19137 Homo sapiens cDNA 5'end
            ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
            TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
            CTCAGTCGTACGTACGTACGT


2.2.3 NCBI FASTA format (NCBI)

As retrieved from NCBI's Entrez when the FASTA option is selected. 
A basic description can be found at http://www.ncbi.nlm.nih.gov/BLAST/fasta.html

e.g.       >gi|4468770|emb|AJ009167.1|TSAJ9167 Trypanosoma sp. 18S rRNA gene, 
            isolate K&A
            ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
            TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
            CTCAGTCGTACGTACGTACGT

Note: Clone information in NCBI format is part of the free form text in the 
      definition line. Clone information is thus not parsed when using NCBI 
      format and therefore data imported in NCBI format cannot be clone linked.


2.2.4  Mixed or Unknown FASTA format (GUESS)

 - Files with mixed FASTA header line formats or files with FASTA header lines
   not described above can also be imported.

 - If stackPACK does not identify one of its pre-defined FASTA headers, the 
   program determines an accession number for the entry by extracting all 
   valid characters (alphanumeric and punctuation such as "_" and ".") found 
   between the > and the first space, where this string is 255 characters or less.  
   If the accession is greater than 255 characters, the sequence will not be 
   imported.  Where possible, other details on the line may be parsed in as well. 
   Otherwise, the remainder of the line is ignored. 

 - It should therefore be ensured that the first 255 characters of each header 
   line entered for processing are unique.  

Minimum requirement for FASTA format input file to stackPACK is:
  >[accession number]
   


3. Running STACKPACK


3.1  Setting the Environment

Once STACKPACK is installed, you must set your path for the location of the 
stackPACK programs. This allows you to work from anywhere on the system.  If 
stackPACK has been installed in the standard location, type the following
at the system prompt to set your paths:

C-shell users:
setenv PATH ${PATH}:/usr/local/stackpack/bin
setenv LD_LIBRARY_PATH /usr/local/stackpack/lib

Bash-shell users:
export PATH=$PATH:/usr/local/stackpack/bin
export LD_LIBRARY_PATH=/usr/local/stackpack/lib

You can verify whether or not your paths have been set correctly by 
typing the following from anywhere on the system (e.g., home directory):
  which stack_ProjectManager
You should see:
  /usr/local/stackpack/bin/stack_ProjectManager
Then typing:
  which d2_cluster
You should see:
  /usr/local/stackpack/bin/d2_cluster
Then typing:
  echo ${LD_LIBRARY_PATH}
You should see:
  /usr/local/stackpack/lib

NOTE:  If your system administrator has installed stackPACK in a directory
       other than the standard one given in installation instructions, the paths 
       used above should be replaced by the actual path for the programs.


3.2  Managing Data

Each clustering run performed with stackPACK is associated with a single project. 
Projects are created and managed by the stack_ProjectManager program.
  
The stack_ProjectManager program consists of a number of operations which allow
the user to list, delete and create projects as well as to display summary 
information on specified projects. stack_ProjectManager also allows the 
conversion of stackPACK v2.1 projects to stackPACK v2.2 projects - this enables 
the user to continue data processing and analysis on a project created with 
stackPACK v2.1.  

Each operation has its own defined set of necessary parameters.

--------------------------------------------------------------------------------
Command:   stack_ProjectManager 

Usage:     stack_ProjectManager <operation> [parameters]

Valid operations, with their parameters are:
  -menu                                    run with a simple interactive menu 
  -list [owner] [description]              list all projects
  -summary <project> <owner>               display summary statistics on 
                                           specified project
  -delete <project> <owner>                delete specified project
  -create <project> [description] [owner]  create a new project
  -convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0>   
                                           convert a v2.1 project to v2.2

NOTE:  <old_dsn> for stackPACK v2.1.1 = stacksys
       <old_dsn_login> for stackPACK v2.1.1 = stackpack
       <old_dsn_password> for stackPACK v2.1.1 = stackpack
  
-------------------------------------------------------------------------------

                                          
3.3 Creating a Project

Projects must first be created before the clustering pipeline can process data.
One project is created per data input file and it is usual for the project 
name to reflect the type of data to be processed. Project names may consist of 
alphanumerics and underscores ("_"). Projects can be created through the command 
line or through the basic menuing system.  

Note: StackPACK is NOT case-sensitive. Please be aware of this when naming 
projects.  For example, "UPPER" and "Upper" will be considered duplicate project
names if they have the same owner.


3.3.1 Creating a project from the command line:
--------------------------------------------------------------------------------
Command:       stack_ProjectManager -create

Info:          creates projects

Usage:         stack_ProjectManager -create <project> [description] [owner]

Where:

Project =       Brief one-word project name.
                Multiple projects may be given the same name provided that they 
                are owned by different users.
                Valid characters are alphanumerics and underscore ("_").
Description =   One-line project description for your reference.  
                Put multi-word descriptions in quotations.
                Description is displayed in subsequent project listings.
                All characters are valid.
Owner =         Email address or name of owner of project
                Valid characters are alphanumerics, dot ("."), dash ("-"),
                underscore ("_") and "@".

Example: 
      stack_Project Manager -create  testolf  "clustering olfactory data"  liza
--------------------------------------------------------------------------------

The above example returns the following from the system:

     Creating project: testolf
          Description: clustering olfactory data
                Owner: liza

     Created project 'testolf'

NOTE: If no project owner is specified, stackPACK will use the logged in 
      username as the project owner.


3.3.2 Creating a project using the menu system:

The same project can be created through the simple menuing system for
stack_ProjectManager.  An example is given below. Comments on the right preceded 
by an arrow are for the users information.

--------------------------------------------------------------------------------
 Command:  stack_ProjectManager -menu


   Stackpack Project Manager
   =========================

     1... List all projects
     2... Create a project

     9... Delete a project

     q... Exit project manager

    > 2                                       <---select #2 Create a project



    Create Project
     --------------
       1... Project Owner:       liza         <---default owner is the name of the 
                                                  person who is logged on
       2... Project Name:        
       3... Project Description: 

       c... Create Project

       q... Return to main menu

     > 1                                     <---select option 1 to enter project owner
     Project Owner: liza                     <---enter name of project owner

     Create Project
     --------------
       1... Project Owner:       liza
       2... Project Name:        
       3... Project Description: 

       c... Create Project
       q... Return to main menu

     > 2                                    <---select option 2 to enter project name
     Project Name: testolf                  <---enter project name

     Create Project
     --------------
       1... Project Owner:       liza
       2... Project Name:        testolf
       3... Project Description: 

       c... Create Project

       q... Return to main menu

     > 3                                    <---select option 3 to enter one-line 
                                                description
     Project Description: clustering olfactory data   <----enter description

     Create Project
     --------------
       1... Project Owner:       liza
       2... Project Name:        testolf
       3... Project Description: clustering olfactory data

       c... Create Project

       q... Return to main menu

 > c                                       <----IMPORTANT: after entering details, 
                                                  you must now type "c" to create 
                                                  your project

-------------------------------------------------------------------------------
     Creating project: 'testolf'...
     Project created successfully
-------------------------------------------------------------------------------

Once your project has been created successfully, type "q" to exit the project 
manager.


3.4 Importing Data 

The sequences from the input data file must be imported into stackPACK's 
database before the clustering engine can process them.  Data in GenBank or a 
range of FASTA formats may be imported. PHRED quality score files may also be 
imported.

Non-alphabetic characters (including * , and numbers) in the sequence
lines are automatically stripped out when the file is read in.

NOTE: 
 - The stack_Import output reports the number of sequences imported, as well 
   as the number of sequences that could not be imported into the project.
 - The stack_Import output indicates the progress of stack_Import. 
   
   e.g. ...(99/2600)..
   Each imported sequence is represented by a dot ("."), and "(99/26000)" 
   indicates that 99 sequences out of a total of 2600 sequences have been 
   imported.

--------------------------------------------------------------------------------
Command:        stack_ImportGenbank 

Info:           imports GenBank format input sequences


Usage:          stack_ImportGenbank <project> <Genbank File> [Organism]


Where:
Project      =  Brief one-word project name.
Genbank file =  Input data file name and path.
Organism     =  Organism under study; If used stackPACK will only import 
                sequences with that specific organism designation. If no
                organism is specified, all sequences will be imported. All 
                or part of an organism name, as it appears in GenBank, may 
                be used. If, for example, the term "apien" is used, all 
                sequences with "apien" somewhere in the GenBank flatfile 
                ORGANISM field, such as "Homo sapiens" and "homo sapien" 
                will be imported.

Example: 
 stack_ImportGenbank testolf /foo/gbest29.seq "Homo sapiens"
--------------------------------------------------------------------------------

The above example returns the following from the system:

 Importing Genbank data.
   Project:  testolf
   Filename: /foo/gbest29.seq
   Organism: Homo sapiens
   Importing: 2600 records

   .........................................................
   .........................................................

   stack_ImportGenbank completed.
   2600 imported.
   2600 sequences processed.
   Total sequences in project: 2600
 
--------------------------------------------------------------------------------
Command:       stack_ImportFasta  

Info:          imports FASTA format input sequences


Usage:         stack_ImportFasta <project> <FastaFile> [GUESS|SIMPLE|STACK|NCBI]
               stack_ImportFasta <project> <FastaFile> <PHRED Quality File>
                  

Where:
Project      = Brief one-word project name.
Fasta file   = Input data file name and path.
Format       = Type of FASTA format file input.  
               Options:  GUESS|SIMPLE|STACK|NCBI|PHRED
               Default format is GUESS.
               See section 2  for a more detailed description of each format.
Quality file = Input quality file name and path.
               Only valid when PHRED format specified.
               Required if PHRED format specified. 

Examples:       
 stack_ImportFasta testolf /foo/olf.fasta SIMPLE
 stack_ImportFasta testolf /foo/olf.fasta PHRED /foo/olf.qual

--------------------------------------------------------------------------------

The first of the above two stack_ImportFasta examples returns the following from 
the system:
  
 Importing Fasta data.
   Project:  testolf
   Filename: /foo/olf.fasta
   Format:   SIMPLE
   Importing: 2600 records

   ....................................................
   ....................................................

 stack_ImportFasta completed.
 2600 imported.
 2600 sequences processed.
 Total sequences in project: 2600


The second example, when importing PHRED quality scores, returns the following 
from the system:

 Importing Fasta data.
   Project:  testolf
   Filename: /foo/olf.fasta
   Format:   PHRED
   QualFile: /foo/olf.fasta.qual
   Importing: 2600 records
               
   ....................................................

   Importing quality data from: '/foo/olf.fasta.qual'
   ....................................................

 stack_ImportFasta completed.
 2600 imported.
 2600 sequences processed.
 Total sequences in project: 2600


NOTE:  
 - If the GUESS format is selected and stackPACK does not identify one of its 
   pre-defined FASTA headers, the program determines an accession number for 
   the sequence entry by extracting all valid characters (alphanumeric and
   punctuation such as "_" or ".") found between the > and the first space. 
 - Sequences will ONLY be imported if there are 255 or less valid characters
   between the > and the first space. It should thus be ensured that these 255
   characters of each header line entered for processing are unique. 
 - Clonelinking will not occur when PHRED or NCBI format are selected as these 
   formats do not parse any clone information.


Multiple imports in multiple formats may be made into the same project, so long 
as no processing has yet taken place. This provides a mechanism for taking 
advantage of maximum annotative information when sequences come from a variety of
sources.  These data files will then be processed as one project. 

Example:

 stack_ImportFasta olftest /foo/olfactory.fasta
 stack_ImportGenbank olftest /foo/olfactory.genbank

Both the olfactory.fasta and the olfactory.genbank data files will be imported 
into the project olftest, and will be processed through the rest of the pipeline 
as one dataset.



3.5 Masking Data

The clustering procedure is intended to group together those sequences which 
share identical regions. A common problem in EST clustering is contamination 
with a sequence common to several members of the input EST data set but not 
representing valid gene data. The masking step helps ensure that ESTs submitted
for clustering are free of artifacts before clustering begins.  

Users can choose to mask input sequences either with CrossMatch (Green, 1999)
or RepeatMasker (Smit, AFA & Green, P). This choice must be specified in 
the stackPACK configuration file.
 
CrossMatch masks input sequences against a database containing: 
  - Repeat sequences. For STACKdb production Electric Genetics uses RepBase 
    (Jurka,1995). Your system administrator may have installed RepBase or  
    another repeat database more pertinent to your data. 
  - Common vector sequences, such as those distributed by NCBI.
  - Other potential contaminants such as rodent, mitochondrial and ribosomal 
    DNA. 
  
RepeatMasker masks input sequences against a database containing:
  - RepBase database, specially formatted for use with RepeatMasker 
    (available from the RepBase web site). 
  - Any FASTA formatted file containing repeat sequences and other contaminants. 
  
NOTE: Vector sequences are not included in the RepBase database used by 
      RepeatMasker.  If you wish to screen for vector contamination as well, 
      using RepeatMasker, you must perform this task prior to importing your 
      data into stackPACK, or you must create your own FASTA formatted repeat 
      database that includes vector sequences.

In both programs, the sequences are masked by replacing the contaminated
portions of the sequence with x's, which are ignored by further steps in the
clustering pipeline.  This ensures that only valid sequence data contributes 
to the associations made to generate a cluster.

Masked regions are retained during the clustering pipeline and are visible 
in the Sequence View, PHRAP Alignment View and CRAW Alignment View. Portions 
of consensus sequences with only masked bases ("x") in their constituent 
sequences will have consensus of "n".  Regions of consensus sequences with 
long strings of "n"s will have these truncated to no more than 10 "n"s.

Alternatively, users may select to cluster with masked sequences and assemble 
with unmasked sequences to maximize consensus information.  Please see Assembling
Data  for more details.  In this event, the masked data will only be used
for initial cluster groupings and unmasked data will be used for the 
remainder of the steps in the pipeline.

When stackPACK is installed, the system administrator performing the
installation should place a repeat database for the generic, system-wide
configuration in    /usr/local/stackpack/supporting/
This repeat database will then be used by default and needs not be specified 
in the command line. 

Alternatively, any of the system wide configurations can be overridden in the
user's home directory in the file .stackpackrc. See section 6, Expert 
Configuration, for more details.

--------------------------------------------------------------------------------
Command:    stack_Mask

Info:       masks an input file of sequences against common contaminants

Requires:   CrossMatch or RepeatMasker. Third-party software. 
            RepBase database or FASTA file of sequences to mask against.

Usage:      stack_Mask <project> [Mask File] 
            stack_Mask <project> skip
            stack_Mask <project> --conf=<conffile>

Where:    
Project   = Brief one-word project name.
Mask file = Name and location of FASTA file of sequences to mask against.
skip      = Option used to omit masking.
conffile  = configuration file
  
Examples:     
 stack_Mask testolf /foo/repeat.seq
 stack_Mask testolf skip
 stack_Mask testolf --conf=Conf_Primates
--------------------------------------------------------------------------------

The first of the above three stack_Mask examples returns the following from the 
system when CrossMatch is used:

 Masking sequence data
   Project:    testolf
   Program:    cross_match
   Mask file:  /foo/repeat.seq
   Num cpus:   2
   Batch size: 500 sequences
   Processing: 2600 sequences in total
   Flags:      /foo/repeat.seq -minmatch 12 -minscore 20 -screen
   

   ...........
   stack_Mask finished
   Processed 2600 sequences


The first of the above three stack_Mask examples returns the following from the 
system when RepeatMasker is used:

Masking sequence data
   Project:    testolf
   Program:    RepeatMasker
   Mask file:  /foo/repeat.seq
   Num cpus:   1 (RepeatMasker using 2 cpus)
   Batch size: 500 sequences
   Processing: 2600 sequences in total
   Flags:         -x -xm -nolow -pa 2 -lib /foo/repeat.seq      
         
   ...........
   stack_Mask finished
   Processed 2600 sequences


NOTE: RepeatMasker, unlike CrossMatch, uses its own parallelization independent 
      of stack_Mask. Thus when using RepeatMasker to mask, the number of CPUs in 
      the output may not correspond to the num cpu value specified in the 
      /etc/stackpack configuration file.


3.5.1 Masking Parameters

Users may generate multiple data-specific configuration files by copying the 
/etc/stackpack configuration file and editing the parameters. The name and 
location of the data-specific configuration file should be specified in the 
stack_Mask command with the --conf= option.

RepeatMasker requires the following default parameters within stackPACK:
 - xm     Creates an additional output file in cross_match format.
 - x      Returns repetitive regions masked with Xs rather than Ns.

FAQ:     Why change the batchsize?
ANSWER:  The larger the batchsize, the more RAM is required to process the
         data.  If your computer has limited RAM or you find the system
         running out of RAM during the stack_Mask process, re-run your
         data with a smaller batchsize.  Conversely, if your computer
         has sufficient RAM you may speed processing by increasing the 
         batch size.

FAQ:     How do I use RepeatMasker instead of CrossMatch?
ANSWER:  - Uncomment 'program=RepeatMasker' in your stackpack 
           configuration file.
         - Comment out 'program=cross_match' in your stackpack 
           configuration file.
         - If you would like RepeatMasker to use its own mask file,
           uncomment 'mask_file=none' in your stackpack configuration 
           file. There should be NO SPACES before 'mask_file=none'.

         e.g.[stack_Mask]
         ; You can choose between program=cross_match and program=RepeatMasker
         ;program=cross_match
         program=RepeatMasker

         ; If you wish RepeatMasker to use it's own masking database, set
         ; mask_file=none
         ;mask_file=/usr/local/stackpack/supporting/full_repeat.seq
         mask_file=none



3.6 Clustering Data

The clustering step of stackPACK uses d2_cluster, a high-performance comparison
algorithm that rapidly determines the relative similarity of large datasets of
genetic sequences. (Biological Evaluation of d2, an Algorithm for High-
Performance Sequence Comparison. ; Hide W., Burke J., Davison D.; Journal of 
Computational Biology 1 (3) 199-215). An update of the algorithm, produced by 
Electric Genetics in January 2001, significantly improved the speed of d2_cluster. 

d2_cluster implements a loose approach to sequence clustering by identifying
and counting matching n-length words (n=6), in contrast with stricter
approaches in which clusters are built up based on matching entire sequence
fragments. While the strict methodology yields cluster members that are 
highly related, the loose approach presents the opportunity to detect clusters
which are related by re-arrangement or alternative splicing. Although the 
resulting clusters are likely to be more "noisy", the combination with 
verification tools for multiple sequence alignments eliminates this noise 
and produces networks of highly related sequences for further analysis. 

d2_cluster, a word-based, greedy clustering algorithm, is discrete from the
assembly tool (PHRAP) and identifies sequences that are greater than 96%
identical over a window of 150 bases.

d2-cluster is a word multiplicity comparison method that utilizes an 
agglomerative algorithm that has been specifically developed for rapidly and 
accurately partitioning transcript databases into index classes by clustering 
ESTs and full-length sequences according to minimal linkage or "transitive 
closure" rules. Agglomerative clustering method means that every sequence begins
in its own cluster and the final clustering is constructed through a series of
mergers that may be described in terms of minimal linkage, sometimes called 
single linkage or "transitive closure". The term transitive closure refers to
the property that any two sequences with a given level of similarity will be in
the same cluster, hence A and B are in the same cluster even if they share no
similarity but there exists a sequence C with enough similarity to both A and 
B. 

NOTE: d2_cluster ignores sequences with less than 50 valid base pairs. Only 
      A,T,C and G are valid bases. These sequences are not included in the 
      clustering step and are considered as singletons. Masked sequences, 
      represented by "x", are not considered valid bases. Sequences masked to 
      less than 50 valid bases, will not be processed by d2_cluster. This value 
      is defined by the d2_cluster minimum_sequence_size parameter and can be 
      overridden in the user's home directory in the file .stackpackrc. See 
      section 6, Expert Configuration, for more details.

--------------------------------------------------------------------------------
Command:   stack_Cluster

Info:      runs d2_cluster on the input data file 

Requires:  enc_db. Distributed with stackPACK
           d2_cluster. Distributed with stackPACK

Usage:     stack_Cluster <project>
           stack_Cluster <project> undo 
           stack_Cluster <project> --conf=<conffile>
       
Where:
Project  =  Brief one-word project name.
undo     =  Reversal of all steps subsequent to and including clustering. 
conffile =  configuration file

Examples: 
 stack_Cluster testolf
 stack_Cluster testolf undo
 stack_Cluster testolf --conf=Conf_Primates
--------------------------------------------------------------------------------

The first of the above three stack_Cluster examples returns the following from 
the system:

 Clustering sequence data
   Project:    testolf
   Exporting:  DONE
   Clustering: 2600 sequences
   Using:      2 cpus
   Parameters: word_size=6
               similarity_cutoff=0.96
               minimum_sequence_size=50
               window_size=100
               reverse_comparison=1

   ...
   Parsing cluster table and fixing accessions...DONE
   Importing clustering results...DONE

   stack_Cluster finished

   Created 268 clusters.
   886 sequences were members of a cluster.

   Generating some statistics...

   ============================================================
                        CLUSTER STATISTICS                    
   ============================================================
    There are 1714 singletons
    There are 199 clusters with 2 sequences
    There are 35 clusters with 3 sequences
    There are 7 clusters with 4 sequences
    There are 7 clusters with 5 sequences
    There are 8 clusters with 6 sequences
    There are 3 clusters with 7 sequences
    There are 1 clusters with 8 sequences
    There are 3 clusters with 9 sequences
    There are 1 clusters with 12 sequences
    There are 2 clusters with 13 sequences
    There are 1 clusters with 53 sequences
    There are 1 clusters with 125 sequences


NOTE: The clustering step is more efficient if long input data sequences, such 
      as mRNAs, are at the top of the imported input data file.


3.6.1 Cluster Parameters

Users may generate multiple data-specific configuration files by copying the 
/etc/stackpack configuration file and editing the parameters. The name and 
location of the data-specific configuration file should be specified in the
stack_Cluster command with the --conf= option.

d2_cluster parameters:
 - Word_size: Sequences are clustered together based on matching n-length words. 
   The default value for n=6. This value may not exceed 9.
 - Similarity_cutoff: Percentage of sequence similiarity required for match. 
   The default value is 96% similarity (96 bases over the window size, 100).
 - minimum_sequence_size: Minimum sequence length, in bases, of sequences 
   processed by d2_cluster. The default value is 50.
 - window_size: Number of base pairs which are compared at one time. The default 
   value is 100.
 - reverse_comparison: Enables d2_cluster to also look at both forward and
   reverse strands of sequences and recognize them as such. This flag is 
   switched on by default.


3.7 Assembling Data

To take advantage of the benefits of looser clustering, it is necessary to
further align and analyze the clusters generated by d2_cluster. The related but
loose clusters are thus subsequently processed by PHRAP to identify, 
characterize and isolate any sequence divergence. 

PHRAP aligns and assembles the sequences grouped together by d2_cluster, and  
improves alignment quality by removing particularly distinct sequences as 
singletons.  Any PHRAP singletons remain associated with the original cluster.
While masked data is used in the assembly step by default, users are given 
the option to carry out stack_Assemble on the original unmasked
sequence data as is described below. In cases where sequences are imported 
without quality files, the PHRAP alignment and PHRAP consensus sequence will 
always be in lower case due to the default quality parameter.

StackPACK retains the PHRAP alignment and consensus sequence, even though it 
is further processed and regenerated by the stack_Analysis step.  

NOTE: 
  - The stack_Assemble output reports the number of clusters processed, the 
    number of contigs generated in this run, the number of clusters with 
    multiple contigs, the number of clusters that did not have a contig as 
    well as the total number of contigs in the database/project 
  - The stack_Assemble output indicates the progress of stack_Assemble.
 
    e.g. ...(9/268)..
    Each assembled cluster is represented by a dot ("."), and "(9/268)" 
    indicates that 9 clusters out of a total of 268 clusters have been 
    processed.

--------------------------------------------------------------------------------
Command:       stack_Assemble

Info:          runs PHRAP on clusters generated by d2_cluster

Requires:      PHRAP    Third-party software.
               ace2gde  Distributed with stackPACK.

Usage:         stack_Assemble <project>
               stack_Assemble <project> --use-unmasked              
               stack_Assemble <project> undo
               stack_Assemble <project> --conf=<conffile>
              
Where:
project      = Brief one-word project name
use-unmasked = Option used if stack_Assemble is to be carried out on the 
               original unmasked input sequences
undo         = Reversal of all steps subsequent to and including assembly
conffile     = configuration file

Examples: 
 stack_Assemble testolf
 stack_Assemble testolf --use-unmasked
 stack_Assemble testolf undo
 stack_Assemble testolf --conf=Conf_Primates
--------------------------------------------------------------------------------

The first of the above four stack_Assemble examples returns the following 
from the system:

 Assembling cluster data
   Project:    testolf
   Num cpus:   16
   Processing: 268 clusters
   Parameters: old_ace=1
               vector_bound=0
               forcelevel=0
               trim_score=20
               penalty=-2
               gap_init=-4
               gap_ext=-3
               ins_gap_ext=-3
               del_gap_ext=-3
               maxgap=30
               flags=-retain_duplicates

   ........................
   Cluster: 241 generated 2 sub-contigs


   stack_Assemble finished
   Processed 268 clusters
   Total contigs generated in this run: 269
   Total clusters that had multiple contigs: 1
   Total clusters that did not have a contig: 0
   Total contigs in database: 269


NOTE: 
 - The cluster generated by d2_cluster may be split into one or more 
   contigs by PHRAP.  PHRAP may also split out particularly divergent
   singletons that do not align well to any of the contigs.  We refer to
   these as PHRAP singletons.
   While each contig is further processed in the stack_Analysis step,    
   PHRAP singletons do not receive further processing and so are not
   seen in the Alignment Analysis or CRAW Alignment views and are 
   not used in the generation of the final consensus sequences. 
 - If the "--use-unmasked" parameter is specified, unmasked sequences
   will be displayed in all alignment and consensus views.


3.7.1 Assembly Parameters and Configuration

 - Users may generate multiple data-specific configuration files by copying 
   the /etc/stackpack configuration file and editing the parameters. The name 
   and location of the data-specific configuration file should be specified in 
   the stack_Assemble command with the --conf= option.

 - With the exception of the retain_duplicates flag, default PHRAP parameters 
   are used within stackPACK unless otherwise specified. Any PHRAP parameter 
   can be set as a flag in the .stackpackrc configuration file.

 - Please ensure that there are no spaces between "flags=" and the first flag 
   in your configuration file.

 - If the PHRAP retain_duplicates flag is not set, and the input data file
   contains two or more sequences with 100% sequence identity, PHRAP processes 
   only the first sequence. These identical sequences are processed by d2_cluster 
   and may thus be included in cluster generation. They are however not present 
   in the subsequent alignment and analysis steps. The retain_duplicates flag 
   is included by default in the assembly step.

 - While masked data is used in the assembly step by default, users are given
   the option to carry out stack_Assemble on the original unmasked sequence data 
   by using the --use-unmasked parameter. If this parameter is specified, 
   unmasked sequences will be displayed in all alignment and consensus views.

Using different PHRAP versions:
 - PHRAP v 0.990319, as well as the older version of PHRAP, v 0.960731, may 
   both be used in stack_Assemble. 
 - In order to use PHRAP version 0.960731, the PHRAP old_ace parameter needs to
   be set to 0, and the executable needs to point at the correct version of 
   PHRAP in the configuration file. 
 - The retain_duplicates flag is not supported by PHRAP version 0.960731.
 - PHRAP has limits of 64K bases for each sequence, and 64K sequences per 
   cluster. PHRAP can be compiled with a .longreads or .manyreads option that 
   allows assembly of sequences with more than 64,000 bp or clusters with more 
   than 64,000 sequences. These options can only be used with PHRAP v 0.990319. 
   It should thus be ensured that the old_ace parameter is set to 1 in the 
   configuration file. Please refer to section iv in the Installation 
   Instructions (Part I) of the Documentation For PHRAP and 
   CROSS_MATCH (Version 0.990319) for further details. 
 - Case provided in an input sequence will be retained in the older version of 
   PHRAP, PHRAP v 0.960731, and may be used as an indicator of sequence quality.


3.8 Analyzing Data

Aligned clusters, particularly those generated by a loose clustering engine, 
need to be further processed for errors, such as those inherent in single-pass
sequences, and alignments analyzed for alternate forms of expressed sequences.  

Although PHRAP aligns sequences, these alignments are lacking information about
variation within the cluster and do not help users distinguish alternative 
splice or other scientifically interesting events from alignment problems
induced by low sequence quality or experimental artifacts.

CRAW (CRAWview: for viewing splicing variation, gene families, and polymorphism
in clusters of ESTs and full-length sequences ; Chou A., Burke J.;
Bioinformatics 15 (5) 376-381) is thus employed to analyze alignments, 
partition sub-assemblies and provide a simple means to view clusters. After
CRAW processing, stackPACK further analyzes clusters to refine consensus 
sequences, maximize consensus sequence length, create final alignments and 
to select the best consensus sequence.  

CRAW works by verifying agreement along the columns of a multiple sequence 
alignment, using the data to sort related sequences within each cluster and 
to generate IUPAC-compliant consensus sequences for each subcluster. Using
default parameters, a subcluster is generated if 50% or more of a 100 base 
window differs from the remaining sequences of a cluster, excluding the
initial 100 bases of any read. The approach depends fundamentally on the 
alignment quality of each assembly. A poor alignment will yield erroneous 
sub-clusters and too low a gap penalty may yield too many columns in agreement 
and thus not create sub-clusters where they would be appropriate.

NOTE:
  - The stack_Analysis output reports the number of contigs processed, the
    number of consensus sequences generated in the run and the total number 
    of consensus sequences in the database/project.
  - The stack_Analysis output indicates the progress of stack_Analysis.   

    e.g. ...(9/269)..
    Each analyzed contig is represented by a dot ("."), and "(9/269)"
    indicates that 9 contigs out of a total of 269 contigs have been
    processed.   

--------------------------------------------------------------------------------
Command:      stack_Analysis

Info:         runs CRAW on aligned sequence data; further analyzes CRAW 
              subassemblies

Requires:     CRAW. Distributed with stackPACK.

Usage:        stack_Analysis <project>
              stack_Analysis <project> undo 
              stack_Analysis <project> --conf=<conffile>

Where: 
Project   =  Brief one-word project name.
undo      =  Reversal of all steps subsequent to and including analysis.
conffile  =  configuration file

Examples: 
 stack_Analysis testolf
 stack_Analysis testolf undo
 stack_Analysis testolf --conf=Conf_Primates
--------------------------------------------------------------------------------

The first of the above three stack_Analysis examples returns the following from 
the system:

 Analyzing contig data
   Project:    testolf
   Processing: 282 contigs
   Parameters: sig=0.5
               window_size=100
               ignore_first=50


   reassigning lone singleton 1871 from 0 to 1

   reassigning lone singleton 2416 from 0 to 1

   stack_Analysis finished.
   Total contigs processed: 269
   Total consensi generated in this run: 271
   Total consensi in database: 271

Note: CRAW has a limit of 20,000 sequences per cluster and a maximum sequence 
      length of 100,000 base pairs. A consensus sequence will not be generated 
      for clusters if these limits are exceeded.


3.8.1 Analysis Configuration

Users may generate multiple data-specific configuration files by copying the   
/etc/stackpack configuration file and editing the parameters. The name and
location of the data-specific configuration file should be specified in the 
stack_Analysis command with the --conf= option.


FAQ: What does "reassigning lone singleton" refer to in the output?
ANSWER: These refer to the d2_cluster singletons. stack_Analysis re-assigns 
        singletons to a larger subalignment with which they have the most 
        sequence similarity.



3.9 Linking Data

All ESTs generated from the same cDNA clone correspond to a single gene. Each 
EST is searched for clone identification so that non-overlapping clusters 
corresponding to the same gene can be identified and linked. Only a proportion 
of ESTs in GenBank currently have documented clone information. This information 
is utilized to extend the length of the cluster consensus sequences by joining
clusters that contain ESTs that share clone IDs. Thus only if the input 
sequences contain clone information will the program create linked clusters.

Given that the clone ID information is solely annotation-based and may have
namespace overlaps depending on the data source(s), this step has been placed
near the end of the processing pipeline. Furthermore, unless a specific 
5'-3' pair can be identified as a seed for each linked consensus, the procedure 
is transitive in nature and may lead to extensive clone-linked networks whose 
biological significance remains to be explored.  To avoid spurious linking,
the program, by default, requires that at least two independent clone ID
matches must be made before two clusters will link. The number of required 
links is a parameter that can be changed in stackPACK.

To form a final consensus sequence, the non-redundant best cluster consensus 
sequences are joined by linker segments of 20 Xs. This choice was made based 
on the word size employed by BLAST, so that alignment breaks would be 
preferentially inserted at these linker regions.

--------------------------------------------------------------------------------
Command:      stack_Link

Info:         creation of linked clusters

Usage:        stack_Link <project>
              stack_Link <project> undo
              stack_Link <project> --conf=<conffile>

Where: 
project   =   Brief one-word project name.
undo      =   Reversal of all steps subsequent to and including analysis.
conffile  =   configuration file

Examples: 
 stack_Link testolf
 stack_Link testolf undo
 stack_Link testolf --conf=Conf_Primates
--------------------------------------------------------------------------------

The first of the above three stack_Link examples above returns the following 
from the system:

 Linking cluster data
   Project:        testolf
   Redundancy:     2

   Pass 1 - identifying internal links and potential external  
            links .................

   Pass 2 - identifying external cluster links   

   stack_Link finished.
   Created 23 clonelinks, consisting of: 46 clusters.


NOTE: Users may generate multiple data-specific configuration files by copying 
      the /etc/stackpack configuration file and editing the parameters. The 
      name and location of the data-specific configuration file should be 
      specified in the stack_Analysis command with the --conf= option.


FAQ: How can I ensure that clonelinking will be performed on my sequence data?
ANSWER: - Clonelinking will only be performed if the clone ID information is 
          parsed when data is imported into stackPACK. This depends on whether 
          the clone ID information is in a recognisable format within the 
          sequence header line. 
        - Clone information in NCBI format is part of the free form text in the 
          definition line, and is thus not parsed when using NCBI format. 
        - Clone ID information is never provided in PHRED format and therefore
          no linking can be performed on this type of data.
        - Clone ID information will only sometimes be parsed when using GUESS 
          FASTA format. 
        - Clone ID information should always be parsed when importing with 
          simple FASTA, STACK FASTA and GenBank formats, provided that the 
          formats are used as per definition.



3.10 Incremental Addition of Data

New sequence data may be added incrementally to existing clusters. The cluster 
history of new, outdated and changed clusters is maintained and reflected 
in WebProbe. New sequences are simply imported into the existing project of 
choice and processed through the stackPACK pipeline as described in sections 
3.4 - 3.9 of the Command Line manual. The output of all the stackPACK commands 
reflect the fact the incremental addition is occuring. Some examples are given 
below.

 e.g. stack_ImportFasta olftest /foo/olf2.fasta GUESS

        The following output is returned from the system: 

         Importing Fasta data.
         Project:  olftest
         Filename: /foo/olf2.fasta   Format:   GUESS
         Importing: 229 records

         ......................

         stack_ImportFasta completed.
         229 imported.
         229 sequences processed.
         Total sequences in project: 493

NOTE: The "Total sequences in project" in the stack_Import output refers to the 
      newly imported sequences added to the total number of sequences in the 
      existing project.


 e.g. stack_Mask olftest

        The following output is returned from the system:

         Masking sequence data.
         Project:       Liver
         Program:       cross_match
         Mask file:     /usr/local/stackpack/supporting/repeat.seq
         Num cpus:      2
         Batch size: 115 sequences
         Processing: 229 sequences, continuing from sequence: 265
         Flags:         /usr/local/stackpack/supporting/repeat.seq -minmatch 12 
                     -minscore 20 -screen

         ..
         stack_Mask finished.
         Processed 229 sequences.

NOTE: "continuing from sequence: 265"  in the stack_Mask output indicates that 
      only the newly added 229 sequences are masked.


 e.g. stack_Cluster Liver

        The following output is returned from the system:

         Clustering sequence data
         Project:    Liver9Jan

         Existing clusters detected, performing an add.
         Exporting:  DONE
         Clustering: 493 sequences
         Using:      2 cpus
         Parameters: word_size=6
                     similarity_cutoff=0.96
                     minimum_sequence_size=50
                     window_size=100
                     reverse_comparison=1

         ...
         Parsing cluster table and fixing accessions...DONE
         Importing clustering results...DONE

         stack_Cluster finished.

         Added 44 new clusters.
         Deprecated 22 clusters.
         Joined 3 clusters into 1 new cluster.
         Left 25 clusters unchanged.

         Generating some statistics...

         ========================================================
                           CLUSTER STATISTICS                    
         ========================================================
        There are 127 singletons
        There are 47 clusters with 2 sequences
        There are 14 clusters with 3 sequences
        There are 12 clusters with 4 sequences
        There are 4 clusters with 5 sequences
        There are 3 clusters with 6 sequences
        There are 3 clusters with 8 sequences
        There are 1 clusters with 10 sequences
        There are 1 clusters with 13 sequences
        There are 1 clusters with 15 sequences
        There are 1 clusters with 18 sequences
        There are 1 clusters with 24 sequences
        There are 1 clusters with 40 sequences


The project may be processed through the rest of the stackPACK pipeline as 
described in sections 3.7 (stack_Assemble), 3.8 (stack_Analysis) and 3.9 
(stack_Link) of the command line manual. Processing will only be performed on
those clusters or contigs that have been altered due to the incremental 
addition of the new sequence data.

NOTE: PHRED quality scores may not be added to converted projects that have 
      been created with stackPACK v2.1. 


3.11  Undoing steps in the pipeline

Users have the capability to "undo" certain steps in the stackPACK
pipeline.  This undo option applies to the clustering, assembly, analysis 
and linking steps. You may NOT undo import or masking steps.  Undo is 
executed as a flag or parameter to the other pipeline steps.

The undo option will reverse all the steps subsequent to and including 
the step being undone in the stackPACK pipeline.  For example, undoing 
stack_Assemble will undo the stack_Assemble, stack_Analysis and stack_Linking
steps in the project specified.

NOTE:  If incremental addition has been carried out on an existing project, 
the undo option will reverse all data contained in the project, not just the
latest additions.

You may wish to use the undo option when you would like to change the parameters
or number of CPUs used for a particular step in a project, as one example.

----------------------------------------------------------------------------
Undo option

Info:   reverses all steps subsequent to and including
        the step being undone in the stackPACK pipeline. 

Usage:  <stackPACK_program> <project> undo

Where:
stackPACK_program = stack_Cluster, stack_Assemble, stack_Analysis
                    or stack_Link
project           = project from which you want to undo pipeline steps.

Examples:
  stack_Cluster <project> undo
  stack_Assemble <project> undo
---------------------------------------------------------------------------



3.12  Restarting steps in the pipeline

Some of the steps in the stackPACK pipeline may be interrupted and restarted 
again with little risk.  The steps that may be restarted include: stack_Import, 
stack_Mask, stack_Assemble and stack_Analysis.

To restart a process that has been interrupted, simply reissue the command 
for that process.  The program will not reprocess all the data, but will 
begin where it left off.  

NOTE:  If the program was interrupted while writing results back to the 
       database, there is a slight risk that the cluster or sequence being 
       written at that time will not be complete and will not be re-run when 
       you restart the program.  In most cases, restarting will not cause any 
       problems.  However, if you would like to be 100% certain that no data is 
       missed, you may undo the process and reprocess 
       all data for that step. 



3.13 Handling projects with the same name

Projects may have the same name, provided that they are owned by different users. 
In such cases, stackPACK will process the project owned by the user that is 
currently logged in, unless otherwise specified.

If for example, there are two projects in the database called "testolf" owned 
by two different users, the following output is returned from the system upon
stack_Import:


Command: stack_ImportFasta testolf /foo/olfactoryseq.fasta  

Output:  There are multiple projects called 'testolf'
         Id: 420       Project: testolf       Owner: liza
         Id: 421       Project: testolf       Owner: gary

         I am going to assume that you want to run the following project:
         Id: 420       Project: testolf       Owner: liza

         If you want to use a different project, please try again using the
         --user=<username> option

         e.g. stack_Assemble --user=<username> <project>


The following command should thus be used to specify the project called 
"olftest" owned by gary:

Command: stack_ImportFasta olftest --user=gary /foo/olfactoryseq.fasta

The --user=<username> option can be specified for all the stack commands in the 
processing pipeline.



3.14 Converting stackPACK v2.1 projects to v2.2

Projects that have been created with stackPACK v2.1 should be converted if users 
wish to have access to these projects on stackPACK v2.2. Further data processing 
as well as data output and analysis may be carried out on converted projects 
within stackPACK v2.2 Project conversion is performed with the stack_ProjectManager.  

-------------------------------------------------------------------------------
Command:           stack_ProjectManager -convert

Info:              Converts a v2.1 project to v2.2

Usage:             stack_ProjectManager -convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0> 

Where: 
Project          = project to be converted
old_dsn          = old data source name
old_dsn_login    = old data source name login
old_dsn_password = old data source name password
0|1              = This argument specifies whether sequences
                   in the stackPACK 2.1 project have been clustered
                   or not. If this argument is set to 1, sequences
                   will be assumed to be clustered. If this argument
                   is set to 0 (or if this argument is not specified)
                   sequences will be assumed to be unclustered.


Example: stack_ProjectManager -convert testolf stacksys stackpack stackpack 1
-------------------------------------------------------------------------------

The above example returns the following output from the system:

 Converting project: testolf
 2.1 System Database: stacksys
 2.1 System Login: stackpack
 
 sequences.........................
 clusters........................
 contigs.............
 consensi................
 clonelinks.......................
 annotation............

 Conversion completed successfully, you can now access your project as:  'testolf_converted'


NOTE: 
 - The old_dsn will always be "stacksys". This DSN entry can be found in the 
   /etc/odbc.ini file
 - For stackPACK v2.1.1 the old_dsn_login and old_dsn_password will always be 
   stackpack and stackpack. StackPACK v2.1, however may have been set up with a 
   different dsn_login and dsn_password. Please refer to the [DATABASE] section 
   of your stackPACK v2.1 /etc/stackpack configuration file to deremine what 
   these are.
 - ACE format output is not supported from converted projects.
 - PHRED format input files may not be added to converted projects.


4. Web-based Viewing and Output of Data

The stackPACK results are stored in a relational database and are viewed 
and exported by using the web interface components WebProbe(tm) and 
WebReport(tm).

WebProbe provides viewing tools that link consensus sequences, alignments,
expression analysis and external data sources like UniGene. 

WebReport provides access to a list of predefined reports which can be
selected and downloaded for further data evaluation or to create searchable
databases of your clustering results.

The web-based interface is typically invoked by opening the following location
in your browser:
  http://www.yourhostname.com/stackpack/

The hostname can be confirmed by viewing the [WEB] entry in the file:
/etc/stackpack.  For example, if the hostname is "myhost.egenetics.com", you
invoke stackPACK by opening the following URL:
  http://myhost.egenetics.com/stackpack/
and the [WEB] entry will look like this:

 [WEB]
 WEBSERVER  = hostname
 MAILSERVER = myhost.egenetics.com
 HTTP_URL   = /stackpack
 CGI_URL    = /cgi-bin/stackpack
 HTTP_PATH  = /home/httpd/html/stackpack
 CGI_PATH   = /home/httpd/cgi-bin/stackpack
 FULLVER    = yes
 PAGESIZE   = 25
 DEBUG      = no

The [WEB] entry contains configuration informations such as the location of 
files. The number of projects listed per page in the WebProjectManager is set 
by the PAGESIZE parameter.


5. Exporting Data from the Command Line

A series of scripts are provided that allow stackPACK users to export their
data in a number of predefined reports. These reports correspond to the 
reports found in the WebReport section of the web-based interface.

The export scripts are found in the "bin" subdirectory of the stackPACK
installation directory (typically /usr/local/stackpack/bin)and are named in 
such a way that users can easily deduce their function.
-------------------------------------------------------------------------------


5.1 Selecting the appropriate command line report:


stack_ReportAllSequences.py             - All masked or original unmasked input 
                                          sequence, in FASTA format.

stack_ReportConsensus.py                - All clonelinked consensus sequences, 
                                          multi-sequence cluster primary consensus 
                                          sequences or/and multi-sequence cluster 
                                          alternate consensus sequences, in 
                                          FASTA format.

stack_ReportAllSingleton.py             - All d2_cluster and/or PHRAP singleton 
                                          sequences, in FASTA format.

stack_ReportClusterMemberEst.py         - List of all constituent EST or mRNA   
                                          sequences per cluster accession or 
                                          for the whole project, in FASTA 
                                          or CSV format.

stack_ReportAlignment.py                - Initial PHRAP or Final CRAW sequence 
                                          alignments, per accession or for the 
                                          whole project, in MSF, ClustalW or ACE 
                                          formats.

stack_ReportClusterAlignmentAnalysis.py - The Alignment Analysis CRAW logs, per 
                                          accession or for the whole project. 
                        
stack_ReportNonRedundant.py             - Non-redundant output of the entire 
                                          project, in FASTA or CSV format.


Please refer to WebReport in stackPACK Support for a detailed description 
of each report.
-------------------------------------------------------------------------------


5.2 Usage of the command line reports:

All optional fields are placed in square brackets "[]", and all compulsory fields 
are placed in "<>". A pipe "|" indicates where one or the other option 
should be selected.


5.2.1 stack_ReportAllSequences.py
------------------------------------------------------------------------------- 
Command:   stack_ReportAllSequences.py

Info:      Outputs all masked or original unmasked input sequence, in FASTA format.

Usage:     stack_ReportAllSequences.py --Owner=<ProjectOwner> --Project=<ProjectName> <--Masked|--Original> [--Output=<OutputFilename>]

Where:
Owner    = Owner of the specified project.
Project  = Name of project from which data is to be extracted.
Masked   = Option to be selected if the input sequences in masked format is required.
Original = Option to be selected if the input sequences in original format is required.
Output   = File into which data is exported.

Examples: stack_ReportAllSequences.py --Owner=liza --Project=testolf --Masked --Output=OlfMasked.fasta 
          stack_ReportAllSequences.py --Owner=liza --Project=testolf --Original --Output=OlfUnmasked.fasta
-------------------------------------------------------------------------------

5.2.2 stack_ReportConsensus.py
-------------------------------------------------------------------------------
Command:           stack_ReportConsensus.py

Info:              Outputs all clonelinked consensus sequences, multi-sequence 
                   cluster primary consensus sequences or/and multi-sequence 
                   cluster alternate consensus sequences, in FASTA format.
  
Usage:             stack_ReportConsensus.py --Owner=<ProjectOwner> --Project=<ProjectName> <ConsensusOptions> [--Output=<OutputFilename>]       

                   Consensus Options:    
                   --Clonelink
                   --Primary
                   --Alternate

                   Note: One or more options may be selected simultaneously

Where:  
Owner            = Owner of the specified project.
Project          = Name of project from which data is to be extracted.
ConsensusOptions = One or more of the three consensus options (explained in the 
                   usage) to be selected.
Output           = File into which data is exported.

Examples: stack_ReportConsensus.py --Owner=liza --Project=testolf --Clonelink --Alternate --Output=Olf_Clonelink_Alternate.fasta 
          stack_ReportConsensus.py --Owner=liza --Project=testolf --Clonelink --Output=Olf_Clonelink.fasta
-------------------------------------------------------------------------------

5.2.3. stack_ReportAllSingleton.py
-------------------------------------------------------------------------------
Command:           stack_ReportAllSingleton.py

Info:              Outputs all d2_cluster and/or PHRAP singleton sequences, 
                   in FASTA format.

Usage:             stack_ReportAllSingleton.py --Owner=<ProjectOwner> --Project=<ProjectName> <SingletonOptions> [--Output=<OutputFilename>]
  
                   Singleton Options:
                   --Singletons           Sequences not included in 
                                          multi-sequence clusters.     
                   --PHRAP                Clustered sequences excluded from the 
                                          PHRAP alignment. 
                   --MinimumSeqLength[=#] If no value is specified, the 
                                          d2_cluster minimum_sequence_size 
                                          parameter value is used by default.
  
                   Note: More than one singleton option may be used simultaneously

Where:
Owner            = Owner of the specified project.
Project          = Name of project from which data is to be extracted.
SingletonOptions = One or both of the singleton options (explained in the usage) 
                   to be selected.
Output           = File into which data is exported.

Examples: stack_ReportAllSingleton.py --Owner=liza --Project=testolf --Singletons --PHRAP --Output=Olf_d2_PHRAP.fasta
          stack_ReportAllSingleton.py --Owner=liza --Project=OlfData --Singletons --Output=Olf_d2.fasta
-------------------------------------------------------------------------------

5.2.4. stack_ReportClusterMemberEst.py
-------------------------------------------------------------------------------
Command:    stack_ReportClusterMemberEst.py

Info:       Lists all constituent EST or mRNA sequences of all clusters, in 
            FASTA  or CSV format.  

Usage:      stack_ReportClusterMemberEst.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<FASTA|CSV> [--Accession=<cl>] [--Output=<OutputFilename>]

            Note: 
            If the accession number is omitted, the cluster members for all 
            clusters within the project will be outputted. 

Where:
Owner     = Owner of the specified project.
Project   = Name of project from which data is to be extracted.
Format    = Format of output file.
Accession = Cluster accession number of the data to be outputted.
Output    = File into which data is exported.

Examples: stack_ReportClusterMemberEst.py --Owner=liza --Project=testolf --Format=FASTA --Accession=cl8 --Output=Olfcl8.fasta
          stack_ReportClusterMemberEst.py --Owner=liza --Project=testolf --Format=FASTA --Output=Olf_cl_project.fasta
-------------------------------------------------------------------------------

5.2.5 stack_ReportAlignment.py
-------------------------------------------------------------------------------
Command:    stack_ReportAlignment.py

Info:       Outputs initial PHRAP or Final CRAW sequence alignments, per 
            accession or for the whole project, in MSF, ClustalW or ACE formats.

Usage:      stack_ReportAlignment.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<ACE|MSF|CLUSTALW> [--Accession=<cl#|ct#|cn#>] [--Alignment=<Final|PHRAP>] [--Output=<filename>]

            Note:
            - The accession number must be specified if the alignment for a 
              particular cluster, contig or consensus is required. If omitted, 
              all specified alignments for the Project will be outputted.
            - Cluster accession numbers (cl#) must be used when specifying ACE 
              format. ACE format is valid only for the PHRAP alignment.
            - Contig accession numbers (ct#) must be used when the PHRAP 
              alignment is required.
            - Consensus accession numbers (cn#) must be used when the final 
              alignment is required.
            - Alignment type is only required if no accession is specified. 

Where:
Owner     = Owner of the specified project.  
Project   = Name of project from which data is to be extracted.
Format    = Format of output file.
Accession = Cluster, contig or consensus accession number of the alignment to be 
            outputted.
Alignment = Alignment type to be outputted.
Output    = File into which data is exported.

Examples: stack_ReportAlignment.py --Owner=liza --Project=testolf --Format=ACE --Accession=cl8 --Output=Olf_cl8.ace
          stack_ReportAlignment.py --Owner=liza --Project=testolf --Format=MSF --Accession=cn8 --Alignment=Final --Output=Olf_cn8.msf

NOTE: If the ACE format option is selected, sequence information will be output 
      in the old ACE format.

-------------------------------------------------------------------------------

5.2.6. stack_ReportClusterAlignmentAnalysis.py
-------------------------------------------------------------------------------
Command:    stack_ReportClusterAlignmentAnalysis.py

Info:       Outputs the Alignment Analysis CRAW logs, per accession or for the 
            whole project.

Usage:      stack_ReportClusterAlignmentAnalysis.py --Owner=<ProjectOwner> --Project=<ProjectName> [--Accession=<cl#|ct#>] [--Output=<OutputFilename>] 

            Note:
            If the accession number is omitted, the alignment analyses for the 
            whole project will be outputted. 

Where:
Owner     = Owner of the specified project.
Project   = Name of project from which data is to be extracted.
Accession = Cluster or contig accession number of the alignment analysis to be
            outputted.
Output    = File into which data is exported.

Examples: stack_ReportClusterAlignmentAnalysis.py --Owner=liza --Project=testolf --Accession=cl8 --Output=Olf_cl8.craw
          stack_ReportClusterAlignmentAnalysis.py --Owner=liza --Project=testolf --Output=Olf_project.craw
-------------------------------------------------------------------------------

5.2.7. stack_ReportNonRedundant.py
-------------------------------------------------------------------------------
Command:          stack_ReportNonRedundant.py

Info:             Non-redundant output of the entire project, in FASTA or CSV 
                  format.
  
Usage:            stack_ReportNonRedundant.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<FASTA|CSV> [SequenceOptions] [--Output=<OutputFilename>]
 
                  Sequence Options:
                  --Clonelink          Consensus sequences for those multi-sequence 
                                       cluster joined by virtue of clone Id
                  --Primary            Primary consensus sequences for all             
                                       multi-sequence clusters NOT present in          
                                       clonelinked clusters
                  --Singletons         Singleton sequences NOT present in 
                                       clonelinked clusters.

                  Note:
                  - When the CSV format is specified, all three sequence options 
                    are selected by default.
                  - When the FASTA format is specified, a sequence option must 
                    be specified. More than one sequence option may be specified 
                    simultaneously.  

Where:
Owner           = Owner of the specified project.
Project         = Name of project from which data is to be extracted.
Format          = Format of output file.
SequenceOptions = One or more of the three sequence options (explained in the 
                  usage) to be selected.  
Output          = File into which data is exported.

Examples: stack_ReportNonRedundant.py --Owner=liza --Project=testolf --Format=FASTA --Primary --Singletons --Output=Olf_Primary_Singletons.fasta
          stack_ReportNonRedundant.py --Owner=liza --Project=testolf --Format=CSV --Output=Olf.csv

NOTE: 
 - For a comprehensive non-redundant report all three sequence options should be 
   included.
 - Alternate consensus sequences are not included in the non-redundant output 
   report. 
--------------------------------------------------------------------------------
 

6. Expert Configuration

The stackPACK software has a system-wide configuration file in the 
following location:   /etc/stackpack

Users wishing to configure stackPACK differently for their own use may do
so through creation of an individual configuration file placed in their
home directory named ".stackpackrc". Key parameters that can be adjusted by  
the user using .stackpackrc include the repeat masking file, masking program,
the number of processors for the masking, clustering and assembly steps and, 
for expert users, parameters for each of the programs called externally by 
stackPACK. Even though the most commonly used parameters for each of these 
programs are listed in the configuration file, any parameter can be set as a 
flag.

NOTE: Please ensure that there are no spaces between "flags=" and the first 
      flag in your configuration file.

StackPACK first sources /etc/stackpack for parameters. Then it will 
source ~/.stackpackrc in the users home directory to see if
it overrides any of the settings declared in /etc/stackpack. Thus, 
the user can override any parameter in /etc/stackpack in
~/.stackpackrc

The easiest way to create the .stackpackrc file is to copy /etc/stackpack 
to the user's home directory as .stackpackrc and further edit it.  
Example:

  cp /etc/stackpack ~/.stackpackrc
  vi .stackpackrc

NOTE: As a safety precaution, the "System Configuration" section of the 
      /etc/stackpack configuration file should be edited out of the .stackpackrc. 
      Users should only edit those parameters in the "User editable parameters" section.


Useful Configuration and Parameter Information:

 - CrossMatch or RepeatMasker can be specified for masking purposes in the 
   .stackpackrc file. 

 - RepeatMasker can use its own specially formatted repeat database, by setting 
   mask_file=none

 - The batchsize of CrossMatch may be increased from its default value of 100 
   when masking large datasets. Increasing the batch size can increase the speed 
   of masking, but also increases the memory requirements.  Therefore, increases 
   in batch size should be considered carefully and in conjunction with the 
   total memory available.

 - Using different PHRAP versions:
   - PHRAP v 0.990319, as well as the older version of PHRAP, v 0.960731, may 
     both be used in stack_Assemble. PHRAP v 0.990319 is used by default.
   - In order to use PHRAP version 0.960731, the PHRAP old_ace parameter needs 
     to be set to 0, and the executable needs to point at the correct version of 
     PHRAP in the configuration file.
   - The retain_duplicates flag is not supported by  PHRAP version 0.960731.
   - PHRAP has limits of 64K bases for each sequence, and 64K sequences per 
     cluster. PHRAP can be compiled with a .longreads or .manyreads option that 
     allows assembly of sequences with more than 64,000 bp or clusters with more 
     than 64,000 sequences. Please refer to section iv in the Installation 
     Instructions (Part I) of the Documentation For PHRAP and 
     CROSS_MATCH (Version 0.990319) for further details.

 - Window size of d2_cluster: If your input sequences are all long, e.g., mRNA 
   sequences, increasing the window size of the d2_cluster algorithm will speed 
   the clustering process.

 - Batchsize and number of processors: The batch size parameter of the masking 
   step automatically adjusts itself according to the amount of input sequences 
   in the dataset and the amount of processors specified.  Therefore, the log 
   file may not reflect the batch size in the .stackpackrc or /etc/stackpack 
   configuration file.  For example, if a batch size of 1000 and 2 processors 
   is specified for the masking step, the batch size reported in the resulting 
   log file will be 1000/2 or 500. 

 - The number of projects listed in the project manager, both from the command 
   line and from the web interface, can be set in the [WEB] section of the 
   configuration file by changing the value of the "PAGESIZE".

 - stackPACK temp directory: During processing, stackPACK uses a temporary 
   directory (usually found in /usr/local/stackpack/tmp) to store intermediate 
   results and log files from the stackPACK pipeline.  Large projects may 
   require a large temporary space. If there is insufficient temporary space, 
   the steps of the stackPACK pipeline will not complete.  Additionally, if 
   the stackPACK processing is interrupted in any way, the contents of the 
   temp directory may fail to delete.  It is important to check the temporary 
   directory periodically and delete outdated temporary files manually.

More detailed parameter descriptions.



7. Support/Questions

For more information about stackPACK or answers to technical questions, please
contact the Electric Genetics team on:
     phone    +27 (0)21 959-3964
     fax      +27 (0)21 959-2512
     e-mail   support@egenetics.com
     web    http://www.egenetics.com/



8. About stackPACK

stackPACK v2.2 is Copyright(C) 1999 - 2002 Electric Genetics PTY Ltd. 
                              All Rights Reserved.

stackPACK(tm), WebProbe(tm) and WebReport(tm) are trademarks of Electric
Genetics PTY Ltd.

d2_cluster(tm) and CRAW(tm) are trademarks of the University of Houston.

All other trademarks are property of their respective owners.
Copyright 1999-2002 Ele ctric Genetics