COMMAND LINE USERS MANUAL
stackPACK(tm) v2.2
Clustering, Alignment and Expression Analysis System
==============================================================================
TABLE OF CONTENTS
==============================================================================
1. stackPACK Conventions
2. Input Data Format
2.1 GenBank flatfile format
2.2 FASTA format
2.2.1 Simple FASTA format
2.2.2 STACK FASTA format
2.2.3 NCBI FASTA format
2.2.4 Mixed or Unknown FASTA format
3. Running stackPACK
3.1 Setting the Environment
3.2 Managing Data
3.3 Creating a Project
3.3.1 Creating a project from the command line
3.3.2 Creating a project using the menu system
3.4 Importing Data
3.5 Masking Data
3.5.1 Masking Parameters
3.6 Clustering Data
3.6.1 Cluster Parameters
3.7 Assembling Data
3.7.1 Assembly Parameters and Configuration
3.8 Analyzing Data
3.8.1 Analysis Configuration
3.9 Linking Data
3.10 Incremental Addition of Data
3.11 Undoing steps in the pipeline
3.12 Restarting steps in the pipeline
3.13 Handling projects with the same name
3.14 Converting stackPACK v2.1 projects to v2.2
4. Web-based Viewing and Output of Data
5. Exporting Data from the Command Line
6. Expert Configuration
7. Support/Questions
8. About stackPACK
==============================================================================
1. stackPACK Conventions
The following conventions are used in the stackPACK commandline interface
and in this Command Line Users Manual:
- When usage is provided for command line programs, insert valid operations
in command line statements using the following conventions:
<operation> Compulsory command line operations
[operation] Optional command line operations
[operation1|operation2] Choose operation 1 OR operation2
- When using stackPACK from the command line, the user can view the
command line usage of a program by typing the program name without any
operations or parameters specified. (e.g. stack_ProjectManager)
2. Input Data Format
The stackPACK system accepts input in GenBank flatfile format or FASTA format.
In addition to these, stackPACK also accepts PHRED quality score file formats.
2.1 GenBank flatfile format
GenBank flatfile format is defined as the format of sequence entries in the
GenBank database or as downloaded from the NCBI web site (e.g. Entrez search
results) when GenBank format is specified. The full GenBank format specification
is found in section 3.4 of the GenBank release notes. GenBank format
should be used to parse maximum annotation information upon import.
2.2 FASTA format
The stackPACK system accepts FASTA files with several different header line
formats. The system attempts to parse annotations such as direction, clone ID
and library name from the FASTA header line when available and in a recognized
format.
2.2.1 Simple FASTA format (SIMPLE)
>[accession].[direction] [clone ID]
Where direction is either "r1" for a 3-Prime clone or "f1" for a 5-Prime clone
e.g. >37463.f1 g83244
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
Note: A sequence with an accession like "R.C.3746" cannot be specified as
Simple FASTA format, as the program will parse "R" for the accession and
"C" for the direction. For accessions beginning with "R.C.", the GUESS
option should be used.
2.2.2 Stack FASTA format (STACK)
>[accession] [gi] | [accession] CLONE: [clone] CLONE_LIB: [clonelib] LEN: [len]
FILE [source file] [direction<5-PRIME|3-PRIME>] DEFN: [descriptive text]
e.g. >T27877 g609975 | T27877 CLONE: 17194 CLONE_LIB: Human Eye LEN: 505
bp FILE gbest3.seq 5-PRIME DEFN: EST19137 Homo sapiens cDNA 5'end
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
2.2.3 NCBI FASTA format (NCBI)
As retrieved from NCBI's Entrez when the FASTA option is selected.
A basic description can be found at http://www.ncbi.nlm.nih.gov/BLAST/fasta.html
e.g. >gi|4468770|emb|AJ009167.1|TSAJ9167 Trypanosoma sp. 18S rRNA gene,
isolate K&A
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
Note: Clone information in NCBI format is part of the free form text in the
definition line. Clone information is thus not parsed when using NCBI
format and therefore data imported in NCBI format cannot be clone linked.
2.2.4 Mixed or Unknown FASTA format (GUESS)
- Files with mixed FASTA header line formats or files with FASTA header lines
not described above can also be imported.
- If stackPACK does not identify one of its pre-defined FASTA headers, the
program determines an accession number for the entry by extracting all
valid characters (alphanumeric and punctuation such as "_" and ".") found
between the > and the first space, where this string is 255 characters or less.
If the accession is greater than 255 characters, the sequence will not be
imported. Where possible, other details on the line may be parsed in as well.
Otherwise, the remainder of the line is ignored.
- It should therefore be ensured that the first 255 characters of each header
line entered for processing are unique.
Minimum requirement for FASTA format input file to stackPACK is:
>[accession number]
3. Running STACKPACK
3.1 Setting the Environment
Once STACKPACK is installed, you must set your path for the location of the
stackPACK programs. This allows you to work from anywhere on the system. If
stackPACK has been installed in the standard location, type the following
at the system prompt to set your paths:
C-shell users:
setenv PATH ${PATH}:/usr/local/stackpack/bin
setenv LD_LIBRARY_PATH /usr/local/stackpack/lib
Bash-shell users:
export PATH=$PATH:/usr/local/stackpack/bin
export LD_LIBRARY_PATH=/usr/local/stackpack/lib
You can verify whether or not your paths have been set correctly by
typing the following from anywhere on the system (e.g., home directory):
which stack_ProjectManager
You should see:
/usr/local/stackpack/bin/stack_ProjectManager
Then typing:
which d2_cluster
You should see:
/usr/local/stackpack/bin/d2_cluster
Then typing:
echo ${LD_LIBRARY_PATH}
You should see:
/usr/local/stackpack/lib
NOTE: If your system administrator has installed stackPACK in a directory
other than the standard one given in installation instructions, the paths
used above should be replaced by the actual path for the programs.
3.2 Managing Data
Each clustering run performed with stackPACK is associated with a single project.
Projects are created and managed by the stack_ProjectManager program.
The stack_ProjectManager program consists of a number of operations which allow
the user to list, delete and create projects as well as to display summary
information on specified projects. stack_ProjectManager also allows the
conversion of stackPACK v2.1 projects to stackPACK v2.2 projects - this enables
the user to continue data processing and analysis on a project created with
stackPACK v2.1.
Each operation has its own defined set of necessary parameters.
--------------------------------------------------------------------------------
Command: stack_ProjectManager
Usage: stack_ProjectManager <operation> [parameters]
Valid operations, with their parameters are:
-menu run with a simple interactive menu
-list [owner] [description] list all projects
-summary <project> <owner> display summary statistics on
specified project
-delete <project> <owner> delete specified project
-create <project> [description] [owner] create a new project
-convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0>
convert a v2.1 project to v2.2
NOTE: <old_dsn> for stackPACK v2.1.1 = stacksys
<old_dsn_login> for stackPACK v2.1.1 = stackpack
<old_dsn_password> for stackPACK v2.1.1 = stackpack
-------------------------------------------------------------------------------
3.3 Creating a Project
Projects must first be created before the clustering pipeline can process data.
One project is created per data input file and it is usual for the project
name to reflect the type of data to be processed. Project names may consist of
alphanumerics and underscores ("_"). Projects can be created through the command
line or through the basic menuing system.
Note: StackPACK is NOT case-sensitive. Please be aware of this when naming
projects. For example, "UPPER" and "Upper" will be considered duplicate project
names if they have the same owner.
3.3.1 Creating a project from the command line:
--------------------------------------------------------------------------------
Command: stack_ProjectManager -create
Info: creates projects
Usage: stack_ProjectManager -create <project> [description] [owner]
Where:
Project = Brief one-word project name.
Multiple projects may be given the same name provided that they
are owned by different users.
Valid characters are alphanumerics and underscore ("_").
Description = One-line project description for your reference.
Put multi-word descriptions in quotations.
Description is displayed in subsequent project listings.
All characters are valid.
Owner = Email address or name of owner of project
Valid characters are alphanumerics, dot ("."), dash ("-"),
underscore ("_") and "@".
Example:
stack_Project Manager -create testolf "clustering olfactory data" liza
--------------------------------------------------------------------------------
The above example returns the following from the system:
Creating project: testolf
Description: clustering olfactory data
Owner: liza
Created project 'testolf'
NOTE: If no project owner is specified, stackPACK will use the logged in
username as the project owner.
3.3.2 Creating a project using the menu system:
The same project can be created through the simple menuing system for
stack_ProjectManager. An example is given below. Comments on the right preceded
by an arrow are for the users information.
--------------------------------------------------------------------------------
Command: stack_ProjectManager -menu
Stackpack Project Manager
=========================
1... List all projects
2... Create a project
9... Delete a project
q... Exit project manager
> 2 <---select #2 Create a project
Create Project
--------------
1... Project Owner: liza <---default owner is the name of the
person who is logged on
2... Project Name:
3... Project Description:
c... Create Project
q... Return to main menu
> 1 <---select option 1 to enter project owner
Project Owner: liza <---enter name of project owner
Create Project
--------------
1... Project Owner: liza
2... Project Name:
3... Project Description:
c... Create Project
q... Return to main menu
> 2 <---select option 2 to enter project name
Project Name: testolf <---enter project name
Create Project
--------------
1... Project Owner: liza
2... Project Name: testolf
3... Project Description:
c... Create Project
q... Return to main menu
> 3 <---select option 3 to enter one-line
description
Project Description: clustering olfactory data <----enter description
Create Project
--------------
1... Project Owner: liza
2... Project Name: testolf
3... Project Description: clustering olfactory data
c... Create Project
q... Return to main menu
> c <----IMPORTANT: after entering details,
you must now type "c" to create
your project
-------------------------------------------------------------------------------
Creating project: 'testolf'...
Project created successfully
-------------------------------------------------------------------------------
Once your project has been created successfully, type "q" to exit the project
manager.
3.4 Importing Data
The sequences from the input data file must be imported into stackPACK's
database before the clustering engine can process them. Data in GenBank or a
range of FASTA formats may be imported. PHRED quality score files may also be
imported.
Non-alphabetic characters (including * , and numbers) in the sequence
lines are automatically stripped out when the file is read in.
NOTE:
- The stack_Import output reports the number of sequences imported, as well
as the number of sequences that could not be imported into the project.
- The stack_Import output indicates the progress of stack_Import.
e.g. ...(99/2600)..
Each imported sequence is represented by a dot ("."), and "(99/26000)"
indicates that 99 sequences out of a total of 2600 sequences have been
imported.
--------------------------------------------------------------------------------
Command: stack_ImportGenbank
Info: imports GenBank format input sequences
Usage: stack_ImportGenbank <project> <Genbank File> [Organism]
Where:
Project = Brief one-word project name.
Genbank file = Input data file name and path.
Organism = Organism under study; If used stackPACK will only import
sequences with that specific organism designation. If no
organism is specified, all sequences will be imported. All
or part of an organism name, as it appears in GenBank, may
be used. If, for example, the term "apien" is used, all
sequences with "apien" somewhere in the GenBank flatfile
ORGANISM field, such as "Homo sapiens" and "homo sapien"
will be imported.
Example:
stack_ImportGenbank testolf /foo/gbest29.seq "Homo sapiens"
--------------------------------------------------------------------------------
The above example returns the following from the system:
Importing Genbank data.
Project: testolf
Filename: /foo/gbest29.seq
Organism: Homo sapiens
Importing: 2600 records
.........................................................
.........................................................
stack_ImportGenbank completed.
2600 imported.
2600 sequences processed.
Total sequences in project: 2600
--------------------------------------------------------------------------------
Command: stack_ImportFasta
Info: imports FASTA format input sequences
Usage: stack_ImportFasta <project> <FastaFile> [GUESS|SIMPLE|STACK|NCBI]
stack_ImportFasta <project> <FastaFile> <PHRED Quality File>
Where:
Project = Brief one-word project name.
Fasta file = Input data file name and path.
Format = Type of FASTA format file input.
Options: GUESS|SIMPLE|STACK|NCBI|PHRED
Default format is GUESS.
See section 2 for a more detailed description of each format.
Quality file = Input quality file name and path.
Only valid when PHRED format specified.
Required if PHRED format specified.
Examples:
stack_ImportFasta testolf /foo/olf.fasta SIMPLE
stack_ImportFasta testolf /foo/olf.fasta PHRED /foo/olf.qual
--------------------------------------------------------------------------------
The first of the above two stack_ImportFasta examples returns the following from
the system:
Importing Fasta data.
Project: testolf
Filename: /foo/olf.fasta
Format: SIMPLE
Importing: 2600 records
....................................................
....................................................
stack_ImportFasta completed.
2600 imported.
2600 sequences processed.
Total sequences in project: 2600
The second example, when importing PHRED quality scores, returns the following
from the system:
Importing Fasta data.
Project: testolf
Filename: /foo/olf.fasta
Format: PHRED
QualFile: /foo/olf.fasta.qual
Importing: 2600 records
....................................................
Importing quality data from: '/foo/olf.fasta.qual'
....................................................
stack_ImportFasta completed.
2600 imported.
2600 sequences processed.
Total sequences in project: 2600
NOTE:
- If the GUESS format is selected and stackPACK does not identify one of its
pre-defined FASTA headers, the program determines an accession number for
the sequence entry by extracting all valid characters (alphanumeric and
punctuation such as "_" or ".") found between the > and the first space.
- Sequences will ONLY be imported if there are 255 or less valid characters
between the > and the first space. It should thus be ensured that these 255
characters of each header line entered for processing are unique.
- Clonelinking will not occur when PHRED or NCBI format are selected as these
formats do not parse any clone information.
Multiple imports in multiple formats may be made into the same project, so long
as no processing has yet taken place. This provides a mechanism for taking
advantage of maximum annotative information when sequences come from a variety of
sources. These data files will then be processed as one project.
Example:
stack_ImportFasta olftest /foo/olfactory.fasta
stack_ImportGenbank olftest /foo/olfactory.genbank
Both the olfactory.fasta and the olfactory.genbank data files will be imported
into the project olftest, and will be processed through the rest of the pipeline
as one dataset.
3.5 Masking Data
The clustering procedure is intended to group together those sequences which
share identical regions. A common problem in EST clustering is contamination
with a sequence common to several members of the input EST data set but not
representing valid gene data. The masking step helps ensure that ESTs submitted
for clustering are free of artifacts before clustering begins.
Users can choose to mask input sequences either with CrossMatch (Green, 1999)
or RepeatMasker (Smit, AFA & Green, P). This choice must be specified in
the stackPACK configuration file.
CrossMatch masks input sequences against a database containing:
- Repeat sequences. For STACKdb production Electric Genetics uses RepBase
(Jurka,1995). Your system administrator may have installed RepBase or
another repeat database more pertinent to your data.
- Common vector sequences, such as those distributed by NCBI.
- Other potential contaminants such as rodent, mitochondrial and ribosomal
DNA.
RepeatMasker masks input sequences against a database containing:
- RepBase database, specially formatted for use with RepeatMasker
(available from the RepBase web site).
- Any FASTA formatted file containing repeat sequences and other contaminants.
NOTE: Vector sequences are not included in the RepBase database used by
RepeatMasker. If you wish to screen for vector contamination as well,
using RepeatMasker, you must perform this task prior to importing your
data into stackPACK, or you must create your own FASTA formatted repeat
database that includes vector sequences.
In both programs, the sequences are masked by replacing the contaminated
portions of the sequence with x's, which are ignored by further steps in the
clustering pipeline. This ensures that only valid sequence data contributes
to the associations made to generate a cluster.
Masked regions are retained during the clustering pipeline and are visible
in the Sequence View, PHRAP Alignment View and CRAW Alignment View. Portions
of consensus sequences with only masked bases ("x") in their constituent
sequences will have consensus of "n". Regions of consensus sequences with
long strings of "n"s will have these truncated to no more than 10 "n"s.
Alternatively, users may select to cluster with masked sequences and assemble
with unmasked sequences to maximize consensus information. Please see Assembling
Data for more details. In this event, the masked data will only be used
for initial cluster groupings and unmasked data will be used for the
remainder of the steps in the pipeline.
When stackPACK is installed, the system administrator performing the
installation should place a repeat database for the generic, system-wide
configuration in /usr/local/stackpack/supporting/
This repeat database will then be used by default and needs not be specified
in the command line.
Alternatively, any of the system wide configurations can be overridden in the
user's home directory in the file .stackpackrc. See section 6, Expert
Configuration, for more details.
--------------------------------------------------------------------------------
Command: stack_Mask
Info: masks an input file of sequences against common contaminants
Requires: CrossMatch or RepeatMasker. Third-party software.
RepBase database or FASTA file of sequences to mask against.
Usage: stack_Mask <project> [Mask File]
stack_Mask <project> skip
stack_Mask <project> --conf=<conffile>
Where:
Project = Brief one-word project name.
Mask file = Name and location of FASTA file of sequences to mask against.
skip = Option used to omit masking.
conffile = configuration file
Examples:
stack_Mask testolf /foo/repeat.seq
stack_Mask testolf skip
stack_Mask testolf --conf=Conf_Primates
--------------------------------------------------------------------------------
The first of the above three stack_Mask examples returns the following from the
system when CrossMatch is used:
Masking sequence data
Project: testolf
Program: cross_match
Mask file: /foo/repeat.seq
Num cpus: 2
Batch size: 500 sequences
Processing: 2600 sequences in total
Flags: /foo/repeat.seq -minmatch 12 -minscore 20 -screen
...........
stack_Mask finished
Processed 2600 sequences
The first of the above three stack_Mask examples returns the following from the
system when RepeatMasker is used:
Masking sequence data
Project: testolf
Program: RepeatMasker
Mask file: /foo/repeat.seq
Num cpus: 1 (RepeatMasker using 2 cpus)
Batch size: 500 sequences
Processing: 2600 sequences in total
Flags: -x -xm -nolow -pa 2 -lib /foo/repeat.seq
...........
stack_Mask finished
Processed 2600 sequences
NOTE: RepeatMasker, unlike CrossMatch, uses its own parallelization independent
of stack_Mask. Thus when using RepeatMasker to mask, the number of CPUs in
the output may not correspond to the num cpu value specified in the
/etc/stackpack configuration file.
3.5.1 Masking Parameters
Users may generate multiple data-specific configuration files by copying the
/etc/stackpack configuration file and editing the parameters. The name and
location of the data-specific configuration file should be specified in the
stack_Mask command with the --conf= option.
RepeatMasker requires the following default parameters within stackPACK:
- xm Creates an additional output file in cross_match format.
- x Returns repetitive regions masked with Xs rather than Ns.
FAQ: Why change the batchsize?
ANSWER: The larger the batchsize, the more RAM is required to process the
data. If your computer has limited RAM or you find the system
running out of RAM during the stack_Mask process, re-run your
data with a smaller batchsize. Conversely, if your computer
has sufficient RAM you may speed processing by increasing the
batch size.
FAQ: How do I use RepeatMasker instead of CrossMatch?
ANSWER: - Uncomment 'program=RepeatMasker' in your stackpack
configuration file.
- Comment out 'program=cross_match' in your stackpack
configuration file.
- If you would like RepeatMasker to use its own mask file,
uncomment 'mask_file=none' in your stackpack configuration
file. There should be NO SPACES before 'mask_file=none'.
e.g.[stack_Mask]
; You can choose between program=cross_match and program=RepeatMasker
;program=cross_match
program=RepeatMasker
; If you wish RepeatMasker to use it's own masking database, set
; mask_file=none
;mask_file=/usr/local/stackpack/supporting/full_repeat.seq
mask_file=none
3.6 Clustering Data
The clustering step of stackPACK uses d2_cluster, a high-performance comparison
algorithm that rapidly determines the relative similarity of large datasets of
genetic sequences. (Biological Evaluation of d2, an Algorithm for High-
Performance Sequence Comparison. ; Hide W., Burke J., Davison D.; Journal of
Computational Biology 1 (3) 199-215). An update of the algorithm, produced by
Electric Genetics in January 2001, significantly improved the speed of d2_cluster.
d2_cluster implements a loose approach to sequence clustering by identifying
and counting matching n-length words (n=6), in contrast with stricter
approaches in which clusters are built up based on matching entire sequence
fragments. While the strict methodology yields cluster members that are
highly related, the loose approach presents the opportunity to detect clusters
which are related by re-arrangement or alternative splicing. Although the
resulting clusters are likely to be more "noisy", the combination with
verification tools for multiple sequence alignments eliminates this noise
and produces networks of highly related sequences for further analysis.
d2_cluster, a word-based, greedy clustering algorithm, is discrete from the
assembly tool (PHRAP) and identifies sequences that are greater than 96%
identical over a window of 150 bases.
d2-cluster is a word multiplicity comparison method that utilizes an
agglomerative algorithm that has been specifically developed for rapidly and
accurately partitioning transcript databases into index classes by clustering
ESTs and full-length sequences according to minimal linkage or "transitive
closure" rules. Agglomerative clustering method means that every sequence begins
in its own cluster and the final clustering is constructed through a series of
mergers that may be described in terms of minimal linkage, sometimes called
single linkage or "transitive closure". The term transitive closure refers to
the property that any two sequences with a given level of similarity will be in
the same cluster, hence A and B are in the same cluster even if they share no
similarity but there exists a sequence C with enough similarity to both A and
B.
NOTE: d2_cluster ignores sequences with less than 50 valid base pairs. Only
A,T,C and G are valid bases. These sequences are not included in the
clustering step and are considered as singletons. Masked sequences,
represented by "x", are not considered valid bases. Sequences masked to
less than 50 valid bases, will not be processed by d2_cluster. This value
is defined by the d2_cluster minimum_sequence_size parameter and can be
overridden in the user's home directory in the file .stackpackrc. See
section 6, Expert Configuration, for more details.
--------------------------------------------------------------------------------
Command: stack_Cluster
Info: runs d2_cluster on the input data file
Requires: enc_db. Distributed with stackPACK
d2_cluster. Distributed with stackPACK
Usage: stack_Cluster <project>
stack_Cluster <project> undo
stack_Cluster <project> --conf=<conffile>
Where:
Project = Brief one-word project name.
undo = Reversal of all steps subsequent to and including clustering.
conffile = configuration file
Examples:
stack_Cluster testolf
stack_Cluster testolf undo
stack_Cluster testolf --conf=Conf_Primates
--------------------------------------------------------------------------------
The first of the above three stack_Cluster examples returns the following from
the system:
Clustering sequence data
Project: testolf
Exporting: DONE
Clustering: 2600 sequences
Using: 2 cpus
Parameters: word_size=6
similarity_cutoff=0.96
minimum_sequence_size=50
window_size=100
reverse_comparison=1
...
Parsing cluster table and fixing accessions...DONE
Importing clustering results...DONE
stack_Cluster finished
Created 268 clusters.
886 sequences were members of a cluster.
Generating some statistics...
============================================================
CLUSTER STATISTICS
============================================================
There are 1714 singletons
There are 199 clusters with 2 sequences
There are 35 clusters with 3 sequences
There are 7 clusters with 4 sequences
There are 7 clusters with 5 sequences
There are 8 clusters with 6 sequences
There are 3 clusters with 7 sequences
There are 1 clusters with 8 sequences
There are 3 clusters with 9 sequences
There are 1 clusters with 12 sequences
There are 2 clusters with 13 sequences
There are 1 clusters with 53 sequences
There are 1 clusters with 125 sequences
NOTE: The clustering step is more efficient if long input data sequences, such
as mRNAs, are at the top of the imported input data file.
3.6.1 Cluster Parameters
Users may generate multiple data-specific configuration files by copying the
/etc/stackpack configuration file and editing the parameters. The name and
location of the data-specific configuration file should be specified in the
stack_Cluster command with the --conf= option.
d2_cluster parameters:
- Word_size: Sequences are clustered together based on matching n-length words.
The default value for n=6. This value may not exceed 9.
- Similarity_cutoff: Percentage of sequence similiarity required for match.
The default value is 96% similarity (96 bases over the window size, 100).
- minimum_sequence_size: Minimum sequence length, in bases, of sequences
processed by d2_cluster. The default value is 50.
- window_size: Number of base pairs which are compared at one time. The default
value is 100.
- reverse_comparison: Enables d2_cluster to also look at both forward and
reverse strands of sequences and recognize them as such. This flag is
switched on by default.
3.7 Assembling Data
To take advantage of the benefits of looser clustering, it is necessary to
further align and analyze the clusters generated by d2_cluster. The related but
loose clusters are thus subsequently processed by PHRAP to identify,
characterize and isolate any sequence divergence.
PHRAP aligns and assembles the sequences grouped together by d2_cluster, and
improves alignment quality by removing particularly distinct sequences as
singletons. Any PHRAP singletons remain associated with the original cluster.
While masked data is used in the assembly step by default, users are given
the option to carry out stack_Assemble on the original unmasked
sequence data as is described below. In cases where sequences are imported
without quality files, the PHRAP alignment and PHRAP consensus sequence will
always be in lower case due to the default quality parameter.
StackPACK retains the PHRAP alignment and consensus sequence, even though it
is further processed and regenerated by the stack_Analysis step.
NOTE:
- The stack_Assemble output reports the number of clusters processed, the
number of contigs generated in this run, the number of clusters with
multiple contigs, the number of clusters that did not have a contig as
well as the total number of contigs in the database/project
- The stack_Assemble output indicates the progress of stack_Assemble.
e.g. ...(9/268)..
Each assembled cluster is represented by a dot ("."), and "(9/268)"
indicates that 9 clusters out of a total of 268 clusters have been
processed.
--------------------------------------------------------------------------------
Command: stack_Assemble
Info: runs PHRAP on clusters generated by d2_cluster
Requires: PHRAP Third-party software.
ace2gde Distributed with stackPACK.
Usage: stack_Assemble <project>
stack_Assemble <project> --use-unmasked
stack_Assemble <project> undo
stack_Assemble <project> --conf=<conffile>
Where:
project = Brief one-word project name
use-unmasked = Option used if stack_Assemble is to be carried out on the
original unmasked input sequences
undo = Reversal of all steps subsequent to and including assembly
conffile = configuration file
Examples:
stack_Assemble testolf
stack_Assemble testolf --use-unmasked
stack_Assemble testolf undo
stack_Assemble testolf --conf=Conf_Primates
--------------------------------------------------------------------------------
The first of the above four stack_Assemble examples returns the following
from the system:
Assembling cluster data
Project: testolf
Num cpus: 16
Processing: 268 clusters
Parameters: old_ace=1
vector_bound=0
forcelevel=0
trim_score=20
penalty=-2
gap_init=-4
gap_ext=-3
ins_gap_ext=-3
del_gap_ext=-3
maxgap=30
flags=-retain_duplicates
........................
Cluster: 241 generated 2 sub-contigs
stack_Assemble finished
Processed 268 clusters
Total contigs generated in this run: 269
Total clusters that had multiple contigs: 1
Total clusters that did not have a contig: 0
Total contigs in database: 269
NOTE:
- The cluster generated by d2_cluster may be split into one or more
contigs by PHRAP. PHRAP may also split out particularly divergent
singletons that do not align well to any of the contigs. We refer to
these as PHRAP singletons.
While each contig is further processed in the stack_Analysis step,
PHRAP singletons do not receive further processing and so are not
seen in the Alignment Analysis or CRAW Alignment views and are
not used in the generation of the final consensus sequences.
- If the "--use-unmasked" parameter is specified, unmasked sequences
will be displayed in all alignment and consensus views.
3.7.1 Assembly Parameters and Configuration
- Users may generate multiple data-specific configuration files by copying
the /etc/stackpack configuration file and editing the parameters. The name
and location of the data-specific configuration file should be specified in
the stack_Assemble command with the --conf= option.
- With the exception of the retain_duplicates flag, default PHRAP parameters
are used within stackPACK unless otherwise specified. Any PHRAP parameter
can be set as a flag in the .stackpackrc configuration file.
- Please ensure that there are no spaces between "flags=" and the first flag
in your configuration file.
- If the PHRAP retain_duplicates flag is not set, and the input data file
contains two or more sequences with 100% sequence identity, PHRAP processes
only the first sequence. These identical sequences are processed by d2_cluster
and may thus be included in cluster generation. They are however not present
in the subsequent alignment and analysis steps. The retain_duplicates flag
is included by default in the assembly step.
- While masked data is used in the assembly step by default, users are given
the option to carry out stack_Assemble on the original unmasked sequence data
by using the --use-unmasked parameter. If this parameter is specified,
unmasked sequences will be displayed in all alignment and consensus views.
Using different PHRAP versions:
- PHRAP v 0.990319, as well as the older version of PHRAP, v 0.960731, may
both be used in stack_Assemble.
- In order to use PHRAP version 0.960731, the PHRAP old_ace parameter needs to
be set to 0, and the executable needs to point at the correct version of
PHRAP in the configuration file.
- The retain_duplicates flag is not supported by PHRAP version 0.960731.
- PHRAP has limits of 64K bases for each sequence, and 64K sequences per
cluster. PHRAP can be compiled with a .longreads or .manyreads option that
allows assembly of sequences with more than 64,000 bp or clusters with more
than 64,000 sequences. These options can only be used with PHRAP v 0.990319.
It should thus be ensured that the old_ace parameter is set to 1 in the
configuration file. Please refer to section iv in the Installation
Instructions (Part I) of the Documentation For PHRAP and
CROSS_MATCH (Version 0.990319) for further details.
- Case provided in an input sequence will be retained in the older version of
PHRAP, PHRAP v 0.960731, and may be used as an indicator of sequence quality.
3.8 Analyzing Data
Aligned clusters, particularly those generated by a loose clustering engine,
need to be further processed for errors, such as those inherent in single-pass
sequences, and alignments analyzed for alternate forms of expressed sequences.
Although PHRAP aligns sequences, these alignments are lacking information about
variation within the cluster and do not help users distinguish alternative
splice or other scientifically interesting events from alignment problems
induced by low sequence quality or experimental artifacts.
CRAW (CRAWview: for viewing splicing variation, gene families, and polymorphism
in clusters of ESTs and full-length sequences ; Chou A., Burke J.;
Bioinformatics 15 (5) 376-381) is thus employed to analyze alignments,
partition sub-assemblies and provide a simple means to view clusters. After
CRAW processing, stackPACK further analyzes clusters to refine consensus
sequences, maximize consensus sequence length, create final alignments and
to select the best consensus sequence.
CRAW works by verifying agreement along the columns of a multiple sequence
alignment, using the data to sort related sequences within each cluster and
to generate IUPAC-compliant consensus sequences for each subcluster. Using
default parameters, a subcluster is generated if 50% or more of a 100 base
window differs from the remaining sequences of a cluster, excluding the
initial 100 bases of any read. The approach depends fundamentally on the
alignment quality of each assembly. A poor alignment will yield erroneous
sub-clusters and too low a gap penalty may yield too many columns in agreement
and thus not create sub-clusters where they would be appropriate.
NOTE:
- The stack_Analysis output reports the number of contigs processed, the
number of consensus sequences generated in the run and the total number
of consensus sequences in the database/project.
- The stack_Analysis output indicates the progress of stack_Analysis.
e.g. ...(9/269)..
Each analyzed contig is represented by a dot ("."), and "(9/269)"
indicates that 9 contigs out of a total of 269 contigs have been
processed.
--------------------------------------------------------------------------------
Command: stack_Analysis
Info: runs CRAW on aligned sequence data; further analyzes CRAW
subassemblies
Requires: CRAW. Distributed with stackPACK.
Usage: stack_Analysis <project>
stack_Analysis <project> undo
stack_Analysis <project> --conf=<conffile>
Where:
Project = Brief one-word project name.
undo = Reversal of all steps subsequent to and including analysis.
conffile = configuration file
Examples:
stack_Analysis testolf
stack_Analysis testolf undo
stack_Analysis testolf --conf=Conf_Primates
--------------------------------------------------------------------------------
The first of the above three stack_Analysis examples returns the following from
the system:
Analyzing contig data
Project: testolf
Processing: 282 contigs
Parameters: sig=0.5
window_size=100
ignore_first=50
reassigning lone singleton 1871 from 0 to 1
reassigning lone singleton 2416 from 0 to 1
stack_Analysis finished.
Total contigs processed: 269
Total consensi generated in this run: 271
Total consensi in database: 271
Note: CRAW has a limit of 20,000 sequences per cluster and a maximum sequence
length of 100,000 base pairs. A consensus sequence will not be generated
for clusters if these limits are exceeded.
3.8.1 Analysis Configuration
Users may generate multiple data-specific configuration files by copying the
/etc/stackpack configuration file and editing the parameters. The name and
location of the data-specific configuration file should be specified in the
stack_Analysis command with the --conf= option.
FAQ: What does "reassigning lone singleton" refer to in the output?
ANSWER: These refer to the d2_cluster singletons. stack_Analysis re-assigns
singletons to a larger subalignment with which they have the most
sequence similarity.
3.9 Linking Data
All ESTs generated from the same cDNA clone correspond to a single gene. Each
EST is searched for clone identification so that non-overlapping clusters
corresponding to the same gene can be identified and linked. Only a proportion
of ESTs in GenBank currently have documented clone information. This information
is utilized to extend the length of the cluster consensus sequences by joining
clusters that contain ESTs that share clone IDs. Thus only if the input
sequences contain clone information will the program create linked clusters.
Given that the clone ID information is solely annotation-based and may have
namespace overlaps depending on the data source(s), this step has been placed
near the end of the processing pipeline. Furthermore, unless a specific
5'-3' pair can be identified as a seed for each linked consensus, the procedure
is transitive in nature and may lead to extensive clone-linked networks whose
biological significance remains to be explored. To avoid spurious linking,
the program, by default, requires that at least two independent clone ID
matches must be made before two clusters will link. The number of required
links is a parameter that can be changed in stackPACK.
To form a final consensus sequence, the non-redundant best cluster consensus
sequences are joined by linker segments of 20 Xs. This choice was made based
on the word size employed by BLAST, so that alignment breaks would be
preferentially inserted at these linker regions.
--------------------------------------------------------------------------------
Command: stack_Link
Info: creation of linked clusters
Usage: stack_Link <project>
stack_Link <project> undo
stack_Link <project> --conf=<conffile>
Where:
project = Brief one-word project name.
undo = Reversal of all steps subsequent to and including analysis.
conffile = configuration file
Examples:
stack_Link testolf
stack_Link testolf undo
stack_Link testolf --conf=Conf_Primates
--------------------------------------------------------------------------------
The first of the above three stack_Link examples above returns the following
from the system:
Linking cluster data
Project: testolf
Redundancy: 2
Pass 1 - identifying internal links and potential external
links .................
Pass 2 - identifying external cluster links
stack_Link finished.
Created 23 clonelinks, consisting of: 46 clusters.
NOTE: Users may generate multiple data-specific configuration files by copying
the /etc/stackpack configuration file and editing the parameters. The
name and location of the data-specific configuration file should be
specified in the stack_Analysis command with the --conf= option.
FAQ: How can I ensure that clonelinking will be performed on my sequence data?
ANSWER: - Clonelinking will only be performed if the clone ID information is
parsed when data is imported into stackPACK. This depends on whether
the clone ID information is in a recognisable format within the
sequence header line.
- Clone information in NCBI format is part of the free form text in the
definition line, and is thus not parsed when using NCBI format.
- Clone ID information is never provided in PHRED format and therefore
no linking can be performed on this type of data.
- Clone ID information will only sometimes be parsed when using GUESS
FASTA format.
- Clone ID information should always be parsed when importing with
simple FASTA, STACK FASTA and GenBank formats, provided that the
formats are used as per definition.
3.10 Incremental Addition of Data
New sequence data may be added incrementally to existing clusters. The cluster
history of new, outdated and changed clusters is maintained and reflected
in WebProbe. New sequences are simply imported into the existing project of
choice and processed through the stackPACK pipeline as described in sections
3.4 - 3.9 of the Command Line manual. The output of all the stackPACK commands
reflect the fact the incremental addition is occuring. Some examples are given
below.
e.g. stack_ImportFasta olftest /foo/olf2.fasta GUESS
The following output is returned from the system:
Importing Fasta data.
Project: olftest
Filename: /foo/olf2.fasta Format: GUESS
Importing: 229 records
......................
stack_ImportFasta completed.
229 imported.
229 sequences processed.
Total sequences in project: 493
NOTE: The "Total sequences in project" in the stack_Import output refers to the
newly imported sequences added to the total number of sequences in the
existing project.
e.g. stack_Mask olftest
The following output is returned from the system:
Masking sequence data.
Project: Liver
Program: cross_match
Mask file: /usr/local/stackpack/supporting/repeat.seq
Num cpus: 2
Batch size: 115 sequences
Processing: 229 sequences, continuing from sequence: 265
Flags: /usr/local/stackpack/supporting/repeat.seq -minmatch 12
-minscore 20 -screen
..
stack_Mask finished.
Processed 229 sequences.
NOTE: "continuing from sequence: 265" in the stack_Mask output indicates that
only the newly added 229 sequences are masked.
e.g. stack_Cluster Liver
The following output is returned from the system:
Clustering sequence data
Project: Liver9Jan
Existing clusters detected, performing an add.
Exporting: DONE
Clustering: 493 sequences
Using: 2 cpus
Parameters: word_size=6
similarity_cutoff=0.96
minimum_sequence_size=50
window_size=100
reverse_comparison=1
...
Parsing cluster table and fixing accessions...DONE
Importing clustering results...DONE
stack_Cluster finished.
Added 44 new clusters.
Deprecated 22 clusters.
Joined 3 clusters into 1 new cluster.
Left 25 clusters unchanged.
Generating some statistics...
========================================================
CLUSTER STATISTICS
========================================================
There are 127 singletons
There are 47 clusters with 2 sequences
There are 14 clusters with 3 sequences
There are 12 clusters with 4 sequences
There are 4 clusters with 5 sequences
There are 3 clusters with 6 sequences
There are 3 clusters with 8 sequences
There are 1 clusters with 10 sequences
There are 1 clusters with 13 sequences
There are 1 clusters with 15 sequences
There are 1 clusters with 18 sequences
There are 1 clusters with 24 sequences
There are 1 clusters with 40 sequences
The project may be processed through the rest of the stackPACK pipeline as
described in sections 3.7 (stack_Assemble), 3.8 (stack_Analysis) and 3.9
(stack_Link) of the command line manual. Processing will only be performed on
those clusters or contigs that have been altered due to the incremental
addition of the new sequence data.
NOTE: PHRED quality scores may not be added to converted projects that have
been created with stackPACK v2.1.
3.11 Undoing steps in the pipeline
Users have the capability to "undo" certain steps in the stackPACK
pipeline. This undo option applies to the clustering, assembly, analysis
and linking steps. You may NOT undo import or masking steps. Undo is
executed as a flag or parameter to the other pipeline steps.
The undo option will reverse all the steps subsequent to and including
the step being undone in the stackPACK pipeline. For example, undoing
stack_Assemble will undo the stack_Assemble, stack_Analysis and stack_Linking
steps in the project specified.
NOTE: If incremental addition has been carried out on an existing project,
the undo option will reverse all data contained in the project, not just the
latest additions.
You may wish to use the undo option when you would like to change the parameters
or number of CPUs used for a particular step in a project, as one example.
----------------------------------------------------------------------------
Undo option
Info: reverses all steps subsequent to and including
the step being undone in the stackPACK pipeline.
Usage: <stackPACK_program> <project> undo
Where:
stackPACK_program = stack_Cluster, stack_Assemble, stack_Analysis
or stack_Link
project = project from which you want to undo pipeline steps.
Examples:
stack_Cluster <project> undo
stack_Assemble <project> undo
---------------------------------------------------------------------------
3.12 Restarting steps in the pipeline
Some of the steps in the stackPACK pipeline may be interrupted and restarted
again with little risk. The steps that may be restarted include: stack_Import,
stack_Mask, stack_Assemble and stack_Analysis.
To restart a process that has been interrupted, simply reissue the command
for that process. The program will not reprocess all the data, but will
begin where it left off.
NOTE: If the program was interrupted while writing results back to the
database, there is a slight risk that the cluster or sequence being
written at that time will not be complete and will not be re-run when
you restart the program. In most cases, restarting will not cause any
problems. However, if you would like to be 100% certain that no data is
missed, you may undo the process and reprocess
all data for that step.
3.13 Handling projects with the same name
Projects may have the same name, provided that they are owned by different users.
In such cases, stackPACK will process the project owned by the user that is
currently logged in, unless otherwise specified.
If for example, there are two projects in the database called "testolf" owned
by two different users, the following output is returned from the system upon
stack_Import:
Command: stack_ImportFasta testolf /foo/olfactoryseq.fasta
Output: There are multiple projects called 'testolf'
Id: 420 Project: testolf Owner: liza
Id: 421 Project: testolf Owner: gary
I am going to assume that you want to run the following project:
Id: 420 Project: testolf Owner: liza
If you want to use a different project, please try again using the
--user=<username> option
e.g. stack_Assemble --user=<username> <project>
The following command should thus be used to specify the project called
"olftest" owned by gary:
Command: stack_ImportFasta olftest --user=gary /foo/olfactoryseq.fasta
The --user=<username> option can be specified for all the stack commands in the
processing pipeline.
3.14 Converting stackPACK v2.1 projects to v2.2
Projects that have been created with stackPACK v2.1 should be converted if users
wish to have access to these projects on stackPACK v2.2. Further data processing
as well as data output and analysis may be carried out on converted projects
within stackPACK v2.2 Project conversion is performed with the stack_ProjectManager.
-------------------------------------------------------------------------------
Command: stack_ProjectManager -convert
Info: Converts a v2.1 project to v2.2
Usage: stack_ProjectManager -convert <project> <old_dsn> <old_dsn_login> <old_dsn_password> <1|0>
Where:
Project = project to be converted
old_dsn = old data source name
old_dsn_login = old data source name login
old_dsn_password = old data source name password
0|1 = This argument specifies whether sequences
in the stackPACK 2.1 project have been clustered
or not. If this argument is set to 1, sequences
will be assumed to be clustered. If this argument
is set to 0 (or if this argument is not specified)
sequences will be assumed to be unclustered.
Example: stack_ProjectManager -convert testolf stacksys stackpack stackpack 1
-------------------------------------------------------------------------------
The above example returns the following output from the system:
Converting project: testolf
2.1 System Database: stacksys
2.1 System Login: stackpack
sequences.........................
clusters........................
contigs.............
consensi................
clonelinks.......................
annotation............
Conversion completed successfully, you can now access your project as: 'testolf_converted'
NOTE:
- The old_dsn will always be "stacksys". This DSN entry can be found in the
/etc/odbc.ini file
- For stackPACK v2.1.1 the old_dsn_login and old_dsn_password will always be
stackpack and stackpack. StackPACK v2.1, however may have been set up with a
different dsn_login and dsn_password. Please refer to the [DATABASE] section
of your stackPACK v2.1 /etc/stackpack configuration file to deremine what
these are.
- ACE format output is not supported from converted projects.
- PHRED format input files may not be added to converted projects.
4. Web-based Viewing and Output of Data
The stackPACK results are stored in a relational database and are viewed
and exported by using the web interface components WebProbe(tm) and
WebReport(tm).
WebProbe provides viewing tools that link consensus sequences, alignments,
expression analysis and external data sources like UniGene.
WebReport provides access to a list of predefined reports which can be
selected and downloaded for further data evaluation or to create searchable
databases of your clustering results.
The web-based interface is typically invoked by opening the following location
in your browser:
http://www.yourhostname.com/stackpack/
The hostname can be confirmed by viewing the [WEB] entry in the file:
/etc/stackpack. For example, if the hostname is "myhost.egenetics.com", you
invoke stackPACK by opening the following URL:
http://myhost.egenetics.com/stackpack/
and the [WEB] entry will look like this:
[WEB]
WEBSERVER = hostname
MAILSERVER = myhost.egenetics.com
HTTP_URL = /stackpack
CGI_URL = /cgi-bin/stackpack
HTTP_PATH = /home/httpd/html/stackpack
CGI_PATH = /home/httpd/cgi-bin/stackpack
FULLVER = yes
PAGESIZE = 25
DEBUG = no
The [WEB] entry contains configuration informations such as the location of
files. The number of projects listed per page in the WebProjectManager is set
by the PAGESIZE parameter.
5. Exporting Data from the Command Line
A series of scripts are provided that allow stackPACK users to export their
data in a number of predefined reports. These reports correspond to the
reports found in the WebReport section of the web-based interface.
The export scripts are found in the "bin" subdirectory of the stackPACK
installation directory (typically /usr/local/stackpack/bin)and are named in
such a way that users can easily deduce their function.
-------------------------------------------------------------------------------
5.1 Selecting the appropriate command line report:
stack_ReportAllSequences.py - All masked or original unmasked input
sequence, in FASTA format.
stack_ReportConsensus.py - All clonelinked consensus sequences,
multi-sequence cluster primary consensus
sequences or/and multi-sequence cluster
alternate consensus sequences, in
FASTA format.
stack_ReportAllSingleton.py - All d2_cluster and/or PHRAP singleton
sequences, in FASTA format.
stack_ReportClusterMemberEst.py - List of all constituent EST or mRNA
sequences per cluster accession or
for the whole project, in FASTA
or CSV format.
stack_ReportAlignment.py - Initial PHRAP or Final CRAW sequence
alignments, per accession or for the
whole project, in MSF, ClustalW or ACE
formats.
stack_ReportClusterAlignmentAnalysis.py - The Alignment Analysis CRAW logs, per
accession or for the whole project.
stack_ReportNonRedundant.py - Non-redundant output of the entire
project, in FASTA or CSV format.
Please refer to WebReport in stackPACK Support for a detailed description
of each report.
-------------------------------------------------------------------------------
5.2 Usage of the command line reports:
All optional fields are placed in square brackets "[]", and all compulsory fields
are placed in "<>". A pipe "|" indicates where one or the other option
should be selected.
5.2.1 stack_ReportAllSequences.py
-------------------------------------------------------------------------------
Command: stack_ReportAllSequences.py
Info: Outputs all masked or original unmasked input sequence, in FASTA format.
Usage: stack_ReportAllSequences.py --Owner=<ProjectOwner> --Project=<ProjectName> <--Masked|--Original> [--Output=<OutputFilename>]
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
Masked = Option to be selected if the input sequences in masked format is required.
Original = Option to be selected if the input sequences in original format is required.
Output = File into which data is exported.
Examples: stack_ReportAllSequences.py --Owner=liza --Project=testolf --Masked --Output=OlfMasked.fasta
stack_ReportAllSequences.py --Owner=liza --Project=testolf --Original --Output=OlfUnmasked.fasta
-------------------------------------------------------------------------------
5.2.2 stack_ReportConsensus.py
-------------------------------------------------------------------------------
Command: stack_ReportConsensus.py
Info: Outputs all clonelinked consensus sequences, multi-sequence
cluster primary consensus sequences or/and multi-sequence
cluster alternate consensus sequences, in FASTA format.
Usage: stack_ReportConsensus.py --Owner=<ProjectOwner> --Project=<ProjectName> <ConsensusOptions> [--Output=<OutputFilename>]
Consensus Options:
--Clonelink
--Primary
--Alternate
Note: One or more options may be selected simultaneously
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
ConsensusOptions = One or more of the three consensus options (explained in the
usage) to be selected.
Output = File into which data is exported.
Examples: stack_ReportConsensus.py --Owner=liza --Project=testolf --Clonelink --Alternate --Output=Olf_Clonelink_Alternate.fasta
stack_ReportConsensus.py --Owner=liza --Project=testolf --Clonelink --Output=Olf_Clonelink.fasta
-------------------------------------------------------------------------------
5.2.3. stack_ReportAllSingleton.py
-------------------------------------------------------------------------------
Command: stack_ReportAllSingleton.py
Info: Outputs all d2_cluster and/or PHRAP singleton sequences,
in FASTA format.
Usage: stack_ReportAllSingleton.py --Owner=<ProjectOwner> --Project=<ProjectName> <SingletonOptions> [--Output=<OutputFilename>]
Singleton Options:
--Singletons Sequences not included in
multi-sequence clusters.
--PHRAP Clustered sequences excluded from the
PHRAP alignment.
--MinimumSeqLength[=#] If no value is specified, the
d2_cluster minimum_sequence_size
parameter value is used by default.
Note: More than one singleton option may be used simultaneously
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
SingletonOptions = One or both of the singleton options (explained in the usage)
to be selected.
Output = File into which data is exported.
Examples: stack_ReportAllSingleton.py --Owner=liza --Project=testolf --Singletons --PHRAP --Output=Olf_d2_PHRAP.fasta
stack_ReportAllSingleton.py --Owner=liza --Project=OlfData --Singletons --Output=Olf_d2.fasta
-------------------------------------------------------------------------------
5.2.4. stack_ReportClusterMemberEst.py
-------------------------------------------------------------------------------
Command: stack_ReportClusterMemberEst.py
Info: Lists all constituent EST or mRNA sequences of all clusters, in
FASTA or CSV format.
Usage: stack_ReportClusterMemberEst.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<FASTA|CSV> [--Accession=<cl>] [--Output=<OutputFilename>]
Note:
If the accession number is omitted, the cluster members for all
clusters within the project will be outputted.
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
Format = Format of output file.
Accession = Cluster accession number of the data to be outputted.
Output = File into which data is exported.
Examples: stack_ReportClusterMemberEst.py --Owner=liza --Project=testolf --Format=FASTA --Accession=cl8 --Output=Olfcl8.fasta
stack_ReportClusterMemberEst.py --Owner=liza --Project=testolf --Format=FASTA --Output=Olf_cl_project.fasta
-------------------------------------------------------------------------------
5.2.5 stack_ReportAlignment.py
-------------------------------------------------------------------------------
Command: stack_ReportAlignment.py
Info: Outputs initial PHRAP or Final CRAW sequence alignments, per
accession or for the whole project, in MSF, ClustalW or ACE formats.
Usage: stack_ReportAlignment.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<ACE|MSF|CLUSTALW> [--Accession=<cl#|ct#|cn#>] [--Alignment=<Final|PHRAP>] [--Output=<filename>]
Note:
- The accession number must be specified if the alignment for a
particular cluster, contig or consensus is required. If omitted,
all specified alignments for the Project will be outputted.
- Cluster accession numbers (cl#) must be used when specifying ACE
format. ACE format is valid only for the PHRAP alignment.
- Contig accession numbers (ct#) must be used when the PHRAP
alignment is required.
- Consensus accession numbers (cn#) must be used when the final
alignment is required.
- Alignment type is only required if no accession is specified.
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
Format = Format of output file.
Accession = Cluster, contig or consensus accession number of the alignment to be
outputted.
Alignment = Alignment type to be outputted.
Output = File into which data is exported.
Examples: stack_ReportAlignment.py --Owner=liza --Project=testolf --Format=ACE --Accession=cl8 --Output=Olf_cl8.ace
stack_ReportAlignment.py --Owner=liza --Project=testolf --Format=MSF --Accession=cn8 --Alignment=Final --Output=Olf_cn8.msf
NOTE: If the ACE format option is selected, sequence information will be output
in the old ACE format.
-------------------------------------------------------------------------------
5.2.6. stack_ReportClusterAlignmentAnalysis.py
-------------------------------------------------------------------------------
Command: stack_ReportClusterAlignmentAnalysis.py
Info: Outputs the Alignment Analysis CRAW logs, per accession or for the
whole project.
Usage: stack_ReportClusterAlignmentAnalysis.py --Owner=<ProjectOwner> --Project=<ProjectName> [--Accession=<cl#|ct#>] [--Output=<OutputFilename>]
Note:
If the accession number is omitted, the alignment analyses for the
whole project will be outputted.
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
Accession = Cluster or contig accession number of the alignment analysis to be
outputted.
Output = File into which data is exported.
Examples: stack_ReportClusterAlignmentAnalysis.py --Owner=liza --Project=testolf --Accession=cl8 --Output=Olf_cl8.craw
stack_ReportClusterAlignmentAnalysis.py --Owner=liza --Project=testolf --Output=Olf_project.craw
-------------------------------------------------------------------------------
5.2.7. stack_ReportNonRedundant.py
-------------------------------------------------------------------------------
Command: stack_ReportNonRedundant.py
Info: Non-redundant output of the entire project, in FASTA or CSV
format.
Usage: stack_ReportNonRedundant.py --Owner=<ProjectOwner> --Project=<ProjectName> --Format=<FASTA|CSV> [SequenceOptions] [--Output=<OutputFilename>]
Sequence Options:
--Clonelink Consensus sequences for those multi-sequence
cluster joined by virtue of clone Id
--Primary Primary consensus sequences for all
multi-sequence clusters NOT present in
clonelinked clusters
--Singletons Singleton sequences NOT present in
clonelinked clusters.
Note:
- When the CSV format is specified, all three sequence options
are selected by default.
- When the FASTA format is specified, a sequence option must
be specified. More than one sequence option may be specified
simultaneously.
Where:
Owner = Owner of the specified project.
Project = Name of project from which data is to be extracted.
Format = Format of output file.
SequenceOptions = One or more of the three sequence options (explained in the
usage) to be selected.
Output = File into which data is exported.
Examples: stack_ReportNonRedundant.py --Owner=liza --Project=testolf --Format=FASTA --Primary --Singletons --Output=Olf_Primary_Singletons.fasta
stack_ReportNonRedundant.py --Owner=liza --Project=testolf --Format=CSV --Output=Olf.csv
NOTE:
- For a comprehensive non-redundant report all three sequence options should be
included.
- Alternate consensus sequences are not included in the non-redundant output
report.
--------------------------------------------------------------------------------
6. Expert Configuration
The stackPACK software has a system-wide configuration file in the
following location: /etc/stackpack
Users wishing to configure stackPACK differently for their own use may do
so through creation of an individual configuration file placed in their
home directory named ".stackpackrc". Key parameters that can be adjusted by
the user using .stackpackrc include the repeat masking file, masking program,
the number of processors for the masking, clustering and assembly steps and,
for expert users, parameters for each of the programs called externally by
stackPACK. Even though the most commonly used parameters for each of these
programs are listed in the configuration file, any parameter can be set as a
flag.
NOTE: Please ensure that there are no spaces between "flags=" and the first
flag in your configuration file.
StackPACK first sources /etc/stackpack for parameters. Then it will
source ~/.stackpackrc in the users home directory to see if
it overrides any of the settings declared in /etc/stackpack. Thus,
the user can override any parameter in /etc/stackpack in
~/.stackpackrc
The easiest way to create the .stackpackrc file is to copy /etc/stackpack
to the user's home directory as .stackpackrc and further edit it.
Example:
cp /etc/stackpack ~/.stackpackrc
vi .stackpackrc
NOTE: As a safety precaution, the "System Configuration" section of the
/etc/stackpack configuration file should be edited out of the .stackpackrc.
Users should only edit those parameters in the "User editable parameters" section.
Useful Configuration and Parameter Information:
- CrossMatch or RepeatMasker can be specified for masking purposes in the
.stackpackrc file.
- RepeatMasker can use its own specially formatted repeat database, by setting
mask_file=none
- The batchsize of CrossMatch may be increased from its default value of 100
when masking large datasets. Increasing the batch size can increase the speed
of masking, but also increases the memory requirements. Therefore, increases
in batch size should be considered carefully and in conjunction with the
total memory available.
- Using different PHRAP versions:
- PHRAP v 0.990319, as well as the older version of PHRAP, v 0.960731, may
both be used in stack_Assemble. PHRAP v 0.990319 is used by default.
- In order to use PHRAP version 0.960731, the PHRAP old_ace parameter needs
to be set to 0, and the executable needs to point at the correct version of
PHRAP in the configuration file.
- The retain_duplicates flag is not supported by PHRAP version 0.960731.
- PHRAP has limits of 64K bases for each sequence, and 64K sequences per
cluster. PHRAP can be compiled with a .longreads or .manyreads option that
allows assembly of sequences with more than 64,000 bp or clusters with more
than 64,000 sequences. Please refer to section iv in the Installation
Instructions (Part I) of the Documentation For PHRAP and
CROSS_MATCH (Version 0.990319) for further details.
- Window size of d2_cluster: If your input sequences are all long, e.g., mRNA
sequences, increasing the window size of the d2_cluster algorithm will speed
the clustering process.
- Batchsize and number of processors: The batch size parameter of the masking
step automatically adjusts itself according to the amount of input sequences
in the dataset and the amount of processors specified. Therefore, the log
file may not reflect the batch size in the .stackpackrc or /etc/stackpack
configuration file. For example, if a batch size of 1000 and 2 processors
is specified for the masking step, the batch size reported in the resulting
log file will be 1000/2 or 500.
- The number of projects listed in the project manager, both from the command
line and from the web interface, can be set in the [WEB] section of the
configuration file by changing the value of the "PAGESIZE".
- stackPACK temp directory: During processing, stackPACK uses a temporary
directory (usually found in /usr/local/stackpack/tmp) to store intermediate
results and log files from the stackPACK pipeline. Large projects may
require a large temporary space. If there is insufficient temporary space,
the steps of the stackPACK pipeline will not complete. Additionally, if
the stackPACK processing is interrupted in any way, the contents of the
temp directory may fail to delete. It is important to check the temporary
directory periodically and delete outdated temporary files manually.
More detailed parameter descriptions.
7. Support/Questions
For more information about stackPACK or answers to technical questions, please
contact the Electric Genetics team on:
phone +27 (0)21 959-3964
fax +27 (0)21 959-2512
e-mail support@egenetics.com
web http://www.egenetics.com/
8. About stackPACK
stackPACK v2.2 is Copyright(C) 1999 - 2002 Electric Genetics PTY Ltd.
All Rights Reserved.
stackPACK(tm), WebProbe(tm) and WebReport(tm) are trademarks of Electric
Genetics PTY Ltd.
d2_cluster(tm) and CRAW(tm) are trademarks of the University of Houston.
All other trademarks are property of their respective owners.
|