Tips and Tools for Configuring and Parameterizing stackPACKtm



 
 
Looking for ways to:
  • optimize your stackPACK usage?
  • better understand how stackPACK works?
  • understand all the data types created by stackPACK?
  • learn how to customize stackPACK for your site?
  • change the parameters in stackPACK for your personal use?
You have come to the right place...


 

Table of Contents

  1. Description of stackPACK processing pipeline and resulting data
  2. Configuration and Parameterization of stackPACK on a system-wide basis
  3. Parameterization of stackPACK by an individual user
  4. Detailed description of all configuration items and parameters accessible in stackPACK

StackPACK processing

StackPACK processes sequence data through a series of steps.  Data is collected and retained at each step in the process.  This data is stored in stackPACK's MySQL database and is retrieved via the various views available in WebProbe or through the output reports.  The table below shows each step in the stackPACK process, the resulting data and the accession number or name for that data when seen later in stackPACK's output views and reports.
 
 
Step Command-line Web-based Results in... Results name
Project Creation stack_ProjectManager WebPipe project created in database with project name, owner and description N/A
Data Import stack_ImportFasta or stack_ImportGenbank WebPipe sequences with original accession numbers stored in database sequences' original accession numbers
Masking stack_Mask WebPipe original sequence overwritten with masked version of sequence; what is masked is dependent on the makeup of each site's masking file.  We recommend that each site create a masking file appropriate for their use.  sequences' original accession numbers
Clustering stack_Cluster WebPipe Large, loose groupings of sequences brought together because they share similar words.  These are grouped using the d2_cluster algorithm. cl# 
group of members
no consensus or alignment is generated
Cluster Assembly stack_Assemble WebPipe One or more contigs created by attempting to assemble a loose cluster using PHRAP. ct# 
PHRAP assembly 
PHRAP consensus (also known as contig consensus)
Assembly Analysis stack_Analysis WebPipe Each contig is analyzed using CRAW as well as stack_Analysis to identify possible subassemblies (potential alternate forms).  Final consensus sequences are created - one for each subassembly.  The primary consensus is chosen from all possible subassembly consensus sequences. cn#
Final consensus sequence(s) for the contig
Alignment Analysis View (CRAW output)
CRAW alignment (final processed alignment resulting in final consensus)
Linking stack_Link WebPipe Non-overlapping clusters are grouped based on sharing clone IDs (e.g., 3' and 5' clusters that do not overlap but come from the same clone).   A consensus sequence is generated by concatenating the primary consensus sequences from each cluster in the linked group.  These are separated by a series of 10 N's. ln#
linked consensus sequence

The stackPACK system and STACKdb database of clustered human ESTs and mRNAs are described in more detail in two publications:

  1. STACK: sequence tag alignment and consensus knowledgebase.; Christoffels A, Gelder Av, Greyling G, Miller R, Hide T, Hide W.; Nucleic Acids Research Vol.29 20001; pg 234 - 238
  2. A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base.;  Miller RT, Christoffels AG, Gopalakrishnan C, Burke J, Ptitsyn AA,    Broveak TR, Hide WA; Genome Research Vol. 11 1999; pg 1143-1155

StackPACK System-wide Configuration and Parameters

StackPACK is configured on a system wide basis through the /etc/stackpack configuration file.
The /etc/stackpack file governs all web-based use of stackPACK and any commandline use where the individual user has not created their own configuration file.

The /etc/stackpack configuration file contains three sections:

  1. System configuration
  2. User Editable Parameters
  3. Detailed Program Parameters


A full description of each parameter in the /etc/stackpack file is provided below.
 
 

StackPACK User Configuration and Parameters

StackPACK is designed so that an individual user can adjust and play with their parameters without affecting the other users on the stackPACK system.  This is accomplished through the use of a .stackpackrc file placed in the user's home directory.  Expert configuration is covered in our Commandline User Manual.

It is important to note that the .stackpackrc file is only effective when the user is running stackPACK from the command line.   Any web-based use of stackPACK is governed by the system wide /etc/stackpack file.

The .stackpackrc file contains two sections:

  1. User Editable Parameters
  2. Detailed Program Parameters
A full description of each parameter in the /etc/stackpack file is provided below.

To construct a .stackpackrc file, do the following:

  1. Copy the /etc/stackpack systemwide configuration file into the user's home directory.
  2. Edit the file to remove the first section "System Configuration"
  3. Edit the parameters in the remaining file to suit your clustering needs.

Detailed Parameter Descriptions

The following table gives a detailed description of each of the parameters accessible via stackPACK's /etc/stackpack and .stackpackrc configuration files.
 
 
Parameter Settings Effects
System Configuration
STACKPACK_BASE Directory location where stackPACK installation sits change if stackPACK installation is moved to a different location
STACKPACK_BIN Directory location where stackPACK executables are stored  
STACKPACK_LIB Directory location where stackPACK library files are stored  
STACKPACK_LIB_EXTERNAL Directory location where third-party library files are stored  
STACKPACK_TMP Location of temporary directory used to store intermediate results while stackPACK processing occurs. If this directory runs out of space, stackPACK will not run.  Ensure the temp directory is in a location with plenty of space.  More users and larger projects require more temp space.
STACKPACK_LIB_PYTHON Directory location where stackPACK's python libraries are stored  
STACKPACK_SUPPORTING Directory location where supporting files are stored.  E.g., this is the default location for the repeats file distributed with stackPACK and the script used to upgrade projects from 2.0 to 2.1 format.  
STACKPACK_LOG Location of stackCORBAd log file.   
[DATABASE]    
ODBCSYSINI location of odbc files that identify location and type of database used behind stackPACK  
DSN_NAME table containing list of stackPACK projects  
DSN_LOGIN login used by stackPACK when accessing the backend database  
DSN_PASSWORD password used by stackPACK when accessing the backend database  
[WEBPROBE]    
HTTP_SERVER web server used for WebProbe viewing software  
HTTPD_LOCATION location of cgi and html directories where WebProbe is stored  
HTML_LOCATION subdirectory where WebProbe html files are stored  
CGI_LOCATION subdirectory where WebProbe cgi files are stored   
[WEBPIPE]    
SMTP_SERVER mail server used to send confirmation and conclusion messages at the beginning and end of each web-based stackPACK run  
HTTP_SERVER web server used by WebPipe clustering submission software  
HTTPD_LOCATION location of cgi and html directories where WebPipe is stored  
HTML_LOCATION subdirectory containing WebPipe html files  
CGI_LOCATION Full path of directory containing WebPipe cgi files  
INSTALLED Yes/No whether or not WebPipe is installed.
This is set to "no" for sites who have purchased STACKdb and not stackPACK
If "NO", WebPipe will not appear in the menus for web-based stackPACK
[ILU]  
ILU_LIB location of ilu libraries  
ILU_BINDING location of ilubinding directory  
ILU_SERVERNAME name of the ilu CORBA server  
ILU_MAXREQUESTS number of client requests the server will handle before shutting itself down.  Increase if you experience crashes under heavy usage, decrease if stackCORBAd uses too much memory.
ILU_GARBAGESIZE Number of client connections that can be maintained simultaneously.  Increase if you experience crashes due to heavy use.
User Editable Parameters  
[stack_Mask]  
program either cross_match or RepeatMasker  
mask_file full path for location of repeats file to be used for masking  
num_cpus number of CPUs on which the masking should be run increasing # cpus can speed processing, but will also use more memory.  These two factors must be balanced.
batch_size for cross_match, number of sequences processed in each 'batch' higher number speeds cross_match, but  uses more memory.  If you are running out of memory, reduce this number
[stack_Cluster]    
num_cpus number of CPUs on which the d2_cluster algorithm should be run more CPUs = faster speed
[stack_Assemble]    
num_cpus number of CPUs on which the assembly should be run increasing # cpus can speed processing, but will also use more memory.  These two factors must be balanced. 
[stack_Analysis]    
[stack_Link]    
redundancy number of independent cloneIDs that must match before two clusters are considered "linked" If this number is low (e.g., 1) and you are using public data, you may experience spurious linking due to errors in the public data annotations. 
max_seq_per_clone Maximum number of sequences that may have the same clone ID This is used as a check-and-balance to ensure that erroneous cloneID information doesn't slip through and cause, e.g., all data to link together
External Programs    
[cross_match]    
executable full path to program executable  
flags you can set all flags (parameters) for cross_match by entering them here See cross_match documentation for details about the effects of changing cross_match parameters.
[RepeatMaster}    
executable full path to program executable  
flags you can set all flags (parameters) for RepeatMasker by entering them here See RepeatMasker documentation for details about the effects of changing RepeatMasker parameters.
[enc_db]    
executable full path to program executable  
[d2_cluster]    
executable full path to program executable See d2_cluster paper for more details on  parameters and parameterization.
Remember, d2_cluster is NOT an alignment based similarity algorithm and so results and parameterization are different from alignment based methods.
word_size size of words used to calculate comparison Increased word size increases selectivity.
Maximum word size is 9.
Default word size has been optimized - change this parameter with care.   
similarity_cutoff percent similarity within window required for positive match Lower similarity can create looser cluster and increase cluster membership.
If similarity is too low, the assembly stage may have difficulty aligning the sequences. 
minimum_sequence_size sequences below this length (prior to masking) are not processed through d2_cluster  
window_size the window within which comparisons are made Window size can be increased if your sequences are all longer (e.g., clustering only mRNAs).
Clustering with longer window sizes, however, runs the risk of potentially missing regions of variation that are shorter than the window. 
reverse_comparisons should reverse complement sequences also be compared value = 1 then both sequence and its reverse complement are used in the d2_cluster comparison.
value = 0 then only the sequence in its original orientation is used.
[phrap]    
executable full path to program executable  
old_ace controls phrap output format for older versions of PHRAP, set to 0
for current version of PHRAP, set to 1
vector_bound Number of potential vector bases at beginning of each read.  Matches
that lie entirely within this region are assumed to represent vector
matches and are ignored.
Currently set to 0 so no bases are ignored.  Increasing this number will decrease the number of bases used in calculating the phrap alignment.
trim_score Minimum score for identifying degenerate sequence at
beginning & end of read.
forcelevel Relaxes stringency to varying degree during final
contig merge pass.  Allowed values are integers from 0 (most
stringent) to 10 (least stringent), inclusive.
Increasing this will bring more sequences into the contig.
penalty Mismatch (substitution) penalty for SWAT comparisons. Increasing this will make alignment comparisons more stringent (i.e., less mismatches allowed).
gap_init Gap initiation penalty for SWAT comparisons. Increasing penalty will decrease the number of gaps allowed.
gap_ext Gap extension penalty for SWAT comparisons.   Increasing penalty will decrease the size/length of gaps allowed.
ins_gap_ext Insertion gap extension penalty for SWAT comparisons (insertion in
subject relative to query).
del_gap_ext  Deletion gap extension penalty for SWAT comparisons (deletion in
subject relative to query)
maxgap Maximum permitted size of an unmatched region in
merging contigs, during first (most stringent) merging pass.
Increased size will allow large unmatching regions and may allow alternate forms to be aligned more readily.
flags you can set all remaining/other flags (parameters) for phrap by entering them here See PHRAP documentation for more details on parameters for PHRAP.
[ace2gde]
executable full path to program executable
[craw]
executable full path to program executable
sig
window_size Window used for CRAW calculation
ignore_first Number of bases at beginning of sequence that are ignored