WebPipeTM
Completion of WebPipe is necessary for initialization of the clustering process.
WebPipe currently accepts one dataset or sequence file per project.
All details must be completed for clustering to proceed.
Entering Project Details
Project name
- Enter a brief one-word project name.
- Legal characters for project names are alphanumerics and underscores ("_").
- More than one project may be given the same name provided that they are owned by different owners.
- StackPACK is not case-sensitive. Therefore, project names such as "UPPER" and "Upper" will be considered duplicates if they have the same owner.
Owner
- Enter the project owner's full email address.
- Legal characters for project owner are alphanumerics, dot ("."), underscore ("_"), dash ("-") and "@".
- This e-mail address will be used to send the owner notification of project submission and completion.
Note:
Upon completion of the project, a log report of the clustering process will be e-mailed to the user.
This log contains details of each step in the pipeline, including parameter settings and any errors
that may have occurred. If the pipeline, and thus the project, fails to run to completion, the log
report is still e-mailed to the user.
Description
- Enter a detailed description of the project or data for users reference.
- All characters are legal in the description line.
- Description information is displayed in the WebProjectManager project listing (truncated to two lines) as well as at the top of the project summary report, accessed via either WebProjectManager or WebProbe.
- Project descriptions may be edited or added to via WebProjectManager's "Edit_Description" button.
Data Format
- Select the appropriate sequence data format from the pop-up box.
- StackPACK accepts any of the listed formats.
- Clicking on the Data Format link takes the user to detailed format descriptions.
- Every EST or mRNA sequence must have a unique accession number (per project) in order for all the data to be processed.
- If the file contains two or more sequences with the same accession number, the program will only process the first sequence. The other duplicate(s) will not be imported or processed regardless of its primary sequence data.
- For maximum processing efficiency and best results, place long sequences, such as mRNAs, at the top of the input data file.
Note: Mixed or Unknown FASTA Formats
- When data is submitted using the "Mixed or Unknown FASTA Format", stackPACK first attempts to identify one of its pre-defined FASTA headers. If these cannot be found, stackPACK then determines an accession number for the sequence entry by extracting all valid characters (alphanumeric and punctuation, such as "_" or ".") found between the > and the first space.
- Sequences will ONLY be imported if there are 255 or less valid characters
between the > and the first space. It should thus be ensured that these 255
characters of each sequence entered for processing are unique. Sequences with accession numbers longer
than 255 characters will not be imported.
- If possible, other details in the header may be parsed in as well. Otherwise, the remainder of the line is ignored.
Click the "Next" button in order to move to the next step: selecting a data file(s).
| Example: |  |
Selecting the Data File for GenBank or any of the FASTA Data Formats
- After clicking "Next", a window similar to the following will be displayed:

- Use the browse option to specify the name and location of the input data file. If the browse
option is not used, it must be ensured that the exact drive and full path is specified.
- Use the "Back" button to change the project details that have been entered. The data format must
be re-selected when this button is used.
Selecting the PHRED quality scores and FASTA Data Format Files
- After clicking "Next", a window similar to the following will be displayed:

- Use the browse option to specify the name and location of the input data and quality file.
If the browse option is not used, it must be ensured that the exact drive and full path is
specified.
- Use the "Back" button to change the project details that have been entered. The data format must
be re-selected when this button is used.
Note:
Due to the fact that clone information is not included in the header line of the input sequence
data, clonelinking will not occur when PHRED quality files and the corresponding sequence data
files are processed.
Upload
- Click this button to create the project and start the clustering process.
- Upon initiation, the following window is displayed:

Below is a brief description of each step in the clustering pipeline:
- Importing:
- The input sequences are checked for the correct format, and are imported into stackPACK's database for processing by the clustering engine.
- Annotative information is parsed from the entries and stored in the database. The information collected varies depending on the input format specified.
- Masking:
- Users can choose to mask input sequences either with
CrossMatch
(Green, 1996) or RepeatMasker
(Smit, AFA & Green, P., unpublished results). Alternatively users may choose to skip the masking step.
- The system administrator must specify the choice of masking algorithm in the system-wide stackPACK configuration file.
- CrossMatch masks input sequences against a FASTA formatted file containing any DNA sequences. Typical masking files include any or all of the following:
- Repeat sequences. For STACKdb production Electric Genetics uses RepBase
(Jurka, 1995). Your systems administrator may have installed RepBase or another repeat database more pertinent to your work.
- Common vector sequences, such as those distributed by NCBI.
- Other potential contaminants such as rodent, mitochondrial and ribosomal DNA.
- RepeatMasker masks input sequences against a single database only, such as one of the following:
- RepeatMasker-formatted version of RepBase. In this case, the user may select all repeats or limit masking to pre-defined taxonomic categories.
- Any FASTA file containing repeat sequences, vectors or other contaminants. In this case the entire file is used for masking.
- Sequences are masked by replacing the contaminated portions of the sequence with x's.
- Masked regions are retained throughout the clustering pipeline and are visible in the EST or mRNA Sequence View, PHRAP Alignment view and CRAW Alignment view.
Note: Vector sequences or other contaminants are not included in the RepeatMasker-formatted version of RepBase.
Because only one masking database can be specified in RepeatMasker, any concern about vector or contaminant
sequences should be addressed by masking for these artifacts prior to stackPACK processing.
Note: Vector sequences within the repeat.seq distributed by
Electric Genetics are obtained from the
vector.z file at NCBI. Some users have reported the presence of fragments of real genes
within this file. Any comments or corrections regarding this should be sent to the curator at
NCBI.
FAQ: Am I able to import pre-masked sequences?
Answer:
- Pre-masked data may be imported into stackPACK. However, if data is submitted via WebPipe, it will also be run through the masking step.
- The masking option can only be omitted from the pipeline when using command line data processing.
- Please refer to section 3.5 of the Command Line manual for further details.
- Clustering:
- Clustering employs d2_cluster, a high-performance comparison algorithm that rapidly determines the relative similarity of large datasets of genetic sequences using a non-alignment based method well-suited to single-pass sequences such as Expressed Sequence Tags (ESTs).
- d2_cluster implements a loose approach to sequence clustering by identifying and counting matching n-length words (n=6) - this loose approach presents the opportunity to identify splice variants and alternate expression forms.
- The d2_cluster algorithm ignores sequences with less than 50 valid base pairs. Only A,T,C and G are considered
valid bases. These sequences are not included in the clustering step and are considered singletons. Masked
sequence bases, represented by "x", are not considered valid bases, and thus will not be counted toward the
minimum number of base pairs required for processing by d2_cluster. The minimum sequence value of 50 is set by
the d2_cluster minimum_sequence_size parameter and can be changed when using command line data processing.
Please refer to section 3.6.1, Cluster
Parameters, of the command line manual for further details.
- Sequences may be added incrementally to existing clusters via the WebProjectManger. The d2_cluster algorithm is also used for the addition of clusters.
- The clustering step is more efficient if long input data sequences, such
as mRNAs, are at the top of the imported input data file.
- Alignment:
- The related but loose clusters are subsequently aligned and assembled by PHRAP in order to identify, characterize and isolate any sequence divergence.
- Contigs within clusters are formed when PHRAP finds such divergent groups within a cluster.
- Users may alternatively choose to use the original unmasked input sequences for alignment and assembly by PHRAP
via the command line. Please refer to section 3.7.1, Assembly Parameters and Configuration, of the
Command Line User Manual for further details.
- PHRAP has a limit of 64K bases for each sequence. If a cluster contains sequences that exceed this limit, a consensus sequence will not be generated for this cluster.
- PHRAP has a limit of 64K sequences per cluster. If the cluster members exceed this limit, a consensus sequence will not be generated for this cluster.
FAQ: Why are some sequences missing from my contig(s)?
Answer: There are two reasons why sequences may be missing from your contigs:
- If PHRAP cannot align all the sequences in the cluster
created by d2_cluster, it may create multiple contigs or
even singleton sequences. The accession number of these
singletons will be displayed in the cluster tree in the
left panel, but not within a contig.
- If stackPACK's default PHRAP flag "retain_duplicates" has been
removed from the configuration file, you may see similar behavior.
Without the retain_duplicates flag, if your cluster contains two or
more sequences with 100% sequence identity, PHRAP processes only the
first sequence. The accession number of these duplicate sequences will
be displayed in the cluster tree in the left panel, but not within a contig.
FAQ: How do I ensure sequences exceeding PHRAP's 64,000 base pair limit are included in the assembly step?
Answer:
- PHRAP can be compiled with a .longreads option that allows assembly of sequences with more than 64 000bp.
- Please ask your Systems Administrator to compile this version of PHRAP.
- Please refer to section iv in the Installation Instructions (Part I) of the Documentation For PHRAP and CROSS_MATCH (Version 0.990319).
FAQ: How do I ensure that clusters exceeding PHRAP's reads limit are included in the assembly step?
Answer:
- PHRAP can be compiled with a .manyreads option that allows assembly of clusters with more than 64,000 members.
- Please ask your Systems Administrator to compile this version of PHRAP.
- Please refer to section iv in the Installation Instructions (Part I) of the Documentation For PHRAP and CROSS_MATCH (Version 0.990319).
FAQ: My assembly is running but no contigs are generated. Why?
Answer:
- During stackPACK processing, some results are temporarily stored in a "temp" directory
(usually /usr/local/stackpack/tmp). When this temp directory becomes full the assembly
step (stack_Assemble) will continue to run, but no contigs will be generated.
- To solve this, establish what temp items are no longer "active" and can be deleted.
- If this becomes a frequent problem, ask your Systems Administrator to configure stackPACK
to use a temp directory location with more available disk space.
- Analysis:
- stackPACK uses all available sequence data to analyze contigs for variation and to produce the longest representative consensus sequence(s).
- The CRAW program is employed to generate consensus sequences, maximize consensus length, partition sub-assemblies and to annotate polymorphic regions and alternative splicing forms in the clusters.
- After CRAW has identified sub-assemblies, stackPACK maximizes the consensus sequence for each sub-cluster and prepares a separate alignment, visible in the CRAW alignment view, for each sub-assembly.
- CRAW has a sequence length limit of 100,000 bp and will analyze contigs with no more than 20,000 members. Contigs that exceed these limits will not be processed by CRAW.
FAQ: How does stackPACK select the primary consensus sequence for a contig?
Answer:
- StackPACK first selects the longest consensus sequence.
- If two or more consensus sequences are the same length, it selects the consensus with the largest number of member sequences.
- If these are the same, it selects the consensus with the greatest number of "good" bases (A,C, G or T).
- If these are the same it selects the consensus sequence with the least number of N's and IUPAC codes.
- Linking:
- All ESTs generated from the same cDNA clone correspond to a single gene. Upon sequence import stackPACK attempts to located clone identification details for each sequence so that the transcripts corresponding to the same gene can be identified.
- The linking algorithm connects non-overlapping sequences that belong to the same clone in order to maximize final consensus sequence length.
- The linking criterion is defined by the redundancy parameter, which defines the number of matching clone IDs
required to link two non-overlapping clusters. This parameter can be specified in the stackPACK configuration
file by your Systems Administrator. By default, stackPACK links clusters that share at least two independent
clone IDs in order to reduce false-positives and accommodate annotative errors found in public sequence data.
FAQ: How can I ensure that clonelinking is performed on my data?
Answer:
- Clonelinking can only be performed if the clone ID information is parsed from the header lines of the input sequence data.
- Clone ID information can only be parsed if it is in a defined field within the header line.
- Formats with defined fields for clone ID information, allowing clonelinking to occur, include GenBank, Simple
FASTA or STACK FASTA formats. Clonelinking may sometimes occur with NCBI FASTA and Mixed or Unknown FASTA formats,
and will never occur when using PHRED quality scores.
FAQ: If the processing is interrupted or if I want to change any of the program parameter settings, can I re-run any a portion of the steps in the pipeline?
Answer:
- You may "undo" the clustering, assembly, analysis and linking steps of a project.
- This option will reverse all the steps including and subsequent to the step being undone in the stackPACK pipeline.
- This option can only be run via command line data processing.
- Please refer to section 3.11 of the command line manual for details.
FAQ: When submitting a clustering job using Internet Explorer as the web browser, WebPipe gives me an error that the server has "timed out". Does this affect the integrity of my data?
Answer:
- No, your data will still undergo all steps in the pipeline and your results will be unaffected by this.
- If you want to avoid this, please use Netscape as your web browser or use command line data processing.
|