Input data format descriptions
Files used as input for stackPACK may be in the following formats:
- GenBank flatfile format
GenBank flatfile format is defined as the format of sequence entries
in the GenBank database or as downloaded from the NCBI web site (e.g., Entrez search
results) when GenBank format is specified. The full GenBank format specification is found
in section 3.4 of the GenBank release
notes.
- GenBank format should be used in order to parse maximum annotation information upon import.
- NCBI entries that have been deleted and replaced by a new entry with a different
accession number may have the old accession number appended to the new accession number
within NCBI GenBank ACCESSION field, e.g. U15570 L36804. In these cases stackPACK will
strip everything after the space.
- Simple FASTA format
- >[accession].[direction] [clone ID] ( Where direction is either "r1" for a 3-Prime clone or "f1" for a 5-Prime clone )
- e.g.
>37463.f1 g83244
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
Note: A sequence with the accession "R.C.3746" cannot be specified as simple FASTA format, as the program will parse "R" for the accession and "C" for the direction. In these cases the GUESS option should be used.
- STACK FASTA format
- >[accession] [gi] | [accession] CLONE: [clone] CLONE_LIB: [clonelib] LEN: [len] FILE [source file] [direction<5-PRIME|3-PRIME>] DEFN: [descriptive text]
- e.g.
>T27877 g609975 | T27877 CLONE: 17194 CLONE_LIB: Human Eye LEN: 505 bp FILE gbest3.seq 5-PRIME DEFN: EST19137 Homo sapiens cDNA 5'end
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
- NCBI FASTA format
>gi|4468770|emb|AJ009167.1|TSAJ9167 Trypanosoma sp. 18S rRNA gene, isolate K&A
ACGTGACTGCTACGTACGGGCGTTACGACTGCTACGATCGCATGC
TATGTCGTAGCAGCCGTGTACACGTGTTTATTCGTAGGGCTTCTA
CTCAGTCGTACGTACGTACGT
Note: Clone information in NCBI format is part of the free form text in the definition line. Clone
information is thus not parsed when using NCBI format and therefore data imported as NCBI format cannot be clone
linked.
- Mixed or unknown FASTA formats
Files with mixed FASTA header line formats or files with FASTA header lines not described above can also be imported.
- If stackPACK does not identify one of its pre-defined FASTA headers,
the program determines an accession number for the sequence entry by
extracting all valid characters (alphanumeric and punctuation such as "_" or ".") found
between the > and the first space.
- Sequences will ONLY be imported if there are 255 or less valid characters
between the > and the first space. It should thus be ensured that these 255
characters of each sequence entered for processing are unique.
- If possible, other details in the header may be parsed in as well.
Otherwise, the remainder of the line is ignored.
- Minimum requirement for FASTA format input file to stackPACK is:
>[accession number]
|