Schema for N-SCAN - N-SCAN Gene Predictions
  Database: equCab2    Primary Table: nscanGene    Row Count: 19,612   Data last updated: 2008-08-19
Format description: A gene prediction with some additional info.
On download server: MariaDB table dump directory
fieldexampleSQL type info description
bin 585smallint(5) unsigned range Indexing field to speed chromosome range queries.
name chr1.001.1varchar(255) values Name of gene (usually transcript_id from GTF)
chrom chr1varchar(255) values Reference sequence chromosome or scaffold
strand +char(1) values + or - for strand
txStart 6739int(10) unsigned range Transcription start position (or end position for minus strand item)
txEnd 16275int(10) unsigned range Transcription end position (or start position for minus strand item)
cdsStart 7126int(10) unsigned range Coding region start (or end position for minus strand item)
cdsEnd 16275int(10) unsigned range Coding region end (or start position for minus strand item)
exonCount 13int(10) unsigned range Number of exons
exonStarts 6739,11198,11908,12331,1300...longblob   Exon start positions (or end positions for minus strand item)
exonEnds 7196,11261,11968,12406,1304...longblob   Exon end positions (or start positions for minus strand item)
score 0int(11) range score
name2 chr1.001varchar(255) values Alternate name (e.g. gene_id from GTF)
cdsStartStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS start annotation (none, unknown, incomplete, or complete)
cdsEndStat cmplenum('none', 'unk', 'incmpl', 'cmpl') values Status of CDS end annotation (none, unknown, incomplete, or complete)
exonFrames 0,1,1,1,1,1,2,2,0,1,2,2,0,longblob   Reading frame of the start of the CDS region of the exon, in the direction of transcription (0,1,2), or -1 if there is no CDS region.

Connected Tables and Joining Fields
        equCab2.nscanPep.name (via nscanGene.name)

Sample Rows
 
binnamechromstrandtxStarttxEndcdsStartcdsEndexonCountexonStartsexonEndsscorename2cdsStartStatcdsEndStatexonFrames
585chr1.001.1chr1+673916275712616275136739,11198,11908,12331,13000,13299,14082,14420,15297,15446,15640,15879,16197,7196,11261,11968,12406,13048,13354,14172,14484,15364,15570,15751,15967,16275,0chr1.001cmplcmpl0,1,1,1,1,1,2,2,0,1,2,2,0,
585chr1.002.1chr1-261827953726182767291026182,29598,32905,33246,55771,66546,70706,71047,76714,79171,26274,30201,32998,33373,55833,67068,70799,71174,76773,79537,0chr1.002cmplcmpl1,1,1,0,1,1,1,0,0,-1,
586chr1.003.1chr1-1324581438761324581438509132458,133461,134140,136900,138150,138683,139178,142530,143673,132643,133534,134328,137042,138327,138844,139328,142690,143876,0chr1.003cmplcmpl1,0,1,0,0,1,1,0,0,
586chr1.004.1chr1-1717852102151717852101865171785,183810,200411,201850,209836,172632,184727,200819,201989,210215,0chr1.004cmplcmpl2,0,0,2,0,
73chr1.005.1chr1-23422727425923422727405111234227,237017,258760,259225,266776,268647,269283,271614,272080,272773,273948,235295,237104,259100,259547,266971,268983,269601,271935,272386,273082,274259,0chr1.005cmplcmpl0,0,2,1,1,1,1,1,1,1,0,
587chr1.006.1chr1+2935242945512935942945511293524,294551,0chr1.006cmplcmpl0,
587chr1.007.1chr1-3322173333963322173331531332217,333396,0chr1.007cmplcmpl0,
587chr1.008.1chr1+3564623589183579823589182356462,357892,356726,358918,0chr1.008cmplcmpl-1,0,
587chr1.009.1chr1+3662713673173663433673172366271,366883,366782,367317,0chr1.009cmplcmpl0,1,
73chr1.010.1chr1-38304339717138304339688911383043,383438,387319,387801,388467,390099,390550,391090,393238,396000,396864,383207,383551,387401,387898,388529,390190,390607,391171,393330,396075,397171,0chr1.010cmplcmpl1,2,1,0,1,0,0,0,1,1,0,

Note: all start coordinates in our database are 0-based, not 1-based. See explanation here.

N-SCAN (nscanGene) Track Description
 

Description

This track shows gene predictions using the N-SCAN gene structure prediction software provided by the Computational Genomics Lab at Washington University in St. Louis, MO, USA.

Methods

N-SCAN PASA-EST

N-SCAN combines biological-signal modeling in the target genome sequence along with information from a multiple-genome alignment to generate de novo gene predictions. It extends the TWINSCAN target-informant genome pair to allow for an arbitrary number of informant sequences as well as richer models of sequence evolution. N-SCAN models the phylogenetic relationships between the aligned genome sequences, context-dependent substitution rates, insertions, and deletions.

For creating predictions on horse, N-SCAN uses human (hg18) as the informant.

N-SCAN PASA-EST combines EST alignments into N-SCAN. Similar to the conservation sequence models in TWINSCAN, separate probability models are developed for EST alignments to genomic sequence in exons, introns, splice sites and UTRs, reflecting the EST alignment patterns in these regions. N-SCAN PASA-EST is more accurate than N-SCAN while retaining the ability to discover novel genes to which no ESTs align.

No manual annotation was performed to generate any of the gene models.

Credits

Thanks to Michael Brent's Computational Genomics Group at Washington University St. Louis for providing these data.

Special thanks for this implementation of N-SCAN to Aaron Tenney in the Brent lab, and Robert Zimmermann, currently at Max F. Perutz Laboratories in Vienna, Austria.

References

Gross SS, Brent MR. Using multiple alignments to improve gene prediction. J Comput Biol. 2006 Mar;13(2):379-93. PMID: 16597247

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003 Oct 1;31(19):5654-66. PMID: 14500829; PMC: PMC206470

Korf I, Flicek P, Duan D, Brent MR. Integrating genomic homology into gene structure prediction. Bioinformatics. 2001;17 Suppl 1:S140-8. PMID: 11473003

van Baren MJ, Brent MR. Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res. 2006 May;16(5):678-85. PMID: 16651666; PMC: PMC1457044