Schema for Web Sequences - DNA Sequences in Web Pages Indexed by Bing.com / Microsoft Research
|
|
Database: hg19 Primary Table: pubsBingBlat Row Count: 313,510   Data last updated: 2013-11-21
Format description: publications blat feature table, in bed12+ format On download server: MariaDB table dump directory
field | example | SQL type | info | description |
bin | 585 | smallint(5) unsigned | range | Indexing field to speed chromosome range queries. |
chrom | chr1 | varchar(255) | values | chromosome |
chromStart | 14789 | int(10) unsigned | range | start position on chromosome |
chromEnd | 15004 | int(10) unsigned | range | end position on chromosome |
name | 3500336380 | varchar(255) | values | internal articleId, article that matches here |
score | 75 | int(10) unsigned | range | score of feature |
strand | | char(1) | values | strand of feature |
thickStart | 14789 | int(10) unsigned | range | start of exons |
thickEnd | 15004 | int(10) unsigned | range | end of exons |
reserved | 8421504 | int(10) unsigned | range | no clue |
blockCount | 2 | int(10) unsigned | range | number of blocks |
blockSizes | 40,35 | longblob | | size of blocks |
chromStarts | 0,180 | longblob | | A comma-separated list of block starts |
tSeqTypes | g | varchar(255) | values | comma-seq list of matching sequence db (g=genome, p=protein, c=cDNA) |
seqIds | 350033638000000000 | blob | values | comma-separated list of matching seqIds |
seqRanges | 0-75 | blob | values | ranges start-end on sequence that matched, one for each seqId |
publisher | | varchar(255) | values | publisher of article, for hgTracks feature filter |
pmid | | varchar(255) | values | PMID of article, for annoGrator output, avoids table join |
doi | | varchar(255) | values | doi of article, for annoGrator output, avoids table join |
issn | | varchar(255) | values | issn of journal |
journal | | varchar(255) | values | name of journal |
title | Tophat, Cufflinks and repl... | varchar(255) | values | title of article, for genome browser mouseover |
firstAuthor | seqanswers.com | varchar(255) | values | first author family name of article, for genome browser |
year | 0 | varchar(255) | values | year of article, for genome browser |
impact | 0 | varchar(255) | values | impact factor of journal, for genome browser coloring, derived from official impact factors: max impact is 25, value is scaled to 0-255 |
classes | | varchar(255) | values | classes assigned to journal article, for genome browser coloring |
locus | WASH2P,WASH7P | varchar(255) | values | closest gene symbols, one or two, comma-separated |
|
| |
|
|
Connected Tables and Joining Fields
|
|
Sample Rows
|
|
bin | chrom | chromStart | chromEnd | name | score | strand | thickStart | thickEnd | reserved | blockCount | blockSizes | chromStarts | tSeqTypes | seqIds | seqRanges | publisher | pmid | doi | issn | journal | title | firstAuthor | year | impact | classes | locus |
---|
585 | chr1 | 14789 | 15004 | 3500336380 | 75 | | 14789 | 15004 | 8421504 | 2 | 40,35 | 0,180 | g | 350033638000000000 | 0-75 | | | | | | Tophat, Cufflinks and replicates - Page 2 - SEQanswers | seqanswers.com | 0 | 0 | | WASH2P,WASH7P |
585 | chr1 | 15017 | 15590 | 3500327042 | 381 | | 15017 | 15590 | 8421504 | 2 | 326,55 | 0,518 | g | 350032704200000008 | 0-747 | | | | | | Research Technologies at Indiana University | biomedapp.iu.edu | 0 | 0 | | WASH7P |
585 | chr1 | 68858 | 68895 | 3500020489 | 37 | | 68858 | 68895 | 8421504 | 1 | 37 | 0 | g | 350002048900000000,350002048900000001 | 0-36,0-36 | | | | | | Genome mapability - Musings from a PhD candidate | davetang.org | 0 | 0 | | OR4F5 |
585 | chr1 | 69170 | 69479 | 3500359797 | 142 | | 69170 | 69479 | 8421504 | 2 | 76,66 | 0,243 | c | 350035979700000000,350035979700000002 | 0-76,10-76 | | | | | | CRAM compression and TLEN SAM's field - SEQanswers | seqanswers.com | 0 | 0 | | OR4F5 |
585 | chr1 | 70013 | 70230 | 3500427570 | 150 | | 70013 | 70230 | 8421504 | 2 | 75,75 | 0,142 | g | 350042757000000000,350042757000000001 | 0-75,0-75 | | | | | | Inconsistency with SAM flag output? - SEQanswers | seqanswers.com | 0 | 0 | | OR4F5 |
585 | chr1 | 98860 | 98888 | 3500207083 | 26 | | 98860 | 98888 | 8421504 | 3 | 5,7,14 | 0,6,14 | g | 350020708300000108,350020708300000060,350020708300000239 | 0-24,0-21,0-21 | | | | | | Method For The Simultaneous Determination Of Blood Group And Platelet Antigen Genotypes | .freshpatents.com | 0 | 0 | | OR4F5 |
586 | chr1 | 137603 | 138008 | 3500170315 | 405 | | 137603 | 138008 | 8421504 | 1 | 405 | 0 | p | 350017031500015076,350017031500015074 | 0-135,0-270 | | | | | | Balding D. (2007) Handbook of Statistical Genetics | www.scribd.com | 0 | 0 | | OR4F5 |
586 | chr1 | 139485 | 143008 | 3500419332 | 1794 | | 139485 | 143008 | 8421504 | 2 | 65,1729 | 0,1794 | g | 350041933200000004,350041933200000000,350041933200000001,350041933200000002,350041933200000003 | 0-1263,0-1859,0-1852,0-1860,0-576 | | | | | | PPT – Evolution by Genome Duplication PowerPoint presentation | free to view | www.powershow.com | 0 | 0 | | OR4F5 |
586 | chr1 | 141535 | 143008 | 3500270480 | 1372 | | 141535 | 143008 | 8421504 | 24 | 57,60,58,59,61,59,60,59,59,62,61,58,60,58,16,59,59,59,59,57,57,59,58,58 | 0,61,125,187,250,314,377,441,503,566,631,695,756,819,881,919,981,1044,1107,1170,1230,1291,1353,1415 | g | 350027048000000003,350027048000000002 | 0-902,0-525 | | | | | | Chen-Kung Chou 3-22-2004 | www.dls.ym.edu.tw | 0 | 0 | | OR4F5 |
587 | chr1 | 352265 | 352290 | 3500427583 | 25 | | 352265 | 352290 | 8421504 | 1 | 25 | 0 | g | 350042758300000000 | 74-99 | | | | | | 2010-11-10.GENSIPS.Assembly in the Clouds | chatzlab.cshl.edu | 0 | 0 | | OR4F29 |
|
Note: all start coordinates in our database are 0-based, not
1-based. See explanation
here.
| |
|
|
Web Sequences (pubsBingBlat) Track Description
|
|
Description
This track is powered by Bing! and Microsoft Research. UCSC collaborators at
Microsoft Research (Bob Davidson, David Heckerman) implemented a DNA sequence
detector and processed thirty days of web crawler updates, which covers
roughly 40 billion webpages. The results were mapped with BLAT to the genome.
Display Convention and Configuration
The track indicates the location of sequences on web pages
mapped to the genome, labelled with the web page URL. If the web page includes
invisible meta data, then the first author and a year of publication
is shown instead of the URL. All
matches of one web page are grouped ("chained") together.
Web page titles are shown when you move the mouse cursor over the features.
Thicker parts of the features (exons) represent matching sequences,
connected by thin lines to matches from the same web page within 30 kbp.
Methods
All file types (PDFs and various Microsoft Office formats) were converted to
text. The results were processed to find groups of words that look like DNA/RNA
sequences. These were then mapped with BLAT to the human genome using the same
software as used in the Publication track.
Credits
DNA sequence detection by Bob Davidson at Microsoft Research.
HTML parsing and sequence mapping by Maximilian Haeussler at UCSC.
References
Aerts S, Haeussler M, van Vooren S, Griffith OL, Hulpiau P, Jones SJ, Montgomery SB, Bergman CM, Open Regulatory Annotation Consortium.
Text-mining assisted regulatory annotation.
Genome Biol. 2008;9(2):R31.
PMID: 18271954; PMC: PMC2374703
Haeussler M, Gerner M, Bergman CM.
Annotating genes and genomes with DNA sequences extracted from biomedical articles.
Bioinformatics. 2011 Apr 1;27(7):980-6.
PMID: 21325301; PMC: PMC3065681
Van Noorden R.
Trouble at the text mine.
Nature. 2012 Mar 7;483(7388):134-5.
| |
|
|
|