##################################### SNiPA v3 data release ##################################### Folders: ==================================================================== 1. annotation/ ==================================================================== This folder contains the annotation data for variants as well as the SNiPA gene and regulatory build files. Holds one folder for each supported assembly of the human genome. 1.1 annotation/grch37/ -------------------------------------------------------------------- This folder contains all annotation data mapped to genome assembly GRCh37 (currently the only assembly supported by SNiPA). Holds one folder for each SNiPA annotation release as designated by the used version of the Ensembl database. 1.1.1 annotation/grch37/ensembl*/ -------------------------------------------------------------------- These folders contain the following files: \ -| chr*.tabix.gz -| chr*.tabix.gz.tbi -------------------------------------------------------------------- Tabix-indexed annotation files for all variants on this chromosome. The *.tbi-files hold the tabix indices. The *.gz files list the actual data. Columns: CHR - chromosome POS - position of the variant RSID - the variant's dbSNP rs-identifier (if available) RSALIAS - previous dbSNP rs-identifier(s) for this variant (if available) FUNC - field used for plotting MULTIPLE - field used for plotting DISEASE - field used for plotting ANNOTATION - tooltip used in plots PHPARRAY - serialized PHP array of predicted effects. This array contains all data listed in the SNiPA cards. Can be accessed in PHP and other scripting languages (e.g. Perl using module PHP::Serialization) COMPEFFECTS - comma-separated list of assigned SNiPA effect categories (see Supplemental Text) \ -| snipa_v*.genes.txt -------------------------------------------------------------------- SNiPA gene build. Columns: ID - Ensembl gene ID NAME - Ensembl assigned gene name CHR - chromosome START - start position STOP - end position SIZE - size of gene location HIGHLIGHT - field used for plotting ANNOTATION - tooltip used in plots LINK - outlink to Ensembl gene. IMPORTANT: links for older releases may not work after Ensembl updates PHPARRAY - serialized PHP array containing synonyms for the gene and its description \ -| snipa_v*.genes.synonyms.txt -------------------------------------------------------------------- Synonym mapping of Ensembl gene IDs to many other gene databases. COL1 - Ensembl gene IDs COL2 - external gene ID (e.g. Entrez ID, HGNC ID, VEGA ID, etc.) \ -| snipa_v*.regulatory.%.txt -------------------------------------------------------------------- SNiPA regulatory build splitted in three files according to the source. % = encode : ENCODE regulatory clusters obtained at Ensembl % = encode.dhs : DNase hypersensitive sites from Thurman et al. % = fantom5 : expressed promoters/enhancers obtained from FANTOM5 Format is the same for all three files: COL1/NAME - name of the element COL2/CHR - chromosome COL3/START - start position COL4/STOP - end position SIZE - element size ANNOTATION - tooltip used in plots LINK - outlink to Ensembl (either to the element if available or to the genomic region in the Ensembl genome browser) PHPARRAY - serialized PHP array containing further information ==================================================================== 2. genomic/ ==================================================================== This folder contains the basic variant data including pairwise LD values. 2.1 genomic/grch37/ -------------------------------------------------------------------- This folder contains all variant data mapped to genome assembly GRCh37 (currently the only assembly supported by SNiPA). Holds one folder for each 1000 genomes release. Folder name format: 1kgp = 1000 genomes project; p_ = phase _; v_ = version _. 2.1.1 genomic/grch37/1kgpp*v*/ -------------------------------------------------------------------- Variant data for the 1000 genomes release specified by the folder name. Contains variant data for each super-population. Populations: afr - African amr - American asn - Asian (until 1kgpp1v3) eas - East Asian (from 1kgpp3v5 onwards) eur - European sas - South Asian (from 1kgpp3v5 onwards) Each population subfolder contains three folders: \ -| ld/ -------------------------------------------------------------------- Contains for each chromosome one *.gz (data) and one *.gz.tbi (tabix index) file for LD data. Data file format: CHR - chromosome POS1 - position variant 1 POS2 - position variant 2 R2, D, DPRIME - pairwise LD measures RSID - dbSNP rs-identifier for variant 2 RSALIAS - dbSNP rs-identifier synonym(s) for variant 2 MINOR - minor allele of variant 2 in this population MAF - minor allele frequency of variant 2 in this population MAJOR - major allele of variant 2 in this population CMMB - recombination rate [cM per MB] of variant 2 CM - genetic position [cM] of variant 2 \ -| mapping/ -------------------------------------------------------------------- Contains one *.gz (data) and one *.gz.tbi (tabix index) file for rs-identifier to position mapping. Data file format: COL1 - chromosome COL2 - position COL3 - dbSNP rs-identifier \ -| self/ -------------------------------------------------------------------- Contains for each chromosome one *.gz (data) and one *.gz.tbi (tabix index) file for variant data. Data file format is the same as for LD data, however, only stores the information for each original bi-allelic variant in this population: CHR - chromosome POS1 - position of the variant POS2 - same as POS1 R2, D, DPRIME - 1 (each variant perfectly correlates with itself) RSID - the variant's dbSNP rs-identifier RSALIAS - dbSNP rs-identifier synonym(s) for the variant MINOR - the variant's minor allele in this population MAF - the variant's minor allele frequency in this population MAJOR - the variant's major allele in this population CMMB - the variant's recombination rate [cM per MB] CM - the variant's genetic position [cM]