#####################################
        SNiPA v3 data release
#####################################

Folders:
====================================================================
1. annotation/
====================================================================
This folder contains the annotation data for variants as well as
the SNiPA gene and regulatory build files. Holds one folder for
each supported assembly of the human genome.

1.1 annotation/grch37/
--------------------------------------------------------------------
This folder contains all annotation data mapped to genome assembly
GRCh37 (currently the only assembly supported by SNiPA). Holds one
folder for each SNiPA annotation release as designated by the used
version of the Ensembl database.

1.1.1 annotation/grch37/ensembl*/
--------------------------------------------------------------------
These folders contain the following files:

\
-| chr*.tabix.gz
-| chr*.tabix.gz.tbi
--------------------------------------------------------------------
Tabix-indexed annotation files for all variants on this chromosome.
The *.tbi-files hold the tabix indices. The *.gz files list the 
actual data. Columns:
CHR - chromosome
POS - position of the variant
RSID - the variant's dbSNP rs-identifier (if available)
RSALIAS - previous dbSNP rs-identifier(s) for this variant (if available)
FUNC - field used for plotting
MULTIPLE - field used for plotting
DISEASE - field used for plotting
ANNOTATION - tooltip used in plots
PHPARRAY - serialized PHP array of predicted effects. This array
           contains all data listed in the SNiPA cards. Can be
		   accessed in PHP and other scripting languages (e.g.
		   Perl using module PHP::Serialization)
COMPEFFECTS - comma-separated list of assigned SNiPA effect
           categories (see Supplemental Text)

\
-| snipa_v*.genes.txt
--------------------------------------------------------------------
SNiPA gene build. Columns:
ID - Ensembl gene ID
NAME - Ensembl assigned gene name
CHR - chromosome
START - start position
STOP - end position
SIZE - size of gene location
HIGHLIGHT - field used for plotting
ANNOTATION - tooltip used in plots
LINK - outlink to Ensembl gene. IMPORTANT: links for older releases
           may not work after Ensembl updates
PHPARRAY - serialized PHP array containing synonyms for the gene and
           its description

\
-| snipa_v*.genes.synonyms.txt
--------------------------------------------------------------------
Synonym mapping of Ensembl gene IDs to many other gene databases.
COL1 - Ensembl gene IDs
COL2 - external gene ID (e.g. Entrez ID, HGNC ID, VEGA ID, etc.)

\
-| snipa_v*.regulatory.%.txt
--------------------------------------------------------------------
SNiPA regulatory build splitted in three files according to the 
source.
% = encode : ENCODE regulatory clusters obtained at Ensembl
% = encode.dhs : DNase hypersensitive sites from Thurman et al.
% = fantom5 : expressed promoters/enhancers obtained from FANTOM5
Format is the same for all three files:
COL1/NAME - name of the element
COL2/CHR - chromosome
COL3/START - start position
COL4/STOP - end position
SIZE - element size
ANNOTATION - tooltip used in plots
LINK - outlink to Ensembl (either to the element if available or to
           the genomic region in the Ensembl genome browser)
PHPARRAY - serialized PHP array containing further information


====================================================================
2. genomic/
====================================================================
This folder contains the basic variant data including pairwise LD
values.

2.1 genomic/grch37/
--------------------------------------------------------------------
This folder contains all variant data mapped to genome assembly
GRCh37 (currently the only assembly supported by SNiPA). Holds one
folder for each 1000 genomes release. Folder name format:
1kgp = 1000 genomes project; p_ = phase _; v_ = version _.

2.1.1 genomic/grch37/1kgpp*v*/
--------------------------------------------------------------------
Variant data for the 1000 genomes release specified by the folder
name. Contains variant data for each super-population. Populations:
afr - African
amr - American
asn - Asian (until 1kgpp1v3)
eas - East Asian (from 1kgpp3v5 onwards)
eur - European
sas - South Asian (from 1kgpp3v5 onwards)

Each population subfolder contains three folders:
\
-| ld/
--------------------------------------------------------------------
Contains for each chromosome one *.gz (data) and one *.gz.tbi 
(tabix index) file for LD data. Data file format:
CHR - chromosome
POS1 - position variant 1
POS2 - position variant 2
R2, D, DPRIME - pairwise LD measures
RSID - dbSNP rs-identifier for variant 2
RSALIAS - dbSNP rs-identifier synonym(s) for variant 2
MINOR - minor allele of variant 2 in this population
MAF - minor allele frequency of variant 2 in this population
MAJOR - major allele of variant 2 in this population
CMMB - recombination rate [cM per MB] of variant 2
CM - genetic position [cM] of variant 2

\
-| mapping/
--------------------------------------------------------------------
Contains one *.gz (data) and one *.gz.tbi (tabix index) file for
rs-identifier to position mapping. Data file format:
COL1 - chromosome
COL2 - position
COL3 - dbSNP rs-identifier

\
-| self/
--------------------------------------------------------------------
Contains for each chromosome one *.gz (data) and one *.gz.tbi 
(tabix index) file for variant data. Data file format is the same as
for LD data, however, only stores the information for each original
bi-allelic variant in this population:
CHR - chromosome
POS1 - position of the variant
POS2 - same as POS1
R2, D, DPRIME - 1 (each variant perfectly correlates with itself)
RSID - the variant's dbSNP rs-identifier 
RSALIAS - dbSNP rs-identifier synonym(s) for the variant
MINOR - the variant's minor allele in this population
MAF - the variant's minor allele frequency in this population
MAJOR - the variant's major allele in this population
CMMB - the variant's recombination rate [cM per MB]
CM - the variant's genetic position [cM]