Biowine - Documentation

Biowine is a knowledge base of genomic data of Vitis vinifera, containing NGS data from several samples of grapevine. Biowine can be consulted through three types of search.

Modules

Single gene
Multiple genes
Pathways

The search sections offer the possibility to depict the results by Cytoscape web network visualization (http://www.cytoscape.org). A single gene search shows protein-protein interactions (PPI) among the gene and its neighbors, together with all miRNAs targeting such a gene. A miRNA search shows the protein-protein interactions among all its targeting genes and their adjacent genes. Similarly, for the multi genes search and the pathway search, the protein-protein interactions among the selected genes and their neighbors are shown. The PPI network has been downloaded from STRING (http://string-db.org) and mapped to the 12x genome. It contains interactions that have been experimentally validated in other species. We discard interactions with score (probability of occurrence) below 0.4. Further details on the score computation can be found in [1]. We mapped the PPI network from the 8x genome to the 12x genome by using the correspondence provided in [2] (Additional file 2). Interactions that involve 8x genes that correspond to multiple 12x genes are mapped to all corresponding genes. Interactions among 8x genes that do not correspond to any 12x genes are discarded.

Single gene

Genes are identified by the Ensembl gene name. The information about a single gene can be obtained by typing the gene name in the corresponding text box and pressing the button Search. Biowine visualizes several sections with information on the requested gene. Sections can be expanded or collapsed by the dedicated links ("show" and "hide"). A description of each section follows.

Gene

In this section, the following data are given:

Ensembl Gene ID
Entrez Gene ID
Chr: chromosome
Strand: sense (+) or antisense (-)
Start: start nucleotide in the chromosome
End: end nucleotide in the chromosome

The nucleotide sequence of the gene can be shown or hidden through the dedicated links ("show" and "hide"). The gene can also be visualized by using GBrowse. Documentation on GBrowse can be found here.

miRNAs

Show a list of known microRNAs that target the given gene.

GoTerms

Visualize all annotations (GO Terms) of the gene.

mRNAs

Show the coding sequence (CDS) and the UTRs of the gene's transcripts. For each sequence, Type, Chr (chromosome), Strand (+ or -) and the limits of the sequence are shown.

Proteins

Show a list of gene's products. Each protein is identified by its Uniprot ID. The amino acid sequence can be shown or hidden by the dedicated links ("show" and "hide").

Pathways

This section gives a list of pathways that contain the given gene. Pathways are imported from KEGG.

SNPs and INDELs

This section reports the list of SNPs and INDELs found in the sequenced genomes. The pipeline used in this project for Variant calling has been based on Samtools (please, refer to this documentation for more details). Information about SNPs and INDELs are given with a "Quality" value which is basically a measure of how confident Samtools are that a variant is really a variant. The GQ (Genotype Quality) value encode the phread quality score -10log_10p(genotype call is wrong) (is numeric). The user can limits the visualized SNPs by choosing a range of GQ value (It represents the Genotype Quality encoded as a phred quality. The GQ range can be specified through the dedicated combo boxes at the beginning of the section.

The section is composed by three parts:

Info about samples: shows a list of samples with associated information. For each sample, the following attributes are given:

a.

Sample: a short identifier of the sample

b.

Origin: where the cultivar is located (town)

c.

Classification: the kind of cultivar(Nero D'avola or Nerello Mascalese)

d.

Typology: environment conditions (normal, iron chlorosis, water stress etc.)

e.

Phenological Phase

f.

Root stock

g.

Cultivar

h.

Age (in years)
GQ frequency over samples. It shows a plot that represents the distribution of GQ values for each sample. Each line corresponds to a different sample.

SNPs/INDELs results: a list of variants for each sample. For each variant, the following properties are given:

a.

Type: SNP or INDEL

b.

Start: the starting position int the chromosome

c.

End: The ending position in the chromosome

d.

Quality: -10log_10 prob(call in Alt is wrong) (bigger is more confident)

e.

Ref: Reference sequence at "Start" position involved in the variant. For a SNP, it is a single base

f.

Alt: Comma delimited list of alternative sequence(s)

g.

GT: Genotype encoded as alleles values separated by either of "/" or "|", e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in Alt, 2 for the second allele list in Alt and so on

h.

PL: Likelihood(data given that the true genotype is X/Y) (bigger is less confident). If for example Ref is G, Alt is A, GT is 1/1 and PL is 255,205,0 these correspond to genotypes: GG:255, GA:205, AA:0. Since 0 is the smallest, it is the most likely given the data

i.

GQ: Genotype Quality, encoded as a phred quality -10log_10p(genotype call is wrong)

Genes Expression

Here the expressions of the selected gene in samples are compared and, for each pair of samples, the negative log of the fold change is given. These results are based on RNAseq data. The pipeline used for the differential gene and trascript expression analysis can be found here. Please refere to this work for more details. The fold change is the gene expression ratio between two samples. All possible pairs of samples are compared. For each pair, the following data are shown:

Samples: the pair of samples that are compared
Genes: the gene name
Exp Samp 1: the absolute expression value of sample 1
Exp Samp 2: the absolute expression value of sample 2
log2(fold_change): the logarithm of the ratio between Exp Samp 1 and Exp Samp 2
Test stat: The value of the test statistic used to compute significance of the observed change in FPKM
P-value: the p-value of the fold change (probability that the computed fold change is observed by chance)
Q-value: the q-value of the fold change (The FDR-adjusted p-value of the test statistic)
Significant: Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Multiple Genes

By searching for multiple genes you can visualize information and statistics that involve several genes. Genes can be typed one by one, or by pasting a list of genes (one row per gene) in the dedicated text box. After pressing the button Search, a multiple-sections page with information about multiple genes is shown. Sections can be expanded or collapsed by dedicated links (“show” and “hide”). A description of each section follows.

Gene

In this section, the following data are given:

Ensembl Gene ID
Entrez Gene ID
Chr: chromosome
Strand: sense (+) or antisense (-)
Start: start nucleotide in the chromosome
End: end nucleotide in the chromosome

Information about a gene can be visualized by clicking on "see more". The single gene page is then visualized.

miRNAs

Show a list of known microRNAs that target the given gene.

Genes Expression

Here the expressions of the selected genes in the various samples are compared and, for each pair of samples, the negative log of the fold change is given. The fold change is the gene expression ratio between two samples.

This section is composed by two subsections, which can be shown and hidden by the dedicated links. "Info about samples" shows information about all samples. The following information is given:

a.: Sample: a short identifier of the sample
b.: Origin: where the cultivar is located (town)
c.: Classification: the kind of cultivar(Nero D'avola or Nerello Mascalese)
d.: Typology: environment conditions (normal, iron chlorosis, water stress etc.)
e.: Phenological Phase
f.: Root stock
g.: Cultivar
h.: Age (in years)

Next, the subsection "more info" shows a heat map with the log odd of gene expression of each pair of samples and each gene. Rows represent genes, while columns represent sample pairs. Cells are colored according with their value.

Genes Enrichment Analysis

This section shows all annotations (GO Terms) and pathways of the given genes and computes a statistical p-value for each GO term/pathway. It is composed by the following four subsections:

Process: GO terms regarding biological processes
Function: GO terms regarding molecular functions
Component: GO terms regarding cellular components
Pathway: list of pathways

Every section contains a table with the terms/pathways that are enriched (have a significant p-value) and a table with non-significant terms/pathways. Every table of the first three subsections (Process, Function and Component) has the following columns:

GO term
Description
Genes: for each gene it indicates whether such gene is annotated with a specific GO term or not

The table with enriched terms contains also the statistical corrected P-value of the enrichment. The subsection Pathway contains two tables with the following columns:

Pathway
Description
Genes: for each gene it indicates whether such gene is annotated with a specific pathway or not

The first table contains enriched pathways and has a further column with the corrected P-value of the enrichment.

Pathways

Here, information about the selected pathways are given. For each pathway, the page shows Desctiption, Name (Pathway) and source (e.g. Kegg) of the pathway. Next, the page shows a list of genes that are shared among all given pathways.

Gene

In this section, a table of genes belonging to at least one input pathway is given. The table has the following columns:

Ensembl Gene ID
Entrez Gene ID
Chr: chromosome
Strand: sense (+) or antisense (-)
Start: start nucleotide in the chromosome
End: end nucleotide in the chromosome

Information about a gene can be visualized by clicking on "see more". The single gene page is then visualized.

miRNAs

Show a list of known microRNAs that target at least one of the given genes.

Genes Expression

Here, gene expressions are compared between samples. For each gene in the input pathways and for each pair of samples, the negative log of the fold change is given. The fold change is the ratio of the gene expression between two samples.

This section is composed by three subsections, which can be shown and hidden through dedicated links (show, hide).

Info about samples: shows information about all samples. The following information is given:

a.

Sample: a short identifier of the sample

b.

Origin: where the cultivar is located (town)

c.

Classification: the kind of cultivar(Nero D'avola or Nerello Mascalese)

d.

Typology: environment conditions (normal, iron chlorosis, water stress etc.)

e.

Phenological Phase

f.

Root stock

g.

Cultivar

h.

Age (in years)
Heat map and volcano plot.
The heat map shows the negative log of the fold change of each pair of samples for each gene. Rows represent genes, while columns represent sample pairs. Cells are colored according with their value.

The volcano plot represents pairs of samples in a scatter-plot of significance (y-axes) vs. fold change (x-axes). Points that are in the top of the plot are more significant, while points that are at the left and right sides of the plot have higher absolute-value fold change. Further information about volcano plots can be found here.
More info
Next, the fold changes of all possible pairs of samples are given in a table. For each pair, the following data are shown:
1. Samples: the pair of samples that are compared
2. Genes: the gene name
3. Exp Samp 1: the absolute expression value of sample 1
4. Exp Samp 2: the absolute expression value of sample 2
5. log2(fold_change): the logarithm of the ratio between Exp Samp 1 and Exp Samp 2
6. Test stat: The value of the test statistic used to compute significance of the observed change in FPKM
7. P-value: the p-value of the fold change (probability that the computed fold change is observed by chance)
8. Q-value: the q-value of the fold change (The FDR-adjusted p-value of the test statistic)
9. Significant: Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

Genes Enrichment Analysis

This section shows all annotations (GO Terms) and pathways of the given genes and computes a statistical corrected p-value for each GO term/pathway. It is composed by the following four subsections:

Process: GO terms regarding biological processes
Function: GO terms regarding molecular functions
Component: GO terms regarding cellular components
Pathway: list of pathways

GO term
Description
Genes: for each gene it indicates whether such gene is annotated with a specific GO term or not

The table with enriched terms contains also the statistical corrected P-value of the enrichment. The subsection Pathway contains two tables with the following columns:

Pathway
Description
Genes: for each gene it indicates whether such gene is annotated with a specific pathway or not

The first table contains enriched pathways and has a further column with the corrected P-value of the enrichment.

miRNA

miRNAs are identified by the miRBase miRNA name. The information about a single miRNA can be obtained by typing the miRNA name in the corresponding text box and pressing the button Search. Biowine visualizes several sections with information on the requested miRNA. Sections can be expanded or collapsed by the dedicated links ("show" and "hide"). A description of each section follows.

[1] Franceschini, Andrea, et al. "STRING v9. 1: protein-protein interaction networks, with increased coverage and integration." Nucleic acids research 41.D1 (2013): D808-D815.
[2] Grimplet, Jérôme, et al. "Comparative analysis of grapevine whole-genome gene predictions, functional annotation, categorization and integration of the predicted gene sequences." BMC Research notes 5.1 (2012): 213.