RNA-Seq Differential Gene Expression Workshop
Tutorial Overview
In this tutorial we compare the performance of three statistically-based expression analysis tools:
- CuffDiff
- EdgeR
- DESeq2
This tutorial builds on top of the basic RNA-seq DGE tutorial. It is recommended to have some familiarity of RNA-seq before beginning this tutorial.
Learning Objectives
At the end of this tutorial you should:
- Be familiar with multiple workflows for RNA-seq differential expression analysis
- Be able to process raw RNA sequence data from samples of two conditions into a list of differentially expressed genes
- Be able to compare the results of gene lists generated by different methods
Background
Where does the data in this tutorial come from?
The data for this tutorial is from the paper, A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae by Nookaew et al. [1] which studies S.cerevisiae strain CEN.PK 113-7D (yeast) under two different metabolic conditions: glucose-excess (batch) or glucose-limited (chemostat).
The RNA-Seq data has been uploaded in NCBI, short read archive (SRA), with accession SRS307298. There are 6 samples in total-- two treatments with three biological replicates each. The data is paired-end.
We have extracted chromosome I reads from the samples to make the tutorial a suitable length. This has implications for downstream analysis, as discussed in the optional extension section.
Section 1: Preparation
1. Register as a new user in Galaxy if you don’t already have an account
- Open a browser and go to a Galaxy server. For this workshop we will be using the server at game-1.genome.edu.au. The public Galaxy Melbourne server can also be used. Recommended browsers include Firefox and Chrome. Internet Explorer is not supported.
- Register as a new user by clicking User > Register on the top dark-grey bar. Alternatively, if you already have an account, login by clicking User > Login.
2. Import the RNA-seq data for the workshop.
If you are using the public game-1 Galaxy server or Galaxy Melbourne server, you can import the data directly from Galaxy. You can do this by going to Shared Data > Published Histories on the top toolbar, and selecting the history called GAMe2017_RNA-seq_Start. Then click on "Import History" on the top right and "start using this history" to switch to the newly imported history.
Alternatively, if you are using your own personal Galaxy server, you can import the data by:
- In the tool panel located on the left, under Basic Tools select Get Data > Upload File. Click on the Paste/Fetch data button on the bottom section of the pop-up window.
- Upload the sequence data by pasting the following links into the text
input area.
These six files are three paired-end samples from the batch condition
(glucose-excess). Make sure the type is specified as 'fastqsanger'
when uploading.
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch1_chrI_1.fastqThese six files are three paired-end samples from the chem condition (glucose-limited). Make sure the type is specified as 'fastqsanger' when uploading.
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch1_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch2_chrI_1.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch2_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch3_chrI_1.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/batch3_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem1_chrI_1.fastqThen, upload this file of gene definitions. You don't need to specify the type for this file as Galaxy will auto-detect the file as a GTF file.
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem1_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem2_chrI_1.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem2_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem3_chrI_1.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/chem3_chrI_2.fastq
https://swift.rc.nectar.org.au:8888/v1/AUTH_a3929895f9e94089ad042c9900e1ee82/RNAseqDGE_ADVNCD/genes.gtf
You should now have these 13 files in your history:
- batch1_chrI_1.fastq
- batch1_chrI_2.fastq
- batch2_chrI_1.fastq
- batch2_chrI_2.fastq
- batch3_chrI_1.fastq
- batch3_chrI_2.fastq
- chem1_chrI_1.fastq
- chem1_chrI_2.fastq
- chem2_chrI_1.fastq
- chem2_chrI_2.fastq
- chem3_chrI_1.fastq
- chem3_chrI_2.fastq
- genes.gtf
These files can be renamed by clicking the pen icon if you wish.
These 12 sequencing files are in FASTQ format and have the file extension: .fastq. If you are not familiar with the FASTQ format, click here for an overview.
Note: The reads are paired end; for example batch1_chrI_1.fastq and batch1_chrI_2.fastq are paired reads from one sequencing run. Low quality reads have already been trimmed.
Click on the eye icon to the top right of a FASTQ file to view the first part of the file.
The gene annotation file (genes.gtf) is in GTF format. This file describes where the genes are located in the S. cerevisiae reference genome. Each feature is defined by a chromosomal start and end point, feature type (CDS, gene, exon etc), and parent gene and transcript. More information on the GTF format can be found here.
Section 2: Alignment with TopHat
In this section we map the reads in our FASTQ files to a reference genome. As these reads originate from mRNA, we expect some of them will cross exon/intron boundaries when we align them to the reference genome. Tophat is a splice-aware mapper for RNA-seq reads that is based on Bowtie. It uses the mapping results from Bowtie to identify splice junctions between exons. More information on Tophat can be found here.
1. Map/align the reads with Tophat to the S. cerevisiae reference
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > Tophat and set the parameters as follows:
- Is this single-end or paired-end data? Paired-end (as individual datasets)
-
RNA-Seq FASTQ file, forward reads:
(Click on the multiple datasets icon and select all six of the forward FASTQ files ending in *1.fastq. This should be correspond to every second file (1,3,5,7,9,11). This can be done by holding down the ctrl key (Windows) or the command key (OSX) to select multiple files.)- batch1_chrI_1.fastq
- batch2_chrI_1.fastq
- batch3_chrI_1.fastq
- chem1_chrI_1.fastq
- chem2_chrI_1.fastq
- chem3_chrI_1.fastq
-
RNA-Seq FASTQ file, reverse reads:
(Click on the multiple datasets icon and select all six of the reverse FASTQ files ending in *2.fastq.)- batch1_chrI_2.fastq
- batch2_chrI_2.fastq
- batch3_chrI_2.fastq
- chem1_chrI_2.fastq
- chem2_chrI_2.fastq
- chem3_chrI_2.fastq
- Use a built in reference genome or own from your history: Use built-in genome
- Select a reference genome: S. cerevisiae June 2008 (SGD/SacCer2) (sacCer2)
- Use defaults for the other fields
- Execute
Note: This may take a few minutes, depending on how busy the server is.
2. Rename the output files
You should have 5 output files for each of the FASTQ input files:
- Tophat on data 2 and data 1: accepted_hits: This is a BAM file containing sequence alignment data of the reads. This file contains the location of where the reads mapped to in the reference genome. We will examine this file more closely in the next step.
- Tophat on data 2 and data 1: splice junctions: This file lists all the places where TopHat had to split a read into two pieces to span an exon junction.
- Tophat on data 2 and data 1 deletions and Tophat on data 2 and data 1: insertions: These files list small insertions or deletions found in the reads. Since we are working with synthetic reads we can ignore Tophat for Illumina data 1:insertions Tophat for Illumina data 1:deletions for now.
- Tophat on data 2 and data 1: align_summary: This file gives some mapping statistics including the number of reads mapped and the mapping rate.
You should have a total of 30 Tophat output files in your history.
Rename the 6 accepted_hits files into a more meaningful name (e.g. 'Tophat on data 2 and data 1: accepted_hits' to 'batch1-accepted_hits.bam') by using the pen icon next to the file.
3. Visualise the aligned reads with Trackster
Before we visualise our alignments in Trackster, we need to slightly modify our GTF file. The genes.gtf file for S. cerevisae (as downloaded from UCSC Table Browser) has a slightly different naming convention for one of the contigs (2-micron) than the reference genome used by Galaxy (2micron), which will cause an error to be thrown by Trackster if you try to add it. This is very typical of genomics currently! For this tutorial, we can fix this by removing the rows beginning with '2-micron' since we are only interested in chrI genes anyway.
- In the left tool panel menu, under Filter and Sort, select
Select lines that match an expression and set the parameters as follows:
- Select lines from: genes.gtf
- that: Matching
- the partern:
^chrI\t
The result will be a GTF file only containing the features belonging to chrI. Rename this file to something like 'chrI_genes.gtf'.
We can now visualise our data in Trackster.
- On the top bar of Galaxy, select Visualization > New Track Browser.
- Name your new visualization and select S. cerevisiae (sacCer2) as the reference genome build.
- Click the Add Datasets to Visualization button and select 'batch1-accepted_hits.bam', 'chem1-accepted_hits.bam', and 'chrI_genes.gtf' by using the checkboxes on the left.
- Select chrI from the dropdown box. You can zoom in and out using the buttons on the top toolbar. Can you find any genes that look differentially expressed between the two BAM files?
- You can also add more tracks using the Add Tracks icon located on the top right. Load one of the splice junction files such as 'Tophat on data 2 and data 1: splice junctions'.
- Explore the data and try to find a splice junction. Next to the drop down list, click on the chromosomal position number display and specify the location chrI:86985-87795 to view an intron junction.
Before starting the next section, leave the Trackster interface and return to the analysis view of Galaxy by clicking 'Analyze Data' on the top Galaxy toolbar.
Section 3: Cuffdiff
The aim in this section is to statistically test for differential expression using Cuffdiff and obtain a list of significant genes.
1. Run Cuffdiff to identify differentially expressed genes and transcripts
Cuffdiff takes a long time to run (even when using a subset of data) so in this workshop we'll be importing a history with finished Cuffdiff results.
You can do this by going to Shared Data > Published Histories on the top toolbar, and selecting the history called GAMe2017_RNA-seq_Cuffdiff. Then click on "Import History" on the top right and "start using this history" to switch to the newly imported history.
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > Cuffdiff and set the parameters as follows:
- Transcripts: genes.gtf
- Condition:
- 1: Condition
- name batch
- Replicates:
- batch1-accepted_hits.bam
- batch2-accepted_hits.bam
- batch3-accepted_hits.bam
(Multiple datasets can be selected by holding down the shift key or the ctrl key (Windows) or the command key (OSX).)
- 2: Condition
- name chem
- Replicates:
- chem1-accepted_hits.bam
- chem2-accepted_hits.bam
- chem3-accepted_hits.bam
- 1: Condition
- Use defaults for the other fields
- Execute
Note: This step may take a while, depending on how busy the server is.
2. Explore the Cuffdiff output files
There should be 11 output files from Cuffdiff. These files should all begin with something like "Cuffdiff on data 43, data 38, and others".
FPKM tracking files:
- transcript FPKM tracking
- gene FPKM tracking
- TSS groups FPKM tracking
- CDS FPKM tracking
These 4 files contain the FPKM (a unit of normalised expression taking into account the transcript length for each transcript and the library size of the sample) for each of the two conditions.
Differential expression testing files:
- gene differential expression testing
- transcript differential expression testing
- TSS groups differential expression testing
- CDS FPKM differential expression testing
- CDS overloading differential expression testing
- promoters differential expression testing
- splicing differential expression testing
These 7 files contain the statistical results from testing the level of expression between the two conditions.
-
Examine the tables of normalised gene counts View the Cuffdiff file "Cuffdiff on data x, data x, and others: gene FPKM tracking" by clicking on the eye icon. The file consists of one row for each gene from the reference transcriptome, with columns containing the normalised read counts for each of the two conditions. Note:
- Cuffdiff gives each gene it’s own ‘tracking_id’, which equates to a gene. Multiple transcription start sites are aggregated under a single tracking_id.
- A gene encompasses a chromosomal locus which covers all the features that make up that gene (exons, introns, 5’ UTR, etc).
-
Inspect the gene differential expression testing file View the Cuffdiff file "Cuffdiff on data x, data x, and others: gene differential expression testing" by clicking on the eye icon. The columns of interest are: gene (c3), locus (c4), log2(fold_change) (c10), p_value (c12), q_value (c13) and significant (c14).
-
Filter based on column 14 (‘significant’) - a binary assessment of q_value > 0.05, where q_value is p_value adjusted for multiple testing. Under Basic Tools, click on Filter and Sort > Filter:
- Filter: "Cuffdiff on data....: gene differential expression testing"
- With following condition: c14=='yes'
- Execute
This will keep only those entries that Cuffdiff has marked as significantly differentially expressed. There should be 53 differentially expressed genes in this list.
We can rename this file by clicking on the pencil icon of the outputted file and change the name from "Filter on data x" to "Cuffdiff_Significant_DE_Genes".
Section 4: Another workflow with HISAT2
Now we've seen an example of a RNA-seq workflow using TopHat for alignment and Cuffdiff for quantification and analysis, we can try another workflow with more modern software.
Start with a new history by importing the same history as before. Go to Shared Data > Published Histories on the top toolbar, and selecting the history called GAMe2017_RNA-seq_Start. Then click on "Import History" on the top right and "start using this history" to switch to the newly imported history.
HISAT2 is a fast alignment program that is a successor to TopHat2. Running HISAT2 is similar to running TopHat.
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > HISAT2 and set the parameters as follows:
- Input data format FASTQ
- Single end or paired reads? Individual paired reads
-
Forward reads:
(Click on the multiple datasets icon and select all six of the forward FASTQ files ending in *1.fastq. This should be correspond to every second file (1,3,5,7,9,11). This can be done by holding down the ctrl key (Windows) or the command key (OSX) to select multiple files.)- batch1_chrI_1.fastq
- batch2_chrI_1.fastq
- batch3_chrI_1.fastq
- chem1_chrI_1.fastq
- chem2_chrI_1.fastq
- chem3_chrI_1.fastq
-
Reverse reads:
(Click on the multiple datasets icon and select all six of the reverse FASTQ files ending in *2.fastq.)- batch1_chrI_2.fastq
- batch2_chrI_2.fastq
- batch3_chrI_2.fastq
- chem1_chrI_2.fastq
- chem2_chrI_2.fastq
- chem3_chrI_2.fastq
- Source for the reference genome to align against: Use built-in genome
- Select a reference genome: S. cerevisiae June 2008 (SGD/SacCer2) (sacCer2)
- Use defaults for the other fields
- Execute
HISAT2 outputs one bam file for each set of paired-end read files. Rename the 6 files into a more meaningful name (e.g. 'HISAT on data 2 and data 1' to 'batch1.bam') by using the pen icon next to the file.
Section 5: Count reads in features
HTSeq-count creates a count matrix using the number of the reads from each bam file that map to the genomic features in the genes.gtf. For each feature (a gene for example) a count matrix shows how many reads were mapped to this feature.
-
Use HTSeq-count to count the number of reads for each feature.
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > SAM/BAM to count matrix and set the parameters as follows:- Gene model (GFF) file to count reads over from your current history: genes.gtf
- bam/sam file from your history:
(Select all six bam files using the shift key.)- batch1.bam
- batch2.bam
- batch3.bam
- chem1.bam
- chem2.bam
- chem3.bam
- Use defaults for the other fields
- Execute
-
Examine the outputted matrix by using the eye icon.
Each column corresponds to a sample and each row corresponds to a gene. By sight, see if you can find a gene you think is differentially expressed from looking at the counts.
We now have a count matrix, with a count against each corresponding sample. We will use this matrix in later sections to calculate the differentially expressed genes.
Section 6: edgeR
edgeR is an R package, that is used for analysing differential expression of RNA-Seq data and can either use exact statistical methods or generalised linear models.
1. Generate a list of differentially expressed genes using edgeR
In the Galaxy tool panel, under NGS Analysis, select NGS: RNA > Differential_Count and set the parameters as follows:
- Select an input matrix - rows are contigs, columns are counts for each sample: bams to DGE count matrix_htseqsams2mx.xls
- Title for job outputs: Differential_Counts_edgeR
- Treatment Name: Batch
- Select columns containing treatment:
- batch1.bam
- batch2.bam
- batch3.bam
- Control Name: Chem
- Select columns containing control:
- chem1.bam
- chem2.bam
- chem3.bam
- Run this model using edgeR: Run edgeR
- Use defaults for the other fields
- Execute
2. Examine the outputs from the previous step
- Examine the Differential_Counts_edgeR_topTable_edgeR.xls file by clicking on the eye icon. This file is a list of genes sorted by p-value from using EdgeR to perform differential expression analysis.
- Examine the Differential_Counts_edgeR.html file. This file has some
output logs and plots from running edgeR. If you are familiar with R,
you can examine the R code used for analysis by scrolling to the bottom
of the file, and clicking Differential_Counts.Rscript to download the
Rscript file.
If you are curious about the statistical methods edgeR uses, you can read the edgeR user's guide at Bioconductor.
3. Extract the significant differentially expressed genes.
Under Basic Tools, click on Filter and Sort > Filter:
- Filter: "Differential_Counts_edgeR_topTable_edgeR.xls"
- With following condition: c6 < 0.05
- Execute
This will keep the genes that have an adjusted p-value of less than 0.05. There should be 55 genes in this file. Rename this file by clicking on the pencil icon of and change the name from "Filter on data x" to "edgeR_Significant_DE_Genes".
Section 7: Degust
Degust is an interactive visualiser for analysing RNA-seq data. It runs as a web service and can be found at vicbioinformatics.com/degust/.
1. Load count data into Degust
- In Galaxy, download the count data "bams to DGE count matrix_htseqsams2mx.xls" generated in Section 5 using the disk icon.
- Go to vicbioinformatics.com/degust/ and click on "Upload your counts file".
- Click "Choose file" and upload the recently downloaded Galaxy tabular file containing your RNA-seq counts.
2. Configure your uploaded data
- Give your visualisation a name.
- For the Info column, select Contig.
- Add two conditions: batch and chem. For each condition, select the three samples which correspond with the condition.
- Click Save changes and view your data.
3. Explore your data
Read through the Degust tour of features. Explore the parallel coordinates plot, MA plot, MDS plot, heatmap and gene list. Each is fully interactive and influences other portions on the display depending on what is selected.
On the right side of the page is an options module which can set thresholds to filter genes using statistical significance or absolute-fold-change.
On the left side is a dropdown box you can specify the method (Voom/Limma or edgeR) used to perform differential expression analysis on the data. You can also view the R code by clicking "Show R code" under the options module on the right.
4. (Optional) Explore the demo data
Degust also provides an example dataset with 4 conditions and more genes. You can play with the demo dataset by clicking on the "Try the demo" button on the Degust homepage. The demo dataset includes a column with an EC number for each gene. This means genes can be displayed on Kegg pathways using the module on the right.
PCA plots are useful for exploratory data analysis. Samples which are more similar to each other are expected to cluster together. A count matrix often has thousands of dimensions (one for each feature) and our PCA plot generated in the previous step transforms the data so the most variability is represented in principal components 1 and 2 (PC1 and PC2 represented by the x-axis and y-axis respectively).
Take note of the scales on the x-axis and the y-axis. The x-axis representing the first principal component accounts for 96% of the variance and ranges from approximately -6 to +6, while the y-axis ranges from approximately -1 to +1.
For both conditions, the 3 replicates tend to be closer to each other than they are to replicates from the other condition.
Additionally, within conditions, the lower glucose (chem) condition shows more variability between replicates than the higher glucose (batch) condition.
Section 8: DESeq2 workflow
The workshop instructors will demonstrate how to use DESeq2 for differential expression analysis.
If you are interested in looking at the outputs, you can view the files by going to Shared Data > Published Histories on the top toolbar, and selecting the history called GAMe2017_RNA-seq_DESeq2. Then click on "Import History" on the top right and "start using this history" to switch to the newly imported history.
Note: These tools may not be installed in Galaxy. You may need to install or ask an administrator to install htseq-count and DESeq2 via the Galaxy Tool Shed.
-
Use HTSeq-count to count the number of reads for each feature and generate a count file for each alignment file.
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > htseq-count and set the parameters as follows:- Aligned SAM/BAM File:
(Select all six bam files using the shift key.)- batch1-accepted_hits.bam
- batch2-accepted_hits.bam
- batch3-accepted_hits.bam
- chem1-accepted_hits.bam
- chem2-accepted_hits.bam
- chem3-accepted_hits.bam
- Stranded: No
- ID Attribute: gene_name
- Use defaults for the other fields
- Execute
- Aligned SAM/BAM File:
-
Rename the outputs from htseq-count to sensible names. For example: "htseq-count on data 13 and data 18" to "batch1_counts".
-
Use DESeq2 to find differentially expressed features from count tables.
In the left tool panel menu, under NGS Analysis, select NGS: RNA Analysis > DESeq2 and set the parameters as follows:- 1: Factor
- Specify a factor name: condition
- 1: Factor level:
- Specify a factor level: batch
(Select the three batch htseq-count files.)- batch1_counts
- batch2_counts
- batch3_counts
- Specify a factor level: batch
- 2: Factor level:
- Specify a factor level: chem
(Select the three chem htseq-count files.)- chem1_counts
- chem2_counts
- chem3_counts
- Specify a factor level: chem
- 1: Factor
-
Have a look at the outputs of DESeq2. We will now filter significant (adjusted p-value < 0.05) genes from the DESeq2 result file.
Under Basic Tools, click on Filter and Sort > Filter:- Filter: "DESeq2 result file on data ..."
- With following condition: c7 < 0.05
- Execute
There should be 53 genes in this file. Rename this file by clicking on the pencil icon of and change the name from "Filter on data x" to "DESeq2_Significant_DE_Genes".
Section 9: How much concordance is there between methods?
We are interested in how similar the identified genes are between the different statistical methods used by Cuffdiff, edgeR, and DESeq2. We can generate a Venn diagram to visualise the amount of overlap.
You can download a shared history with all three significant gene lists by going to Shared Data > Published Histories on the top toolbar, and selecting the history called GAMe2017_RNA-seq_concordance. Then click on "Import History" on the top right and "start using this history" to switch to the newly imported history.
The output has already been generated in this history, but here are the steps to generate the Venn diagram if you want to run it yourself:
-
Generate a Venn diagram of the output of the 3 differential expression tools.
Note that column index 2 (or c3) contains the gene name in the Cuffdiff output. Similarly column index 0 (or c1) in EdgeR and DESeq2 contain the gene names.
In the Galaxy tool panel, under Statistics and Visualisation, select Graph/Display Data > proportional venn and set the parameters as follows:- title: Common genes
- input file 1: Cuffdiff_Significant_DE_Genes
- column index: 2
- as name: Cuffdiff
- input file 2: DESeq2_Significant_DE_Genes
- column index file 2: 0
- as name file 2: DESeq2
- two or three: three
- input file 3: edgeR_Significant_DE_Genes
- column index file 3: 0
- as name file 3: edgeR
- Execute
-
View the generated Venn diagram. Agreement between the tools is good: the vast majority of genes identified as differentially expressed are agreed upon with all three programs, and only a handful that are exclusive to each tool.
Optional extension: Gene set enrichment analysis
The biological question being asked in the original paper is essentially:
"What is the global response of the yeast transcriptome in the shift from
growth at glucose excess conditions (batch) to glucose-limited conditions
(chemostat)?"
We can address this question by attempting to interpret our differentially expressed gene list at a higher level, perhaps by examining the categories of gene and protein networks that change in response to glucose.
For example, we can input our list of differentially expressed genes to a Gene Ontology (GO) enrichment analysis tool such as GOrilla to find out the GO enriched terms.
NOTE: Because of time-constraints in this tutorial, the analysis was confined to a single chromosome (chromosome I) and as a consequence we don’t have sufficient information to look for groups of differentially expressed genes (simply because we don’t have enough genes identified from the one chromosome to look for statistically convincing over-representation of any particular gene group).
-
Download the list of genes here in a plain-text file to your local computer by right clicking on the link and selecting Save Link As...
Note that there are ~2500 significantly differentially expressed genes identified in the full analysis. Also note that the genes are ranked in order of statistical significance. This is critical for the next step.
-
Explore the data using gene set enrichment analysis (GSEA) using the online tool GOrilla
- Go to cbl-gorilla.cs.technion.ac.il
- Choose Organism: Saccharomyces cerevisiae
- Choose running mode: Single ranked list of genes
- Open the gene list you downloaded in the previous step in a text editor. Select the full list, then copy and paste the list into the text box.
- Choose an Ontology: Process
- Search Enriched GO terms
-
Once the analysis has finished running, you will be redirected to a page depicting the GO enriched biological processes and its significance (indicated by colour), based on the genes you listed.
Scroll down to view a table of GO terms and their significance scores. In the last column, you can toggle the [+] Show genes to see the list of associated genes.
-
Experiment with different ontology categories (Function, Component) in GOrilla.
- Go to cbl-gorilla.cs.technion.ac.il
At this stage you are interpreting the experiment in different ways, potentially discovering information that will lead you to further lab experiments. This is driven by your biological knowledge of the problem space. There are an unlimited number of methods for further interpretation of which GSEA is just one.
References
[1] Nookaew I, Papini M, Pornputtpong N, Scalcinati G, Fagerberg L, Uhlén M, Nielsen J: A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res 2012, 40 (20):10084 – 10097. doi:10.1093/nar/gks804. Epub 2012 Sep 10