Large-scale biology projects such as the sequencing of the human genome and gene expression surveys using RNA-seq, microarrays and other technologies have created a wealth of data for biologists. However, the challenge facing scientists is analyzing and even accessing these data to extract useful information pertaining to the system being studied. This course focuses on employing existing bioinformatic resources – mainly web-based programs and databases – to access the wealth of data to answer questions relevant to the average biologist, and is highly hands-on.
Topics covered include multiple sequence alignments, phylogenetics, gene expression data analysis, and protein interaction networks, in two separate parts.
The first part, Bioinformatic Methods I, dealt with databases, Blast, multiple sequence alignments, phylogenetics, selection analysis and metagenomics.
This, the second part, Bioinformatic Methods II, will cover motif searching, protein-protein interactions, structural bioinformatics, gene expression data analysis, and cis-element predictions.
This pair of courses is useful to any student considering graduate school in the biological sciences, as well as students considering molecular medicine.
These courses are based on one taught at the University of Toronto to upper-level undergraduates who have some understanding of basic molecular biology. If you're not familiar with this, something like https://learn.saylor.org/course/bio101 might be helpful. No programming is required for this course although some command line work (though within a web browser) occurs in the 5th module.
Bioinformatic Methods II is regularly updated, and was last updated for February 2022.
In this module we'll be exploring conserved regions within protein families. Such regions can help us understand the biology of a sequence, in that they are likely important for biological function, and also be used to help ascribe function to sequences where we can't identify any homologs in the databases. There are various ways of describing the conserved regions from simple regular expressions to profiles to profile hidden Markov models (HMMs).
In this module we'll be exploring protein-protein interactions (PPIs). Protein-protein interactions are important as proteins don't act in isolation, and often an examination of the interaction partners (determined in an unbiased, perhaps high throughput way) of a given protein can tell us a lot about its biology. We'll talk about some different methods used to determine PPIs and go over their strengths and weaknesses. In the lab we'll use 3 different tools and two different databases to examine interaction partners of BRCA2, a protein that we examined in last module's lab. Finally, we'll touch on a "foundational" concept, Gene Ontology (GO) term enrichment analysis, to help us understand in an overview way the proteins interacting with our example.
The determination of a protein's tertiary structure in three dimensions can tell us a lot about the biology of that protein. In this module's mini-lecture, we'll talk about some different methods used to determine a protein's tertiary structure and cover the main database for protein structure data, the PDB. In the lab we'll explore the PDB and an online tool for searching for structural (as opposed to sequence) similarity, VAST. We'll then use a nice piece of stand-alone software, PyMOL, to explore several protein structures in more detail.
Review: Protein Motifs, Protein-Protein Interactions, and Protein Structure
Gene Expression Analysis I
When and where genes are expressed (active) in tissues or cells is one of the main determinants of what makes that tissue or cell the way it is, both in terms of morphology and in terms of response to external stimuli. Several different methods exist for generating gene expression levels for all of the genes in the genome in tissues or even at cell-type-specific resolution. In this class we'll be processing and then examining some gene expression data generated using RNA-seq. We'll explore one of the main databases for RNA-seq expression data, the Sequence Read Archive (SRA), and then use an open-source suite of programs in R called BioConductor to process the raw reads from 4 RNA-seq data sets, to summarize their expression levels, to select significantly differentially expressed genes, and finally to visualize these as a heat map.
Gene Expression Analysis II
When and where genes are expressed (active) in tissues or cells is one of the main determinants of what makes that tissue or cell the way it is, both in terms of morphology and in terms of response to external stimuli. Several different methods exist for generating gene expression levels for all of the genes in the genome in tissues or even at cell-type-specific resolution. In this class we'll be hierarchically clustering our significantly differentially expressed genes from last time using BioConductor and the built-in function of an online tool, called Expression Browser. Then we'll be using another online tool that uses a similarity metric, the Pearson correlation coefficient, to identify genes responding in a similar manner to our gene of interest, in this case AP3. We'll use a second tool, ATTED-II to corroborate our gene list. We'll also be exploring some online databases of gene expression and an online tool for doing a Gene Ontology enrichment analysis.
Cis Regulatory Systems
When and where genes are expressed in tissues or cells is one of the main determinants of what makes that tissue or cell the way it is, both in terms of morphology and in terms of response to external stimuli. Gene expression is controlled in part by the presence of short sequences in the promoters (and other parts) of genes, called cis-elements, which permit transcription factors and other regulatory proteins to bind to direct the patterns of expression in certain tissues or cells or in response to environmental stimuli: We'll explore a couple of sets of promoters of genes that are coexpressed with AP3 from Arabidopsis, and with INSULIN from human, for the presence of known cis-elements, and we'll also try to predict some new ones using a couple of different methods.
Review: Gene Expression Analysis and Cis Regulatory Systems + Final Assignment
Adelyne Chan completed this course, spending 2 hours a week on it and found the course difficulty to be easy.
I took this course as a follow up after completing Bioinformatic Methods I and finding it extremely useful. Although Bioinformatic Methods I is not listed as a prerequisite, I think it would be helpful if they were taken together as Prof Provart introduces some of the simpler tools in the first course then following up with this one. The way in which the exercises are designed can sometimes be confusing to follow, but there are plenty of discussion forums where help can be easily obtained - Prof Provart himself follows the forums very closely and is not hesitant to step in with further explanations.
Afraa Mhmad completed this course, spending 8 hours a week on it and found the course difficulty to be medium.
Detailed and appropriate for beginners, the most beneficial and in the same time interesting part was the lab section. despite the relative complexity of the algorithms and concepts of bioinformatics, they were reviewed in the course in simple straight-forward way.