Teaching Data Carpentry
I taught one of the first [genomics data carpentry workshops at JMU last summer, using this R material that I modified from my own and some of Jenny Bryan's existing material. For this lesson I used the gapminder data and examples similar to Jenny's for teaching things like dplyr and ggplot2.
I want to update this lesson using a similar dataset more relevant to genomics or biology in general, rather than the population/socioeconomic data in gapminder. This is directly in response to participant criticism that the workshop should actually use real biological/genomic data. Specifically, I'm looking for a biological/genomic dataset that has similar characteristics:
- At least two continuous variables (demonstrating scatter plots, examining distributions)
- At least two discrete/factor variables with several levels (for teaching dplyr
group_by() %>% summarize() operations and mapping aesthetics like color or faceting using ggplot2).
- Related to another second dataset by some identifier column (so as to demonstrate joins and other dplyr/SQL two-table verbs).
- At least a few thousand observations (so we can call the data "big").
- Preferably something published, or at least not proprietary/copyrighted, so there won't be any issues if I wanted to host the data externally or publish as an R data package.
- Bonus points if it's human health-related or at least mammalian data - I work and teach mostly in a medical center, so this kind of data is more relevant than ecology or other similar data.
The gapminder data had all of this (life expectancy (cont) vs GDP per capita (cont) by continent (factor) or country (factor)). I'd like a genomics/biological dataset with similar features.
I've asked this question before in other forums with little success. Answers like "there's lots of data in NCBI GEO" or "check out the UCI machine learning repository" or "there's lots of free data online just google or read that one Quora post" aren't helpful.
DC has been actively developing genomics material over the last year. It's still somewhat under development, but it's been taught several times now. The relevant repos are here:
and there is an organized list of lessons here (in the Genomics section):
@tracyteal should be able to provide more info about the current state of development.
Thanks. Yes, seeing some undocumented fastq files in datadescription-genomics, but looking specifically for individual-level data for 1000s observations with both multiple continuous and categorical variables for dplyr/group_by and ggplot aesthetic operations. Unfortunately most sequencing datasets are of the n=3 type, and don't have that multi-dimensional complexity.
We've taught it four more times since then, and now have R materials! It's scattered, so I'm in the process of bringing it all in to the repo.
This is the most recent version of the whole workshop
And this is the R lesson
It's using a subset of data from the Lenski E. coli LTE experiment.
I made up a 'genome_size' column and values so that the data could more multivariate. And I kept it relatively simple so it would be easier to work with. A bad thing about that though is that it's not a good example of the RNASeq like tabular data they'll work with.
One idea we had was to actually run through the SNP analysis with this dataset, and then use that output in part of the R lesson.
Comments, changes or PRs would be very welcome.
Thanks Tracy. I don't have any suggestions that warrant a PR, but the "scatter plots" mentioned in the ggplot2 lesson aren't really scatter plots. They're just showing the genome size for each observation in the data. What I was looking for had truly multivariate continuous data (1000s of observations) so as to be able to show things like scatter plots faceted by some other level of another factor variable.
I'd like to recommend the microarray gene expression data from Brauer et al 2008: Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast. Its supplemental data is published (though for some reason the link to the dataset is broken on the journal: I have reposted it here).
As for your points:
- It has two continuous variables: the dilution rate that the yeast was grown at (higher means faster growth rate) and the resulting microarray expression of each gene. (It's already been processed and normalized)
It has two factor variables: the gene ID and the nutrient that was the limiting factor: glucose (G), ammonium (N), sulfate (S), phosphate (P), uracil (U) or leucine (L). You can pick six genes, facet by the geneID, color by the nutrient, and show how expression varied with growth rate for each of those limiting nutrients. For example, here's a plot I made with it.
It can be joined it with the Gene Ontology data so that you can group by functional classifications of genes. For instance, join it with the data from:
yeast_GO <- org.Sc.sgdGO %>%
(You can match that to
toTable(GO.db::GOTerm) for the GO terms themselves). Then see whether, for example, expression for genes in leucine transport changes with growth rate when leucine is a limiting factor.
- It has ~186,000 observations. Big-ish data, but still easily workable in R.
- Published in Mol Cell Bio.
- Sorry, not human/mammalian, but at least molecular bio and genomics.
- Perfect dataset to teach tidyr on, since it requires cleaning: both
gather (it comes in one-row-per-gene, not one-row-per-gene-per-sample) and
separate (gene names are stuck in the middle of a column, and column names concatenate the nutrient and the growth rate). But if you'd rather skip that you can simply pre-tidy it.
- Perfect dataset to teach my broom package on: perform one linear regression per nutrient per gene (with
group_by), then see what genes have significant growth-rate-to-expression relationships in each nutrient. (Or once joined with GO data, see what groups!). You'd be surprised how accommodating the data is, most relationships are pretty linear. (Gooooo Team Yeast!)
- I've really been impressed with the clarity of the biological story you can tell. When a gene's expression is positively related to dilution (growth) rate, that often means its involved in that nutrient's metabolism. When it's negatively related to growth rate, that often means it's involved in cell transport of that limiting nutrient- it has to work extra hard to pull that nutrient in when it's rare and not coming in through diffusion.
If you're interested I can send you the homework assignment I wrote using this dataset that teaches ggplot2 (I didn't post it here in case you end up drawing from it. Also it was pre-tidyr/dplyr). Only warning is that I may blog about teaching tidyr/dplyr/broom with this dataset if I ever get around to it. Let me know if you have other questions.