Introduction
BioMart is a web service provided by Ensembl that can be used to interrogate various datasets. For an introduction, you can look at the BiomaRt, Bioconductor R package page. It’s a Bioconductor package, so it must installed using
then loaded using
In this practical session, we’ll also use the tools provided by the tidyverse:
Using the biomaRt package
BioMart organises the information in a hierarchical way.
Marts
biomart version
1 ensembl Ensembl Genes 95
2 snp Ensembl Variation 95
3 regulation Ensembl Regulation 95
List the datasets
List the datasets:
There are 66 available datasets. Here’s a sample
dataset description
1 amexicanus_gene_ensembl Cave fish genes (AstMex102)
2 etelfairi_gene_ensembl Lesser hedgehog tenrec genes (TENREC)
3 pvampyrus_gene_ensembl Megabat genes (pteVam1)
4 itridecemlineatus_gene_ensembl Squirrel genes (spetri2)
5 sharrisii_gene_ensembl Tasmanian devil genes (Devil_ref v7.0)
6 tguttata_gene_ensembl Zebra Finch genes (taeGut3.2.4)
Selecting a dataset
In such a long list of datasets, one must resort to some sort of filtering based on a pattern. For instance, “sapiens” to find datasets related to Homo sapiens. The grep function allows you to do that.
The following command, for instance, tells you that there’s a “sapiens” match in the 26-th value of the column “dataset” of the data frame:
[1] 26
The following command returns the match:
[1] "hsapiens_gene_ensembl"
Another approach relies on the dplyr and stringr packages (part of the tidyverse). The following syntax makes for improved legibility and I would tend to recommend it
dataset description
1 hsapiens_gene_ensembl Human genes (GRCh37.p13)
The dataset, once identified, can be selected:
Querying the dataset
In order to extract information from the dataset, one must provide a filter criterion for the query. For instance, extract all instances of the dataset such that the Ensembl Gene ID is “ENSG00000151726”. One must also provide a list of attributes that must be returned, for instance, the chromosome name.
In order to know which filters are available:
name description
1 chromosome_name Chromosome/scaffold name
2 start Start
3 end End
4 band_start Band Start
5 band_end Band End
6 marker_start Marker Start
In order to know which filters are available:
name description page
1 ensembl_gene_id Gene stable ID feature_page
2 ensembl_gene_id_version Gene stable ID version feature_page
3 ensembl_transcript_id Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5 ensembl_peptide_id Protein stable ID feature_page
6 ensembl_peptide_id_version Protein stable ID version feature_page
Below is one example:
query = getBM(mart=data,
filters="ensembl_gene_id",
values="ENSG00000151726",
attributes=c("ensembl_gene_id",
"chromosome_name",
"ensembl_transcript_id"))
Note that the “values” argument can be a list of values.
Now “query” contains 12 rows, one for each unique combination of information:
ensembl_gene_id chromosome_name ensembl_transcript_id
1 ENSG00000151726 4 ENST00000454703
2 ENSG00000151726 4 ENST00000513001
3 ENSG00000151726 4 ENST00000515030
4 ENSG00000151726 4 ENST00000503407
5 ENSG00000151726 4 ENST00000281455
6 ENSG00000151726 4 ENST00000507295
Transcription start sites
- How many Ensembl genes are in the database?
- How many Ensembl transcripts are in the database?
- How many Ensembl genes are associated to ten or more Ensembl transcripts?
- Produce a histogram of the number of transcripts per gene.
- Produce a histogram of the range of TSS positions among genes.
Cross analysis
Now use the grex package to look at the range of TSS positions among genes based on their biotypes. The grex package vignette can be found here.