Using BioMart with R

Introduction

BioMart is a web service provided by Ensembl that can be used to interrogate various datasets. For an introduction, you can look at the BiomaRt, Bioconductor R package page. It’s a Bioconductor package, so it must installed using

BiocManager::install("biomaRt")

then loaded using

library("biomaRt")

In this practical session, we’ll also use the tools provided by the tidyverse:

library("tidyverse")

Using the biomaRt package

BioMart organises the information in a hierarchical way.

Marts

listEnsembl(GRCh=37)

     biomart               version
1    ensembl      Ensembl Genes 95
2        snp  Ensembl Variation 95
3 regulation Ensembl Regulation 95

ensembl = useEnsembl(biomart="ensembl", GRCh=37)

List the datasets

List the datasets:

sets = listDatasets(ensembl)

There are 66 available datasets. Here’s a sample

                         dataset                            description
1        amexicanus_gene_ensembl            Cave fish genes (AstMex102)
2         etelfairi_gene_ensembl  Lesser hedgehog tenrec genes (TENREC)
3         pvampyrus_gene_ensembl                Megabat genes (pteVam1)
4 itridecemlineatus_gene_ensembl               Squirrel genes (spetri2)
5         sharrisii_gene_ensembl Tasmanian devil genes (Devil_ref v7.0)
6          tguttata_gene_ensembl        Zebra Finch genes (taeGut3.2.4)

Selecting a dataset

In such a long list of datasets, one must resort to some sort of filtering based on a pattern. For instance, “sapiens” to find datasets related to Homo sapiens. The grep function allows you to do that.

The following command, for instance, tells you that there’s a “sapiens” match in the 26-th value of the column “dataset” of the data frame:

grep("sapiens", sets$dataset)

[1] 26

The following command returns the match:

grep("sapiens", sets$dataset, value=TRUE)

[1] "hsapiens_gene_ensembl"

Another approach relies on the dplyr and stringr packages (part of the tidyverse). The following syntax makes for improved legibility and I would tend to recommend it

sets %>% filter(str_detect(dataset, "sapiens"))

                dataset              description
1 hsapiens_gene_ensembl Human genes (GRCh37.p13)

The dataset, once identified, can be selected:

data = useDataset("hsapiens_gene_ensembl", mart=ensembl)

Querying the dataset

In order to extract information from the dataset, one must provide a filter criterion for the query. For instance, extract all instances of the dataset such that the Ensembl Gene ID is “ENSG00000151726”. One must also provide a list of attributes that must be returned, for instance, the chromosome name.

In order to know which filters are available:

listFilters(data)

             name              description
1 chromosome_name Chromosome/scaffold name
2           start                    Start
3             end                      End
4      band_start               Band Start
5        band_end                 Band End
6    marker_start             Marker Start

In order to know which filters are available:

listAttributes(data)

                           name                  description         page
1               ensembl_gene_id               Gene stable ID feature_page
2       ensembl_gene_id_version       Gene stable ID version feature_page
3         ensembl_transcript_id         Transcript stable ID feature_page
4 ensembl_transcript_id_version Transcript stable ID version feature_page
5            ensembl_peptide_id            Protein stable ID feature_page
6    ensembl_peptide_id_version    Protein stable ID version feature_page

Below is one example:

query = getBM(mart=data,
              filters="ensembl_gene_id",
              values="ENSG00000151726",
              attributes=c("ensembl_gene_id",
                           "chromosome_name",
                           "ensembl_transcript_id"))

Note that the “values” argument can be a list of values.

Now “query” contains 12 rows, one for each unique combination of information:

  ensembl_gene_id chromosome_name ensembl_transcript_id
1 ENSG00000151726               4       ENST00000454703
2 ENSG00000151726               4       ENST00000513001
3 ENSG00000151726               4       ENST00000515030
4 ENSG00000151726               4       ENST00000503407
5 ENSG00000151726               4       ENST00000281455
6 ENSG00000151726               4       ENST00000507295

Transcription start sites

How many Ensembl genes are in the database?
How many Ensembl transcripts are in the database?
How many Ensembl genes are associated to ten or more Ensembl transcripts?
Produce a histogram of the number of transcripts per gene.
Produce a histogram of the range of TSS positions among genes.

Cross analysis

Now use the grex package to look at the range of TSS positions among genes based on their biotypes. The grex package vignette can be found here.