library("tidyverse")

Reading FASTA files with Biostrings

Biostrings is one of the core packages of Bioconductor. Use it to import the sequences of a FASTA file into a variable sequences

library("Biostrings")

What’s the class of the sequences object?
What’s the difference between sequences[1] and sequences[[1]]
How many sequences are there in the set?
What’s the reverse sequence of the 4th sequence
What’s the complementary sequence of the 5th sequence
Calculate the GC content for each sequence?
Plot a histogram of the lengths of the sequences (using ggplot2)

Have a look at the documentation of the DNAStringSet class.

Reference genomes

You need to install the Bioconductor package “BSgenome.Scerevisiae.UCSC.sacCer3”.

BiocManager::install("BSgenome.Scerevisiae.UCSC.sacCer3")

Load it using something like

library("BSgenome.Scerevisiae.UCSC.sacCer3")
yeast = BSgenome.Scerevisiae.UCSC.sacCer3

Have a look at the documentation of the BSgenome class.

What are the ten most represented \(k\)-mers for \(k = 4\) for each chromosome. Hint: you may want to use the following functions: oligonucleotideFrequency() and bsapply().
Compare the frequencies of those \(k\)-mers with what would be expected from a genome without nucleotide regularities. Hint: Use a Monte Carlo approach.

Variant call file (VCF)

Use the vcfR package to investigate allele frequencies of some bi-allelic SNPs of the file diff.vcf.

library("vcfR")

Create a data frame containing all SNPs and the AFR, EAS and EUR allele frequencies.

Analysing SAM files 1

There are several packages that can handle BAM/SAM files (for instance, Rsamtools). Here we’ll use GenomicAlignments (a Bioconductor package) .

library("GenomicAlignments")

The SAM data file corresponds to reads from a liquid biopsy mapped on to a lncRNA gene. Alternatively, you can start the exercise using the CSV file.

Compare the output of readGAlignments() and readGAlignmentPairs().

aln = readGAlignments("assets/CATG00000105396_BH01.bam")

aln = readGAlignmentPairs("assets/CATG00000105396_BH01.bam")

length(aln) returns the number of alignment pairs in x
seqlevels(aln) returns the sequence names of the reference genome
strand(aln) gets the strand for each alignment pair
isProperPair(aln) gets the “isProperPair” flag bit
aln[[k]] will return the \(k\)-th alignment pair
start(aln[[k]]) and end(aln[[k]]) extract the start and end positions of the alignment
granges(aln) turns the alignment pairs into a GRanges object

What the distribution of the fragment size?
Compute the average sequencing depth
Compute the sequencing depth at each genomic position
Compute the depth of inserts at each genomic position

The window protection score is defined for a position \(i\) and a window size \(w\) as \(+1\) increments for each read that is fully contained in the window centred about position \(i\) and \(-1\) decrements for each read only partially contained in the window.

Compute and plot the window protection score on the genomic region using \(w = 121\)
Analyse the region using a Fourier Transform

wps = as.ts(wps)
wps = stats::filter(c(window(wps, 1, 300), wps),
                    filter=1/seq(5, 100, 4),
                    method="recursive")[-(1:300)]
wps = wps - mean(wps, trim=.1)
fft = spec.pgram(wps, pad=.3, tap=.3, span=2,
                 plot=FALSE, detrend=TRUE)
plot(fft$freq[fft$freq < .02],
     fft$spec[fft$freq < .02], type="l")