Reading FASTA files with Biostrings
Biostrings is one of the core packages of Bioconductor. Use it to import the sequences of a FASTA file into a variable sequences
- What’s the class of the
sequences
object? - What’s the difference between
sequences[1]
andsequences[[1]]
- How many sequences are there in the set?
- What’s the reverse sequence of the 4th sequence
- What’s the complementary sequence of the 5th sequence
- Calculate the GC content for each sequence?
- Plot a histogram of the lengths of the sequences (using ggplot2)
Have a look at the documentation of the DNAStringSet class.
Reference genomes
You need to install the Bioconductor package “BSgenome.Scerevisiae.UCSC.sacCer3”.
Load it using something like
Have a look at the documentation of the BSgenome class.
What are the ten most represented \(k\)-mers for \(k = 4\) for each chromosome. Hint: you may want to use the following functions: oligonucleotideFrequency() and bsapply().
Compare the frequencies of those \(k\)-mers with what would be expected from a genome without nucleotide regularities. Hint: Use a Monte Carlo approach.
Variant call file (VCF)
Use the vcfR package to investigate allele frequencies of some bi-allelic SNPs of the file diff.vcf.
Create a data frame containing all SNPs and the AFR, EAS and EUR allele frequencies.
Analysing SAM files 1
There are several packages that can handle BAM/SAM files (for instance, Rsamtools). Here we’ll use GenomicAlignments (a Bioconductor package) .
The SAM data file corresponds to reads from a liquid biopsy mapped on to a lncRNA gene. Alternatively, you can start the exercise using the CSV file.
- Compare the output of
readGAlignments()
andreadGAlignmentPairs()
.
length(aln)
returns the number of alignment pairs in xseqlevels(aln)
returns the sequence names of the reference genomestrand(aln)
gets the strand for each alignment pairisProperPair(aln)
gets the “isProperPair” flag bitaln[[k]]
will return the \(k\)-th alignment pairstart(aln[[k]])
andend(aln[[k]])
extract the start and end positions of the alignmentgranges(aln)
turns the alignment pairs into a GRanges object
What the distribution of the fragment size?
Compute the average sequencing depth
Compute the sequencing depth at each genomic position
Compute the depth of inserts at each genomic position
The window protection score is defined for a position \(i\) and a window size \(w\) as \(+1\) increments for each read that is fully contained in the window centred about position \(i\) and \(-1\) decrements for each read only partially contained in the window.
Compute and plot the window protection score on the genomic region using \(w = 121\)
Analyse the region using a Fourier Transform