Adapted from Martin Jones’s Python for biologists. (Check out his three books: Python for the biologists, Advanced Python for the biologists, Effective Python development for biologists.)
Simple regular expressions
Let’s consider the following strings of characters: ‘xkn59438’, ‘yhdck2’, ‘eihd39d9’, ‘chdsye847’, ‘hedle3455’, ‘xjhd53e’, ‘45da’ and ‘de37dp’.
# R
ids = c('xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847',
'hedle3455', 'xjhd53e', '45da', 'de37dp')
# Python
ids = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847',
'hedle3455', 'xjhd53e', '45da', 'de37dp']
Which identifiers
- contain the number 5
- contain the letter d or e
- contain the letters d and e in that order
- contain the letters d and e in that order with a single letter * between them
- contain both the letters d and e in any order
- start with x or y
- start with x or y and end with e
- contain three or more digits in a row
- end with d followed by either a, r or p
Use grep()
or the stringr library, if you’re using R.
Use grep()
or the stringr library, if you’re using R.
Identifying restriction sites
First let’s download a DNA sequence from Martin Jones’s Python for biologists. The curl package allows the user to open connections through the internet.
library("curl")
URL = "https://pythonforbiologists.com/s/dna.txt"
req = curl_fetch_memory(URL)
seq = rawToChar(req$content)
The goal is to find all restrictions sites with the sequence GCRW/TG
, where /
is the cleavage site.
The IUPAC_CODE_MAP vector (package Biostrings) gives a the mapping from the IUPAC nucleotide ambiguity codes to their meaning.
- Find out what R and W stand for.
- Use
str_locate_all()
andstr_match_all()
(package stringr) to identify all restriction sites.
Repeated As
Find all subsequences in seq
made of any number (\(>0\)) of adenosines.
- How many are there?
- Plot a histogram with the number of As in each hit.
Tryptic cleavages in a protein
- Create a random protein sequence of a thousand amino acid residues (assume uniform probability)
- Trypsin will cleave the protein right after each lysine or arginine, unless it is followed by a proline
- What’s the distribution of the lengths of the fragment