Regular expressions

Adapted from Martin Jones’s Python for biologists. (Check out his three books: Python for the biologists, Advanced Python for the biologists, Effective Python development for biologists.)

Simple regular expressions

Let’s consider the following strings of characters: ‘xkn59438’, ‘yhdck2’, ‘eihd39d9’, ‘chdsye847’, ‘hedle3455’, ‘xjhd53e’, ‘45da’ and ‘de37dp’.

# R
ids = c('xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847',
        'hedle3455', 'xjhd53e', '45da', 'de37dp')

# Python
ids = ['xkn59438', 'yhdck2', 'eihd39d9', 'chdsye847',
       'hedle3455', 'xjhd53e', '45da', 'de37dp']

Which identifiers

contain the number 5
contain the letter d or e
contain the letters d and e in that order
contain the letters d and e in that order with a single letter * between them
contain both the letters d and e in any order
start with x or y
start with x or y and end with e
contain three or more digits in a row
end with d followed by either a, r or p

Use grep() or the stringr library, if you’re using R.

Identifying restriction sites

First let’s download a DNA sequence from Martin Jones’s Python for biologists. The curl package allows the user to open connections through the internet.

library("curl")
URL = "https://pythonforbiologists.com/s/dna.txt"
req = curl_fetch_memory(URL)
seq = rawToChar(req$content)

The goal is to find all restrictions sites with the sequence GCRW/TG, where / is the cleavage site.

The IUPAC_CODE_MAP vector (package Biostrings) gives a the mapping from the IUPAC nucleotide ambiguity codes to their meaning.

Find out what R and W stand for.
Use str_locate_all() and str_match_all() (package stringr) to identify all restriction sites.

Repeated As

Find all subsequences in seq made of any number (\(>0\)) of adenosines.

How many are there?
Plot a histogram with the number of As in each hit.

Tryptic cleavages in a protein

Create a random protein sequence of a thousand amino acid residues (assume uniform probability)
Trypsin will cleave the protein right after each lysine or arginine, unless it is followed by a proline
What’s the distribution of the lengths of the fragment