Adapted from Martin Jones’s Python for biologists. (Check out his three books: Python for the biologists, Advanced Python for the biologists, Effective Python development for biologists.)

Simple regular expressions

Let’s consider the following strings of characters: ‘xkn59438’, ‘yhdck2’, ‘eihd39d9’, ‘chdsye847’, ‘hedle3455’, ‘xjhd53e’, ‘45da’ and ‘de37dp’.

Which identifiers

  • contain the number 5
  • contain the letter d or e
  • contain the letters d and e in that order
  • contain the letters d and e in that order with a single letter * between them
  • contain both the letters d and e in any order
  • start with x or y
  • start with x or y and end with e
  • contain three or more digits in a row
  • end with d followed by either a, r or p

Use grep() or the stringr library, if you’re using R.

Use grep() or the stringr library, if you’re using R.

Identifying restriction sites

First let’s download a DNA sequence from Martin Jones’s Python for biologists. The curl package allows the user to open connections through the internet.

The goal is to find all restrictions sites with the sequence GCRW/TG, where / is the cleavage site.

The IUPAC_CODE_MAP vector (package Biostrings) gives a the mapping from the IUPAC nucleotide ambiguity codes to their meaning.

  • Find out what R and W stand for.
  • Use str_locate_all() and str_match_all() (package stringr) to identify all restriction sites.

Repeated As

Find all subsequences in seq made of any number (\(>0\)) of adenosines.

  • How many are there?
  • Plot a histogram with the number of As in each hit.

Tryptic cleavages in a protein

  • Create a random protein sequence of a thousand amino acid residues (assume uniform probability)
  • Trypsin will cleave the protein right after each lysine or arginine, unless it is followed by a proline
  • What’s the distribution of the lengths of the fragment