Sequence analysis

From BiGCaT Wiki 2G
Jump to: navigation, search


Sequence Analysis


As soon as you've identified enriched regions in a ChIP or DNA methylation array experiment, it is worth looking at interesting DNA motifs in the sequences of these regions. Think for instance of:

  • Searching for the motif associated to the ChIP'd protein (in silico validation).
  • Searching for motifs that discriminate a particular group of enriched regions. Useful when you have additional information on the targets, such as information about up or down regulation (from transcriptomics)
  • Search for CpG islands in a set of enriched regions or in a set of promoters of identified targets.

Slowly work your way through the tutorial. If you get stuck or have any other questions, just give a (polite) shout.

Information about the used dataset

For this tutorial, we will use the data generated by a ChIP-on-chip on Estrogen Receptor Alpha. In this experiment, T47D cells were stimulated with an estrogenic compound ("treated") or vehicle only ("untreated"). This estrogenic compound activates Estrogen Receptor Alpha, which then functions as a transcription factor and heads for its target genes.

Fifty minutes after stimulation, cells were processed using the ChIP-on-chip protocol, with a ChIP against Estrogen Receptor Alpha. Enriched regions identified in the data of the estrogenic compound treated samples will thus correspond to potential Estrogen Receptor Alpha binding places.

For this tutorial, we will use the processed data files ("Promoter reports") you get back from NimbleGen. This is data that is already completely analyzed for enriched regions and contains all the enriched regions for each array, including all annotation of associated genes.

There are 5 (estrogenic compound) treated sample data files and 4 (vehicle only) untreated sample data files:

  • Treated 1 (NG116469)
  • Treated 2 (NG108643)
  • Treated 3a (bad quality) and rehybridized treated 3b (still bad quality) (NG115172 and NG1327002)
  • Treated pool (pooled treated samples 1-3a) (NG115088)
  • Untreated 1 (NG108139)
  • Untreated 2 (NG114686)
  • Untreated 3 (bad quality) (NG113334)
  • Untreated pool (pooled untreated samples 1 - 3) (NG116399)



We will start with loading all required packages and reading in the data. Results will be automatically stored and can be found in Course -> Results -> Sequence analysis results and GO results folders in your home directory.

# Load some homebrew functions

# Define directories
DATA.DIR <- paste("/home/michiel/Course/Data Andrea/",
            "Processed_data_files/Report_All_Peaks", sep = "")
SEQRES.DIR <- paste("~/Course/Results/Sequence analysis results/", sep = "")
GORES.DIR <- paste("~/Course/Results/GO results/", sep = "")

# Read NimbleGen promoter report data
peakData <- nimbleReadPromoterReport()

Filter genes

NimbleGen uses within array analysis approaches, so we can only compare the results of replicate arrays after they've been completely analyzed. In this example, we want to know all genes that occur in both good quality "treated" arrays, T1 and T2, which are biological replicates. Although the overlap might be low, it will be very relevant.

t1Data <- peakData[["NG116469"]]
t2Data <- peakData[["NG108643"]]
erGenes <- t1Data$ncbi_gene_id[is.element(t1Data$ncbi_gene_id,
erGenes <- unique(erGenes[erGenes != "N/A"])

The result is a list of genes erGenes containing all the genes common between T1 and T2, as NCBI Entrez Gene IDs.

GO enrichment analysis

To give a biological interpretation to our group of targets, we perform a GO enrichment analysis. What this basically does is retrieve all associated GO terms for all the genes and then checks if particular terms are overrepresented compared to a large random set of genes.

We just need to specify a p-value cut-off and give the list of genes to the function GOenrichmentAnalysis(..) to do this.

# Do a GO enrichment analysis
p <- 0.05
resGO <- GOenrichmentAnalysis(erGenes, p)

As always, we will save the results to a tab-delimited text-file. You can find the file topGO_T1_T2_Combined.txt in the folder GO results.

# Export results
write.table(resGO, "topGO_T1_T2_Combined.txt", sep = "\t",
            row.names = FALSE, quote = FALSE)

Next, we select genes from a few interesting processes. Inspection the results (you can open them in Open Office Spread Sheet) and considering that estrogen receptor alpha is involved in cell signaling and cell cycle processes, we choose the genes involved in similar processes (transcription regulation, receptor clustering, protein targeting & localization, MAPK cascade).

These genes have Entrez Gene IDs "23192" (ATG4B), "2316" (FLNA) and "5609" (MAP2K7).

# Select genes for continued analysis
erGenes.subset <- c("23192", "2316", "5609")

Get peak-sequences from Ensembl

Sequences aren't given in the NimbleGen analyzed files, so we need to get those ourselves. We use the function nimbleGetPeakSequence(..). The only things you need to supply to this function is the peak numbers of the enriched regions you are interested in and from which dataset these peak numbers are. You can get them manually from the NimbleGen analyzed file, for instance from Open Office Spreadsheet.

We will retrieve the peaks sequences from the T1 treated array. The file is called 116469_ratio_peaks_mapToFeatures_All_Peaks.xls and located in //home/Michiel/Course/Data Andrea/Processed_data_files/Report_All_Peaks/.

Alternatively, we can get them directly from R:

t1Data$PEAK_ID[match(erGenes.subset, t1Data$ncbi_gene_id)]

In this case, these peak numbers are 1365 (Gene ID 23192), 1376 (Gene ID 2316) and 1255 (Gene ID 5609).

peakIdList <- c(1365, 1376, 1255)
enrichedSeqList <- nimbleGetPeakSequence(peakIdList, t1Data)

Export to FASTA format

R is a nice analysis environment (it is!), but the sequence analysis packages are quite scarce. That means you might want to perform your sequence analysis in a different program or web-application. The easiest way to get the sequences to these programs is using the FASTA exchange format. All we need to do is take the column of the sequence dataframe containing FASTA formatted sequences (named Seq_FASTA) and export it to a textfile using the write.fasta(..) function.


Conserved novel motif analysis

We can do some sequence motif analysis in R, using the cosmo package. Cosmo searches for conserved (or overrepresented) novel motifs in a set of sequences. In case of estrogen receptor alpha this is a very valid method, as the canonical palindrome is almost never found.

You can put various constraints on the motifs you want to search for. In most cases you can use the function constraintBuilder(..) to build such constraints using a GUI (point & click). For some reason, this GUI isn't working correctly at the moment, but it should be fixed in the next version of Bioconductor.

For now, we will manually define the constraint for the motif we want to search. Since the estrogen receptor alpha motif should be palindromic (dimeric protein -> palindromic site), we will force a search for a palindromic site. To do this we make a constraint-set. We specify that the sequence should consist of 3 segments: a fixed size segment of 3 base pairs ("B"), then a variable length segment ("V") and then again a fixed size segment of 3 base pairs ("B"). Next, we state that the first and third segments should be palindromes of each other.

# Create constraint
conSet <- makeConSet(numInt = 3, type = c("B", "V", "B"), length = c(3, NA, 3))

# Construct palindromic constraint: require intervals 1 and 3 to be palindromes
# within reasonable tolerance
palCon1 <- makePalCon(int1 = 1, int2 = 3, errBnd = 0.1)
constraint <- list(palCon1)
int <- list(NA)
conSet <- addCon(conSet = conSet, constraint = constraint, int = int)

Now we can do a cosmo analysis. Copying the code below will open up a window, asking for a FASTA formatted sequence file. Select the one you created with the code above (it is called something like enrichedSeqList$Seq_FASTA.fasta in the Sequence analysis results folder).

# Do a conserved motif analysis
cosmoRes <- cosmo(constraints = conSet, models = "ZOOPS", minW = 9, maxW = 18)

# Plot sequence logos of cosmo results
x11(title = "Enriched sequences motif", width = 7, height = 4)

You can also make a plot of the canonical estrogen receptor alpha motif. Copy the code below to do so. Is the motif you found in anyway similar to the canonical estrogen receptor alpha motif?

# Plot canonical ER-alpha target-site motif
m <- rbind(c(2, 2, 9, 2,  0,  0,  0, 11, 2, 3, 0,  0,  0, 8, 0,  0, 2, 1, 2),
           c(2, 4, 0, 0,  0,  0, 11,  0, 4, 2, 3,  1,  1, 0, 8, 11, 2, 2, 4),
           c(4, 1, 1, 9, 11,  1,  0,  0, 3, 4, 8,  0, 10, 2, 1,  0, 0, 6, 2),
           c(3, 4, 1, 0,  0, 10,  0,  0, 2, 2, 0, 10,  0, 1, 2,  0, 7, 2, 2))
pwm <- makePWM(m / t(array(rep(rowsum(m, rep(1, 4)), 4), dim = c(19, 4))))
x11(title = "Estrogen Receptor alpha binding site motif",
        width = 7, height = 4)

Get promoters from Ensembl

We can also retrieve the whole promoters from Ensembl. To do this, we first need to convert the Entrez Gene IDs we have to Ensembl Gene IDs. This is done with the function entrez2Synonym(..):

ensIdList <- entrez2Synonym(erGenes.subset)$ensemblGeneID

Then we can get the promoters. We only need to specify the size we want and how many base pairs we want downstream of the transcription start site. In this example, we take the promoter size 1500 bp, of which 200 bp is downstream of the TSS and 1300 upstream of the TSS.

promoterList <- getPromoterSequence(ensIdList,
                                    dTSS = 200,
                                    promoterLength = 1500)

Next, we can check the promoters sequences (which are in the second column of promoterList) for presence of CpG islands:

cgiCheckAll(promoterList[, 2])

Webtools for known TF binding site analysis


Although it is nice to search for new motifs, in many cases you are just interested in looking for known motifs, such as transcription factor binding sites. Many online tools exist for this type of analysis, of which a few are mentioned below. CORE_TF and JASPER are both open source and free for academic use.

NB: If you are a NuGO partner, you may have a login code and may use Genomatix. Consider however that although it is er very powerful set of tools, giving a Genomatix tutorial would take up a whole day itself. Hence we suggest only using it if you are already familiar with it.


The website of JASPER is:

JASPER is basically a database of transcription factor binding site motifs. Click on the vertebrates link on the first page. Have a look at the various motifs available. On first glance, do any of them look very similar to the motif(s) you found in R?

You can also have a more thorough look at this by actually supplying the motif we found in R to the database. This can be done on the first page. Go back to the first page and scroll down to the window Align to a custom matrix or IUPAC string.

To get the information required, typing this in R (after running the code above!):


will give the following result:

A 0 0.0000 0.8238 0.3333 0.3333 0 0.6667 0 0.3333  0  0 0.0000  1  0 0.0000 0.0875  0
C 0 0.3667 0.0000 0.3333 0.0000 1 0.0000 0 0.6667  0  0 0.6667  0  0 0.0762 0.5500  1
G 1 0.4500 0.1762 0.3333 0.6667 0 0.3333 0 0.0000  0  1 0.0000  0  1 0.0000 0.2749  0
T 0 0.1833 0.0000 0.0000 0.0000 0 0.0000 1 0.0000  1  0 0.3333  0  0 0.9238 0.0875  0
  • Copy the result above and paste it into the window Align to a custom matrix or IUPAC string on the first page of JASPER.
  • Click on align.
  • All profiles in the selected database will be compared to the input profile, using a modified Needleman-Wunsch algorithm.
  • Results are sorted by raw comparison score (for reference, the maximum score is 2 times the width of the smallest matrix in the compared pair). Both this raw comparison score (score) and fraction of potential maximal score (percent_score) is reported.

Is the motif we found in R similar to any of the motifs in the JASPER database? How similar is it to the estrogen receptor motif in JASPER (ESR2)? Try to find more information about the most similar motifs in Entrez or Pubmed.


With CORE_TF we can look at overrepresentation of particular transcription factor binding sites in one group of sequences compared to another (often random) group of sequences. We will use it to see which transcription factor binding sites are overrepresented in the up-regulated group of estrogen alpha targets compared to the down-regulated group of estrogen alpha targets.

You can use two sets of FASTA sequence files for this. One corresponds to Estrogen Receptor Alpha enriched regions of up-regulated genes, one to enriched regions of down-regulated genes. They were created in the same manner as in the R example above, using additional information about up and down regulation of targets from a public transcriptomics dataset. The FASTA files are called:

Right-click on the links above and save them to any place on the server (just remember where).

The website of CORE_TF is:

To perform the overrepresentation analysis, follow these steps:

  • To start check the box (specifying that you are a non-commercial user) on the first page and click on continue.
  • On the page that opens, mark enter sequence in both the Experimental set and Random set specifications.
  • Click on submit.
  • On the next page, in the first window (experimental set) paste the sequences from the ERA_UP.FASTA file. To do so, open the file by right-clicking on it and selecting from Open with the option Open with Text Editor. Select all the text and copy and paste it into the first window.
  • Do the same for the ERA_DOWN.FASTA file, but paste this in the second window (random set).
  • Click on submit.

You will now be presented with a table, showing all transcription factor motifs in the database. Each has a p-value, specifying if the motif was found significantly more often in the up-regulated target sequences than in the down-regulated target sequences. If you go down to the bottom of the page, you can export the table to tab-delimited text file. You can open this text file in Open Office Spreadsheet.

Have a look at all the transcription factors with p-values smaller than 0.05. You can find more information about them by looking up their names (just the bit before the underscore) in the PAZAR database ( or alternatively Entrez Gene ( or JASPER (

You can repeat the analysis for the down-regulated targets, by following the same steps, but pasting the contents of the ERA_DOWN.FASTA file into the first window (experimental set) and the contents of the ERA_UP.FASTA in the second window (random set).

Personal tools