Raw data processing

From BiGCaT Wiki 2G
Jump to: navigation, search


Raw data processing


Let's start with the elephant in the room: in most cases, you will not analyze the raw data you get back from NimbleGen yourself, but use the fully processed data which is also returned. We'll look into this data tomorrow.

This tutorial is meant to show the basic steps involved in analyzing raw data of a ChIP-on-chip experiment performed on the NimbleGen platform. These steps are:

  • Reading in the raw data
  • QC inspection of the raw data
  • Normalization of the raw data
  • QC inspection of the normalized data
  • Finding enriched regions in the normalized data
  • Annotating enriched regions
  • Exporting the results

We do not intend to teach you how to use R on an advanced level in the short amount of time we have, hence we have tried to make the tutorial as clear as possible by hiding difficult code.

The R code below will generate results in the folders Course/Results/Raw data results and Course/Results/Norm data results in your own home directory on the R server.

Slowly work your way through the tutorial. If you get stuck or have any other questions, just give a (polite) shout.

Information about the used dataset

For this tutorial, we will use the data generated by a ChIP-on-chip on Estrogen Receptor Alpha. In this experiment, T47D cells were stimulated with an estrogenic compound ("treated") or vehicle only ("untreated"). This estrogenic compound activates Estrogen Receptor Alpha, which then functions as a transcription factor and heads for its target genes.

Fifty minutes after stimulation, cells were processed using the ChIP-on-chip protocol, with a ChIP against Estrogen Receptor Alpha. Enriched regions identified in the data of the estrogenic compound treated samples will thus correspond to potential Estrogen Receptor Alpha binding places.

For this tutorial, we will use the raw data files you get back from NimbleGen.

There are 5 (estrogenic compound) treated sample data files and 4 (vehicle only) untreated sample data files:

  • Treated 1 (NG116469)
  • Treated 2 (NG108643)
  • Treated 3a and rehybridized treated 3b (NG115172 and NG1327002)
  • Treated pool (pooled treated samples 1-3a) (NG115088)
  • Untreated 1 (NG108139)
  • Untreated 2 (NG114686)
  • Untreated 3 (NG113334)
  • Untreated pool (pooled untreated samples 1 - 3) (NG116399)

For all arrays, the ChIP'd samples are in the red channels and the total DNA samples are in the green channels.



We start with loading all required packages and defining the directories where results will be stored. These directories are Course/Results/Raw data results/ and Course/Results/Norm data results/.

Copy the code below into an R terminal.

# Load some homebrew functions

# Define directories
MAIN.DIR <- paste("/home/michiel/Course/Data Andrea/")
RAWDATA.DIR <- paste(MAIN.DIR, "Raw_data_files/", sep = "")
DESIGN.DIR <- paste(MAIN.DIR, "Design_information/", sep = "")
PIC.DIR <- "~/Course/Results/Raw data results/"
NORMPIC.DIR <- "~/Course/Results/Norm data results/"

Raw data import

The code below will read in all the data required for the subsequent analysis steps. We will read in the raw data, annotation data and a list of known Estrogen Receptor Alpha targets, to compare with afterward. In addition, we will create a look-up table (list) to couple array IDs to treatments. This will be useful later.

Copy the code below into the R terminal.

# Read raw data
#rawData <- nimbleReadRaw() # Will find the original raw data

# Read POS-file and annotation
posData <- nimbleReadPos(dir()[grep(".+pos$", dir())], createEnv = FALSE)

# Read table of known ER alpha targets
knownErTargetList <- read.delim(paste(MAIN.DIR, "ERt_DAVID.txt", sep = ""))

# Create simple hash of slide numbers and corresponding sample description
arrayId <- list()
arrayId[c("U1",      "T2",      "U3",      "U2",      "TP",
               "T3_a",    "UP",      "T1",      "T3_b",
        "108139", "108643", "113334", "114686", "115088",
               "115172", "116399", "116469", "1327002")] =
        c("108139", "108643", "113334", "114686", "115088",
               "115172", "116399", "116469", "1327002",
		 "U1",      "T2",      "U3",      "U2",      "TP",
               "T3_a",    "UP",      "T1",      "T3_b")
rawData$targets$SlideNumber <- as.character(rawData$targets$SlideNumber)

Raw data inspection

NimbleGen scans the arrays in a specific way, which leads to an absence of background values and other criteria that can be used for a full probe-level QC as presented today. We can however make plots and images of the raw data to inspect global quality.

Copy the code below into the R terminal to run the raw data QC. This piece of code runs a script that can be found in /Home/Michiel/R, should you be interested in it's inner workings. The script will take about 15 minutes to run. Resulting images will appear (one by one) in the folder Course/Results/Raw data results in your own home directory on the R server.



As mentioned in one of the talks today, in ChIP-on-chip experiments, the red and green channel often contain very different samples. Hence, it is often a good idea to normalize them separately.

In this example, the red channels in the treated arrays contain very different samples from the red channels in the untreated arrays. Considering we will dichotomize the data after normalization for the enriched region finding, and hence will not have to compare intensities directly, we can normalize the treated and untreated data separately.

As you probably saw in the images generated by the QC script, the treated T3a has an almost saturated red channel, which is especially visible in the intensity density plot (peak farthest to the right). The rehybridized T3b does not suffer from this problem at first sight. When looking at the log-ratio distribution however and the corresponding log-ratio array image, you can clearly see that the red channel still dominates the log-ratio and has the same small spread (suggesting that the same sample was used again, but deluted before rehybridization). The U3 array shows the same effect as the T3b array. Considering the samples for U3 and T3 were prepared on the same day, there might actually be something wrong with the samples, not the arrays.

Although the problems with the U3, T3a and T3b arrays can be caused be a variety of factors for which we could try to correct, we will exclude them from further analysis for now.

We will use three different normalization approaches:

  • Quantile normalization: red and green channels normalized together
  • T-quantile normalization: red and green channels normalized separately
  • NimbleGen normalization: red and green channels scaled together

Copy the code below in the R terminal to normalize all raw data of the 'treated' and 'untreated' arrays. It will take some time to run.


Normalized data inspection

When the code below is run, a script is called to create some pictures of the normalized data. Results can be found in Course/Results/Norm data results folder in your own home directory on the R server.


Inspect the results after the script is done. Was normalization successful for all normalization approaches? Based on the plots, which normalization approach would you choose?

Finding enriched regions

All the pre-processing steps are done, so it's time to find those enriched regions. We will use the ACME algorithm for this.

First, probes are dichotomized based on their log-ratio values. The probes with values that are above a certain threshold (above the 95th percentile value in the example below) are designated as positive probes. Other probes are designated as negative probes. Next, we will look in an area of a particular number of base pairs (500 bp in the example below), centered around a probe, to see if there is a significantly larger amount of positive probes in that area than one would expect randomly. The result of this test is expressed as a p-value for each probe. A low p-value for a probe means that the area around (and including) that probe has a significant amount of positive probes and is hence indicative of enrichment. In short for this example: low p-value means there is a potential Estrogen Receptor Alpha binding site.

To run the ACME algorithm, we need to reformat the data a bit. Luckily, a function can do this for us. Just copy the code below (T-quantile normalized data is used, but you may choose a different one if you feel adventurous).

acmeData <- createACMEdata(maData = cbind(nMADataT.tq, nMADataU.tq),
                           posData = posData)

Next, we can run the ACME algorithm on the data.

windowSize <- 500
threshHold <- 0.95
myACMEresult <- myACME:::do.aGFF.calc(acmeData, window = windowSize,
                                      thresh = threshHold)

After we get the results, we can filter probes (actually probe windows!) based on their p-value. In this example, we want probes that have a low p-value in at least two treated arrays, but do not have a low p-value in the at least two untreated arrays.

Copy the code below into the R terminal.

pCutOff <- 0.05
hitsT <- which(
               myACMEresult@vals[,1] < pCutOff & myACMEresult@vals[,2] < pCutOff
               myACMEresult@vals[,2] < pCutOff & myACMEresult@vals[,3] < pCutOff
               myACMEresult@vals[,3] < pCutOff & myACMEresult@vals[,1] < pCutOff
hitsU <- which(
               myACMEresult@vals[,4] < pCutOff & myACMEresult@vals[,5] < pCutOff
               myACMEresult@vals[,5] < pCutOff & myACMEresult@vals[,6] < pCutOff
               myACMEresult@vals[,6] < pCutOff & myACMEresult@vals[,4] < pCutOff
hits <- hitsT[!is.element(hitsT, hitsU)] 

Each promoter is represented by several probes on the array. We need to summarize the results for each promoter. In this example, we simply check if the promoter contains at least one probe (window) with a low p-value. This is done with the code below. Copy it into the R terminal.

probeResults <- cbind(#promoter ID
                   #FDRs plus info
                   rowMeans(myACMEresult@vals[hits, 1:3]),
                   #genomic coordinates
                   #window size
                   rep(windowSize, length(hits)))

# Combine results for each promoter
probeResults.split <- split(probeResults, probeResults[, 1])
promoterResults <- lapply(probeResults.split, function(x) {
            .temp <- array(x, dim = c(round(length(x)/5), 5))
            return(.temp[which.min(.temp[, 2]), ])

Annotating enriched regions

We have a list of enriched promoters now. The problem is, we don't know of which genes these promoters are, as NimbleGen don't give the annotation for the promoters on their arrays. This is actually a major hurdle in the analysis of NimbleGen raw data.

What we do get however, is the genomic location of each probe. Using these coordinates, we can construct coordinates for each promoter. Once we know where a promoter is located in the genome, we can check which known genes are close to each promoter. If we find a known gene close to a particular promoter, we assume it is functionally associated with this promoter. Backwards-engineering at its best.

In this example, we have already created the annotation for you. The annotation is stored as a list, meaning that if you supply a promoter ID to this list, for example seqEntrezAnnData['P123456789'], you get back all the annotation associated with that promoter.

The code below does this for all enriched promoters identified earlier. Copy it into the R terminal.

ann <- seqEntrezAnnData[names(promoterResults)]

The result is a list named 'ann' which contains the annotation for each enriched promoter. If you want to use the data further in R, this resulting list is fine. In this case, we have two lists: ann and promoterResults. The 1st position in ann (retrieved with ann[1]) contains the annotation for the promoter with ACME data on the 1st position of promoterResults (retrieved with promoterResults[1]). The same goes for the 2nd, 3rd, 4th etc. positions. In most cases however, you will want to export the data to an Excel readable text-file, to use in other programs or send to other people. The ACME data contained in promoterResults and the annotation data contained in ann need to be combined for this. So we combine the 1st entry in promoterResults with the 1st entry in ann, the 2nd with the 2nd, the 3rd with the 3rd, and so on until all data is combined. After that, we reformat the resulting list to be able to export it in a readable format (you may ignore this advanced code).

Copy the code below into the R terminal to do this.

# Combine myACME results and annotation
promoterAnnotation <- NULL
for(i in seq(along = ann)) {

    l <- dim(ann[[i]])[1]
    promoterAnnotation <-
                        t(array(rep(cbind(promoterResults[[i]]), l),
                                dim = c(5, l)))

# Clean up: take multiple entries for the same entrez gene ID together
promoterAnnotation.split <- split(promoterAnnotation, promoterAnnotation[,1])
ERAenrichedRegions <- t(sapply(promoterAnnotation.split, function(x) {
            .temp <- array(x, dim = c(length(x)/10, 10))
            return(cbind(#entrez ID
                         .temp[1, 1],
                         .temp[1, 2],
                         .temp[1, 3],
                         paste(.temp[, 8], collapse = "|"),
                         #TSS location
                         paste(.temp[, 4], collapse = "|"),
                         paste(.temp[, 5], collapse = "|"),
                         paste(.temp[, 7], collapse = "|"),
                         paste(.temp[, 9], collapse = "|"),
                         paste(.temp[, 10], collapse = "|")))
        })) # 'sapply' gives (character) matrix
dimnames(ERAenrichedRegions) <- list(c(), c("entrez_gene_id", "gene_symbol",
                                           "gene_description", "chromosome",
                                           "tss_location", "strand",
                                           "window_location", "window_size"))

# Remove spurious "-1" and "0" entries:
ERAenrichedRegions <- ERAenrichedRegions[-c(1:2),]

Check if known targets are found

We have list of known targets from a previous Estrogen Receptor Alpha target study. There are about 30 validated targets. Let's see if any of these known targets are found.

Copy the code below into the R terminal to find out.

            is.element(knownErTargetList$ENTREZ_GENE_ID, ERAenrichedRegions[,1]),
      quote = FALSE)
print(paste("Number of found targets: ",
      ), quote = FALSE)

Are any known targets found? If you have time and energy left, try different normalization procedures and ACME settings to see if you can find more!

Export data to an Excel readable text file

To export the list of Estrogen Receptor Alpha enriched regions, copy the code below into the R terminal. The file named ERA_Enriched_regions.txt will be saved in the Course/Results/Norm data results folder.

write.table(ERAenrichedRegions, sep = "\t",
            file = "ERA_Enriched_regions.txt",
            row.names = FALSE, quote = FALSE)

Try opening and inspecting the resulting file in Open Office Spreadsheet.

Personal tools