My watch list  

Expression profiling


In the field of molecular biology, gene expression profiling measures the activity of thousands of genes at once, creating a global picture of cellular function. These profiles can distinguish between cells that are actively dividing, for example, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell. Microarray technology[1] or tag-based techniques, like serial analysis of gene expression (SAGE) are often used for gene expression profiling.

Additional recommended knowledge



Expression profiling represents a logical next step to sequencing a genome. The sequence tells us what the cell could possibly do where the expression profile tells us what it is actually doing now. Genes contain the instructions for making messenger RNA (mRNA), but any given moment, particular cells make mRNA from only a fraction of the genes they carry. One can think of genes as being "on" when the cell makes mRNA from them, and "off" when the cell does not. A host of factors such as the time of day, whether or not the cell is actively dividing, its local environment, and chemical signals from other cells determine whether a gene is on or off. Skin cells, liver cells and nerve cells turn on (express) somewhat different genes and that is in large part what makes them different. Working backwards from an expression profile (which genes are on and off) one can therefore deduce a cell's type, its state, environment, and so forth.

Expression profiling experiments often take the form of looking at the relative amount of mRNA expressed in two or more experimental conditions. This is interesting because altered levels of a specific sequence of mRNA suggest a changed need for the protein coded for by the mRNA, perhaps indicating a homeostatic response or a pathological condition. For example, if one sees higher levels of mRNA coding for alcohol dehydrogenase one might infer that the cells or tissues under study are responding to increased levels of ethanol in their environment. Similarly, if one notices that breast cancer cells express higher levels of mRNA associated with a particular transmembrane receptor than normal cells do, one might take this as evidence that this receptor plays a role in breast cancer. It could be that a drug that interferes with this receptor could prevent or treat breast cancer. Developing this drug, one may be interested in performing gene expression profiling experiments to help assess the toxicity of the new drug, perhaps by looking for changing levels in the expression of cytochrome P450 genes, which may be a biomarker of drug metabolism.[2] Gene expression profiling may become an important diagnostic test. [3][4]

Gene expression profiles compared to proteomics

The human genome contains on the order of 25,000 genes which work in concert to produce on the order of 1,000,000 distinct proteins. This is because cells make important changes to proteins through posttranslational modification after they first construct them, so a given gene serves as the basis for many possible versions of a particular protein. In any case, a single mass spectrometry experiment can identify about 2,000 proteins [5] or .2% of the total. While knowledge of the precise proteins a cell makes (proteomics) is more relevant than knowing how much messenger RNAs is made from each gene, gene expression profiling provides the most global picture possible in a single experiment.

Expression profiles answer hypothesis testing, generation and other questions

In many cases, scientists already have an idea what is going on, a hypothesis, and they perform an expression profiling experiment with the idea of potentially disproving this hypothesis. In other words, the scientist is making a specific prediction about level of expression that could turn out to be false.

In other cases, expression profiling takes place before enough is known about how genes interact with experimental conditions for a testable hypothesis to exist. With no hypthesis, there is nothing to disprove, but expression profiling can nonetheless help to identify a candidate hypothesis for future experiments. Most early expression profiling experiments, and many current ones, take this form[6] which is known as class discovery. In this approach, genes or samples are grouped together using k-means or hierarchical clustering.

Class prediction allows one to answer questions of direct clinical significance such as, given this profile, what is the probability that this patient will respond to this drug? This requires many examples of profiles that responded and did not respond, as well as cross-validation techniques to discriminate between them.

Expression profiling leads to relatively short published gene lists

In general, expression profiling studies report those genes that showed statistically significant differences under changed experimental conditions. This is typically a small fraction of the genome for several reasons. First, different cells and tissues express a subset of genes as a direct consequence of cellular differentiation so many genes are turned off. Second, many of the genes code for proteins that are required for survival in very specific amounts so many genes do not change. Third, the cell has many mechanisms to regulate proteins in addition to the obvious one of altering the amount of mRNA, so these genes may stay consistently expressed even when protein concentrations are rising and falling. Fourth, financial constraints limit expression profiling experiments to a small number of observations of the same gene under identical conditions, reducing the statistical power of the experiment, making it impossible for the experiment to identify important but subtle changes. Finally, it takes a great amount of effort to discuss the biological significance of each regulated gene, so scientists often limit their discussion to a subset. Newer microarray analysis techniques automate certain aspects of attaching biological significance to expression profiling results, but this remains a very difficult problem.

The relatively short length of gene lists published from expression profiling experiments limits the extent to which experiments performed in different laboratories appear to agree. Placing expression profiling results in a publicly accessible microarray database makes it possible for researchers assess expression patterns beyond the scope of published results, perhaps identifying similarity with their own work.

Measurement techniques used to validate high throughput measurements

Both DNA microarrays and QPCR exploit the preferential binding or "base pairing" of complementary nucleic acid sequences, and both are used in gene expression profiling, often in a serial fashion. While high throughput DNA microarrays lack the quantitative accuracy of QPCR, it takes about the same time to measure the gene expression of a few dozen genes via QPCR as it would to measure an entire genome using DNA microarrays. So it often makes sense to perform semiquantitative DNA microarray analysis experiments to identify candidate genes, then perform QPCR on some of the most interesting candidate genes to validate the microarray results. Tag-based approaches like SAGE produce very accurate gene expression profiles, which represent and accurately quantify almost all expressed genes in a single analysis. Other experiments, such as a western blot of some of the protein products of differentially expressed genes, make conclusions based on the expression profile more persuasive, since the mRNA levels do not necessarily correlate to the amount of expressed protein.

Statistical challenges in assembling gene lists

Data analysis of microarrays has become an area of intense research.[7] Simply stating that a group of genes were regulated by at least two fold, once a common practice, lacks a solid statistical footing. With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold. In addition, arbitrarily setting the bar at two-fold is not biologically sound, as it eliminates from consideration many genes with obvious biological significance.

Rather than using a fold change cutoff to identify genes of interest, one can select differentially expressed genes using a variety statistical tests or omnibus tests such as ANOVA, all of which consider both fold change and variability in combination to create a p-value, an estimate of how frequently we would observe our data by chance alone. Applying these to microarrays is not as easy at it might seem, because of the large number of multiple comparisons (genes) involved. For example, a p-value of .05 is typically thought to indicate significance, since it estimates a 5% probability of observing the data by chance. With 10,000 genes on a microarray, however, using p < .05 as measure of significance would identify 500 genes as significant even if there were no difference between the experimental groups. One obvious solution is to consider significant only those genes meeting a much more stringent p value criterion, e.g., by using a Bonferroni correction, or using a false discovery rate to adjust p values in proportion to the number of parallel tests involved. Unfortunately, these approaches may reduce the number of significant genes to zero, even when genes are in fact differentially expressed. Current statistics such as Rank products aim to strike a balance between false discovery of genes due to chance variation and falsely calling genes insignificant when they are in fact differentially expressed. Commonly cited methods include SAM [8] and a wide variety of methods are available from Bioconductor.

Selecting a different test will generally identify a different list of significant genes[9] since each test operates under a specific set of assumptions, and places a different emphasis on certain features in the data. Many tests begin with the assumption of a normal distribution in the data, because that seems like a sensible starting point and often produces results that appear more significant. Some tests consider the joint distribution of all gene observations to estimate general variability in measurements[10], while others look at each gene in isolation. Many modern microarray analysis techniques involve bootstrapping (statistics), machine learning or Monte Carlo methods [11]. As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy. The MAQC Project[12] makes recommendations to guide researchers in selecting more standard methods so that experiments performed in different laboratories will agree better.

Gene annotation challenges

While the statistics may reliably identify which gene products change under experimental conditions, making biological sense of expression profiling rests on knowing which protein each gene product makes and what function this protein performs, a process called gene annotation. Some annotations are more reliable than others; some are absent. Gene annotation databases change regularly, and various databases refer to the same protein by differerent names, reflecting a changing understanding of protein function. Use of standardized gene nomenclature helps address the naming aspect of the problem, but exact matching of transcripts to genes [13][14]remains an important consideration.

Categorizing regulated genes

Having identified some set of regulated genes, the next step in expression profiling involves looking for patterns within the regulated set. Do the proteins made from these genes perform similar functions? Are they chemically similar? Do they reside in similar parts of the celll? Gene ontology analysis provides a standard way to define these relationshps. Gene ontologies have the characteristic of starting with very broad categories, e.g., "metabolic process" and breaks them down into smaller categories, e.g., "carbohydrate metabolic process" and finally into quite restrictive categories like "inositol and derivative phosphorylation".

Genes have other attributes beside biological function, chemical properties and cellular location. One can compose sets of genes based on proximity to other genes, association with a disease, and relationships with drugs or toxins. The Molecular Signatures Database [15] and the Comparative Toxicogenomics Database [16] are examples of resources to categorize genes in numerous ways.

Finding patterns among regulated genes

  Having categorized regulated genes in terms of what they are and what they do, important relationships between genes may emerge. Scanning the recent scientific literature, one may notice that a substantial number of regulated genes list appear in the context of an important disease, or that a substantial number control each other in a known process. For example, we might see evidence that a certain gene creates a protein to make an enzyme that that activates a protein to turn on a second gene on our list. This second gene may be a transcription factor that regulates yet another gene from our list. Observing these links we may begin to suspect that they represent much more than chance associations in the results, and that they are all on our list because of an underlying biological process. On the other hand, it could be that if one selected genes at random, one might find many that seem to have something in common.

Significant gene set enrichment may suggest cause and effect relationships

Fairly straighforward statistics provide estimates of whether associations between genes on lists are greater than what one would expect by chance. These statistics are interesting, even if they represent a substantial oversimplification of what is really going on. Here is an example. Suppose there are 10,000 genes in an experiment, only fifty (.5%) of which play a known role in making cholesterol. The experiment identifies two hundred regulated genes. Forty of the two hundred (20%) regulated genes turn out to be on a list of cholesterol genes as well. Based on the overall prevalence of the cholesterol genes (.5%) one expects an average of one cholesterol gene for every two hundred regulated genes, that is, .005 times 200. The exprectation of one out of two hundred is an average, so one expects to see more than one some of the time. The question becomes how often we would see forty instead of one due to pure chance.

According to the hypergeometric distribution one would expect to try about 10^57 times (10 followed by 57 zeroes) before picking thirty nine or more of the chlolesterol genes from a pool of 10,000 by drawing two hundred genes at random. Whether one pays much attention to how infinitesimally small the probability of observing this by chance is, one would conclude that the regulated gene list is enriched[18] in genes with a known cholesterol association.

One might further hypothesize that the experimental treatment regulates cholesterol, because the treatment seems to selectively regulate genes associated with cholesterol. While this may be true, there are a number of reasons why making this a firm conclusion based on enrichment alone represents an unwarranted leap of faith. One previously mentioned issue has to do with the observation that gene regulation may have no direct impact on protein regulation: even if the proteins coded for by these genes do nothing other than make cholesterol, showing that their mRNA is altered does not directly tell us what is happening at the protein level. It is quite possible that the amount of these cholesterol-related proteins remains constant under the experimental conditions. Second, even if protein levels actually do change, perhaps there is always enough of them around to make cholesterol as fast as it can be possibly made, that is, another protein, not on our list, is the rate determining step in the process of making cholesterol. Finally, proteins typically play many roles, so these genes may be regulated not because of their shared association with making cholesterol but because of a shared role in a completely independent process.

Bearing the forgoing caveats in mind, while gene profiles do not in themselves prove causal relationships between treatments and biological effects, they do offer unique biological insights that would often be very difficult to arrive at in other ways.


Expression profiling provides exciting new information about what genes do under various conditions. Overall, microarray technology produces reliable expression profiles.[19]From this information one can generate new hypotheses about biology or test existing ones. However, the size and complexity of these experiments often results in a wide variety of possible interpretations. In many cases, analyzing expression profiling results takes far more effort than performing the initial experiments.

Most researchers use multiple statistical methods and exploratory data analysis before publishing their expression profiling results, coordinating their efforts with a biostatistician or other expert in microarray technology. Good experimental design, addequate biological replication and follow up experiments play key roles in successful expression profiling experiments.


  1. ^ Microarrays Factsheet. Retrieved on 2007-12-28.
  2. ^ Suter L, Babiss LE, Wheeldon EB (2004). "Toxicogenomics in predictive toxicology in drug development". Chem. Biol. 11 (2): 161–71. doi:10.1016/j.chembiol.2004.02.003. PMID 15123278.
  3. ^ Magic Z, Radulovic S, Brankovic-Magic M (2007). "cDNA microarrays: identification of gene signatures and their application in clinical practice". J BUON 12 Suppl 1: S39–44. PMID 17935276.
  4. ^ Cheung AN (2007). "Molecular targets in gynaecological cancers". Pathology 39 (1): 26–45. doi:10.1080/00313020601153273. PMID 17365821.
  5. ^ Mirza SP, Olivier M (2007). "Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry". Physiol Genomics. doi:10.1152/physiolgenomics.00292.2007. PMID 18162499.
  6. ^ Chen JJ (2007). "Key aspects of analyzing microarray gene-expression data". Pharmacogenomics 8 (5): 473–82. doi:10.2217/14622416.8.5.473. PMID 17465711.
  7. ^ Vardhanabhuti S, Blakemore SJ, Clark SM, Ghosh S, Stephens RJ, Rajagopalan D (2006). "A comparison of statistical tests for detecting differential expression using Affymetrix oligonucleotide microarrays". OMICS 10 (4): 555–66. doi:10.1089/omi.2006.10.555. PMID 17233564.
  8. ^ Significance Analysis of Microarrays. Retrieved on 2007-12-27.
  9. ^ Yauk CL, Berndt ML (2007). "Review of the literature examining the correlation among DNA microarray technologies". Environ. Mol. Mutagen. 48 (5): 380–94. doi:10.1002/em.20290. PMID 17370338.
  10. ^ Breitling R (2006). "Biological microarray interpretation: the rules of engagement". Biochim. Biophys. Acta 1759 (7): 319–27. doi:10.1016/j.bbaexp.2006.06.003. PMID 16904203.
  11. ^ Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008). "Monte Carlo feature selection for supervised classification". Bioinformatics 24 (1): 110–7. doi:10.1093/bioinformatics/btm486. PMID 18048398.
  12. ^ Dr. Leming Shi, National Center for Toxicological Research. MicroArray Quality Control (MAQC) Project. U.S. Food and Drug Administration. Retrieved on 2007-12-26.
  13. ^ Dai M, Wang P, Boyd AD, et al (2005). "Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data". Nucleic Acids Res. 33 (20): e175. doi:10.1093/nar/gni179. PMID 16284200.
  14. ^ Alberts R, Terpstra P, Hardonk M, et al (2007). "A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat". BMC Bioinformatics 8: 132. doi:10.1186/1471-2105-8-132. PMID 17448222.
  15. ^ GSEA. Retrieved on 2008-01-03.
  16. ^ CTD: The Comparative Toxicogenomics Database. Retrieved on 2008-01-03.
  17. ^ Ingenuity Systems. Retrieved on 2007-12-27.
  18. ^ Curtis RK, Oresic M, Vidal-Puig A (2005). "Pathways to the analysis of microarray data". Trends Biotechnol. 23 (8): 429–35. doi:10.1016/j.tibtech.2005.05.011. PMID 15950303.
  19. ^ Couzin J (2006). "Genomics. Microarray data reproduced, but some concerns remain". Science 313 (5793): 1559. doi:10.1126/science.313.5793.1559a. PMID 16973852.
This article is licensed under the GNU Free Documentation License. It uses material from the Wikipedia article "Expression_profiling". A list of authors is available in Wikipedia.
Your browser is not current. Microsoft Internet Explorer 6.0 does not support some functions on Chemie.DE