Pseudo-Replicates
Bioassay Basics: Designing Dose Response Curves and the Meaning of “N”
I began working with clients developing bioassays back in the 90’s. Most of my projects involve relative potency measurements made using in vitro procedures where test samples and reference material are run together in a 96 well plate. By the time I am brought on board, the biologists have worked out conditions that yield a dose response curve. During research, they will include many more dilutions than are practical for a quality control test but can get sufficient information to select an appropriate dose response model. Deciding how many dilutions are needed to cover the dose response range, determining the optimal spacing of those dilutions and choosing how many replicate curves need to be performed for both test and reference sample to calculate a single relative potency result from a procedure(1) , can be daunting. Figuring out how many individual results needed to be averaged to compute a reportable value is very important in determining whether the bioassay method(1) can be considered fit for purpose.
The (mis)use of replicates is common. Understanding the purpose and value of replication in establishing confidence in results and conclusions is critical. Terminology can make this subject confusing. This blog is intended to clarify the basic concepts surrounding sample sizes and shed light on the terms "technical replicate” and “pseudo-replicate” that many biologists use.
First it is important to clarify that a “replicate” in any the context of an assay procedure means a “sample” in statistical terms. The sample, which has a size designated as “n”, is one taken from a population which represents all the theoretical possibilities for whatever is being measured. The goal is to use the sample information to estimate the truth about the population. For example, a drug is tested in a clinical trial (sample) and the data are used to license the product for sale to patients now and in the future (population).
It is intuitive that the larger the sample size, the more likely we are to draw the correct conclusion about the safety and the efficacy of the drug. We understand that there is variability in clinical responses that are related to differences in patients. When working with bioassays, there are many sources of variability that can affect the dose response curve and the calculation of relative potency. As will be explained in more detail below, the bottom line is that replication is most valuable where there is the most variability. For example, reading the same 96 well plate several times within a matter of minutes is unlikely to yield much variability. Similarly, the variation between wells receiving test material from a single dilution preparation will exhibit less variability than wells receiving independently prepared dilutions of the test material because the variability in making dilutions will be manifested. Why does this matter?
There are two statistical properties of the sample size that are important to understand. The first is independence and the second is power.
Samples used in statistical analyses are assumed to be independent or uncorrelated. That means the values of the steps taken in creating the samples are different among the samples. By this, I mean the analysts, the days, the vials, the preparation of the serial dilution, the wells, etc. The more values of the steps shared, the more correlated the samples.
As samples become more and more correlated, the variability among them decreases, unless the statistics being done includes terms for the correlations. Correlated samples contain less information about the population of interest --for example, a product lot being tested in QC-- than uncorrelated samples. Bigger N and smaller variability mean more statistical power. More statistical power (see my blog on The Power of Statistical Power) means rejecting the null hypothesis and accepting the alternative hypothesis as being “proven.”
Pseudo-replicates in assays at the level of “well” are an example of highly correlated samples. Because they are correlated, they have smaller variability and a bigger N, and can more easily arrive at the wrong conclusion. The way to avoid this is to always average pseudo-replicates prior to statistical analyses including fitting the dose response curve to the model and assessing parallelism.
The reason such replicates are often included is to allow for outlier testing and removal of data points that will negatively impact an otherwise “good” curve fit. Without replication, the removal of a single point that looks “off” cannot be justified. So, returning to the example of wells, it would make more sense to replicate at the level of dilution preparation. While all wells are correlated at the level of plate, and all plates within a day are correlated at the level of day, using more than one vial and/or preparing more than one dilution series, allows for detection of unusual variability in vial or sample preparation with outlier testing. After outlier testing and removal of any aberrant values, the remaining values should still be averaged.
To capture the variability introduced by running the procedure with different reagents, days, analysts, it is common to combine the test results from independent runs and use the average as the reportable value. While not every run may be truly independent with respect to all factors, the estimate of run-run variability from historical data and DOE studies should be estimated sufficiently well to reasonably represent the population (i.e. the true variability all future “infinite runs” of the assay). The standard error of the mean (SEM) is then used to convey confidence in the final result. However, it is critical that the “N” in the denominator is paired with the “s” or variability in the numerator. Information on the within run variability can be taken into account in generating the single result reported from each run.
(1) USP clarifies procedure vs. method in glossary 1030