Journal of Public Health Advance Access originally published online on August 11, 2006
Journal of Public Health 2006 28(3):278-282; doi:10.1093/pubmed/fdl038
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Comparing the part with the whole: should overlap be ignored in public health measures?
Lillian J. Hayes, Faculty of Nursing and Midwifery1
Geoffrey Berry, Emeritus Professor Biostatistics and Epidemiology2
1 Faculty of Nursing and Midwifery, University of Sydney, Sydney, New South Wales 2006, Australia
2 School of Public Health, University of Sydney, New South Wales 2006, Australia
Address correspondence to Lillian J. Hayes, E-mail: lhayes{at}nursing.usyd.edu.au
Background In public health, health outcomes such as cancer incidence or mortality of subgroups are often compared with health outcomes of the whole population. Our objective was to explore the effect of overlap that occurs in such comparisons and to develop a correction factor to adjust the test statistics and confidence intervals to allow for the effect in situations where the full data are not available.
Method The standard error of a difference between a statistic calculated for a subgroup and for the whole population was derived theoretically both ignoring and allowing for overlap. The ratio of these standard errors was defined as the correction factor. Cancer incidence and death data (19972001) for the Australian state of New South Wales (NSW) were examined to demonstrate the utility of the correction factor.
Results If the overlap is ignored, significance tests are conservative and confidence intervals too wide. In an example with an overlap of 12%, the correction factor was 1.13 and the significance level of 0.08 was corrected to 0.05 by taking the overlap into account.
Conclusions The overlap may not be of concern if the result is significant or if the subgroup is <10% of the whole population, but if the overlap is greater than 10% it should not be ignored. The easiest way of allowing for overlap is to use a correction factor, calculated from the amount of overlap, to adjust analyses that ignore overlap.
Keywords: confidence intervals, health status indicators, part compared with whole, statistical methods
| Introduction |
|---|
|
|
|---|
Many countries have population-based cancer registries which maintain cancer databases and release regular reports on cancer incidence and mortality. Such routinely collected statistical information is often used to draw inferences about the health of subgroups in comparison with the overall population. Consider the following example. The Australian state of New South Wales (NSW) is divided into a number of geographically defined health administration areas known as Area Health Services (AHS). The cancer incidence or death rate of an AHS is often compared with the cancer incidence or death rate of the state, NSW. In such a comparison, there is an overlap between the AHS and the state because the data used to calculate the incidence or death rate of the AHS are also used in calculating the state rate. Therefore, the comparison is not of two independent quantities, and the statistical method employed needs to allow for the overlap between the AHS and the state. If there is access to the full data, there is no problem because the subgroup can be compared to the remainder of the population. Often full data are not available, only the standardized rates and standard errors. In this article, the effect of the overlap is explored and a correction factor developed to adjust the test statistic to allow for the effect of the overlap in situations where full data are not available.
Suppose that for a variable
X = the mean value for the whole population of size N,
x = the mean value for a subgroup of the population of size n,
f = n/N, the proportionate size of the subgroup relative to the total population and
Y is the mean value for the population excluding the subgroup (of size N n)
then
![]() |
and
![]() | (1) |
If it is required to test the difference between the mean for the subgroup, x, and the mean for the whole population, X, then the optimum test is to compare x with the mean for the whole population excluding the subgroup, Y, as x and Y are independent. The test statistic is
![]() |
An alternative is to compare x with X and, if the correlation between x and X is taken into account, this gives exactly the same test statistic. However, it is more usual to carry out this test ignoring the correlation to give the test statistic
![]() |
Taking the correlation into account, the alternative but equivalent form of z1 is
![]() |
Our purpose is to establish the relationship between the approximate test statistic, z2, and the correct test statistic, z1. In the null hypothesis situation that the subgroup has the same distribution as the whole population, then the variances of x, X and Y are equal to the population variance
2 divided by n, N and N n respectively, so that var(X) = f var(x), and cov(x,X) = f var(x), and the correlation between x and X is
f.
Therefore,
![]() |
and
![]() |
Hence
![]() |
where C is a correction factor given by
![]() | (2) |
The test statistic, z2, is too small for two reasons. First the difference x X is too small because of the overlap and secondly the denominator is too large because of failure to take account of the positive correlation between x and X.
The relationship between C and f is shown in Fig. 1. For small values of f, the correction factor is approximately equal to 1 + f, for example if f = 0.05 then C = 1.051, and if f = 0.1, C = 1.106. For larger values of f, the relationship increases much faster than linearly and reaches 1.5 for f = 0.38 and 1.7 for f = 0.5.
|
The correction factor may be written as
![]() | (3) |
This shows that the result in equation 2 is a general result that applies if var(X) = f var(x) and the correlation between x and X is
f. Therefore, the same result applies if x and X are the proportions with some characteristic in the sub-population and whole population respectively.
Proportions are often compared in terms of their ratio R = x/X. Using expressions for the variance of a ratio and for a general function1 with and without inclusion of the covariance between x and X it can be shown that
![]() | (4) |
and in the null case, R = 1, this reduces to equation 2.
This correction factor may be used to adjust analyses carried out ignoring the overlap between the sub-population and whole population. A z test statistic for a null hypothesis should be multiplied by C using equation 2, and a chi-squared statistic should be multiplied by C2. Standard errors and the widths of confidence intervals should be divided by C. In some cases, the factor C depends on the size of the effect (see equation 4). If the effect is not too large, it would be satisfactory to use equation 2 for this purpose.
Often the comparison is of standardized rates. The directly standardized rate for the sub-population is
![]() |
where xi is the rate in age group i in the sub-population and Pi is the population size in that age group for the standard population. A similar expression applies to X, and x and X may be compared either in terms of their difference or their ratio. For indirectly standardized rates, comparison is usually in terms of the standardized mortality ratio (SMR) of the sub-population relative to the total population.
This gives
![]() |
where xi and Xi are the death rates in age group i in the sub-population and whole population respectively, and ni is the population size in age group i in the sub-population. In both cases, the ratios are of weighted sums of xi and Xi, where for each pair (xi, Xi), xi is part of Xi, and var(Xi) = fi var(xi) and the correlation of xi and Xi is
fi. In the null case where the expected value of xi equals that of XI, then it can be shown that if the comparison is tested using an approximate normal test ignoring the correlations between xi and Xi, then the correction factor has the form
![]() | (5) |
where
is the weighted mean of the fi,
![]() | (6) |
with weights wi = Pi2 var(xi) for directly standardized rates and wi = ni2 var(xi) for indirectly standardized rates. As var(xi) is approximately equal to di/ni2, where di is the number of events in age group i in the sub-population,2 then
![]() |
![]() |
The same factor applies if the comparison is in terms of the difference of directly standardized rates.
| Example |
|---|
|
|
|---|
In NSW, cancer age standardized incidence and mortality rates by gender and AHS are available.3 An examination of the cancer incidence and death rates on this site suggests that, in NSW, overlap may be a problem for some cancer rates in the larger AHSs. AHSs vary greatly in population size from over 700 000 to <50 000. South Eastern Sydney AHS (SESAHS) has one of the largest AHS populations and represents 12% of the total NSW population. SESAHS had a higher age standardized incidence rate of breast cancer than NSW, but the difference was not reported to be statistically significant.
The NSW Cancer Registry made 19972001 cancer data available (Table 1), to the authors, to determine whether overlap was a problem in this particular example. For the period 19972001, breast cancer direct age standardized incidence rates for SESAHS and NSW were calculated using the 1991 Australian population as standard. For SESAHS, the standardized rate was 104.46 per 100 000 with a standard error of 2.20. For NSW 19972001, the directly standardized rate was 100.40 with a standard error of 0.75. Ignoring overlap, the standard error of the difference between the incidence rates for SESAHS and NSW was 2.32. The incidence rate for SESAHS was higher than NSW as a whole by 4.1 per 100 000. The significance level of this difference was 0.08 (z2 = 1.751), and the 95% confidence interval was from 0.5 to +8.6.
|
From the data in Table 1, the population and incidence figures can be produced for the remainder of NSW (NSWSESAHS). The directly standardized rate for the remainder was 99.84 per 100 000 with a standard error of 0.7936. The difference between SESAHS with the remainder of the state was 4.62 per 100 000 with a standard error of 2.34 giving a z statistic of 1.979.
If the completed data had not been available, the analysis in the previous paragraph could not be done and it would be necessary to apply a correction factor to the comparison of NSW to SESAHS. SESAHS makes up a fraction of 0.12 of the NSW population and this gives a C of 1.13, and the standard error of the difference was reduced to 2.06. Thus after allowing for overlap, the significance level of the difference was 0.05 (z1 = 1.972), and the 95% confidence interval was from +0.0 to +8.1. These corrected values are very similar to those of the analysis using the full data in the previous paragraph.
The above correction is based on an overall mean value of
, whereas the weighted value of
, (equation 6) is preferable if available. The weighted value from equation 6 is 0.1205 leading to z = 1.974 which differs little from the value using the unweighted fraction. A non-weighted value of
, calculated from the total over all ages, may be used to calculate an approximate correction factor if f does not vary too much over age groups. In the example this fraction varied with age from 0.096 in the age group 019 years to 0.145 for 2029 years, and using the non-weighted value of
gave a very good approximation to the z value using the weighted mean, 1.972 compared with 1.974.
It is clear from this example that a consequence of ignoring the overlap is to obtain a confidence interval that is too wide, and a conservative value for the significance level, 0.08 instead of 0.05.
| Discussion |
|---|
|
|
|---|
What is known
If the overlap of the subgroup and the overall population is ignored, the significance tests are conservative and correspondingly the confidence intervals are too wide. If the result is statistically significant ignoring overlap, it is even more significant after allowing for the overlap, whilst a non-significant result could become significant at a specified significance level.
| Main findings of the study |
|---|
|
|
|---|
The consequences of ignoring overlap depend on the size of f. From Fig. 1 and equation 2 for f=0.1, the correction factor is 1.1 and ignoring it may not be a large error. However, in the example where f was 0.12, allowing for the overlap proved important. Certainly for an overlap >0.1, attention needs to be given to the problem and a correction factor applied. With standardized measures, the correction factor is in terms of a weighted average of the proportional overlap in the different age groups. The weighted mean may be higher or lower than the overall overlap but could not be higher than the highest overlap in any age group. Consequently, if this highest rate is <0.1, then the correction factor can be ignored or otherwise more attention is needed.
| What this study adds |
|---|
|
|
|---|
A correction factor to adjust the test statistics and confidence intervals to allow for the effect in situations where the full data are not available has been developed.
| Limitation of the study |
|---|
|
|
|---|
For standardized death rates, the adjusted value of z is valid when the null hypothesis applies to each age group. Where the contrasts between the subgroup and overall population vary markedly with age, then examples may be produced where the value of z is higher, rather than lower, when testing the subgroup with the overall population than when the comparison is between the subgroup and the remainder of the population, and use of the correction factor would be inappropriate. In such cases, the comparison of the standardized rates is a comparison made up of heterogeneous components and the meaning of the comparison is unclear.
Throughout this article, we have worked with confidence intervals and significance tests based on a normal approximation. This is satisfactory for the example with over 2000 deaths in total in the subcohort and over 50 deaths in 12 of the 14 age groups. For smaller numbers, the normal approximation may be unsatisfactory for the directly standardized rate as the confidence intervals based on Poisson limits are asymmetric. Alternative methods are available4,5, and in the context of this article, the overlap should be taken into account using the correction factor if they are used to compare the part with the whole.
| Acknowledgements |
|---|
|
|
|---|
We thank Ms Elizabeth Tracey (Manager of the NSW Central Cancer Registry) and Ms Wendy Chen (Medical coder and analysis statistician of the NSW Central Cancer Registry) for the provision of data and advice. We are also grateful to a reviewer who detected a flaw in the original draft.
| References |
|---|
|
|
|---|
- Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research, 4th edn. Oxford: Blackwell Science 2002,15962.
- Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research, 4th edn. Oxford: Blackwell Science 2002,662.
- Statistical Reporting Module. The Cancer Council of New South Wales. http://www.statistics.cancercouncil.com.au (18 February 2004, date last accessed).
- Dobson AJ, Kuulasmaa K, Eberle E et al. Confidence intervals for weighted sums of Poisson parameters. Stat Med 1991;10:45762.[ISI][Medline]
- Fay MP, Feuer EJ. A semi-parametric estimate of extra-Poisson variation for vital rates. Stat Med 1997;16:2389401.[CrossRef][ISI][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

















