Identifying depression in primary care: a comparison of different methods in a prospective cohort study
BMJ 2003; 326 doi: https://doi.org/10.1136/bmj.326.7382.200 (Published 25 January 2003) Cite this as: BMJ 2003;326:200
All rapid responses
Rapid responses are electronic comments to the editor. They enable our users to debate issues raised in articles published on bmj.com. A rapid response is first posted online. If you need the URL (web address) of an individual response, simply click on the response headline and copy the URL from the browser window. A proportion of responses will, after editing, be published online and in the print journal as letters, which are indexed in PubMed. Rapid responses are not indexed in PubMed and they are not journal articles. The BMJ reserves the right to remove responses which are being wilfully misrepresented as published articles or when it is brought to our attention that a response spreads misinformation.
From March 2022, the word limit for rapid responses will be 600 words not including references and author details. We will no longer post responses that exceed this limit.
The word limit for letters selected from posted responses remains 300 words.
EDITOR: We want to commend Henkel et al. on their elegant study comparing 3 brief
instruments for detecting depressive disorders in primary care,1 but
respectfully disagree with their conclusions. Because the WHO-5 has the highest
sensitivity, the authors conclude it is the preferred screening test for
depression. But as is commonly the case, higher sensitivity means a lower
specificity. Regarding depression, we disagree that sensitivity is the only
consideration. The primary care setting is exceptionally busy with the average
visit length often being 10 minutes or less and the general practitioner often
having many competing demands besides simply screening for one
disorder.2 Because the WHO-5 and GHQ-12 have a considerably lower
specificity than the PHQ, they result in many more false positive cases. For
every 100 patients screened in primary care, 17 patients would have a depressive
disorder according to the criterion standard, the CIDI. Given the operating
characteristics reported by Henkel et al., nearly twice as many patients would
screen positive with the WHO-5 or GHQ-12 compared to the PHQ (see Table). Almost
half of patients screened with the WHO-5 or GHQ-12 would require a full
diagnostic interview for depression, yet only 1 out of 3 of interviewed patients
would have depression. In contrast, only one-fourth of patients screened with
the PHQ would require a full diagnostic interview, and over half of these would
have depression. Thus, the number of cases missed by using the PHQ instead of
the WHO-5 is small (3 per 100 patients screened) but the efficiency is much
greater (i.e., far fewer false positives).
Measure of test efficiency |
WHO-5 |
GHQ-12 |
PHQ |
Unaided Clinical Diagnosis |
For each 100 primary care patients |
|
|
|
|
Number who screen positive |
46 |
46 |
25 |
33 |
Number of true positives |
16 |
14 |
13 |
11 |
Positive likelihood ratio |
2.6 |
2.2 |
5.2 |
2.5 |
Since a trade-off between sensitivity and specificity is inevitable, the
likelihood ratio is a useful way of integrating both test characteristics into a
single number. The likelihood ratio for a positive test equals: Sensitivity / (1
- Specificity). Depressed patients are 2.2 to 2.6 times more likely than
nondepressed patients to screen positive with the WHQ-5 or GHQ-12 (comparable to
unaided clinical diagnosis), while depressed patients are 5.2 times more likely
to screen positive with the PHQ.
Rather than using a single somewhat arbitrary cutpoint, receiver operating
characteristic (ROC) analysis evaluates sensitivity and specificity for all
possible cut-off points. A recent study of 501 patients using ROC analysis
showed the PHQ to be superior to both the WHO-5 and Hospital Anxiety and
Depression Scale (HADS).3
A study in 6000 patients validated different PHQ cutpoints representing mild,
moderate, and severe depression.4 Also, the PHQ contains the 9 core
depression symptoms required to make DSM-IV diagnoses. Finally, there is a
2-item PHQ depression screener which has a sensitivity and specificity similar
to the WHO-5.5
References
1. Henkel V, Mergl R, Kohnen R, Maier W, Möller HJ, Hegerl U. Identifying
depression in primary care: a comparison of different methods in a prospective
cohort study. BMJ 2003; 326:200-201.
2. Williams JW, Jr. Competing demands. Does care for depression fit in
primary care? J Gen Intern Med 1998;13:137-139.
3. Löwe B, Spitzer RL, Gräfe K, Kroenke K, Quenter A, Zipfel S, Bucholz C,
Witte S, Herzog W. Comparative validity of three screening questionnaires for
DSM-IV depressive disorders and physicians' diagnoses. J Affect Disord (in
press).
4. Kroenke K, Spitzer RL, Williams JB. The PHQ-9. Validity of a brief
depression severity measure. J Gen Intern Med 2001; 16:606-613.
5. Kroenke K, Spitzer RL. The PHQ-9: A new depression and diagnostic severity
measure. Psychiatric Annals 2002;32:509-521.
Competing interests:
None declared
Competing interests: No competing interests
EDITOR, In the paper by Henkel et. al. there is an error of analysis
which is extremely widespread in medical research, especially in general
practice settings.
The authors have measured the performance of an unspecified number of
doctors in the detection of depression. The analysis has pooled all the
results from these doctors and has taken no account of the clustering
which is almost certain to appear in data like this. It is well known from
other studies that doctors can vary widely in their ability to detect
depression and other common mental disorders. It is possible that another
level of clustering has also been ignored, as the doctors were from
eighteen primary care facilities and rates of detection may also vary
between facilities. Doctors in one centre may be more similar to each
other than to doctors in other centres.
Ignoring clustering in the analysis, for binary data, will not affect
the overall mean value of a proportion (such as sensitivity), but will
give an estimate which is too precise, with confidence intervals which are
too narrow. If statistical tests are performed, the p-value will be too
small.
It also misses an opportunity to comment on something which is of
scientific importance - the degree of variation between doctors in their
ability to detect depression. Including an estimate of the intra-class
correlation for the doctors’ sensitivity would be very useful to future
investigators who wish to carry out cluster randomised studies in this
area.
To analyse these results, it may be possible to use random effects
logistic regression if an estimate of between doctor variance is required.
If a simple correction for clustering is required, logistic regression
with robust standard errors using the Huber-White sandwich estimator is a
good approach, available in statistical packages like STATA and SAS.
Alternatively logistic regression generalised estimating equations can be
used.
It may be difficult to compare the doctors’ performance against other
tests. Despite extensive research in the statistical literature, I was
unable to find a method for analysing matched pairs data where there are
random effects. It is, however, relatively easy to work out how to do this
from first principles. A pen and paper calculation can be done by creating
a ‘stratified’ McNemar’s test, with a stratum for each of N doctors. The
odds within each stratum can then be analysed using standard techniques
for 2 by N tables (as in Donner and Klar (ref3) or Fleiss (ref2))
Alternatively a combination of conditional (to account for the pair
matching) and random effects logistic regression could be used. Another
possibility would be to use multi-level logisitic regression with random
effects at the level of each pair and random effects at the level of the
doctor. This isn’t quite the same model as the conditional logistic
regression model, but should give similar results.
1. Henkel V, Mergl R, Kohnen R, Maier W, Moller H-J and Hegerl U.
Identifying depression in primary care: a comparison of different methods
in a prospective cohort study. BMJ 2003; 326: 200-201 (25 January)
2. Fleiss, J. L. Statistical Methods for Rates and Proportions. 2nd
Ed. Chichester: John Wiley and Sons, 1981.
3. Donner A and Klar N . Design and Analysis of Cluster Randomised
Trials in Healthcare Research. London: Arnold, 2000.
Competing interests:
None declared
Competing interests: No competing interests
We agree with Henkel et al. (1) that in screening for depression in
primary care the most relevant parameters are sensitivity and negative
predictive value of the screening instrument.
Simplicity is also
important, and Whooley et al. (2) showed that a two-question case-finding
instrument that asks about depressed mood and anhedonia achieved a
sensitivity of 96% (greater than that achieved by the WHO-5 in Henkel et
al) and an NPV of 98% (equal to that reported for WHO-5) when applied to
536 adult patients in a US primary care setting. A recent systematic
review (3) of screening for depression in primary care showed reasonable
performance even for a single-question screen.
1 Henkel V, Mergl R, Kohenen R, Maier W, Molller H-J, Hegerl R.
Identifying depression in primary care: a comparison of different methods
in a prospective cohort study. BMJ 2003;326:200-201
2 Whooley MA, Avins AL, Miranda J, Browner WS. Case-finding
instruments for depression. J Gen Int Med 1997; 12: 439-445.
3 Williams JW Jr, Noël, PH, Cordes JA, Gilbert Ramirez G, Pignone M..
Is this patient clinically depressed? JAMA 2002; 287: 1160-1170.
John C Duffy, Head of Statistics, Department of Primary Care,
University of Birmingham, Edgbaston, Birmingham B29 5HB
David F Peck, Professor of Health Research, University of Stirling
(Highland
Campus), Old Perth Road, Inverness, IV2 3FG.
Competing interests:
None declared
Competing interests: No competing interests
Editor,
Henkel and colleagues recommend using the WHO-5 depression self-rating instrument as a screening tool in primary care, suggesting it is a superior method than unaided clinical diagnosis.1 Unfortunately, their analysis and conclusions are flawed.
The Table summarizes the most clinically meaningful way of presenting results of diagnostic or screening tests- as positive likelihood ratios (sensitivity/(1-specificity)) and the consequent post-test probability of detecting depression in primary care. The WHO-5 self-rated instrument is no different than unaided clinical diagnosis. The instrument that performs the best is the brief patient held questionnaire (B-PHQ), but even this provides a post-test probability of depression of just over 50%.
The prior probability of depression in this cohort is 17%. The prior probability of depression is highly dependent on the age and sex of patients consulting, as well as other important socio-demographic variables.2 Uncritical application of an unselected prior of 17% is misleading and fails to reflect the clinical reality of primary care practice.
Even if these self-rating instruments did perform well as effective screening tools (they clearly do not), Henkel et al fail to mention a recent systematic review published in the BMJ showing that the use of self-rating questionnaires for detection and treatment of anxiety and depression do not increase the recognition of these disorders and have no effect on patient outcome.3 General practitioners are in fact more likely to initiate treatment for patients whom they themselves have diagnosed as depressed.4
Contrary to Henkel et al’s conclusion,1 their study does not provide any evidence that use of these self-rating instruments for screening of depression in primary care is an effective or cost effective strategy.
Yours sincerely,
Tom Fahey, Steve MacGillivray, Frank Sullivan
References
1. Henkel V, Mergl R, Kohnen R, Maier W, Möller H-J, Hegerl U. Identifying depression in primary care: a comparison of different methods in a prospective cohort study BMJ 2003; 326: 200-201.
2. Okkes IM, Oskam SK, Lamberts H. The probability of specific diagnoses for patients presenting with common symptoms to Dutch family physicians. The Journal of Family Practice 2002; 51:31-36.
3. Gilbody S, Touse A, Sheldon T. Routinely administered questionnaires for depression and anxiety: systematic review. BMJ. 2001; 332: 406-409.
4. Dowrick, C. Does testing for depression influence diagnosis or management by general practitioners? Family Practice 1995;12:461-465.
Table Test accuracy of screening questionnaires and family doctors’ unaided clinical diagnosis. Adopted from Henkel et al.1 Sensitivity Specificity Positive likelihood ratio (95% CI) Post-test probability of depression % (95% CI) WHO-5 wellbeing index 93 64 2.6 (2.2 to 3.0) 34.6 (31.2 to 45.4) General Health Questionnaire-12 85 62 2.2 (1.9 to 2.6) 31.4 (27.9 to 35.0) Brief Patient Health Questionnaire 78 85 5.2 (3.9 to 6.8) 51.5 (44.6 to 58.3) Unaided clinical diagnosis 65 74 2.5 (2.0 to 3.2) 34.1 (28.9 to 37.9)
Competing interests:
None declared
Competing interests: TableTest accuracy of screening questionnaires and familydoctors’ unaided clinical diagnosis. Adopted fromHenkel et al.1 Sensitivity Specificity Positive likelihood ratio (95% CI) Post-test probability of depression % (95% CI) WHO-5 wellbeing index 93 64 2.6 (2.2 to 3.0) 34.6 (31.2 to 45.4)General Health Questionnaire-12 85 62 2.2 (1.9 to 2.6) 31.4 (27.9 to 35.0)Brief Patient Health Questionnaire 78 85 5.2 (3.9 to 6.8) 51.5 (44.6 to 58.3)Unaided clinical diagnosis 65 74 2.5 (2.0 to 3.2) 34.1 (28.9 to 37.9)
EDITOR: When testing screening questionnaires the best possible
available instrument should be used as the reference standard. In their
comparison of routine screening by brief questionnaire with a clinical
assessment of depression made by a family physician, Henkel and colleagues
(1) chose a self-report survey questionnaire (CIDI) as their reference
standard for diagnosing depression in primary care. However, they
overlooked a comparison in the community of their reference instrument
with the WHO semi-structured clinical interview (SCAN)(2) showing that the
sensitivity of the CIDI for clinician assessed current depression was only
0.5 (96%CI 0.12 to 0.88) and diagnostic agreement was also poor. The CIDI
was designed for use in large epidemiological surveys (3) where the many
thousands of assessments required would render clinical interviewing
impractical. But, in their own study, only 431 reference assessments were
required and psychologists were available to conduct these by telephone.
They cited work from which they could easily have chosen as the reference
standard for diagnosing depression another widely used structured clinical
evaluation (SCID)(4). The SCID was used to evaluate another short
screening questionnaire (PRIME-MD)(5) designed for use in primary care.
Their chosen design, testing the sensitivity of a family physician
clinical assessment by comparing it with a self-report tool intended for
survey prevalence estimation (CIDI), makes little clinical or scientific
sense.
Reference List
1. Henkel V, Mergl R, Kohnen R, Maier W, Moller HJ, Hegerl U.
Identifying depression in primary care: a comparison of different methods
in a prospective cohort study. BMJ 2003;326:200-1.
2. Brugha TS, Jenkins R, Taub NA, Meltzer H, Bebbington P. A general
population comparison of the Composite International Diagnostic Interview
(CIDI) and the Schedules for Clinical Assessment in Neuropsychiatry
(SCAN). Psychol.Med. 2001;31:1001-13.
3. Robins LN, Wing J, Wittchen HU, Helzer JE, Babor TF, Burke J et
al. The Composite International Diagnostic Interview. An epidemiologic
Instrument suitable for use in conjunction with different diagnostic
systems and in different cultures. Arch.Gen.Psychiatry 1988;45:1069-77.
4. Spitzer RL, Williams JB, Gibbon M, First MB. The Structured
Clinical Interview for DSM-III-R (SCID). I: History, rationale, and
description. Arch.Gen.Psychiatry 1992;49:624-9.
5. Spitzer RL, Kroenke K, Williams JB. Validation and utility of a
self-report version of PRIME-MD: the PHQ primary care study. Primary Care
Evaluation of Mental Disorders. Patient Health Questionnaire. J.A.M.A.
1999;282:1737-44.
Competing interests:
None declared
Competing interests: No competing interests
EDITOR, The paper on identifying depression in primary care by Henkel
et. al. contains a number of scientific and statistical errors. The
authors are not, perhaps entirely to blame for this. The errors are mostly
ones which are very common in the medical literature, but the BMJ’s peer
review process should have identified and eliminated these mistakes before
publication.
Firstly, the study is wrongly described as a cohort study. A cohort
study is one in which a group of subjects (classified into two or more
subgroups by exposure to various factors of interest) is followed up over
time and re-examined at a later date. The design of this study is a cross-
sectional study. A number of tests are applied to a group of subjects.
Each test is administered only once for each subject and the tests are as
close in time to each other as possible (so that the diagnosis does not
change between administration of tests).
It is inappropriate to use the GHQ as a screening test for depression
and it is not valid to compare the characteristics of the GHQ as a test
with a gold standard test that is specific for depression. The GHQ is a
screening test for the ‘common mental disorders’ (ref 2) which commonly
occur in primary care settings and which include mixed symptoms of
depression and anxiety. This is a much broader concept than clinical
depression. The GHQ is therefore likely to have a reasonably good
sensitivity for depression but a poor specificity. This is exactly what
was found in this study. In fact, when these considerations are taken into
account, the GHQ performs surprisingly well.
The authors choose sensitivity and negative predictive value as
measures of how well their tests perform. In general, it is preferable to
pair sensitivity with specificity and to pair negative predictive value
with positive predictive value. Of these two pairs, it is usual to use
sensitivity and specificity because these can be generalised more easily.
The positive and negative predictive values, on the other hand, change
according to the prevalence of the target condition in the population.
There is a particular problem with using a combination of sensitivity
and negative predictive value because they are not independent. When the
sensitivity is close to one, the negative predictive value must also be
close to one. (Both measures include the number of ‘false negative’
subjects in the denominator. When the number of false negatives is small,
both the sensitivity and the negative predictive value tend to one.)
Whilst the sensitivity and specificity of a test are important
indicators of how well it performs, it is usual to measure the overall
quality of a test by some form of ‘measure of agreement’. For binary data,
the standard measure is Cohen’s kappa (ref 3, p.217). This measures the
level of agreement of the test with the gold standard over and above the
agreement that would be expected by chance.
The authors are to be congratulated for appropriately using McNemar’s
test for differences in proportions of matched binary data. However, their
strategy for statistical analysis and the presentation of results is very
poor.
They do not make clear a primary hypothesis or explain what questions
they are seeking to examine in their analysis. With so many different
measures, which they could compare, it is essential to start with a plan
for the statistical analysis.
For example, as a primary analysis, it would be possible to look at
sensitivity as the primary outcome and to test
(1) is there any difference in sensitivity between the tests,
including unaided clinical diagnosis? The null hypothesis being that the
sensitivity of all the tests is the same. The appropriate test would be
Cochran’s Q (ref 3, chapter 8).
(2) If (1) shows a difference, is there a difference in sensitivity
between the three paper tests combined and unaided clinical diagnosis? (by
partitioning Cochran’s Q)
(3) Are there any differences between the three paper tests?
(4) If there are differences between the three tests, where does the
difference lie?
As a secondary analysis, the same procedure could be followed for the
specificity.
The authors use one-sided statistical tests. For the comparison of
two groups, it is recommended that two-sided tests should be used, but it
would occasionally be possible to use a one sided test if an intervention
is only likely to have an effect in one direction. For more than two
groups, it is not logically possible to use one-sided tests. (To do so,
implies that the effects of the interventions can only be ordered in one
way, in which case a test for trend would be the appropriate test).
It is incorrect to report the result of statistical tests as p <=
0.05. The exact p-value should be given. There is a difference in the
interpretation of a p-value of 0.049 and one of 0.001 (If a p-value would
be zero, given the number of decimal places, it should be written as p
< 0.001, for example) Even better would be to use confidence intervals.
The procedure for confidence intervals in matched pairs is given by Fleiss
(1981), chapter 8.
The authors have carried out multiple statistical tests without
making a correction. Using one-sided tests for differences between four
‘groups’ for four operating characteristics they will have performed at
least 24 tests. Each of these tests has a probability of 0.05 of showing a
‘significant’ result. The probability of getting one or more such results
by chance alone (a type I error) is 1- (0.95)^24. This is a probability of
0.71.
Even if the authors had carried out only two-sided tests for
sensitivity and specificity the probability of getting one or more
‘significant’ results by chance alone is 0.46.
To maintain the overall level of obtaining a type I error at 0.05, a
Bonferroni or other correction is usually made to the significance level
of the test.
In the discussion, it is mentioned that WHO recommends screening of
patients in the waiting room with a pen and paper test, but this advice is
not evidence based. Henkel et. al.’s paper was accepted in August 2002,
but in February 2001 (ref. 4) a systematic review of the use of routinely
administered questionnaires for depression and anxiety was published in
the BMJ. The result of this review was that “routine administration and
feedback of scores for all patients did not increase the overall rate of
recognition of mental disorders such as anxiety and depression”. Selective
feedback of patients with high scores did improve the rate of recognition
of depression in two studies, but this did not increase the rate of
intervention. There was no evidence from these studies of any effect on
patient outcome.
1. Henkel V, Mergl R, Kohnen R, Maier W, Moller H-J and Hegerl U.
Identifying depression in primary care: a comparison of different methods
in a prospective cohort study. BMJ 2003; 326: 200-201 (25 January)
2. Goldberg, D and Huxley, P. Common Mental Disorders: A Bio-Social
Model. London: Routledge, 1992
3. Fleiss, J. L. Statistical Methods for Rates and Proportions. 2nd
Ed. Chichester: Jphn Wiley and Sons, 1981.
4. Gilbody S, M., House A, O and Sheldon T, A. Routinely administered
questionnaires for depression and anxiety: systematic review. BMJ
2001;322:406-409 (17 February)
Competing interests:
None declared
Competing interests: No competing interests
Dear Sir
Surely we need a smorgersbord of assessment techniques, rather than a
set menu, to do justice to the human condition?
I think it is about both..and, and as with most things in life, if
people feel that the person who is going through the questionnaire with
them is skilled and cares and has the ability not to have their mind
framed by any one evaluation tool, then anything has got to be better than
the current situation where many practitioners do not know how to get a
handle on these issues.
I was at a meeting this week when we were told by the speaker that
her mother had clinical depression for 31 years before a diagnosis was
made. Woeful. It is not about having tools for people who are competent
and good either, but having them for those who are not, who can, by using
them also find out and be helped to find out how incredibly rewarding
depression is to treat.
Further, we do well to introduce such interventions through
supportive educational programmes and not by endlessly criticising people
for "missing it", when they too may feel that the condition is
understandable, have no up-to-date clue as to how we are wired, are
depressed themselves or have nowhere to refer the person on for help,
because, as we all know, people with depression are only a danger to
themselves and it is not really a severe and enduring mental illness
anyway?
How many of those with 'treatment-resistant' depression are in that
state because of chronicity and time to diagnosis? We do not accept that
in any other field of clinical practice. Early sypathetic intervention by
a caring, confident and knowing professional (skilled not by job title)has
got everything going for it in every branch of health and social care?
It is to be hoped that depression, as the third commonest cause for
GP consultations in the UK, other non-psychotic conditions, somatisation
and comorbidities can be captured in the new GP contract. It is high time
to have it in there along with asthma, diabetes and CHD, with which it is
40% comorbidly associated in any event, either as cause, effect or both.
Yours Faithfully
Dr Chris Manning
Competing interests:
None declared
Competing interests: No competing interests
I do not agree with using standardised questionnaires for detecting
depression in primary care.The presentation varies across cultures and
also varies with age.The geriatric populace may present with
hypochondriacal symptoms.The common presentation in this part of the world
is predominantly somatic complaints,often multiple rather than the
cognitive symptoms.Questionnaires suited for the local population may be
more effective.If administered by local health workers and family
physicians who are familiar with the family background,they are much more
effective than filling up questionnaires in the waiting rooms of a
consultant.
Competing interests:
None declared
Competing interests: No competing interests
In don´t agree with test in order to identify depression in primary
care. I think this a good manner to put a barrier with patient´s problems.
I can´t think patients doing questionnaries at the waiting room, like
somebody buying at the supermarket, at the same time giving opinion
about supplies. My idea is that the physician must know how to conduct an
interview, because clinical semiology is sovereign. Because clinical
stance is based in relationship, not in a cold test. The first tool of
family doctor´s perfomance, not only in detecting depression, are their
own selves. Depression, or what you want in clinical practice, is
understandable in the road of intersubjectivity. Questionnaries are far
off.
Competing interests:
None declared
Competing interests: No competing interests
Reply to Rapid Responses
EDITOR: We have read with great interest the rapid responses to our
above-mentioned report. The comments give us the opportunity to reply to
the criticism raised by some of the comments and to explain several
aspects which had been omitted when the original full length text had been
abbreviated.
Effectiveness of screening for depression
Many responses to our short report have referred to the issue of
effectiveness of screening for depression. This is a key issue, but was
not the main focus of our study. We are well aware of the fact that
increased recognition of depression does not necessarily translate into
improved outcome of depression. Screening can only be considered as one
part of a bundle of intervention. Such a combination can be expected to
reduce diagnostic and therapeutic deficits concerning depression. It is
necessary that primary care physicians know how to interpret screening
results and that they have available resources for effective intervention.
Obviously, effective early intervention could prevent the needless
suffering, impairment and social costs associated with a full-blown
episode of depression.
Methods: Statistical analysis
We want to point out that our comparative study had exploratory
character. Before study start, we did not find any other study comparing
these three psychometric instruments (WHO-5, GHQ-12, BPHQ) with regard to
their diagnostic accuracy in primary care. Thus, hypotheses were not
generated and a hierarchical plan for statistical analysis for testing
confirmatory hypotheses has not been applied. For the same reason,
Bonferroni correction has not been conducted. However, if two-sided tests
and Bonferroni correction for multiple testing (p<=.0020833) are
applied, the following significant findings with regard to sensitivity and
specificity do result:
SENSITIVITY: WHO-5 > BPHQ, clinical diagnosis (p<=.001);
SPECIFICITY: BPHQ > clinical diagnosis > WHO-5, GHQ-12 (p<=.001).
As already indicated in our report, the results have to be treated as
tentative until confirmed by subsequent studies, preferably in different
countries.
Methods: Choice of operating characteristics
As pointed out in one of the rapid responses (by J.C. Duffy and D.F.
Peck), a very important psychometric property of screening questionnaires
is sensitivity. In the first stage of a screening process no patient with
the disease should be missed. In the second stage, a more specific
assessment of individuals, who have screened positive, is necessary.
In our view, specificity and therefore, also the likelihood ratio are
important parameters rather for diagnostic purposes (second stage) than
for screening purposes (first stage). If the positive likelihood ratio is
chosen as the decisive operating characteristic, the BPHQ is indeed
superior to the WHO-5.
Another possibility to compare screeners consists in the use of
receiver operating characteristic analyses. We have conducted these
analyses under consideration of several demographic variables (paper
submitted). The results have confirmed our conclusions favouring the
application of the WHO-5 as a very sensitive, brief screening instrument
for depression in primary care in a two step screening process.
Methods: Choice of screening tools
In one of the rapid responses, the use of the GHQ-12 in our study is
criticised. The GHQ-12 has been chosen as one of the three screening
instruments because this broad-based questionnaire was established to
detect non-psychotic psychopathology in primary care and it has been
described to perform best at identifying symptoms of depression (Newman et
al. 1988). The use of broad-based versus disease-specific screening
instruments in primary care is still under discussion. Therefore, we have
decided to compare one broad-based (GHQ-12) and one disease-specific
(BPHQ) screening instrument as well as one tool that is less restricted to
both issues (WHO-5).
Methods: The reference standard
Regarding our reference standard (CIDI), this standardised, fully
structured diagnostic interview was selected because reliability and
validity of this instrument have been described (Wittchen, 1994; Andrews
& Peters, 1998) and a computer-administered form (DIA-X) (Wittchen
& Pfister, 1997) is available. The use of DIA-X can be justified
because the equivalency of the CIDI delivered by human interviewers and
its computerised version has been confirmed (Peters et al. 1998). Of
course, as suggested by the rapid response of T. S. Brugha, the SCID
(First et al. 1995) would have been an option. It would be interesting to
replicate the study by use of this reference standard. However, the CIDI
provides ICD-10 as well as DSM-IV diagnoses, whereas the SCID focuses on
DSM-IV diagnoses. In Germany, primary care doctors exclusively use ICD-10
diagnoses.
Methods: Cohort study
We have tested the comparative validity of different screening
instruments for depression in primary care in the context of an ongoing
prospective cohort study. In this study, depressed primary care patients -
as screened by family physicians using the three screening instruments -
are included in a clinical study. The study objective is to compare
outcomes in primary care patients receiving different antidepressant
treatment strategies, e.g. sertraline, placebo or cognitive behavioural
treatment. One year later, there is a follow-up examination for all
patients. Unfortunately, this information had to be omitted by revising
and shorting the original manuscript.
Conclusions
Overall, the issue of aided recognition of depression in primary care
is a very sophisticated one. A number of investigators have already
studied the effects and the potential pitfalls of introducing a screening
instrument for depression in primary care settings. Our study confirmed
previous results which have demonstrated that general practitioners tend
to outperform several screening questionnaires in terms of specificity
(Wilkinson & Barczak, 1988), but simultaneously tend to miss many
cases. Therefore, the best working compromise appears to be the final
diagnosis of the primary care doctor when he takes into account a
screening questionnaire result. In this context, our study demonstrates
the potential value of the WHO-5.
References
Andrews G, Peters L. The psychometric properties of the Composite
International Diagnostic Interview. Soc Psychiatry Psychiatr Epidemiol
1998;33(2):80-88.
First MB, Spitzer RL, Williams JBW, Gibbon M. Structured Clinical
Interview for DSM-IV (SCID). Washington, DC: American Psychiatric
Association, 1995.
Henkel V, Mergl R, Kohnen R, Maier W, Möller H-J, Hegerl U.
Identifying depression in primary care: a comparison of different methods
in a prospective cohort study. BMJ 2003;326:200-201.
Newman SC, Bland RC, Orn H. A comparison of methods of scoring the
General Health Questionnaire. Compr Psychiatry 1988;29(4):402-408.
Peters L, Clark D, Carroll F. Are computerized interviews equivalent
to human interviewers? CIDI-Auto versus CIDI in anxiety and depressive
disorders. Psychol Med 1998;28(4):893-901.
Wilkinson MJB, Barczak P. Psychiatric screening in general practice:
comparison of the general health questionnaire and the hospital anxiety
depression scale. J R Coll Gen Pract 1988;38:311-313.
Wittchen HU. Reliability and validity studies of the WHO-Composite
International Diagnostic Interview (CIDI): a critical review. J Psychiatr
Res 1994;28(1):57-84.
Wittchen HU, Pfister H. Instruktionsmanual zur Durchführung von DIA-X
Interviews. Frankfurt am Main: Swets Test Services, 1997.
Competing interests:
None declared
Competing interests: No competing interests