Unit of Biostatistics and Epidemiology, Oslo University Hospital, Kirkeveien 116, 0450 Oslo, Norway

Centre for Child and Adolescent Mental Health, Eastern and Southern Norway, Oslo, Norway

Norwegian Centre for Violence and Traumatic Stress Studies, Oslo, Norway

Centre for Clinical Research, Haukeland University Hospital, Bergen, Norway

Department of Heart Disease, Haukeland University Hospital, Bergen, Norway

Faculty of Health and Social Sciences, Bergen University College, Bergen, Norway

Institute of Medicine, University of Bergen, Bergen, Norway

Abstract

Background

Questionnaires are used extensively in medical and health care research and depend on validity and reliability. However, participants may differ in interest and awareness throughout long questionnaires, which can affect reliability of their answers. A method is proposed for "screening" of systematic change in random error, which could assess changed reliability of answers.

Methods

A simulation study was conducted to explore whether systematic change in reliability, expressed as changed random error, could be assessed using unsupervised classification of subjects by cluster analysis (CA) and estimation of intraclass correlation coefficient (ICC). The method was also applied on a clinical dataset from 753 cardiac patients using the Jalowiec Coping Scale.

Results

The simulation study showed a relationship between the systematic change in random error throughout a questionnaire and the slope between the estimated ICC for subjects classified by CA and successive items in a questionnaire. This slope was proposed as an awareness measure - to assessing if respondents provide only a random answer or one based on a substantial cognitive effort. Scales from different factor structures of Jalowiec Coping Scale had different effect on this awareness measure.

Conclusions

Even though assumptions in the simulation study might be limited compared to real datasets, the approach is promising for assessing systematic change in reliability throughout long questionnaires. Results from a clinical dataset indicated that the awareness measure differed between scales.

Background

Questionnaires are used extensively in medical and health care research

Motivation and interest of participants in clinical studies can differ and thereby their focus on accurately answering questions. If the questionnaires contain many items and require a lot of time to complete, concentration and enthusiasm may change throughout the questionnaire. Krosnick

Reliability can be assessed in different ways; test-retest reliability for stability, inter-item reliability for internal consistency and interrater reliability or parallel scale for equivalence

Subjects in clinical studies can be assigned to subsets by cluster analysis. The similarity between subjects within a subset should be higher than between subsets. However, with increased random error, similarity within subsets can be reduced compared to data from different subsets. Several methods for cluster analysis are available and results from questionnaires are suitable for dividing participants into subsets.

The objective of this study was therefore to simulate the relationship between systematic changes in reliability, expressed as proportion of random error, throughout a questionnaire and how it can be detected. Our proposed method is to divide respondents into subsets using cluster analysis on questionnaire items. The ICC is then estimated for each item based on a mixed effects model. The slope between ICC and item number is proposed as an awareness measure. If this approach can assess systematic change in reliability throughout a questionnaire, it may have an applied potential in "screening" questionnaires on reliability properties. The approach will also be explored using a clinical dataset.

Methods

The awareness measure

It was assumed that a number of subjects (n) answered a given number of questions (q), measuring a single construct. These q questions usually comprise part of a questionnaire, often with other questions in between. The subjects were divided into two subsets using cluster analysis (CA). Next, a mixed effects model with no covariates (fixed effect as intercept only) and a random between group effect was run for each item t of the q questions, with subset (from CA) as grouping factor. From this mixed effects model an intraclass correlation coefficient ICC_{t }was computed for each item t, as

Finally, a linear regression on intraclass correlation by item number was done. If the cluster analysis or the mixed effects models did not converge for some items, the linear regression was based on the remaining items. The slope from this linear regression was the awareness measure throughout questionnaires, with a negative slope indicating reduced reliability towards the end of the questionnaire due to increased random error or a positive slope indicating increased reliability towards the end of the questionnaire due to reduced random error. An illustration of the procedure is shown in Figure

Proposed method to detect change in reliability throughout questionnaires

**Proposed method to detect change in reliability throughout questionnaires**. The flow chart shows the steps of the proposed method. The slope between ICC and item number is proposed as a measure to detect change in reliability. A negative slope indicates increased random error and poorer reliability and a positive slope indicates decreased random error and improved reliability. This slope is our awareness measure (see Table 1 and Figure 3)

When the procedure was used with real datasets, confidence intervals of the slope were based on 10000 bootstrap replicates. The slope was considered as significantly different from zero if zero was outside a 95% bootstrap BC_{a }confidence interval

Simulation study

The simulation study was conducted to investigate a systematic change in random error, and hence changed reliability throughout a questionnaire. A basic assumption was the existence of an unknown underlying factor (f_{i}) partially determining the questionnaire items considered. A unique source of variance (e_{it}) was included which was assumed to systematically increase or decrease throughout the scale, reflecting changed reliability. Specifically, we assumed that the answer (y_{it}), for each person i on each item t was given by

where

and all f_{i }and e_{it }are independent from the first (t_{f}) to the last (t_{l}) items used within the scale with fixed numbers of a and b. Division by

Assumptions in the simulation study

**Assumptions in the simulation study**. The sources of variance are assumed due to an underlying factor (f_{i}) and random error (e_{it}). Increased random error represents poorer reliability. Total variance, Var(Y_{it}), is standardized to 1.0 before estimation of the awareness measure.

The awareness measure was computed as indicated above, and its distribution in 10000 simulations reported for different scenarios determined by chosen values of a, b and number of questions (q). For q we used 4 or 13 items, evenly distributed among 50 questions, and the number of persons was 600. 1) A negative value of the awareness measure was expected if a was considerably less than b, simulating the scenario with decreasing awareness throughout the questionnaire. 2) An awareness measure close to 0 was expected if a was approximately equal to b, simulating the scenario with constant awareness throughout the questionnaire. 3) A positive value of the awareness measure was expected if a was considerably larger than b, simulating the scenario with increased awareness throughout the questionnaire (Figure

Patient sample

The clinical dataset was from a study conducted between August 2000 and February 2002. The source population included 1283 patients admitted to elective coronary angiography at the Department of Heart Disease, Haukeland University Hospital, Bergen, Norway. At least 214 of these patients were not invited to participate due to capacity reasons. Among the remaining 1069 eligible patients, 753 patients (70%) responded. However, due to missing items, 632 individuals with valid values constituted the study population

The Jalowiec Coping Scale

The revised 60 items Norwegian version of JCS was used

Estimated awareness measure (AM) in a clinical dataset using the Jalowiec Coping Scale.

**Scale**

**# Items**

**Resp.**

**N**

**AM (95% BC _{a})**

Confrontive^{a}

10

639

563

0.0030 (-0.0002, 0.0096)

Evasive^{a}

13

642

541

0.0020 (0.0012, 0.0057)

Optimistic^{a}

9

643

577

0.0077 (0.0059, 0.0115)

Self-reliant^{a}

7

637

573

0.0001 (-0.0051, 0.0059)

Confrontive^{b}

12

642

549

0.0018 (-0.0026, 0.0038)

Normalising optimistic^{b}

10

640

582

0.0098 (0.0055, 0.0138)

Combined emotive^{b}

9

646

590

-0.0038 (-0.0069, -0.0004)

# Items: Number of items in selected factor structure. Resp.: Number of the 753 respondents completing at least one item in the scale. N: Number of respondents completing all items in the scale - the analyses are based on these. AM: Awareness measure.

^{a }Jalowiec's original scale.

^{b }Wahl's alternative factor structure

Statistical software

All computations were conducted with the software R version 2.9.1-2.12 (The R Foundation for Statistical Computing, Vienna, Austria), with the R packages cluster with the clara function for cluster analysis

Results

Simulation study

The boxplots from 10000 simulations of the proposed awareness measure to explore reliability or different scenarios is presented in Figure

Results from the simulation study

**Results from the simulation study**. The boxplots of estimated awareness measures from simulated data to detect change in reliability are expressed as random error. Number of simulations for each set of condition is 10000. Estimations of the awareness measure are done under scenarios with different values of a, b and q (see Figure 2). The number of subjects is 600.

Clinical dataset on JCS

The results are shown in Table _{a }0.0055, 0.0138) for the Normalising optimistic scale, whilst the Optimistic scale is somewhat lower, 0.0077 (95% BC_{a }0.0059, 0.0115). Thus, based on our proposed method, awareness and thereby reliability increased throughout the answering process for both optimistic scales. In contrast, the slope for the Combined emotive scale is negative, -0.0038 (95% BC_{a }-0.0069, -0.0004). Based on our proposed method, reliability decreased throughout the questionnaire for the Combined emotive scale.

Discussion

The simulations showed that our procedure is able to detect changes in awareness, if such results in changed random error, during completion of a questionnaire for a single factor scale. When the procedure is used on a real data set, the results depend on the scale. The highest slope (in absolute value) is 0.0098 per question for the Normalising optimistic scale, corresponding to an increase of 0.0098⋅60 = 0.59, with confidence limits 0.33 and 0.83, during the 60 JCS questions. The slope for the Optimistic scale is somewhat lower. In contrast to this, the negative slope -0.0038 for the Combined emotive scale corresponds to -0.0038⋅60 = -0.23, with confidence limits -0.41 and -0.02 during 60 questions. These slopes are large enough to indicate relevant changes during the 60 items JCS questionnaire on reliability according to our proposed method. For the other scales there were smaller changes. It is interesting that the two optimistic scales with items for "positive" and constructive coping strategies also have increased reliability throughout the scale according to the proposed awareness measure. Perhaps the participants want to emphasis and answer very accurately on these items? Thus, confirming "positive" coping strategies. The Combined emotive scale contained many "negative" coping strategies like e.g. "avoided being with people", "took your tensions on someone else" and "took medications to reduce tensions". A hypothesis is that participant may feel embarrassed or hopeless about using such strategies, and thus, do not want to answer accurately. Both these tendencies may increase as the participants become accustomed to answering coping questions.

Reliability and reproducibility are crucial in all quantitative research. Diagnostic tests are conducted under supervised conditions to assure both validity and reliability of results. Laboratory analyses in areas as genetics, chemistry or physics are often done in replicates to assess reliability. Likewise, reliability of questionnaires and surveys should be assessed when used as measurement tools in research. Predictions from statistical models are limited by reliability of measurements

The aim for our proposed method is to serve as a screening test to detect reliability differences throughout a scale. Questionnaire length may reduce the motivation of the participants. A meta-analysis on questionnaire length showed lower response rate on long compared to shorter questionnaires

Item Response Theory (IRT) (see e.g. textbooks by Lord

There are several assumptions in our simulation study. Scales are assumed to represent a uniform underlying factor for all participants. Our mixed results on scales in the real data set may be due to problems with the factor structure. However, the scales used have acceptable internal consistency in most studies, and we have used scales from two alternative factor structures, with similar results. The mixed results may also indicate that changes in awareness during a long questionnaire constitute a more complex process, depending on context and the individual items. Satisficing can also take other forms than randomly answering items and be a more systematic process

Conclusions

An awareness measure was proposed to explore changes in reliability throughout questionnaires. The simulation study showed that the systematic change in random error was detected by estimating the ICC between subjects unsupervised classified by CA. In the real data set, however, different changes were observed for different scales.

Response burden always needs to be considered when planning a study. Consequently, when applying long questionnaires, reliability should be evaluated. We suggest using CA and estimation of ICC to assess potential systematic change in reliability.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

TWL carried out the simulations and drafted the manuscript. AHP developed the initial idea for the work, participated with simulations and drafted the manuscript. TMN, BU and ON participated in the preparation of data and selection of scales for investigation. All authors read and approved the final manuscript.

Pre-publication history

The pre-publication history for this paper can be accessed here: