Author: Roger White
Journal Title: PRS-LTSN Journal
ISSN:
ISSN-L:
Volume: 1
Number: 1
Start page: 52
End page: 60
Return to vol. 1 no. 1 index page
In some subjects in which either, as in law, there is an identifiable body of facts which the students need to master to obtain a degree, or, as in mathematics, it will be relatively beyond dispute whether a student has or has not succeeded in proving what it is required to be proved, it ought to be relatively straightforward to devise an objective and fair method of assessment appropriate for deciding that a student has attained the relevant standard to obtain a degree of any given class. However, in other subjects, of which philosophy is perhaps the most obvious example, what constitutes excellence in the subject is far more a matter of judgment—and even controversy. Hence, there is a great need to scrutinise the method whereby we seek to ensure objectivity in examinations.
Traditionally the preferred method of assessment was ‘double marking’—a method where every script was marked by two internal examiners who then meet to discuss the marks they have independently arrived at, to arrive at an agreed mark which is then submitted to the scrutiny of an external examiner. Latterly, a different method of assessment has grown up—‘monitoring’—in which there is a first examiner who submits a set of marks to a second examiner, who samples sufficient number of the first examiner’s marks on scripts to form a judgment of how far the two examiners agree. The monitor does not attempt to agree marks with the first examiner but writes a brief report on the first examiner’s marking. This report is then discussed, and then, if necessary, the first examiner’s marks are systematically adjusted. For example, if the monitor forms the opinion that the examiner has been too harsh, and succeeds in persuading the examiner that this is so, then the mark of every student may be raised somewhat. The scripts are then submitted together with the monitor’s report to the scrutiny of the external examiner.
There is no doubt that the system of monitoring has grown up in large part under the pressure of the increased workload created by such factors as worsening staff/student ratios, and the increased number of heads under which students are assessed under modularisation. Since it has frequently been adopted thus for reasons of expediency, there is a widespread feeling that this system is inferior to double marking and has only been adopted out of necessity.
However, since I believe that the widespread opinion that double marking is the superior way of examining is an irrational prejudice, simply based on the vague idea that two heads must be better than one, and that in fact in most respects monitoring is a method of examining which, properly done, is more likely to yield an objectively just result, it is worthwhile spelling out reasons why this is so. I actually advocated that we switch away from double marking long before the pressure of work led our own department to do so. This was as a result of studies of what actually occurred when people do in fact double mark that were conducted a long time ago by my colleague Timothy Potts and myself. Although these studies were conducted a long time ago, what we discovered then is still relevant to the current situation.
I initially became worried about the objectivity and rationality of our examination procedure shortly after I came to Leeds, as a result of a few cases where what had happened seemed difficult to reconcile with the idea that justice was done to groups of students taking those particular courses. (I will not identify the examiners involved: all are now retired.) At that time, double marking was of course a sacred cow, and the department was small enough to cope easily with the workload involved. (The externals also read every script: something which is now completely impractical: but that was the only feature of the system that could protect the examination from becoming a farce in the case that I shall mention. That safeguard has now long vanished.) The case that was most worrying, because it represents in an extreme form a situation that, even if only for minority of batches of scripts, does recur with sufficient frequency to be a problem for a system of examination. Here the two examiners had produced marks that bore no discernible relation to one another at all: one examiner would give a 1st to a script that the other examiner saw as low 2/2 (or even in one case a 3rd), and vice versa. As a result I did an informal study of the examination for all the courses for that year. The results were sufficiently disturbing for me to raise the issue of the objectivity of our examination procedure. This was followed up by Timothy Potts who, following my lead, did a complete statistical breakdown of the marks assigned in examinations for the previous three years, comparing such things as the arithmetical mean mark, standard deviations and rank orderings produced by each pair of examiners for each course. What follows are some of the conclusions I arrived at as a result of the studies we had undertaken between us. (I am of course relying here on memory from a long time ago—there may be the odd mistake in what I say, but I am confident that for the most part my memory is accurate.)
Against this background, the questions arise, ‘How well does double marking do as a method for arriving at a just mark on scripts?’ and ‘Is there reason to suppose that monitoring fares better?’ I take monitoring to be the practice we have adopted at Leeds where one examiner marks an entire batch of scripts, and then a second marker marks a significant sample, large enough to judge how well the first examiner has done their job—departmental policy says that 10% of scripts plus 1sts and fails should be looked at: I have always interpreted this as minimum, and where one is monitoring a small batch of scripts (e.g. a module with 20 or fewer scripts), it would be clearly inadequate just to look at two—what is required is to look at enough scripts to get a proper picture of what the first examiner has done. The two examiners then meet to discuss how, if at all, it would be appropriate to modify the first examiners’ marks. Departmental policy is that monitoring is a monitoring of the whole examination, and not the provision of second marks for individual scripts. That is, the result of monitoring should not be the adjustment of individual marks, but to suggest a systematic modification of all the first examiners’ marks. The only individual marks that are adjusted are perhaps those at the very top or the very bottom, where it is a question of how a very good or very bad script is to be marked. Otherwise, adjusting individual marks is unfair either on those students whose scripts happen to have been selected for monitoring, or on those who have not. (The only exception I would, perhaps somewhat inconsistently, make to this rule, is where the divergence between the examiner and the monitor is explained not by a difference in judgment between the two, but by a definite indisputable oversight on the part of the examiner: for instance where the examiner overlooks a gross error of fact on the part of the candidate.) So, how do monitoring and double marking fare for each of the three types of sets of mark I identified in 2. above?
The main conclusion I draw from the preceding is that the system of double marking, despite its reputation, is a deeply flawed system. The idea that it is the best system of examination is a myth, which is only sustained because it is not subjected to scrutiny—including the kind of empirical scrutiny which Timothy Potts and I subjected it to. The following defects emerge from the earlier discussion:
Surveying the ‘agreed’ marks actually given by two examiners suggests that whatever we think that we are doing, most of the time the upshot of the discussions between the two examiners is to produce a mark which is the average of their two original marks. If the examiners are in fact disagreed, either in the qualities that they are looking for in a good script or in the way that they translate their opinion of scripts into numbers, it is hard to believe that such average marks have much real meaning. (The most that can be said is, that if either of the two original marks was right, the average ‘won’t be too far out’—I suspect it is that thought which makes averaging attractive. However, that thought may well be depriving a student of a 1st class mark, if one of the two examiners has seriously underestimated the script.)
The effect of such averaging is a large-scale regression to the mean. This is perhaps both the most obvious defect, and the most vicious aspect of the system of double marking. When, as now, we are assessing students under a large number of heads, and then arriving at a class by averaging, the threat of regression to the mean is already real enough—even now we have a system where it is remarkably easy to get a low 2/1, but difficult to get a 1st or a 3rd. If we were to engage in double marking with our present numbers of students and under a modular system, we would have a system of examining which would make it impossible to differentiate students, apart from the very few that swam against the stream by being exceptionally good or bad in everyone’s opinion.
The system of double marking is not designed readily to detect when differences between the marks awarded by two examiners for a particular script were the effect of systematic differences of marking practice between the two examiners rather than disagreements about this particular script. Such systematic differences should be dealt with systematically and not somewhat erratically on a script-by-script basis. Systematic differences between the marking practices of two examiners, which will affect a whole batch of scripts, and can have large effects on individual marks are probably far more significant than particular disagreements in judgment, and yet are completely neglected by double marking.
The system of double marking does not have built into it a rational decision procedure for what should happen when there is no real meeting of minds between the two examiners. Looking at the results produced by Timothy’s studies suggested that examiners were typically prone in such cases to produce an average mark as the agreed mark, even though in these cases such average marks are almost completely meaningless.
The defects noted above would to some extent be compensated for (at the time that Timothy Potts and I made our studies) by the role of the external examiner. At that time we were a much smaller department, marking a much smaller number of courses and the external examiners did read and mark every script, so that the vagaries of the internal examiners could be and frequently were overridden. However, the time when that was possible are long past, and also the pressure of exam load has increased in ways that would exacerbate the problems we detected. (There is now, for example., much less time for a full discussion between examiners, increasing the temptation simply to average marks.)
The system of monitoring is designed in such a way that it avoids all of the defects that I have specified: examiners do not agree marks on each individual script and hence do not average marks; as a consequence the system has absolutely no tendency to produce a regression to the mean; the task of the monitor is precisely to detect systematic differences of opinion which can then allow one to adjust a whole set of marks systematically; and finally the fact that a monitor’s primary task is simply to make a report on the first examiner’s work means that the situation of a radical difference between the two can be brought in the open to be then dealt with.
The only indisputable advantage of double marking is that there can occur cases where the first examiner makes an error of judgment on a particular script which is then picked up by the second, and the first examiner is persuaded of the error. However, looking at the extent to which practice is dominated by simply averaging marks suggests that this situation may occur less frequently than we think, and given that no examination system is ever going to be perfect, I believe there is an overwhelming case for saying that monitoring is on balance the vastly superior system, quite disregarding questions of the workload imposed on examiners by the two systems.
Return to vol. 1 no. 1 index page
Created on: December 18th 2009
Updated on: August 19th 2010