Automated Assessment and Out-of-class Small Group Discussions: Taking WebCT forward in Philosophy

Julia Tanney

Department of Philosophy, University of Kent

Start date: July 2006

This proposal uses special features of WebCT for an in-depth study of two aspects related to e-learning and Philosophy. First, the project will assess the suitability for automated, or semi-automated assessment; taking into account, for example, its capacity partly to replace assessments that are ripe for internet plagiarism and weighing this against the reluctance within the profession to move away from essay-style assignments. (Note that the project will also experiment with a WebCT-integrated "TurnitinUK" submission feature for other, essay-type assignments. Second, the project will assess whether student learning is enhanced by mentored participation in specific concerns and to flag up valuable features associated with WebCT. If successful, it will provide a viable model of good practice for implementation by other HE and non-HE institutions, both in the UK and internationally.

Final Report

Background

When I arrived at the University of Kent in 1995, I taught Philosophy of Mind to a group of fewer than 20 students in a one-hour lecture, two-hour seminar slot. Within two years, Philosophy decreased the seminars to one hour. Each year, the number of students has increased. Two years ago, I taught more than 50 students in one two-hour class. The assessments-one mid-term and one final essay-remained the same throughout the period of student expansion until, with over 100 students per term, the marking became untenable. As the government's widening-participation policy was implemented the range of ability weakened considerably. I typically would receive mid-term essays which began, for example:

"In this paper, I shall discuss the philosophical doctrine of behaviourism [or functionalism, or type-physicalism, etc.] This doctrine holds that..."

and what the student would go on to say would be false. It was clear that I had a large number of students who had not even mastered the basic positions in philosophy of mind. Reading essays and trying to assign marks on some of the content, based on a misunderstanding of the basic positions, became extraordinarily difficult.

In the following year, I introduced a mid-term, in-class test that required students to write short paragraph answers. This was noted and praised by an external reviewer for our teaching audit, but he pointed out that because these answers still had to be hand-marked, it was not as efficient as it could be. It was he who suggested I consider designing automated assignments.

Description of project

1. This main purpose of this project was to assess the suitability for automated, or semi-automated assessment in Philosophy. Another goal was to integrate the "TurnitinUK" submission feature for essay-type assignments. Yet another was to assess whether student learning is enhanced by mentored participation in small-group, out-of-class, on-line discussions.

2. Early in the project, colleagues were interviewed about their attitude to multiple-choice and other automated questions for assessment in Philosophy. They were almost unanimous in their feeling that such style of assessment was inappropriate for Philosophy as a subject; one, however, suggested that it may be appropriate for the kind of courses I taught.

3. Over time the project focus became much more oriented toward on-line assessment. Although I continued with mentored, on-line group discussion, I treated it as an extra dimension and much more time and energy was put into developing, moderating, and defending the assessment. During an early Board meeting it was decided that all questions for automated assignments would be subject to scrutiny by another member of staff (acting as moderator). This means that the questions are checked by at least two members of staff.

4. In the first semester's student evaluations, one class was equally divided between whether they thought the use of WebCT was a good thing or not. But there were other problems with the course that needed correcting before the work on WebCT could be properly assessed (e.g., it was decided that teaching 50 students in classroom style is inappropriate; Philosophy policy is now that this number not exceed 40). By the next term, the project was no longer an "experiment" and the feedback was generally positive, especially with respect to "prompt return of written work" and "written comments helpful" which received a 4.6 in one course out of 5. By the next semester, these marks had risen to 4.8 out of 5.

5. The project was early in its second year when the UK's National Student Survey revealed that philosophy students felt they would benefit from quicker turnaround of assessments, and more feedback. A University Assessment Strategy was set up to get people to reflect on how they can assess more efficiently and in tune with the learning outcomes of the modules. This project nicely anticipated the newly published assessment strategy insofar as 1) students have their work marked with full comments within a few days instead of the normal 3 weeks; 2) the WebCT software generates a full report for each multiple choice question, giving the students the chance to see exactly what part of the material they failed to understand: in other words, the feedback is comprehensive; 3) using semi-automated assessment makes it easier to use the full range of the marking scale; and 4) it is an extraordinarily efficient way of assessing mastery of basic concepts, basic philosophical positions, and philosophical arguments.

6. By the second year of the project, the whole School was asked to roll out their modules on WebCT (at least for the syllabus, reading material, etc.). Colleagues in other subjects began to work on some of the other features of the software as well. After recently giving a presentation to colleagues on the benefits of semi-automated assessment (during a Philosophy Board Meeting with student representatives from my courses in attendance), and obtaining extra money from the Unit for the Enhancement of Learning at Kent to offer as an 'incentive', some colleagues signed up for the project of compiling a bank of multiple-choice questions. Interestingly, most of them have decided, rather than make banks for their own modules, to help me build up a bank of questions for my courses. (This may reflect their judgement that the assessment is appropriate for my courses but not theirs.)

7. I was invited to apply for, and won, a monetary prize from the Humanities Faculty for the work I have been doing on WebCT; I was one of the invited speakers at the University Assessment Strategy Day, as well as at a School-level discussion on good and innovative practice. During the latter discussion, when other members of the School expressed their strong reservations about the appropriateness of multiple choice questions in the Humanities, two Philosophers spoke up and said that they too had had strong reservations but are now convinced that at least certain subjects in Philosophy (Philosophy of Mind, Wittgenstein) lend themselves well to this kind of assessment.

8. Future plans: I have been invited to contribute to a workshop specifically addressing multiple-choice questions. I have informed colleagues from other Universities informally about my work on automated assessment and have suggested that I be invited to talk to their departments about it.

9. The GradeMark feature of Turnitin makes collection, correction, and moderation of essays extremely practical and energy-efficient. The plagiarism feature of Turnitin is interesting and more helpful than not. But the main interest in my view is the efficiency (and relative safety) of handing essays in on-line.

10. Students are divided about the effectiveness of mentored, on-line discussion. Generally, the shy ones are in favour and participate well; the more confident students prefer more discussion in class. From my point of view as a teacher, mentored on-line discussion is a luxury that can only be afforded if there is a paid assistant (e.g., graduate student) willing and able to help facilitate the discussions outside of normal contact hours.

Description of tests

All on-line tests have to date included a mixture of multiple choice questions, true and false, mix and match, and fill-in questions worth 75% of the mark for that test; in addition there are short paragraph answers (which have to be hand-marked) worth 25%. The on-line assessments are conducted in a computer lab in which I, with the aid of a teaching assistant, invigilate. For each module, the on-line assessment is worth 40% of the final mark; the final short essay or report is worth 40%; and in-class/on-line participation and attendance is worth 20%. There has been no noticeable difference in how students do in the paragraph answers and the automated ones. In general, student performance on these quizzes is not in general inconsistent with their performance on the final, essay-style report.

Discussion

Multiple choice questions are by far the most difficult to create and yet enable one to test understanding in greater depth, especially if two or three questions centre upon the same point or argument. (See Appendix) The fact that students can be penalised for wrong answers and therefore are dissuaded from guessing is crucial and is one reason to favour multiple choice questions on WebCT. It takes on average 20 - 30 minutes to create a question, including pre-recording the feedback.

There is a remarkable prejudice against non-essay style questions in Humanities subjects. Although I have finally convinced colleagues that such assessments have a role in my course, most cannot conceive of how they could be appropriate for their own courses. A few colleagues appreciate that such assessments may be appropriate for their courses, but are (understandably) put off by the amount of effort required to compile a question bank.

These two points are related and worth reflecting upon. A surprising number of colleagues across the Humanities have a distorted view of multiple choice questions. Anecdote: during the initial debate on the project, a witty colleague in Philosophy suggested that a question on a Wittgenstein exam may include: "In his argument against sense data language, what does Wittgenstein imagine to be in the box ? 1) A rabbit; 2) A snake; 3) A beetle; 4) An elephant." This was a joke but it uncovers what seems a widespread false belief that automation in the test implies some sort of corresponding automatic behaviour on the part of the student and thus does not foster creativity or deep understanding.

A different, more understandable worry is that questions posed in automated assessments lack the appropriate contextualisation. In other words, a student may interpret the question to be asking something other than what the designer of the test intended. On an essay style question, this difference in interpretation may be discovered in the course of a fuller answer and the student given full or partial credit for it.

The second problem is more serious than the first, but the response to both is that great care and thought is required in framing the questions. These must rule out, as much as possible, reasonable alternative interpretations. (Some mistakenly think that all alternative interpretations must be ruled out but this is, arguably, not possible for any sentence taken out of context.) It is imperative that the context be clear. The questions are not abstracted from a context: in my modules it is made clear from the beginning that the context is both the primary text and the development of that text in class-discussion. Indeed, the fact that students will be tested on research material introduced during class time is made explicit in the learning outcomes for the module. This, together with help from colleagues who are willing to take the test in advance and discuss possible reasonable interpretations is part of what is required in designing it and what makes it so time-consuming.

Students themselves are very pleased with the tests; appreciate their difficulty, and their ability to test very precise understanding of arguments. (Some of the better students have left the exam muttering that the exam was 'wicked'. See sample questions from a test on Wittgenstein's Philosophical Investigations and from Philosophy of Mind in Appendix.) Because students are able to receive very high marks, they reward the students who work hard and have mastered the material. (In my opinion there would thus be a revolt if these results did not contribute to the final marks.) Students are also pleased with the software's ability to pre-record feedback on each question; e.g., why particular choices would be wrong. Some of the best students have remarked that because many of the questions are embedded with true claims about the subject, they are forced to think very hard about the complete question to decide whether it is correct or not.

One of the most useful features of this type of assessment is that it allows one to award the full range of marks. Results typically range from 18 to the high 80s. Unfortunately, this is still perceived as problematic by those who continue to use the traditional marking range of 50-70. Indeed, as long as the two methods of marking are used in the same department, the final results of students will seem skewed. This is a subject of much wider debate within the Faculty of Humanities and the University as a whole, which has now made explicit in their Learning and Teaching Policy that colleagues in all Faculties are to use the full marking range. This is another way in which the project anticipated what was to become University policy.

In sum, such tests can be very successful in testing for a broad range of abilities; in motivating students to gain a deep understanding of the material; including complex structures of arguments; in allowing fast return of marks and in-depth feedback; and in releasing many hours in mid-term during which the teacher would be marking essays. But the negative side is that the tests require an enormous amount of energy to set up. Clearly unless one were to teach the same content in subsequent years it would not be worth the effort required.

Future Plans

Until now, automated questions have formed the minority of the assessed material for my courses. In the future I would like to run a study to replace some of the short answer (paragraph) questions with automated ones to see how they compare.

I would like to continue addressing colleagues' concerns about the usefulness of multiple choice questions and would be happy to lead workshops discussing with colleagues -including those in other Humanities subjects- how they might convert some of their assessment into automated questions. I have arranged with the Unit for the Enhancement of Learning at the University of Kent to organise such a workshop in 2008-9 but would also be happy to visit other Universities as well.

Finally, I have recently been awarded a small sum of money for mounting some aspects of the Wittgenstein course in a virtual, "avatar" environment. In the first instance this will be for Canterbury students who attend (physical) classes as well as participate in some of the exercises in a virtual environment. But the idea is to look at its feasibility for a course in distance learning. If this is indeed possible, then the possibility of awarding marks for automated assessment becomes very attractive, especially if the course attracts large numbers.

Appendix

Sample Questions

[note: students have no way of knowing the number of correct questions in multiple choice questions; wrong answers receive negative points to discourage guessing]

Wittgenstein

1. In the builder/assistant language game the assistant is trained to bring a certain type of building stone when he hears the builder say the words 'pillar', 'slab', 'block', and 'beam'. The reason why it is correct to say he understands 'slab' in this language game is because (typically) when he hears this command:

  1. he calls up an image of a pillar before his mind
  2. he calls up an image of a slab before his mind
  3. he brings a slab
  4. he calls up an image of a slab, compares it to the various building stones, chooses a slab, and then brings it

2. In the builder/assistant language game the assistant is trained to bring a certain type of building stone when he hears the builder say the words 'pillar', 'slab', 'block', and 'beam'. It would be correct to say the assistant misunderstands 'slab' if he were to:

  1. call an image of a slab before his mind, and then bring a slab
  2. call an image of a slab before his mind and then point to a slab
  3. call an image of a pillar before his mind and bring a slab
  4. call an image of a slab before his mind and do nothing
  5. call no image before his mind but just bring a slab

3. Suppose we define "sepia" as the colour of the standard sepia which is kept in a hermetically sealed container (and which is used to arbitrate disputes about what things are coloured sepia). According to Wittgenstein, it makes no sense to say of this sample:

  1. either that it is sepia or that it is not.
  2. either that it is red or that it is not.
  3. either that it is blue or that it is not.
  4. either that it is the kind of colour you find on old film or that it is not.

4. "There is one thing of which one can say neither that it is one metre long nor that it is not one metre long, and that is the standard metre in Paris." This statement is true when:

  1. the standard metre in Paris is the only instrument used or is the final arbitrator in the determination of a metre-length.
  2. the standard metre in Paris is hermetically sealed and thus protected from losing particles that may one day make it smaller than a metre.
  3. the standard metre in Paris, though it may have been the original sample used in the determination of a metre-length, is no longer so used.

Philosophy of Mind I

3. True or False: Functionalism discards what is problematic about behaviourism insofar as it pays no attention to either the circumstances or the behaviour of someone who is in some mental state like pain or hunger.

4. True or False: A serious problem for both behaviourists and functionalists is that it is impossible to define desire without reference to belief and define belief without reference to desire.

5. Functionalism differs from logical behaviourism insofar as it:

  1. takes mental states to be real, internal states of an organism with causal powers
  2. takes a realist, as opposed to instrumentalist, approach to behavioural dispositions
  3. allows reference to other mental states in giving the characterization of a given mental state
  4. pays no attention to initial conditions
  5. pays no attention to behaviour