Peer Review 2.0

The history

Some years ago I proposed a model for assessing the quality of scientific papers. The main publication on this idea is (paper [IJ9] on my publications page; mail me for a copy):

 @Article{ej-jasist03,
  author = 	 {Mizzaro, S.},
  title = 	 {Quality Control in Scholarly Publishing: A New Proposal},
  journal = 	 Journal of the American Society for Information Science and Technology,
  volume =	 54,
  number  =      11,
  pages =	 "989-1005",
  year = 	 2003
}

The idea dates back to 1999, when it was presented at a birds-of-a-feather session at ECDL'99 conference. Other minor -- but earlier -- publications are [RR5] [IC10] [IJ6], see my publications page.

The idea

The basic idea is the following; more details can be found in the above mentioned publications.

Let us imagine a scholarly journal in which each paper is immediately published after its submission, without a refereeing process. Each paper has some scores, measuring its quality (accuracy, comprehensibility, novelty, and so on). For the sake of simplicity I use a single score, measuring overall quality, but the generalization to multidimensional quality measures is straightforward. This score is initially zero, or some predetermined value, and it is later dynamically updated on the basis of the readers' judgments. A subscriber to the journal is an author or a reader (or both). Each subscriber has a score too, initially set to zero (or some predetermined value) and later updated on the basis of the activity of the subscriber (if the subscriber is both an author and a reader, she has two different scores, one as an author and one as a reader). Therefore, the scores of subscribers are dynamic, and change according to the subscribers' behavior: if an author with a low score publishes a very good paper, i.e., a paper judged very positively by the readers, her score increases; if a reader expresses an inadequate judgment on a paper, her score decreases accordingly. This last point highlights a difference from other similar "democratic" approaches: all readers are created equals, but some readers (those expressing correct judgments) are "more equal" (i.e., more influential) than others. Correctness of judgments is computed automatically by the system (see the publications for more details).

Every object with a score (author, reader, paper) also has a steadiness value, representing how steady the score is: for instance, old papers, i.e., papers that have been read and judged by many readers, will have a high steadiness; new readers and authors will have a low steadiness. Steadiness affects the score update: a low (high) steadiness allows quicker (slower) changes of the corresponding score. While a score changes, the corresponding steadiness value increases.

As time goes on, readers read the papers, judgments are expressed, and the corresponding scores and steadinesses vary accordingly. The score of a paper can be used to decide whether to read or not to read that paper; the scores of authors and readers are a measure of their research productivity, and therefore they will try to do their best to keep their score at a high level, hopefully leading to a virtuous circle (publishing good papers and giving correct judgments to the read papers). A steadiness value is an estimate of how stable and, therefore, reliable the corresponding score is.

The name

I've been thinking for a long while how to name this mechanism, and today I've finally found the inspiration. I name it Peer Review 2.0, following the "Web 2.0" term -- or hype. What is Web 2.0 is not clear, at least to me. Anyway, it is clear that Web 2.0 is a move towards a more "social" nature of the Web (everybody writes, everybody comments, everybody evaluates, etc.); see, e.g., these two links. Peer Review 2.0 is more social than classical peer review, since in Peer Review 2.0 every reader is a referee, although good referees are more influential than bad ones.

Peer Assessment 2.0

More recently, I realized that peer assessment (students assessing other students' answers to exercises) seems an ideal environment to apply this idea, the reason being that the "Truth" (the correct evaluation of an exercise) is available, as it can simply be provided by the teacher. Also, the system would be useful because it could allow to assign to students exercises that would not otherwise be assigned, because their assessment and marking would be too long. This could be the case, for instance, of an Introduction to Programming class with 100 students at the first year university level: if the teacher asks the students to write a 100 lines of code program to do something, the assessment and marking would then probably take at least half an hour for each program, for a total amount of 50 hours, one week of full-time work. This is unfeasible; what can be done, and feasible, is to have students peer assessing each other answers, with the teacher (and/or some tutors) assessing only some answers, sampled according to the most effective strategy.

I'm working on this idea with a colleague of mine, Paolo Coppola, and a student of mine, Alessandra Girardi; we developed a peer assessment system and are currently improving and evaluating it. Preliminary results hint that the workload on the teacher is reduced: if the teacher evaluates only about 1/3 (30%) of the answers, a good evaluation of students is obtained (with a 0.9 correlation with the final "correct" evaluation).


About this site Last modified: 2007-04-01