Plagiarism Detection - YAP
The Problem of Plagiarism
It is an unfortunate fact of life for University lecturers
that the pressures on students leads some of them to copy other
students' assignments or at least to obtain
more assistance from their friends than is appropriate.
Apart from discrediting the use of assignments for assessment,
the copying of assignments also vitiates the
assignments' educational aims.
The typical institutional response is to require that assignments only
form a small part of a student's assessment.
However, such a response is inappropriate
because it either results in trivial assignments or in assignments
which do not adequately repay students in marks
for the effort that they have invested.
Computer based plagiarism detection can restore much of
the confidence in the usefulness of computer-based assignments
because the same computers that students use to do the
assignments can be used to automate the testing
of the assignments, and then the detection of similarities
among the submissions.
Plagiarism versus Cooperation
Setting aside the issue of group assignments (where a different
set of issues arise to do with the equitable division of labour),
students are encouraged to discuss their work with other
students, e.g. the merits of particular data-structure choices.
With this in mind, and given that students will be attempting
the same task in the same language, low level similarlities are bound
to arise.
However, in the same way that two people, given the same topic and
a lexicon will none-the-less write very different essays,
the similarities due to discussion tend to dissipate remarably quickly.
The situation is very different when students have seen other's
work, because even if they hand back the source texts - unlikely anyway - it is
almost inevitable that they will reproduce the original, because if they
had any strong ideas about how to tackle the assignment, they would
not be need the crutch!
Plagiarism versus Accidental Similarity
Parker and Hamblen (1982) define software plagiarism
as: a program which has been produced from another
program with a small number of routine changes.
In practice, if one program can be transformed
into another simply through use of editor operations (such as global
substitutions) or by
exploiting synonymous expressions provided by the programming language,
then a prima facie case of plagiarism has been found and should be
examined further.
Note that neither ploy requires a knowledge of the problem being solved
by the source program.
Note also that this is also the level at which optimizing compilers operate.
YAP
YAP, which stands for Yet Aanother Plague, is a series
of systems which follow a common pattern.
YAP3 is the current version.
In the first stage, which is common to all three systems, source texts
are tokenized.
In particular:
- Comments and string-constants are removed.
- Uppper-case letters are translated to lower-case .
- A range of synonyms are mapped to a common form.
- If possible, the functions/procedures are expanded in calling order.
- All tokens that are not in the lexicon for the language are removed.
The real difference between the three systems is primarily in their
respective second, O(n2), stages:
A paper which appeared in the First Australian Conference on
Computer Science Education, Sydney University, July 3-5, 1996,
http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/doc/yap_vs_acm.ps
compares the YAP approach with two others from the literature, both
of which involve attribute-counting, i.e. comparison of
various statistics from the source texts.
Obtaining YAP
At http://www.pam1.bcs.uwa.edu.au/~michaelw/ftp/src/YAP.distibution.tar.gz
you will find tar file containing tokenizers for Pascal, C and LISP, plus YAP1,
YAP2 and YAP3, is available for ftp.