DUplicate text DEtection, or DUDE

A joint project of ACM SIGDA and IEEE CEDA

What is DUDE?
DUDE's Use Model
Preserving Confidentiality of Conference Submissions
Persistent Reviews
Public Disclosure
Expected Impact
Project Information

Contacts

IEEE CEDA: Dr. Lou Scheffer, HHMI
ACM SIGDA: Prof. Igor Markov, Univ. of Michigan
imarkov vat umich caught edu
Univ. of Michigan: Stephen Hufnagel

What is DUDE?

DUDE applies computer technology used by Web search engines [1] to the task of detecting matching text in sets of technical papers. DUDE can help reviewers to identify papers most relevant to the paper under review. DUDE can also help program committees of research conferences to check for the following

A submitted paper should not overlap too much with previously published work.
A submitted paper should not overlap too much with other papers still under consideration by conferences
(including accidental duplicate submissions to the same conference and deliberately similar submissions to multiple conferences).
A final submission should fairly closely match the original submission used for review.

DUDE can help enforcing the new 30%-policy for IEEE and ACM Transactions, which requires at least 30% new material compared to earlier conference publications. DUDE does not make moral judgements about how much matching text is ''too much overlap'' or ''fairly closely match'', but rather sorts matching papers to highlight most similar pairs. It generates reports for conference committees, pointing out and annotating any similarities that exist. Conference committees, in accord with their conference policies, make all decisions.

[1] S. Brin, J. Davis and H. Garcia-Molina, " Copy detection mechanisms for Digital Documents" Proc. ACM SIGMOD `95, pp. 398-409.
[2] S. Schleimer, D. S. Wilkerson, A. Aiken, " Winnowing: Local Algorithms for Document Fingerprinting," Proc. ACM SIGMOD `03, pp. 76-85.
[3] L. Guterman, ``Copycat Articles Seem Rife in Science Journals, a Digital Sleuth Finds'', Chronicle of Higher Education, January 24, 2008.

DUDE's Use Model

DUDE consists of two parts:

The web server.: This is currently supplied by Prof. Igor Markov's group at the University of Michigan. This server stores files containing hash codes computed from papers. Hash codes are distributed to users, but confidential conference submissions never reach the server under normal operation.
The client,: run on a conference program committee machine. This program computes hash codes from submitted papers, then consults the DUDE server to find other papers containing the same phrases.

DUDE goes to considerable lengths to preserve the confidentiality of submitted but unpublished papers. These papers never leave the conference machine (where they already exist in human readable form, for reviewing). DUDE computes from each paper a ''hash digest'', which is a re-ordered set of hash values computed from a portion of the phrases in the document. From such a digest, it is not possible to re-create the original exactly (since some of original input is not used) and very difficult to reconstruct even a portion of the original document. See the section " Preserving Confidentiality of Submitted Papers" below.

Preserving Confidentiality of Submitted Papers

A possible extension - Persistent Reviews

DUDE technology could potentially be extended to allow program committees and journal editors to automatically retrieve reviews of rejected papers when a similar paper is submitted to another conference.

However, a potential negative effect is that overly pessimistic reviewers will have a greater impact on the review process. There could be several ways to mitigate this effect. Program committes (e.g., track chairs) or associate editors might only submit only some reviews to DUDE. Persistent reviews may be hidden from fresh reviewers so as not to bias their opinions, but revealed when the decision is made, e.g., at a program committee meeting. Additionally, old reviews could be given a smaller weight compared to fresh reviews.

Because persistent reviews are controversial, DUDE does not implement persistent reviews at this time.

Public Disclosure

Specific algorithms used by DUDE will not be made public, so as not to encourage attempts at fooling duplicate detection. To discourage such attempts, DUDE will have fail-safe features that bring some submissions to the attention of program committees when they cannot be reliably processed by DUDE.

All source code will be available for inspection by participating program committees and journal editors.

Information about specific candidate duplications is intended for program committes, journal editors and publication managers. It should not be made public. It should not be sent the authors' institutions (except for cases when program committee members are from the same institution).

Expected Impact

Enforce stated policies of SIGDA and IEEE conferences and journals in a reasoanable, reliable, consistent and efficient way.

Encourage more original work and less incremental papers

Decrease stress on the existing reviewing system by partially reusing reviews of rejected papers.