DUplicate text DEtection, or DUDE

A joint project of ACM SIGDA and IEEE CEDA

ACM SIGDA logo                

Contents

  1. What is DUDE?
  2. DUDE's Use Model
  3. Preserving Confidentiality of Conference Submissions
  4. Persistent Reviews
  5. Public Disclosure
  6. Expected Impact
  7. Project Information

               

Contacts

  • IEEE CEDA: Dr. Lou Scheffer, HHMI
  • ACM SIGDA: Prof. Igor Markov, Univ. of Michigan
    imarkov vat umich caught edu
  • Univ. of Michigan: Stephen Hufnagel



Valid HTML 4.01 Transitional
IEEE CEDA logo


What is DUDE?

DUDE applies computer technology used by Web search engines [1] to the task of detecting matching text in sets of technical papers.  DUDE can help reviewers to identify papers most relevant to the paper under review. DUDE can also help program committees of research conferences to check for the following DUDE can help enforcing the new 30%-policy for IEEE and ACM Transactions, which requires at least 30% new material compared to earlier conference publications. DUDE does not make moral judgements about how much matching text is ''too much overlap'' or ''fairly closely match'', but rather sorts matching papers to highlight most similar pairs.   It generates reports for conference committees, pointing out and annotating any similarities that exist.  Conference committees, in accord with their conference policies, make all decisions.

[1] S. Brin, J. Davis and H. Garcia-Molina, " Copy detection mechanisms for Digital Documents" Proc. ACM SIGMOD `95, pp. 398-409.
[2] S. Schleimer, D. S. Wilkerson, A. Aiken, " Winnowing: Local Algorithms for Document Fingerprinting," Proc. ACM SIGMOD `03, pp. 76-85.
[3] L. Guterman, ``Copycat Articles Seem Rife in Science Journals, a Digital Sleuth Finds'', Chronicle of Higher Education, January 24, 2008.


DUDE's Use Model

DUDE consists of two parts:
The web server.
This is currently supplied by Prof. Igor Markov's group at the University of Michigan.  This server stores files containing hash codes computed from papers. Hash codes are distributed to users, but confidential conference submissions never reach the server under normal operation.
The client,
run on a conference program committee machine.  This program computes hash codes from submitted papers, then consults the DUDE server to find other papers containing the same phrases.
DUDE goes to considerable lengths to preserve the confidentiality of submitted but unpublished papers.  These papers never leave the conference machine (where they already exist in human readable form, for reviewing).  DUDE computes from each paper a ''hash digest'', which is a re-ordered set of hash values computed from a portion of the phrases in the document.  From such a digest, it is not possible to re-create the original exactly (since some of original input is not used) and very difficult to reconstruct even a portion of the original document.  See the section " Preserving Confidentiality of Submitted Papers" below.
  1. DUDE maintains a large collection of papers published by conferences and journals sponsored by ACM SIGDA and IEEE CaS (C-EDA).
    Provisional: ACM and IEEE shall provide DUDE with regular updates in batch mode (one batch per conference or per year), preferrably both in PDF and ASCII.  For papers not available in ASCII, DUDE may handle the conversion from PDF to text (resorting to OCR in rare cases).

    Published papers are not considered confidential and never expire in DUDE's repository.  DUDE prepares "hash digests" (explained below) from those papers and distributes them to interested conference PCs to help finding conference submissions that reuse text from published papers.  In this case this is only for efficiency, and not privacy, since the originals are public.

  2. The DUDE server does not collect or maintain any confidential files, such as original conference submissions or camera copies.  Furthermore, the programs and protocols are designed so that confidential papers are never sent over the internet.  Only hash digests are transferred, and these are designed so exact reconstruction is impossible, and even approximate re-construction impractically difficult.

  3. DUDE makes no decisions on the implications of detected matches, for administrative and technical reasons.  Administratively, only the conference organizers know what matching is OK - for example, a tutorial or invited talk may include large swaths of previously published material.  Technically, DUDE compares only hash values, not the original text.  Therefore false matches are possible, though designed to be unlikely.  For these reasons, the duplicate detection checks shall always be performed by program committees by running the DUDE client. The output of this program should be used in advisory capacity only, and files in question shall always be checked manually for similarities. Indeed, rare cases of mistaken identity are possible.

    In the case of a match to a published paper, DUDE can (potentially - it does not do this yet) compare the text of the submission to the original, provided the program committee has access to the IEEE and ACM on-line libraries.  For comparison to still-confidential suibmissions comparison to the original text is not possible.  In this case DUDE can only report the conference name and submission code.  If suspicions arise, a program committee or a journal editor will have to directly contact the program committee where the other potential duplicate belongs.  Again, the possibility of duplication should be evaluated by people based on the actual submissions. Therefore, the DUDE authors and maintainers does not need to be involved in such discussions, but would appreciate general statistics about duplicate submissions detected.

  4. DUDE maintains "hash digests" for confidential files, such as original conference submissions or camera copies. Each hash digest includes the conference name and
    the submission code of the original paper, which allows the PC of the respective conference to look up the paper if a duplicate submission is suspected.

    A hash digest consists of a sorted sequence of 32-bit hash codes, computed from a fraction of the original text reconstruction of the paper. This allows DUDE to distribute hash digests of confidential papers to participating program committees and journal editors so as to aid in the detection of duplicate submissions by these committees or journals.

  5. Hash digests of confidential papers maintained by DUDE expire when the conference begins, or can be superseded by newer hash digests. For example, an entire batch of hash digests created for submissions to a particular conference will be superseded by a batch of hash digests created for camera copies of papers accepted to this conference (rejected papers are removed, and no longer can show up as matches).  Further, hash digests for published papers supersede hash digests for camera copies.

  6. Hash digests for confidential files are computed by program committees (for original submissions) or by publishers (for camera copies) and sent to DUDE for further distribution.

  7. The software and hash digests provided by DUDE to program committees and journal editors allows them to
    • find [erroneous] duplicate submissions to the same conference, so as not to send almost-identical papers for review
    • find submissions that reuse a large amount of text from published papers
    • find submissions that are similar to other recent submissions or camera copies
    • find camera copies that are very different from the original submissions

    An additional service would improve the quality of reviews by automatically supplying reviewers with references to published papers that look similar to a given submission. These can point to missing references, help find qualified reviewers, and can be useful in determining the novelty of the work under review.

  8. ACM and IEEE Transactions using DUDE will be able to identify conference submissions from which a given journal submission was derived, so as to check the 30% rule.

  9. The research community may benefit from the automatic literature search provided by DUDE based on the hash digest technology.

Preserving Confidentiality of Submitted Papers

Only the hash files are transferred over the net.  If someone wanted to gain unauthorized access to confidential files, they would need to intercept the hash files and reconstruct a paper from the hash values.

The files are kept on a password protected server.  This is the first line of defense. In case the server is compromised or the files are intercepted while being transmitted to/from the server, the attacker can get the hash values. Can the paper be reconstructed from these values? It is not possible to reproduce the paper exactly since only a fraction of it was used to compute to compute the hash values. Therefore many similar papers (any that differ only in the ignored parts) will give exactly the same sets of hash values. There is no way to tell which of these many possible papers was the original.
But can an attacker find much useful about the paper? Even this is difficult.

A possible extension - Persistent Reviews

Almost everyone who has reviewed for at least two conference has probably seen the same paper submitted to conference B after rejection from conference A. Sometimes the objections from conference A have been addressed, but in many other cases the paper is word for word identical, and the authors are simply hoping the committee for B will have a different view of the relative advantages and drawbacks.

DUDE technology could potentially be extended to allow program committees and journal editors to automatically retrieve reviews of rejected papers when a similar paper is submitted to another conference.

Potential advantages of persistent reviews include

However, a potential negative effect is that overly pessimistic reviewers will have a greater impact on the review process. There could be several ways to mitigate this effect. Program committes (e.g., track chairs) or associate editors might only submit only some reviews to DUDE. Persistent reviews may be hidden from fresh reviewers so as not to bias their opinions, but revealed when the decision is made, e.g., at a program committee meeting. Additionally, old reviews could be given a smaller weight compared to fresh reviews.

Because persistent reviews are controversial, DUDE does not implement persistent reviews at this time.


Public Disclosure

We recommend that the use of DUDE by a conference is disclosed to the authors of papers in adavnce, e.g., in the call for papers and/or in the online submission form that spells out conference policies on duplicate submissions. However, disclosure is not a requirement. The main advantage of disclosing the use of DUDE is to encourage the authors to write more original papers.

Specific algorithms used by DUDE will not be made public, so as not to encourage attempts at fooling duplicate detection. To discourage such attempts, DUDE will have fail-safe features that bring some submissions to the attention of program committees when they cannot be reliably processed by DUDE.

All source code will be available for inspection by participating program committees and journal editors.

Information about specific candidate duplications is intended for program committes, journal editors and publication managers. It should not be made public. It should not be sent the authors' institutions (except for cases when program committee members are from the same institution).


Expected Impact

The main goals of the DUDE project are to
  1. Enforce stated policies of SIGDA and IEEE conferences and journals in a reasoanable, reliable, consistent and efficient way.
  2. Encourage more original work and less incremental papers
  3. Decrease stress on the existing reviewing system by partially reusing reviews of rejected papers.
The first goal is the main short-term motivator for adoption by conferences because it would decrease the amount of work required to find duplicate submissions and be more reliable than semi-automated ad hoc approaches in use now. On the other hand, we expect the other two goals to have a more significant impact on the research community. We are very interested in discussing possible and perceived negative impact of dude (use our email addresses at the top of the page).

Frequently Asked Questions

Please see a separate document.

Project Information

The DUDE project is currently in its experimental stage. Most of the software has been written and works. We are evaluating it with several pilot conferences, and hope to have it approved for use in all SIGDA and C-EDA sponsored conferences within a year. A user manual is available. If you have questions, comments or suggestions, please drop us a note using email (see addresses at the top of the page). Also, please read the End-User License Agreement for DUDE. The conferences and symposia that currently use DUDE include DAC, ICCAD, DATE, ASPDAC, DATE, ISPD.


Igor Markov and Lou Scheffer