DUplicate text DEtection, or DUDE
What is DUDE?
DUDE applies computer technology used
by Web search engines  to the task of detecting matching text in sets
of technical papers.
DUDE can help reviewers to identify papers
most relevant to the paper under review.
DUDE can also help program committees of research conferences
to check for the following
- A submitted paper should not overlap too much
with previously published work.
- A submitted paper should not overlap too much
with other papers still under consideration by conferences
(including accidental duplicate submissions
to the same conference and deliberately similar submissions
to multiple conferences).
- A final submission should fairly closely match
the original submission used for review.
DUDE can help enforcing the new 30%-policy for IEEE and ACM
Transactions, which requires at least 30% new material compared
to earlier conference publications.
DUDE does not make moral judgements about how much matching text
is ''too much overlap'' or ''fairly closely match'', but rather
sorts matching papers to highlight most similar pairs.
It generates reports for conference committees,
pointing out and annotating any similarities that exist.
Conference committees, in accord with their conference policies,
make all decisions.
 S. Brin, J. Davis and H. Garcia-Molina,
Copy detection mechanisms for Digital Documents"
Proc. ACM SIGMOD `95, pp. 398-409.
 S. Schleimer, D. S. Wilkerson, A. Aiken,
Winnowing: Local Algorithms for Document Fingerprinting,"
Proc. ACM SIGMOD `03, pp. 76-85.
 L. Guterman,
``Copycat Articles Seem Rife in Science Journals, a Digital Sleuth Finds'',
Chronicle of Higher Education, January 24, 2008.
DUDE consists of two parts:
- The web server.
- This is currently supplied by Prof. Igor
Markov's group at the University of Michigan. This server stores files
containing hash codes computed from papers. Hash codes are distributed to
users, but confidential conference submissions never reach
the server under normal operation.
- The client,
- run on a conference program committee machine.
This program computes hash codes from submitted papers, then consults
the DUDE server to find other papers containing the same phrases.
DUDE goes to considerable lengths to preserve the confidentiality of
submitted but unpublished papers. These papers never leave the
conference machine (where they already exist in human readable form,
for reviewing). DUDE computes from each paper a ''hash digest'',
which is a re-ordered set of hash values computed from a portion of the
phrases in the document. From such a digest, it is not possible
to re-create the original exactly (since some of original input
is not used) and very difficult to reconstruct even a portion
of the original document. See the section
" Preserving Confidentiality of Submitted Papers
- DUDE maintains a large collection of papers published by conferences
and journals sponsored by ACM SIGDA and IEEE CaS (C-EDA).
Provisional: ACM and IEEE shall
provide DUDE with regular updates in batch mode (one batch per
conference or per year), preferrably both in PDF and ASCII. For
papers not available in ASCII, DUDE may handle the conversion from PDF
to text (resorting to OCR in rare cases).
Published papers are not considered confidential and never expire in
DUDE's repository. DUDE prepares "hash digests" (explained below)
from those papers and distributes them to interested conference PCs to
help finding conference submissions that reuse text from published
papers. In this case this is only for efficiency, and not
privacy, since the originals are public.
- The DUDE server does not collect or maintain any confidential files,
such as original conference submissions or camera copies.
Furthermore, the programs and protocols are designed so that
confidential papers are never sent over the internet. Only hash
digests are transferred, and these are designed so exact reconstruction
is impossible, and even approximate re-construction impractically
- DUDE makes no decisions on the implications of detected
matches, for administrative and technical reasons.
Administratively, only the conference organizers know what matching is
OK - for example, a tutorial or invited talk may include large swaths
of previously published material. Technically, DUDE compares only
hash values, not the original text. Therefore false matches are
possible, though designed to be unlikely. For these reasons, the
duplicate detection checks shall always be performed by program
committees by running the DUDE client. The output of this program
should be used in advisory capacity only, and files in question
shall always be checked manually for similarities. Indeed, rare cases
of mistaken identity
In the case of a match to a published paper, DUDE can (potentially - it
does not do this yet) compare the text of the submission to the
original, provided the program committee has access to the IEEE and ACM
on-line libraries. For comparison to still-confidential
suibmissions comparison to the original text is not possible. In
this case DUDE can only report the conference name and submission
code. If suspicions arise, a program committee or a journal editor
will have to
directly contact the program committee where the other potential
duplicate belongs. Again, the possibility of duplication should
be evaluated by people based on the actual submissions. Therefore, the
DUDE authors and maintainers does not need to be involved in such
discussions, but would appreciate general statistics about duplicate
DUDE maintains "hash digests" for confidential files, such as
original conference submissions or camera copies. Each hash digest
includes the conference name and
the submission code of the original paper, which allows the PC of the
respective conference to look up the paper if a duplicate submission is
A hash digest consists of a sorted sequence of 32-bit hash
codes, computed from a fraction of the original text reconstruction of
the paper. This allows DUDE to distribute
hash digests of confidential papers to participating program committees
and journal editors so as to aid in the detection of duplicate submissions
by these committees or journals.
Hash digests of confidential papers maintained by DUDE expire when
the conference begins, or can be superseded by newer hash digests. For
example, an entire batch of hash digests created for submissions to a
particular conference will be superseded by a batch of hash digests
created for camera copies of papers accepted to this conference
(rejected papers are removed, and no longer can show up as
matches). Further, hash digests for published papers supersede
hash digests for camera copies.
Hash digests for confidential files are computed by program
committees (for original submissions) or by publishers (for camera
copies) and sent to DUDE for further distribution.
The software and hash digests provided by DUDE to program committees
and journal editors allows them to
- find [erroneous] duplicate submissions to the same conference,
so as not to send almost-identical papers for review
- find submissions that reuse a large amount of text from published
- find submissions that are similar to other recent submissions or
- find camera copies that are very different from the original
An additional service would improve the quality of reviews by
automatically supplying reviewers with references to published papers
that look similar to a given submission. These can point to missing
references, help find qualified reviewers, and can be useful in
determining the novelty of the work under review.
ACM and IEEE Transactions using DUDE will be able to identify
conference submissions from which a given journal submission was
derived, so as to check the 30% rule.
The research community may benefit from the automatic literature
search provided by DUDE based on the hash digest technology.
Preserving Confidentiality of Submitted Papers
Only the hash files are transferred over the net.
If someone wanted to gain unauthorized access to confidential
files, they would need to intercept the hash files and
reconstruct a paper from the hash values.
The files are kept on a password protected server.
This is the first line of defense. In case the server is compromised
or the files are intercepted while being transmitted to/from the server,
the attacker can get the hash values. Can the paper be reconstructed from these
values? It is not possible to reproduce the paper exactly since only
a fraction of it was used to compute to compute the hash values.
Therefore many similar papers (any that differ only in the ignored parts)
will give exactly the same sets of hash values. There is no way to tell
which of these many possible papers was the original.
But can an attacker find much useful about the paper?
Even this is difficult.
- First, the attacker would need to invert the hash function to get
phrases from the article. This is hard for two reasons - practical
and theoretical. From a practical viewpoint,
it is hard (computationally difficult) to find a
phrase that hashes to a given value. From a theoretical viewpoint,
the hash and phrase size are chosen to make this inversion inherently
ambiguous. We hash 6-word phrases to 32-bit hashes. Even assuming
a limited vocabulary of only 1000 words, there are 1018
6-word phrases. These
hash into only 4x109 possible values. So there are
that can hash into each value. Some may be ruled unlikely by grammar,
but the inversion is still very ambiguous.
- Even if the attacker can somehow invert the hash function, the
file is sorted by hash number, which is random with respect to the
order of the phrases in the original file. Once again, this means
many papers map to the same file. For example, a very short paper
might have 1500 words. From this, 150 phrases are chosen, hashed,
and sorted. Therefore 150! (that's right, 150 factorial) arrangements
of these phrases all hash to the exact same digest. Once again
some of them can be ruled out, but many plausible ones remain.
A possible extension - Persistent Reviews
Almost everyone who has reviewed for at least two conference has
probably seen the same paper submitted to conference B after rejection
from conference A. Sometimes the objections from conference A have been
addressed, but in many other cases the paper is word for word identical,
and the authors are simply hoping the committee for B will have a different
view of the relative advantages and drawbacks.
DUDE technology could potentially be extended to allow program committees
and journal editors to automatically retrieve reviews
of rejected papers when a similar paper is submitted
to another conference.
Potential advantages of persistent reviews include
- Higher quality of PC decisions (won't overlook flaws noticed before)
- Authors will be discouraged from resubmitting identical papers
- Reduced burden on the review system through requesting fewer
new reviews (since many submissions are former rejects)
However, a potential negative effect is that overly pessimistic reviewers
will have a greater impact on the review process.
There could be several ways to mitigate this effect.
Program committes (e.g., track chairs)
or associate editors might only submit only some reviews to DUDE.
Persistent reviews may be hidden
from fresh reviewers so as not to bias their opinions, but
revealed when the decision is made, e.g., at a program committee
Additionally, old reviews could be given a smaller weight compared
to fresh reviews.
Because persistent reviews are controversial, DUDE does not implement persistent reviews at
We recommend that the use of DUDE by a conference
is disclosed to the authors of papers in adavnce, e.g.,
in the call for papers and/or in the online submission form
that spells out conference policies on duplicate submissions.
However, disclosure is not a requirement.
The main advantage of disclosing the use of DUDE
is to encourage the authors to write more original papers.
Specific algorithms used by DUDE will not be made public,
so as not to encourage attempts at fooling duplicate detection.
To discourage such attempts, DUDE will have fail-safe features
that bring some submissions to the attention of program committees
when they cannot be reliably processed by DUDE.
All source code will be available for inspection
by participating program committees and journal editors.
Information about specific candidate duplications
is intended for program committes, journal editors
and publication managers. It should not
be made public. It should not be
sent the authors' institutions (except
for cases when program committee members
are from the same institution).
The main goals of the DUDE project are to
The first goal is the main short-term motivator for adoption
by conferences because it would decrease the amount of work
required to find duplicate submissions and be more reliable
than semi-automated ad hoc approaches in use now.
On the other hand, we expect the other two goals
to have a more significant impact on the research community.
We are very interested in discussing possible and perceived
negative impact of dude (use our email addresses
at the top of the page).
- Enforce stated policies of SIGDA and IEEE conferences
and journals in a reasoanable, reliable, consistent
and efficient way.
- Encourage more original work and less incremental papers
- Decrease stress on the existing reviewing system
by partially reusing reviews of rejected papers.
Frequently Asked Questions
Please see a separate document.
The DUDE project is currently in its experimental stage.
Most of the software has been written and works. We are evaluating it
with several pilot conferences, and hope to have it approved for use
in all SIGDA and C-EDA sponsored conferences within a year.
A user manual is available.
If you have questions, comments or suggestions, please drop us
a note using email (see addresses at the top of the page).
Also, please read the
End-User License Agreement for DUDE.
The conferences and symposia that currently use DUDE include
DAC, ICCAD, DATE, ASPDAC, DATE, ISPD.
Igor Markov and Lou Scheffer