User Manual for
DUDE - DUplicate text DEtection
Hello, Ms./Mrs./Mr. Conference Chair!
Here is how to use the DUDE system to perform
duplication checks on submissions to your conference.
Get the software and the password
You will need:
- A linux machine with the following utilities: C++ compiler, pdftotext, perl, wget
- About 100 MB of free disk space
- Internet access
- A password for the server, obtained from Igor or Lou.
Please create a writeable directory for all DUDE work. In this directory, create
- A sub-directory 'Submissions' that contains copies of all papers submitted.
We assume your submissions are in .PDF form.
- A sub-directory 'CameraReady' for the final submissions when they come in
- The DUDE program itself. To get this, using the password, get DUDE.cpp from The DUDE server. Compile it as "g++ -O3 -o DUDE DUDE.cpp"
Now you are ready to begin!
Creating .txt files
DUDE works with .txt files, so we need plain text files containing the contents of each paper. To do this, run the DUDE program, give it command 'T' (for text). It will ask you for a directory name. It will attempt to convert each .PDF file in the directory to text, and store this in a parallel .txt file.
If conversion fails for any files, you will get a message. You can then try to convert with other programs (perhaps the full Adobe Acrobat), or ask the author for a plaintext version, or just plain continue without that paper. In any case, please let Igor and Lou know of any conversion problems.
In this step, the DUDE program contacts the University of Michigan server and gets copies of hash codes from all papers known to the server. It retrieves this as a compressed tar file, then unpacks it. The unpacked files occupy about 65 megabytes, and will be stored in a directory "HashCodes".
You can also do this manually, using the command "wget --http-user=user --http-passwd=passwd http://sigda.eecs.umich.edu/DUDE/WellKept/HashCodes.tar.gz", then
unpacking it manually with "tar xzf HashCodes.tar.gz". Please cut and paste
the commands above because sometimes two dashes cannot be distinguished from
one dash.
Run the DUDE program, if not already running, and select the 'R' option (for report).
This will generate a report, listing for all submissions the other papers with the most common phrases.
Also, for each submission that is above the matching threshold (currently 5%) it generates a web page that highlights the identical phrases.
These pages are stored in the 'Submissions' directory, in parallel to the original submissions.
Reading the annotated web page: (See a sample comparison page)
- Matched text is shown as a link.
Hovering over the link shows the papers that contained the matching phrases.
The number in parentheses after the name shows how many consecutive phases have matched at that point.
Clicking on the link does nothing for now, but may go to the matching text at a later time.
- When the size of the consecutive matches to the same paper exceeds 50 words (a common limit for 'fair use'), the section is highlighted.
- Since order of phrases is not preserved, and since the hash functions used are (deliberately) ambiguous, matches shown here could be bogus. You MUST check the original source to be sure.
If the reports flag too many files, you may want to increase the similarity
threshold to 15% or even 20%. This can be done using the command
setenv THRESHOLD 15.0
before running the DUDE executable in the same session
(if you open another window or start a new login session,
you must run this command again).
Run the DUDE program, if not already running, and select the 'H' option (for Hash codes).
This will create a hash digest file for all submissions.
To send these manually, do this:
- Examine all the .hdi files, if desired
- Create a tar file of all .hdi files
- Email this tar file to Lou or Igor
If you are more trusting, or have examined the source code to be sure we are sending only what we claim, then just run DUDE and type the 'S' option.
If a paper is withdrawn, the system needs to forget about it, so other will not find it as a possible similar paper. Run DUDE with the 'W' option, then enter the ID of the paper.
It is very important to submit camera-copies
to the DUDE database, so that they replace the hash digests
of original submissions. This removes information
about rejected papers from the database and thus reduces
spurious overlaps with later conferences. While the submission
process can be autoamated, we currently prefer you to post
an archive (tar.gz or zip) with PDF files for us to download.
We discourage large email attachments, but are willing
to consider them if all else fails.
Conference organizers may want to compare
each camera copy against the original submission.
Substantially different camera copies may warrant
more careful checks (e.g., if the authors remove
key results and resubmit to another conference,
which has happened previously).
At this point the accepted papers become public.
Run the DUDE program, if not already running, and select the 'F' option (for Finally holding the Conference).
Frequently Asked Questions
Please see a separate document.
Project Information
The DUDE project is currently in its experimental stage.
Most of the software has been written and works. We are evaluating it
with several pilot conferences, and hope to have it approved for use
in all SIGDA and CEDA sponsored conferences within a year.
If you have questions, comments or suggestions, please drop us
a note using email (see addresses at the top of the page).
Igor Markov and Lou Scheffer