User Manual for DUDE - DUplicate text DEtection

A joint project of ACM SIGDA and IEEE CEDA

Hello, Ms./Mrs./Mr. Conference Chair! Here is how to use the DUDE system to perform duplication checks on submissions to your conference.

Get the software and the password
Create plaintext versions of your pdf files
Get Hash Codes from DUDE server
Generate Reports for your reviewers
Create hash codes from your own submissions
A paper is withdrawn
Camera Ready papers come in
At long last, you hold the conference

Contacts

IEEE CEDA: Dr. Lou Scheffer, Cadence
lou bat cadence bought com
ACM SIGDA: Prof. Igor Markov, University of Michigan
imarkov vat umich caught edu
University of Michigan: Stephen Hufnagel
shuf cat umich got edu

Get the software and the password

You will need:

A linux machine with the following utilities: C++ compiler, pdftotext, perl, wget
About 100 MB of free disk space
Internet access
A password for the server, obtained from Igor or Lou.

Please create a writeable directory for all DUDE work. In this directory, create

A sub-directory 'Submissions' that contains copies of all papers submitted. We assume your submissions are in .PDF form.
A sub-directory 'CameraReady' for the final submissions when they come in
The DUDE program itself. To get this, using the password, get DUDE.cpp from The DUDE server. Compile it as "g++ -O3 -o DUDE DUDE.cpp"

Now you are ready to begin!

Creating .txt files

DUDE works with .txt files, so we need plain text files containing the contents of each paper. To do this, run the DUDE program, give it command 'T' (for text). It will ask you for a directory name. It will attempt to convert each .PDF file in the directory to text, and store this in a parallel .txt file.

If conversion fails for any files, you will get a message. You can then try to convert with other programs (perhaps the full Adobe Acrobat), or ask the author for a plaintext version, or just plain continue without that paper. In any case, please let Igor and Lou know of any conversion problems.

Getting the Hash Codes

In this step, the DUDE program contacts the University of Michigan server and gets copies of hash codes from all papers known to the server. It retrieves this as a compressed tar file, then unpacks it. The unpacked files occupy about 65 megabytes, and will be stored in a directory "HashCodes".

You can also do this manually, using the command "wget --http-user=user --http-passwd=passwd http://sigda.eecs.umich.edu/DUDE/WellKept/HashCodes.tar.gz", then unpacking it manually with "tar xzf HashCodes.tar.gz". Please cut and paste the commands above because sometimes two dashes cannot be distinguished from one dash.

Generating Reports

Run the DUDE program, if not already running, and select the 'R' option (for report). This will generate a report, listing for all submissions the other papers with the most common phrases. Also, for each submission that is above the matching threshold (currently 5%) it generates a web page that highlights the identical phrases. These pages are stored in the 'Submissions' directory, in parallel to the original submissions.

Reading the annotated web page: (See a sample comparison page)

Matched text is shown as a link. Hovering over the link shows the papers that contained the matching phrases. The number in parentheses after the name shows how many consecutive phases have matched at that point. Clicking on the link does nothing for now, but may go to the matching text at a later time.
When the size of the consecutive matches to the same paper exceeds 50 words (a common limit for 'fair use'), the section is highlighted.
Since order of phrases is not preserved, and since the hash functions used are (deliberately) ambiguous, matches shown here could be bogus. You MUST check the original source to be sure.

If the reports flag too many files, you may want to increase the similarity threshold to 15% or even 20%. This can be done using the command

  setenv THRESHOLD 15.0

before running the DUDE executable in the same session (if you open another window or start a new login session, you must run this command again).

Creating Hash Codes, and send them to the server

Run the DUDE program, if not already running, and select the 'H' option (for Hash codes). This will create a hash digest file for all submissions. To send these manually, do this:

Examine all the .hdi files, if desired
Create a tar file of all .hdi files
Email this tar file to Lou or Igor

If you are more trusting, or have examined the source code to be sure we are sending only what we claim, then just run DUDE and type the 'S' option.

Handling a withdrawn paper

If a paper is withdrawn, the system needs to forget about it, so other will not find it as a possible similar paper. Run DUDE with the 'W' option, then enter the ID of the paper.

Submitting camera copies to our database

It is very important to submit camera-copies to the DUDE database, so that they replace the hash digests of original submissions. This removes information about rejected papers from the database and thus reduces spurious overlaps with later conferences. While the submission process can be autoamated, we currently prefer you to post an archive (tar.gz or zip) with PDF files for us to download. We discourage large email attachments, but are willing to consider them if all else fails.

Conference organizers may want to compare each camera copy against the original submission. Substantially different camera copies may warrant more careful checks (e.g., if the authors remove key results and resubmit to another conference, which has happened previously).

Hold the Conference

At this point the accepted papers become public.

Run the DUDE program, if not already running, and select the 'F' option (for Finally holding the Conference).

Frequently Asked Questions

Please see a separate document.

Project Information

The DUDE project is currently in its experimental stage. Most of the software has been written and works. We are evaluating it with several pilot conferences, and hope to have it approved for use in all SIGDA and CEDA sponsored conferences within a year. If you have questions, comments or suggestions, please drop us a note using email (see addresses at the top of the page).

Igor Markov and Lou Scheffer