{ "info": { "author": "Parker Moore", "author_email": "parkrmoore@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Near-Duplicate Detection\n\nThis program identifies near-duplicates in a corpus using techniques described\nby Professor William Arms of Cornell University in his lectures to the students\nof INFO 4300, Information Retrieval, Fall 2012.\n\nThis program was written by Parker Moore (pjm336), Fall 2012.\n\n## Install\n\n```\npip install git://github.com/parkr/near-dup-detection.git#egg=NearDuplicatesDetection\n```\n\n## Usage\n\n python ndd.py\n\n## Explanation of Methodology\n\nAll of the logic for the program is built into the Detector class \n(`detector.py`). This class contains the methods and instance variables needed\nto detect near-duplicates, such as the `get_jaccard(file1, file2)` method, the\n`calculate_sketches()` method and the fundamental `create_3grams()` method.\n\nThis program implements the standard procedure for detecting near-duplicates:\n\n1. Generate n-grams (3-grams in this case) for each document. Assign these\n n-grams a unique ID based on a 64-bit hash.\n2. Create 25 sketches for document based on 50 randomly selected\n numbers and some stuff we generated earlier:\n - `p` is the closest prime number to the # of n-grams\n - `a_s` random, in the range [1, p-1]\n - `b_s` random, in the range [0, p-1]\n - `x` is the n-gram ID (the hash generated in step 1)\n - using the equation: `f_s(x) = (a_s*x + b_s) % p`\n - note: this equation is calculated 25 times per document (one time per\n random pair `a_s` and `b_s`), but only the minimum result of\n `f_s(x)` for each of the 25 pairs is saved. Thus, at the end of\n the calculation, each document has 25 `f_min`'s, one for each\n pair of random numbers.\n3. Go through each document and compare it to all the other documents using the\n Jaccard coefficient estimation equation : `J(d1, d2) = m/n`, where:\n - `m` = number of sketch values (must be at the same index in respective\n lists) which are the same between the two documents\n - `n` = number of samples (# of sketches)\n4. Having defined an arbitrary Jaccard coefficient threshold of `0.5`, the\n program prints out the names of the documents whose Jaccard coefficient\n is greater than the threshold previously defined, as well as the corresponding\n Jaccard coefficient.\n\nAs an addendum to the project, the three \"nearest neighbors\" to the first ten\ndocuments is calculated at the end using the same method (and the data from\nbefore).\n\n## License\n\nStandard MIT license applies.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/parkr/near-dup-detection", "keywords": null, "license": "LICENSE.txt", "maintainer": null, "maintainer_email": null, "name": "NearDuplicatesDetection", "package_url": "https://pypi.org/project/NearDuplicatesDetection/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/NearDuplicatesDetection/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/parkr/near-dup-detection" }, "release_url": "https://pypi.org/project/NearDuplicatesDetection/0.2.0/", "requires_dist": null, "requires_python": null, "summary": "Identifies near-duplicates in a corpus", "version": "0.2.0" }, "last_serial": 784786, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "1d6a6a463cedce7b727a22570ab71b3d", "sha256": "5fffb7e4534a7d1278aab88f9853835ee1dbcdcdef2ee74b87c6792661491547" }, "downloads": -1, "filename": "NearDuplicatesDetection-0.2.0.tar.gz", "has_sig": false, "md5_digest": "1d6a6a463cedce7b727a22570ab71b3d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5935, "upload_time": "2013-03-21T14:38:57", "url": "https://files.pythonhosted.org/packages/03/9e/8d9362d5798e540190f155787cfe9001494bd9a2a38f12de77e7e3b10d61/NearDuplicatesDetection-0.2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1d6a6a463cedce7b727a22570ab71b3d", "sha256": "5fffb7e4534a7d1278aab88f9853835ee1dbcdcdef2ee74b87c6792661491547" }, "downloads": -1, "filename": "NearDuplicatesDetection-0.2.0.tar.gz", "has_sig": false, "md5_digest": "1d6a6a463cedce7b727a22570ab71b3d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5935, "upload_time": "2013-03-21T14:38:57", "url": "https://files.pythonhosted.org/packages/03/9e/8d9362d5798e540190f155787cfe9001494bd9a2a38f12de77e7e3b10d61/NearDuplicatesDetection-0.2.0.tar.gz" } ] }