{ "info": { "author": "Israel G. Lugo", "author_email": "israel.lugo@lugosys.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: End Users/Desktop", "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: System :: Filesystems", "Topic :: Utilities" ], "description": "CapiDup CLI\r\n===========\r\n\r\n|license| |PyPi version| |PyPi pyversion| |Codacy Badge|\r\n\r\nQuickly find duplicate files in directories.\r\n\r\nCapiDup recursively crawls through all the files in a list of directories and\r\nidentifies duplicate files. Duplicate files are files with the exact same\r\ncontent, regardless of their name, location or timestamp.\r\n\r\nThis program is designed to be quite fast. It uses a smart algorithm to detect\r\nand group duplicate files using a single pass on each file (that is, CapiDup\r\ndoesn't need to compare each file to every other).\r\n\r\nCapiDup fully supports both Python 2 and Python 3.\r\n\r\nThe capidup-cli package is the command-line utility. It depends on the\r\ncapidup_ library package, which actually implements the functionality.\r\n\r\nUsage\r\n-----\r\n\r\nUsing CapiDup is quite simple:\r\n\r\n.. code:: bash\r\n\r\n $ capidup /media/sdcard/DCIM ~/photos\r\n /media/sdcard/DCIM/DSC_1137.JPG\r\n /home/user/photos/Lake001.jpg\r\n ------------------------------\r\n /media/sdcard/DCIM/DSC_1138.JPG\r\n /home/user/photos/Lake002.jpg\r\n ------------------------------\r\n /home/user/photos/Woman.jpg\r\n /home/user/photos/portraits/Janet.jpg\r\n ------------------------------\r\n\r\n\r\nHere we find out that /media/sdcard/DCIM/DSC\\_1137.JPG is a duplicate of\r\n~/photos/Lake001.jpg, DSC\\_1138.JPG is a duplicate of Lake002.jpg, and\r\n~/photos/Woman.jpg is a duplicate of photos/portraits/Janet.jpg.\r\n\r\nAlgorithm\r\n---------\r\n\r\nCapiDup crawls the directories and gathers the list of files. Then, it takes a\r\n3-step approach:\r\n\r\n1. Files are grouped by size (files of different size must obviously be\r\n different).\r\n\r\n2. Within files of the same size, they are further grouped by the MD5 of the\r\n first few KBytes. Naturally, if the first few KB are different, the files\r\n are different.\r\n\r\n3. Within files with the same initial MD5, they are finally grouped by the MD5\r\n of the entire file. Files with the same MD5 are considered duplicates.\r\n\r\nConsiderations\r\n--------------\r\n\r\nThere is a *very small* possibility of false positives. For any given file,\r\nthere is a 1 in 2\\ :sup:`64` (1:18,446,744,073,709,551,616) chance of some\r\nother random file being detected as its duplicate by mistake.\r\n\r\nThe reason for this is that two different files may have the same hash: this is\r\ncalled a *collision*. CapiDup uses MD5 (which generates 128 bit hashes) for\r\ndetecting whether the files are equal. It cannot distinguish between a case\r\nwhere both files are equal and a case where they just happen to generate the\r\nsame MD5 hash.\r\n\r\nThe odds of this happening by *accident* for two files of the same size, are,\r\nthen, extremely low. For normal home use, dealing with movies, music, source\r\ncode or other documents, this concern can be disregarded.\r\n\r\nSecurity\r\n--------\r\n\r\nThere is one case when care should be taken: when comparing files which might\r\nhave been intentionally manipulated by a malicious attacker.\r\n\r\nWhile the chance of two random files having the same MD5 hash are really very\r\nlow (as stated above), it *is* possible for a malicious attacker to purposely\r\nmanipulate a file to have the same MD5 as another. The MD5 algorithm is not\r\nsecure against intentional disception.\r\n\r\nThis may be of concern for example when comparing things such as program\r\ninstallers. A malicious attacker could infect an installer with malware, and\r\nmanipulate the rest of the file in such a way that it still has the same MD5 as\r\nthe original. Comparing the two files, CapiDup would show them as duplicates\r\nwhen they are not.\r\n\r\nFuture plans\r\n------------\r\n\r\nFuture plans for CapiDup include having a configurable option to use a\r\ndifferent hashing algorithm, such as SHA1 which has a larger hash size of 160\r\nbits, or SHA2 which allows hashes up to 512 bits and has no publicly known\r\ncollision attacks. SHA2 is currently used for most cryptographic purposes,\r\nwhere security is essential. False positives, random or maliciously provoked,\r\nwould be practically impossible. Duplicate detection will of course be slower,\r\ndepending on the chosen algorithm.\r\n\r\nFor the extremely paranoid case, there could be an additional setting which\r\nwould check files with two different hashing algorithms. The tradeoff in speed\r\nwould not be worthwhile for any normal use case, but the possibility could be\r\nthere.\r\n\r\n.. |license| image:: https://img.shields.io/badge/license-GPLv3+-blue.svg?maxAge=2592000\r\n :target: LICENSE\r\n.. |PyPi version| image:: https://img.shields.io/pypi/v/capidup-cli.svg\r\n :target: https://pypi.python.org/pypi/capidup-cli\r\n.. |PyPi pyversion| image:: https://img.shields.io/pypi/pyversions/capidup-cli.svg?maxAge=86400\r\n.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/7c0bc6264ca141f49fefe28609c6f6fe\r\n :target: https://www.codacy.com/app/israel-lugo/capidup-cli\r\n\r\n.. _capidup: https://github.com/israel-lugo/capidup", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/israel-lugo/capidup-cli", "keywords": "", "license": "GPLv3+", "maintainer": "", "maintainer_email": "", "name": "capidup-cli", "package_url": "https://pypi.org/project/capidup-cli/", "platform": "", "project_url": "https://pypi.org/project/capidup-cli/", "project_urls": { "Homepage": "https://github.com/israel-lugo/capidup-cli" }, "release_url": "https://pypi.org/project/capidup-cli/1.0.0/", "requires_dist": null, "requires_python": null, "summary": "Quickly find duplicate files in directories (CLI utility)", "version": "1.0.0" }, "last_serial": 2227141, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "fab381832c0a70a69c090d2c874b35c4", "sha256": "8ec05570244c2ab96a8e9d5c565193ce88d43ec95152f1df6cd99cae62cf054e" }, "downloads": -1, "filename": "capidup-cli-1.0.0.tar.gz", "has_sig": true, "md5_digest": "fab381832c0a70a69c090d2c874b35c4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5664, "upload_time": "2016-07-17T17:42:59", "url": "https://files.pythonhosted.org/packages/e4/19/1cf7b2dd974bd1f2d5557dd72af54e58b486e60af30599a2d3d40d606d74/capidup-cli-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fab381832c0a70a69c090d2c874b35c4", "sha256": "8ec05570244c2ab96a8e9d5c565193ce88d43ec95152f1df6cd99cae62cf054e" }, "downloads": -1, "filename": "capidup-cli-1.0.0.tar.gz", "has_sig": true, "md5_digest": "fab381832c0a70a69c090d2c874b35c4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5664, "upload_time": "2016-07-17T17:42:59", "url": "https://files.pythonhosted.org/packages/e4/19/1cf7b2dd974bd1f2d5557dd72af54e58b486e60af30599a2d3d40d606d74/capidup-cli-1.0.0.tar.gz" } ] }