{ "info": { "author": "hp310780", "author_email": "", "bugtrack_url": null, "classifiers": [ "Programming Language :: Python :: 3" ], "description": "# FindDuplicateFiles\n\nFinds all duplicate files within a given directory on a file system.\n\nThis module will walk the given directory tree and then group files by size \n(indicating potential duplicate content) followed by comparing the hash of the file.\nThis hash can be chunked by passing in a chunk arg. This will compute an initial hash for a chunk of the file \nbefore then computing the full hash if the first hash matched, thus avoiding computing\nexpensive hashes on large files.\n\n### Prerequisites\n\n* Python 3.6.5\n\n### Installing\n\n```\n> pip install find-duplicate-files\n> find_duplicate_files --dir /path/to/dir --chunk 2\n```\nTo run as a Python module:\n```\nimport find_duplicate_files\n# required arg: dir, optional: chunk\nfind_duplicate_files.find_duplicate_files(\"/path/to/dir\", chunk=1)\n```\n\n## Running the tests\n\nTo run the tests, please use the following commands:\n\n```\n> cd \n> python -m tests.run\n```\n\n## Test Data\n\nThe test data provided takes the following form - \n* tests/test_data/TestFindDuplicateFilesByHash: 5 .txt files of equal size (29 bytes). 1.txt and 3.txt are the same content. 4.txt and 5.txt are the same content. 2.txt is different contents (but the same size). Used to verify the find_duplicate_files.find_duplicate_files_by_hash function.\n* tests/test_data/TestGenerateHash/1.txt: 1 .txt file with which to compare the outcome of find_duplicate_files.generate_hash to.\n\n## Performance\n\nAn optional performance script to compare the performance of hashing the full file versus the chunked approach when finding duplicate files. Outputs performance metrics.\nTo run:\n```\n> cd \n> python performance.py\n```\nExample output:\n```\nMethod 1 - Generate full hash returns correct duplicates.Time 0.006515709001178038\nMethod 2 - Generate chunked hash returns correct duplicates.Time 0.006872908999866922\n```\n\n## Benchmarking\n| Attempt | #1 | #2 | #3 | #4 |\n| :---: | :---: | :---: | :---:| :---: |\n| Chunk Size | 1 | 1 | 8 | 8 |\n| Seconds | 5.4 | 4.16 | 3.25 | 3.27 |\n\nTest Data: 10.9gb, 3653 files, 128 duplicates, largest file ~156mb\n\n## Further Optimisations\n* Investigate optimal chunk size given common file type\n* Investigate threading for performance\n* Investigate different hashing algorithms\n* Investigate recursive chunking - i.e. Eliminating files that differ\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/hp310780/FindDuplicateFiles", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "find-duplicate-files", "package_url": "https://pypi.org/project/find-duplicate-files/", "platform": "", "project_url": "https://pypi.org/project/find-duplicate-files/", "project_urls": { "Homepage": "https://github.com/hp310780/FindDuplicateFiles" }, "release_url": "https://pypi.org/project/find-duplicate-files/1.0.0/", "requires_dist": null, "requires_python": "", "summary": "Module to find duplicate files in a directory", "version": "1.0.0" }, "last_serial": 5169967, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "7765f675d74c0c770cf92a0a6f8b7d78", "sha256": "a676a0f045933b05605ca9f3c6bffa34b0a5f57c074bbac35598cde09ec24d51" }, "downloads": -1, "filename": "find_duplicate_files-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "7765f675d74c0c770cf92a0a6f8b7d78", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5182, "upload_time": "2019-04-21T14:30:27", "url": "https://files.pythonhosted.org/packages/ac/26/e9858b58abaef46c7ee7262c8a542e812e7def4e7af24c236046755ced80/find_duplicate_files-1.0.0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7765f675d74c0c770cf92a0a6f8b7d78", "sha256": "a676a0f045933b05605ca9f3c6bffa34b0a5f57c074bbac35598cde09ec24d51" }, "downloads": -1, "filename": "find_duplicate_files-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "7765f675d74c0c770cf92a0a6f8b7d78", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5182, "upload_time": "2019-04-21T14:30:27", "url": "https://files.pythonhosted.org/packages/ac/26/e9858b58abaef46c7ee7262c8a542e812e7def4e7af24c236046755ced80/find_duplicate_files-1.0.0-py3-none-any.whl" } ] }