{ "info": { "author": "Steve Schmerler", "author_email": "git@elcorto.com", "bugtrack_url": null, "classifiers": [], "description": "About\n=====\n\nFind duplicate files and directories.\n\nAs other tools we use file hashes but additionally, we report duplicate\ndirectories as well, using a Merkle tree for directory hash calculation.\n\nTo increase performance, we use parallel hash calculation and optional limits\non to-be-hashed data.\n\n\nInstall\n=======\n\nFrom pypi:\n\n.. code-block:: shell\n\n $ pip3 install findsame\n\nDev install of this repo:\n\n.. code-block:: shell\n\n $ git clone ...\n $ cd findsame\n $ pip3 install -e .\n\nThe core part (package ``findsame`` and the CLI ``bin/findsame``) has no\nexternal dependencies. If you want to run the benchmarks (see \"Benchmarks\"\nbelow), install:\n\n.. code-block:: shell\n\n $ pip3 install -r requirements_benchmark.txt\n\n\nUsage\n=====\n\n::\n\n usage: findsame [-h] [-b BLOCKSIZE] [-l LIMIT] [-L AUTO_LIMIT_MIN] [-p NPROCS]\n [-t NTHREADS] [-o OUTMODE] [-v]\n file/dir [file/dir ...]\n\n Find same files and dirs based on file hashes.\n\n positional arguments:\n file/dir files and/or dirs to compare\n\n optional arguments:\n -h, --help show this help message and exit\n -b BLOCKSIZE, --blocksize BLOCKSIZE\n blocksize in hash calculation, use units K,M,G as in\n 100M, 218K or just 1024 (bytes) [default: 256.0K]\n -l LIMIT, --limit LIMIT\n read limit (bytes or 'auto'), if bytes then same units\n as for BLOCKSIZE apply, calculate hash only over the\n first LIMIT bytes, makes things go faster for may\n large files, try 500K [default: None], use 'auto' to\n try to determine the smallest value necessary\n automatically\n -L AUTO_LIMIT_MIN, --auto-limit-min AUTO_LIMIT_MIN\n start value for auto LIMIT calculation when --limit\n auto is used [default: 8.0K]\n -p NPROCS, --nprocs NPROCS\n number of parallel processes [default: 1]\n -t NTHREADS, --nthreads NTHREADS\n threads per process [default: 4]\n -o OUTMODE, --outmode OUTMODE\n 1: list of dicts (values of dict from mode 2), one\n dict per hash, 2: dict of dicts (full result), keys\n are hashes, 3: compact, sort by type (file, dir)\n [default: 3]\n -v, --verbose enable verbose/debugging output\n\nThe output format is json, see ``-o/--outmode``, default is ``-o 3``. An\nexample using the test suite data:\n\n.. code-block:: shell\n\n $ cd findsame/tests\n $ findsame data | jq .\n {\n \"dir:empty\": [\n [\n \"data/dir2/empty_dir\",\n \"data/dir2/empty_dir_copy\",\n \"data/empty_dir\",\n \"data/empty_dir_copy\"\n ]\n ],\n \"dir\": [\n [\n \"data/dir1\",\n \"data/dir1_copy\"\n ]\n ],\n \"file:empty\": [\n [\n \"data/dir2/empty_dir/empty_file\",\n \"data/dir2/empty_dir_copy/empty_file\",\n \"data/empty_dir/empty_file\",\n \"data/empty_dir_copy/empty_file\",\n \"data/empty_file\",\n \"data/empty_file_copy\"\n ]\n ],\n \"file\": [\n [\n \"data/dir1/file2\",\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ],\n [\n \"data/lena.png\",\n \"data/lena_copy.png\"\n ],\n [\n \"data/file1\",\n \"data/file1_copy\"\n ]\n ]\n }\n\nThis returns a dict whose keys are the path type (file, dir). Values are nested\nlists. Each sub-list contains paths having the same hash. A special case is\n``file:empty`` and ``dir:empty`` which actually have the same hash (that of an\nempty string), which is not visible in this format. Use ``-o1`` or ``-o2`` in\nthat case. More examples below.\n\nUse `jq `_ for pretty-printing. Post-processing\nis only limited by your ability to process json (using ``jq``, Python, ...).\n\nNote that the order of key-value entries in the output from both ``findsame``\nand ``jq`` is random.\n\nNote that currently, we skip symlinks.\n\n\nPerformance\n===========\n\nParallel hash calculation\n-------------------------\nBy default, we use ``--nthreads`` equal to the number of cores. See\n\"Benchmarks\" below.\n\nLimit data to be hashed\n-----------------------\n\nStatic limit\n~~~~~~~~~~~~\nApart from parallelization, by far the most speed is gained by using\n``--limit``. Note that this may lead to false positives, if files are exactly\nequal in the first ``LIMIT`` bytes. Finding a good enough value can be done by\ntrial and error. Try 500K. This is still quite fast and seems to cover most\nreal-world data.\n\nAutomatic optimal limit\n~~~~~~~~~~~~~~~~~~~~~~~\nWe have an *experimental* feature where we iteratively increase ``LIMIT`` to\nfind the smallest possible value. In every iteration, we increase the last\nlimit by multiplying with ``config.cfg.auto_limit_increase_fac``, with that\nre-calculate only the hash of files that have the same hash as others within\nthe last ``LIMIT`` and check whether their new hashes are now different. This\nworks but hasn't been extensively benchmarked. The assumption is that a small\nnumber of iterations on a subset of all files (those reported equal so far)\nconverges quickly and is still faster than a non-optimal ``LIMIT`` or even no\nlimit at all when you have many big files (as in GiB).\n\nRelated options and defaults:\n\n* ``--limit auto``\n* ``--auto-limit-min 8K`` = ``config.cfg.auto_limit_min``\n* ``config.cfg.auto_limit_increase_fac=2`` (no cmd line so far)\n\nObservations so far:\n\nConvergence corner cases: When files are equal in a good chunk at file start\nand ``auto_limit_min`` is small, then the first few iterations show no change\nin files being equal (which we use to detect converged limit values). To\ncircumvent early converge here, we iterate until the number of equal files\nchanges. The worst case scenario is that ``auto_limit_min`` is already optimal.\nSince there is no way to determine that a priori, we will iterate until limit\nhits the biggest file size, which may be slow. That is why it is important to\nchoose ``auto_limit_min`` small enough.\n\n``auto_limit_min``: Don't use very small values such as 20 (that is 20 bytes).\nWe found that this can converge to a local optimum (converged but too many\nequal files reported), depending in the structure of the headers of the files\nyou compare. Stick with something like a small multiple of the blocksize of\nyour file system (we use 8K).\n\n\nTests\n=====\n\nRun ``nosetests3`` (maybe ``apt install python3-nose`` before (Debian)).\n\n\nBenchmarks\n==========\n\nYou may run the benchmark script to find the best blocksize and number threads\nand/or processes for hash calculations on your machine.\n\n.. code-block:: shell\n\n $ cd findsame/benchmark\n $ ./clean.sh\n $ ./benchmark.py\n $ ./plot.py\n\nThis writes test files of various size to ``benchmark/files`` and runs a couple\nof benchmarks (runtime ~10 min for all benchmarks). Make sure to avoid doing\nany other extensive IO tasks while the benchmarks run, of course.\n\n**The default value of \"maxsize\" in benchmark.py (in the __main__ part) is only\nsome MiB to allow quick testing. This needs to be changed to, say, 1 GiB in\norder to have meaningful benchmarks.**\n\nBottom line:\n\n* blocksizes below 512 KiB (``--blocksize 512K``) work best for all file sizes\n on most systems, even though the variation to worst timings is at most factor\n 1.25 (e.g. 1 vs. 1.25 seconds)\n* multithreading (``-t/--nthreads``): up to 2x speedup on dual-core box -- very\n efficient, use NTHREADS = number of cores for good baseline performance\n (problem is mostly IO-bound)\n* multiprocessing (``-p/--nprocs``): less efficient speedup, but on some\n systems NPROCS + NTHREADS is even a bit faster than NTHREADS alone, testing\n is mandatory\n* we have a linear increase of runtime with filesize, of course\n\nTested systems:\n\n* Lenovo E330, Samsung 840 Evo SSD, Core i3-3120M (2 cores, 2 threads / core)\n* Lenovo X230, Samsung 840 Evo SSD, Core i5-3210M (2 cores, 2 threads / core)\n\n * best blocksizes = 256K\n * speedups: NPROCS=2: 1.5, NTHREADS=2..3: 1.9,\n no gain when using NPROCS+NTHREADS\n\n* FreeNAS 11 (FreeBSD 11.0), ZFS mirror WD Red WD40EFRX, Intel Celeron J3160\n (4 cores, 1 thread / core)\n\n * best blocksizes = 80K\n * speedups: NPROCS=3..4: 2.1..2.2, NTHREADS=4..6: 2.6..2.7, NPROCS=3..4,NTHREADS=4: 3\n\n\nOutput modes\n============\n\nDefault (``-o3``)\n-----------------\n\nThe default output format is ``-o3`` (same as the initial example above).\n\n.. code-block:: shell\n\n $ findsame -o3 data | jq .\n {\n \"dir:empty\": [\n [\n \"data/dir2/empty_dir\",\n \"data/dir2/empty_dir_copy\",\n \"data/empty_dir\",\n \"data/empty_dir_copy\"\n ]\n ],\n \"dir\": [\n [\n \"data/dir1\",\n \"data/dir1_copy\"\n ]\n ],\n \"file:empty\": [\n [\n \"data/dir2/empty_dir/empty_file\",\n \"data/dir2/empty_dir_copy/empty_file\",\n \"data/empty_dir/empty_file\",\n \"data/empty_dir_copy/empty_file\",\n \"data/empty_file\",\n \"data/empty_file_copy\"\n ]\n ],\n \"file\": [\n [\n \"data/dir1/file2\",\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ],\n [\n \"data/lena.png\",\n \"data/lena_copy.png\"\n ],\n [\n \"data/file1\",\n \"data/file1_copy\"\n ]\n ]\n }\n\n\nOutput with hashes (``-o2``)\n-----------------------------\n\n.. code-block:: shell\n\n $ findsame -o2 data | jq .\n {\n \"da39a3ee5e6b4b0d3255bfef95601890afd80709\": {\n \"dir:empty\": [\n \"data/dir2/empty_dir\",\n \"data/dir2/empty_dir_copy\",\n \"data/empty_dir\",\n \"data/empty_dir_copy\"\n ],\n \"file:empty\": [\n \"data/dir2/empty_dir/empty_file\",\n \"data/dir2/empty_dir_copy/empty_file\",\n \"data/empty_dir/empty_file\",\n \"data/empty_dir_copy/empty_file\",\n \"data/empty_file\",\n \"data/empty_file_copy\"\n ]\n },\n \"55341fe74a3497b53438f9b724b3e8cdaf728edc\": {\n \"dir\": [\n \"data/dir1\",\n \"data/dir1_copy\"\n ]\n },\n \"9619a9b308cdebee40f6cef018fef0f4d0de2939\": {\n \"file\": [\n \"data/dir1/file2\",\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ]\n },\n \"0a96c2e755258bd46abdde729f8ee97d234dd04e\": {\n \"file\": [\n \"data/lena.png\",\n \"data/lena_copy.png\"\n ]\n },\n \"312382290f4f71e7fb7f00449fb529fce3b8ec95\": {\n \"file\": [\n \"data/file1\",\n \"data/file1_copy\"\n ]\n }\n }\n\nThe output is one dict (json object) where all same-hash files/dirs are found\nat the same key (hash).\n\nDict values (``-o1``)\n---------------------\nThe format ``-o1`` lists only the dict values from ``-o2``, i.e. a list of\ndicts.\n\n.. code-block:: shell\n\n $ findsame -o1 data | jq .\n [\n {\n \"dir:empty\": [\n \"data/dir2/empty_dir\",\n \"data/dir2/empty_dir_copy\",\n \"data/empty_dir\",\n \"data/empty_dir_copy\"\n ],\n \"file:empty\": [\n \"data/dir2/empty_dir/empty_file\",\n \"data/dir2/empty_dir_copy/empty_file\",\n \"data/empty_dir/empty_file\",\n \"data/empty_dir_copy/empty_file\",\n \"data/empty_file\",\n \"data/empty_file_copy\"\n ]\n },\n {\n \"dir\": [\n \"data/dir1\",\n \"data/dir1_copy\"\n ]\n },\n {\n \"file\": [\n \"data/file1\",\n \"data/file1_copy\"\n ]\n },\n {\n \"file\": [\n \"data/dir1/file2\",\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ]\n },\n {\n \"file\": [\n \"data/lena.png\",\n \"data/lena_copy.png\"\n ]\n }\n ]\n\n\nMore usage examples\n===================\n\nHere we show examples of common post-processing tasks using ``jq``. When the\n``jq`` command works for all three output modes, we don't specify the ``-o``\noption.\n\nCount the total number of all equals:\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|.[]' | wc -l\n\nFind only groups of equal dirs:\n\n.. code-block:: shell\n\n $ findsame -o1 data | jq '.[]|select(.dir)|.dir'\n $ findsame -o2 data | jq '.[]|select(.dir)|.dir'\n $ findsame -o3 data | jq '.dir|.[]'\n [\n \"data/dir1\",\n \"data/dir1_copy\"\n ]\n\nGroups of equal files:\n\n.. code-block:: shell\n\n $ findsame -o1 data | jq '.[]|select(.file)|.file'\n $ findsame -o2 data | jq '.[]|select(.file)|.file'\n $ findsame -o3 data | jq '.file|.[]'\n [\n \"data/dir1/file2\",\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ]\n [\n \"data/lena.png\",\n \"data/lena_copy.png\"\n ]\n [\n \"data/file1\",\n \"data/file1_copy\"\n ]\n\nFind the first element in a group of equal items (file or dir):\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|[.[0]]'\n [\n \"data/lena.png\"\n ]\n [\n \"data/dir2/empty_dir\"\n ]\n [\n \"data/dir2/empty_dir/empty_file\"\n ]\n [\n \"data/dir1/file2\"\n ]\n [\n \"data/file1\"\n ]\n [\n \"data/dir1\"\n ]\n\nor more compact w/o the length-1 list:\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|.[0]'\n \"data/dir2/empty_dir\"\n \"data/dir2/empty_dir/empty_file\"\n \"data/dir1/file2\"\n \"data/lena.png\"\n \"data/file1\"\n \"data/dir1\"\n\n\nFind *all but the first* element in a group of equal items (file or dir):\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|.[1:]'\n [\n \"data/dir1_copy\"\n ]\n [\n \"data/lena_copy.png\"\n ]\n [\n \"data/dir1/file2_copy\",\n \"data/dir1_copy/file2\",\n \"data/dir1_copy/file2_copy\",\n \"data/file2\"\n ]\n [\n \"data/dir2/empty_dir_copy/empty_file\",\n \"data/empty_dir/empty_file\",\n \"data/empty_dir_copy/empty_file\",\n \"data/empty_file\",\n \"data/empty_file_copy\"\n ]\n [\n \"data/dir2/empty_dir_copy\",\n \"data/empty_dir\",\n \"data/empty_dir_copy\"\n ]\n [\n \"data/file1_copy\"\n ]\n\nAnd more compact:\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|.[1:]|.[]'\n \"data/file1_copy\"\n \"data/dir1/file2_copy\"\n \"data/dir1_copy/file2\"\n \"data/dir1_copy/file2_copy\"\n \"data/file2\"\n \"data/lena_copy.png\"\n \"data/dir2/empty_dir_copy/empty_file\"\n \"data/empty_dir/empty_file\"\n \"data/empty_dir_copy/empty_file\"\n \"data/empty_file\"\n \"data/empty_file_copy\"\n \"data/dir2/empty_dir_copy\"\n \"data/empty_dir\"\n \"data/empty_dir_copy\"\n \"data/dir1_copy\"\n\nThe last one can be used to remove all but the first in a group of equal\nfiles/dirs:\n\n.. code-block:: shell\n\n $ findsame data | jq '.[]|.[]|.[1:]|.[]' | xargs cp -rvt duplicates/\n\n``jq`` trick: preserve color in ``less(1)``:\n\n.. code-block:: shell\n\n $ findsame data | jq . -C | less -R\n\n\nOther tools\n===========\n\n* ``fdupes``\n* ``findup`` from ``fslint``\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/elcorto/findsame", "keywords": "merkle-tree hash duplicates multithreading multiprocessing", "license": "BSD 3-Clause", "maintainer": "", "maintainer_email": "", "name": "findsame", "package_url": "https://pypi.org/project/findsame/", "platform": "", "project_url": "https://pypi.org/project/findsame/", "project_urls": { "Homepage": "https://github.com/elcorto/findsame" }, "release_url": "https://pypi.org/project/findsame/0.1.2/", "requires_dist": null, "requires_python": "", "summary": "Find duplicate files and directories using hashes and a Merkle tree", "version": "0.1.2" }, "last_serial": 5787510, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "bcacbe3c47922328a52a8e6048c4f0fe", "sha256": "e1d794be0d288b7712bdedd821ef621129666df760a08b4894b44f25c95c95db" }, "downloads": -1, "filename": "findsame-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "bcacbe3c47922328a52a8e6048c4f0fe", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 25433, "upload_time": "2019-08-31T09:28:41", "url": "https://files.pythonhosted.org/packages/9f/a3/c8a5244ac4b21055793478a9916dd28d274f22897d750e8b5357d88d8678/findsame-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5f41399a395e9335c1766ac5fcb8dea9", "sha256": "c6aa42353ca95d354eb80d2bece6944148c945acbc4f780af2623b39f6e0fe29" }, "downloads": -1, "filename": "findsame-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5f41399a395e9335c1766ac5fcb8dea9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 25447, "upload_time": "2019-08-31T09:28:44", "url": "https://files.pythonhosted.org/packages/87/47/0eff725726d43312426bbc4de87f397cd4941e630f8caaf9d9eb3c5ff720/findsame-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "378d11fa426bfa055e342e3b7461528e", "sha256": "60c72024562f043012ddc96d496800295d2a6a00c138b5ed26153e5eba218a52" }, "downloads": -1, "filename": "findsame-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "378d11fa426bfa055e342e3b7461528e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 26525, "upload_time": "2019-09-02T17:18:59", "url": "https://files.pythonhosted.org/packages/2a/f7/a3b5f6efb9a614754fdbcfe3d5408c5dffbbcbd275dda4bf0434561c3c2c/findsame-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "78760e70d2b71d50829ac2de115cff67", "sha256": "ff0850dbd0af6e26ae0bb25634b74b97fb3fbc207a0c06bd87b7620ef54cbfe2" }, "downloads": -1, "filename": "findsame-0.1.1.tar.gz", "has_sig": false, "md5_digest": "78760e70d2b71d50829ac2de115cff67", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27387, "upload_time": "2019-09-02T17:19:01", "url": "https://files.pythonhosted.org/packages/96/d1/8bb0494374018f5297ca5cc4ae5a12015a1a6b5c70af2f4480cd94619059/findsame-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "a718e23b2587fdbeed34122b0cb2899d", "sha256": "0b8e7858cf6d13eef665e042738db05838808bed36856363cb8c1d7c623cd237" }, "downloads": -1, "filename": "findsame-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "a718e23b2587fdbeed34122b0cb2899d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 26347, "upload_time": "2019-09-05T17:05:25", "url": "https://files.pythonhosted.org/packages/6a/ed/4d7b9c71740a3a08c17e112ef39a8a33c0fd3336db5b2f6356679b1f67eb/findsame-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f78be0fb035948dbf34a2e1a854fe993", "sha256": "9cd43bbef02227fe2cb3c029b31ed1ef2919d0de5d563875e790d63210625cfd" }, "downloads": -1, "filename": "findsame-0.1.2.tar.gz", "has_sig": false, "md5_digest": "f78be0fb035948dbf34a2e1a854fe993", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27272, "upload_time": "2019-09-05T17:05:27", "url": "https://files.pythonhosted.org/packages/e5/f6/d2dc11e676e3aeb14fe39e59ee3ee588d715220e9af64b338007f0a1d388/findsame-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a718e23b2587fdbeed34122b0cb2899d", "sha256": "0b8e7858cf6d13eef665e042738db05838808bed36856363cb8c1d7c623cd237" }, "downloads": -1, "filename": "findsame-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "a718e23b2587fdbeed34122b0cb2899d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 26347, "upload_time": "2019-09-05T17:05:25", "url": "https://files.pythonhosted.org/packages/6a/ed/4d7b9c71740a3a08c17e112ef39a8a33c0fd3336db5b2f6356679b1f67eb/findsame-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f78be0fb035948dbf34a2e1a854fe993", "sha256": "9cd43bbef02227fe2cb3c029b31ed1ef2919d0de5d563875e790d63210625cfd" }, "downloads": -1, "filename": "findsame-0.1.2.tar.gz", "has_sig": false, "md5_digest": "f78be0fb035948dbf34a2e1a854fe993", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27272, "upload_time": "2019-09-05T17:05:27", "url": "https://files.pythonhosted.org/packages/e5/f6/d2dc11e676e3aeb14fe39e59ee3ee588d715220e9af64b338007f0a1d388/findsame-0.1.2.tar.gz" } ] }