{ "info": { "author": "Anders Huss", "author_email": "andhus@kth.se", "bugtrack_url": null, "classifiers": [], "description": "[![Build Status](https://travis-ci.com/andhus/dirhash.svg?branch=master)](https://travis-ci.com/andhus/dirhash)\n[![codecov](https://codecov.io/gh/andhus/dirhash/branch/master/graph/badge.svg)](https://codecov.io/gh/andhus/dirhash)\n\n# dirhash\nA lightweight python module and tool for computing the hash of any\ndirectory based on its files' structure and content.\n- Supports any hashing algorithm of Python's built-in `hashlib` module\n- `.gitignore` style \"wildmatch\" patterns for expressive filtering of files to \ninclude/exclude.\n- Multiprocessing for up to [6x speed-up](#performance)\n\n## Installation\n```commandline\ngit clone git@github.com:andhus/dirhash.git\npip install dirhash/\n```\n\n## Usage\nPython module:\n```python\nfrom dirhash import dirhash\n\ndirpath = 'path/to/directory'\ndir_md5 = dirhash(dirpath, 'md5')\nfiltered_sha1 = dirhash(dirpath, 'sha1', ignore=['.*', '.*/', '*.pyc'])\npyfiles_sha3_512 = dirhash(dirpath, 'sha3_512', match=['*.py'])\n```\nCLI:\n```commandline\ndirhash path/to/directory -a md5\ndirhash path/to/directory -a sha1 -i \".* .*/ *.pyc\"\ndirhash path/to/directory -a sha3_512 -m \"*.py\"\n```\n\n## Why?\nIf you (or your application) need to verify the integrity of a set of files as well\nas their name and location, you might find this useful. Use-cases range from \nverification of your image classification dataset (before spending GPU-$$$ on \ntraining your fancy Deep Learning model) to validation of generated files in\nregression-testing.\n\nThere isn't really a standard way of doing this. There are plenty of recipes out \nthere (see e.g. these SO-questions for [linux](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)\nand [python](https://stackoverflow.com/questions/24937495/how-can-i-calculate-a-hash-for-a-filesystem-directory-using-python))\nbut I couldn't find one that is properly tested (there are some gotcha:s to cover!) \nand documented with a compelling user interface. `dirhash` was created with this as \nthe goal.\n\n[checksumdir](https://github.com/cakepietoast/checksumdir) is another python \nmodule/tool with similar intent (that inspired this project) but it lacks much of the\nfunctionality offered here (most notably including file names/structure in the hash)\nand lacks tests.\n\n## Performance\nThe python `hashlib` implementation of common hashing algorithms are highly\noptimised. `dirhash` mainly parses the file tree, pipes data to `hashlib` and \ncombines the output. Reasonable measures have been taken to minimize the overhead \nand for common use-cases, the majority of time is spent reading data from disk \nand executing `hashlib` code.\n\nThe main effort to boost performance is support for multiprocessing, where the\nreading and hashing is parallelized over individual files.\n\nAs a reference, let's compare the performance of the `dirhash` [CLI](https://github.com/andhus/dirhash/blob/master/dirhash/cli.py) \nwith the shell command:\n\n`find path/to/folder -type f -print0 | sort -z | xargs -0 md5 | md5` \n\nwhich is the top answer for the SO-question: \n[Linux: compute a single hash for a given folder & contents?](https://stackoverflow.com/questions/545387/linux-compute-a-single-hash-for-a-given-folder-contents)\nResults for two test cases are shown below. Both have 1 GiB of random data: in \n\"flat_1k_1MB\", split into 1k files (1 MiB each) in a flat structure, and in \n\"nested_32k_32kB\", into 32k files (32 KiB each) spread over the 256 leaf directories \nin a binary tree of depth 8.\n\nImplementation | Test Case | Time (s) | Speed up\n------------------- | --------------- | -------: | -------:\nshell reference | flat_1k_1MB | 2.29 | -> 1.0\n`dirhash` | flat_1k_1MB | 1.67 | 1.36\n`dirhash`(8 workers)| flat_1k_1MB | 0.48 | **4.73**\nshell reference | nested_32k_32kB | 6.82 | -> 1.0\n`dirhash` | nested_32k_32kB | 3.43 | 2.00\n`dirhash`(8 workers)| nested_32k_32kB | 1.14 | **6.00**\n\nThe benchmark was run a MacBook Pro (2018), further details and source code [here](https://github.com/andhus/dirhash/tree/master/benchmark).\n\n## Documentation\nPlease refer to `dirhash -h` and the python [source code](https://github.com/andhus/dirhash/blob/master/dirhash/__init__.py).", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/andhus/dirhash", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "dirhash", "package_url": "https://pypi.org/project/dirhash/", "platform": "", "project_url": "https://pypi.org/project/dirhash/", "project_urls": { "Homepage": "https://github.com/andhus/dirhash" }, "release_url": "https://pypi.org/project/dirhash/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "Python module and CLI for hashing of file system directories.", "version": "0.1.1" }, "last_serial": 4823377, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "f29a18f60abe9676db50ee87ea7f6159", "sha256": "dc88718f06dd7f6c3bb4fdfd1567ae161af152aecb0a74dae28fbfe726166ec3" }, "downloads": -1, "filename": "dirhash-0.1.1.tar.gz", "has_sig": false, "md5_digest": "f29a18f60abe9676db50ee87ea7f6159", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13390, "upload_time": "2019-02-15T06:25:21", "url": "https://files.pythonhosted.org/packages/e3/7f/7b41eb6b6c9695569bdeaff2bdeab3fa70b6df03f6b6ae016ca8c8370ee5/dirhash-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f29a18f60abe9676db50ee87ea7f6159", "sha256": "dc88718f06dd7f6c3bb4fdfd1567ae161af152aecb0a74dae28fbfe726166ec3" }, "downloads": -1, "filename": "dirhash-0.1.1.tar.gz", "has_sig": false, "md5_digest": "f29a18f60abe9676db50ee87ea7f6159", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13390, "upload_time": "2019-02-15T06:25:21", "url": "https://files.pythonhosted.org/packages/e3/7f/7b41eb6b6c9695569bdeaff2bdeab3fa70b6df03f6b6ae016ca8c8370ee5/dirhash-0.1.1.tar.gz" } ] }