{ "info": { "author": "Vincent Pelletier", "author_email": "plr.vincent@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: System Administrators", "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: System :: Archiving :: Backup" ], "description": ".. contents::\n\nRe-hardlink directories created using ``rsync --link-dest``.\n\n``rsync --link-dest``\n=====================\n\nThis is a nice way to make backups which are trivial to restore (cp -a) and use\nlittle disk space. Low-tech enough that I trust it when I often have new files\nrather than modified large files.\n\nThe problem\n===========\n\nSometimes, because of a user mistake or a bug, you get a new full backup\nhappening. As a result, a lot of disk space is suddenly allocated for a backup\nwhere little changed. This program is a shot at curing this situation after it\nhappened.\n\nThe algorithm\n=============\n\nLet's take the following simple backup content::\n\n /some/place/2017-01-01/home/foo/bar\n /some/place/2017-01-01/home/foo/boo\n /some/place/2017-01-02/home/foo/bar\n /some/place/2017-01-02/home/foo/baz\n /some/place/2017-01-03/home/foo/bar\n /some/place/2017-01-03/home/foo/baz\n /some/place/2017-01-03/home/foo/boo\n\nIt can be seen as a sparse 2d grid where cells in each row are strongly\nrelated (they are more likely to be the same file than accross rows):\n\n+----------------+-------+----------------+----------------+---------------+\n| Relative path\u2193 | Head\u2192 | ``2017-01-01`` | ``2017-01-02`` | ``2017-01-03``|\n+================+=======+================+================+===============+\n|``/home/foo/bar`` | inode 1 | inode 3 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/baz`` | (n/a) | inode 2 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/boo`` | inode 4 | (n/a) | inode 5 |\n+------------------------+----------------+----------------+---------------+\n\nThis tool takes the ``heads`` as arguments (ex: one folder per day).\nIt resurses through the first head, and, for each non-directory path found,\niterates over the heads on its right, building mapping between inodes and the\nheads each inode is in:\n\n.. code:: python\n\n # relative_path = \"/home/foo/bar\"\n {\n 1: [\"2017-01-01\", \"2017-01-02\"],\n 3: [\"2017-03-03\"],\n }\n\nThen, each inode is compared to the others (by picking an arbitrary file in the\nlist), and each inode identical to a given one gets unlinked and relinked to\nthat inode.\n\nOptionally (enabled by default), the code then remembers the (relative) path it\njust analysed, so that it does not analyse it again when iterating over next\nhead. This is very efficient, but makes memory usage proportional to the number\nof unique relative paths and their length. In my experience, with pypy, this is\nunder 200MB for over 300k unique relative paths.\n\nThen it goes to the next relative path in current head. Then it moves on to the\nnext head to process files which did not yet exist in current head.\n\nAs a result, execution time is dominated by the number of unique relative paths\ninstead of the total number of files. For the example above, it will read these\nfiles, in this order, this number of times::\n\n /some/place/2017-01-01/home/foo/bar\n /some/place/2017-01-03/home/foo/bar\n /some/place/2017-01-01/home/foo/boo\n /some/place/2017-01-03/home/foo/boo\n /some/place/2017-01-02/home/foo/baz\n\nAssuming inodes 1 and 2 are identical, and 4 and 5 are identical, optimal\n``rsync --link-dest`` (so without users & code issues) would have produced:\n\n+----------------+-------+----------------+----------------+---------------+\n| Relative path\u2193 | Head\u2192 | ``2017-01-01`` | ``2017-01-02`` | ``2017-01-03``|\n+================+=======+================+================+===============+\n|``/home/foo/bar`` | inode 1 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/baz`` | (n/a) | inode 2 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/boo`` | inode 4 | (n/a) | inode 5 |\n+------------------------+----------------+----------------+---------------+\n\nAnd this script will produce:\n\n+----------------+-------+----------------+----------------+---------------+\n| Relative path\u2193 | Head\u2192 | ``2017-01-01`` | ``2017-01-02`` | ``2017-01-03``|\n+================+=======+================+================+===============+\n|``/home/foo/bar`` | inode 1 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/baz`` | (n/a) | inode 2 |\n+------------------------+----------------+----------------+---------------+\n|``/home/foo/boo`` | inode 4 | (n/a) | inode 4 |\n+------------------------+----------------+----------------+---------------+\n\nAn advantage of this method compared to other hard-linkers is that it freezes\nspace regularly while running by unlinking all homonym duplicates at once.\n\nA disadvantage of this method compared to other hard-linkers is that it will not\nlink together identical files if their name differs. But neither would\n``rsync --link-dest``, so it would not be our backup too of choice.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/vpelletier/resynclinkdest", "keywords": "rsync --link-dest hardlink", "license": "GPL 3+", "maintainer": "", "maintainer_email": "", "name": "resynclinkdest", "package_url": "https://pypi.org/project/resynclinkdest/", "platform": "any", "project_url": "https://pypi.org/project/resynclinkdest/", "project_urls": { "Homepage": "http://github.com/vpelletier/resynclinkdest" }, "release_url": "https://pypi.org/project/resynclinkdest/1.0.1/", "requires_dist": null, "requires_python": "", "summary": "Re-hardlink directories created using ``rsync --link-dest``.", "version": "1.0.1" }, "last_serial": 3100710, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "c132dcd846f75fcb8c6a1fb8bd053716", "sha256": "4ada7215f058bcba383b9631fb7abd9d5e3020041f5bc2426d5e6316f128a16b" }, "downloads": -1, "filename": "resynclinkdest-1.0.tar.gz", "has_sig": false, "md5_digest": "c132dcd846f75fcb8c6a1fb8bd053716", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3534, "upload_time": "2017-08-15T10:55:50", "url": "https://files.pythonhosted.org/packages/37/99/ff7113f387a7261d52b03cb489688b52dd01b6fb62d8c5dea984f08251e3/resynclinkdest-1.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "cb285df8186f643250e66ae73e3aa2e8", "sha256": "fb48b3ab55115b9c70fafbef0506b6f660084796962e98afd82d95519aa71a7f" }, "downloads": -1, "filename": "resynclinkdest-1.0.1.tar.gz", "has_sig": false, "md5_digest": "cb285df8186f643250e66ae73e3aa2e8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5506, "upload_time": "2017-08-16T13:27:27", "url": "https://files.pythonhosted.org/packages/31/7f/b2f1c2ada4e69cd8c80e50190c4966d01f7918d704d763f734017f138572/resynclinkdest-1.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cb285df8186f643250e66ae73e3aa2e8", "sha256": "fb48b3ab55115b9c70fafbef0506b6f660084796962e98afd82d95519aa71a7f" }, "downloads": -1, "filename": "resynclinkdest-1.0.1.tar.gz", "has_sig": false, "md5_digest": "cb285df8186f643250e66ae73e3aa2e8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5506, "upload_time": "2017-08-16T13:27:27", "url": "https://files.pythonhosted.org/packages/31/7f/b2f1c2ada4e69cd8c80e50190c4966d01f7918d704d763f734017f138572/resynclinkdest-1.0.1.tar.gz" } ] }