{ "info": { "author": "Yassine Azzouz", "author_email": "yassine.azzouz@agmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Intended Audience :: System Administrators", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4" ], "description": "[![pypi](https://badge.fury.io/py/pydistcp.svg)](https://badge.fury.io/py/pydistcp)\n\npydistcp\n==================================\n\nA python WebHDFS/HTTPFS based tool for inter/intra-cluster data copying. This tool is very suitable for multiple mid or small size files cross-clusters copy. Compared to the normal distcp which adds a lot of overhead time for subtimitting the map-reduce job then waiting for YARN to schedule it..., pydistcp uses webhdfs to stream the data from source cluster datanodes directly to destination cluster datanodes using multiple parallel threads. \n\nWhen transferring few huge files, the normal distcp may be faster, but when transferring lot of small, midsize or relatively big file, pydistcp provides a very good performance.\n\n```bash\n $ pydistcp -f -s staging -d prod /data/outgoing /data/incoming --threads=10 --part-size=131072\n 27.1% [ pending: 32 | transferring: 6 | complete: 4 ]\n```\n\n```json\nJob Status:\n{\n \"Size Failed\": 0,\n \"Size Copied\": 257721641,\n \"Source Path\": \"/data/t100\",\n \"Size Expected\": 257721641,\n \"Files Expected\": 42,\n \"Files Failed\": 0,\n \"Destination Path\": \"/data/t200\",\n \"Start Time\": \"2017-02-22 17:39:29\",\n \"Files Skipped\": 0,\n \"Size Deleted\": 0,\n \"End Time\": \"2017-02-22 17:39:50\",\n \"Files Copied\": 42,\n \"Files Deleted\": 0,\n \"Duration\": 20.756325006484985,\n \"Outcome\": \"Successful\",\n \"Size Skipped\": 0\n}\n```\n\nPydistcp uses [ pywhdfs ](https://github.com/yassineazzouz/pywhdfs) for establishing connections with WEBHDFS/HTTPFS source and destination clusters.\n\nFeatures\n--------\n\n* Pydistcp is based on pywhdfs project to establish WebHDFS and HTTPFS connections with source and destination clusters,\n so all clusters configurations supported in pywhdfs are also supported in pydistcp:\n - Support both secure (Kerberos,Token) and insecure clusters\n - Supports HA cluster and handle namenode failover\n - Supports HDFS federation with multiple nameservices and mount points.\n* Supports data copy between secure and insecure clusters\n* Supports data copy between clusters using different kerberos realms using token authentication\n* Supports data copy between encrypted and non encrypted clusters\n* Json format clusters configuration.\n* Perform concurent multithreaded data copy.\n\n\nGetting started\n---------------\n\n```bash\n $ easy_install pydistcp\n```\n\n\nConfiguration\n---------------\n\nPydistcp share the same json configuration file used by [ pywhdfs ](https://github.com/yassineazzouz/pywhdfs).\nPlease refer to the project readme file for details about the json configuration schema.\n\nUSAGE\n-------\n\nThere are multiple arguments you can use to alter the way the copy works, or to inhence the performace of the job depending on the size of the server you use.\nUse the help argument to display the full list of supported parameters:\n\n```bash\n $ pydistcp --help\n pydistcp: A python Web HDFS based tool for inter/intra-cluster data copying.\n\n Usage:\n pydistcp [-fp] [--no-checksum] [--silent] (-s CLUSTER -d CLUSTER) [-v...] [--part-size=PART_SIZE] [--threads=THREADS] SRC_PATH DEST_PATH\n pydistcp (--version | -h)\n\n Options:\n --version Show version and exit.\n -h --help Show help and exit.\n -s CLUSTER --src=CLUSTER Alias of source namenode to connect to (valid only with dist).\n -d CLUSTER --dest=CLUSTER Alias of destination namenode to connect to (valid only with dist).\n -v --verbose Enable log output. Can be specified multiple times to increase verbosity each time.\n --no-checksum Disable checksum check prior to file transfer. This will force overwrite.\n --silent Don't display progress status.\n -f --force Allow overwriting any existing files.\n -p --preserve Preserve file attributes.\n --threads=THREADS Number of threads to use for parallelization.\n zero limits the concurency to the maximim concurrent threads\n supported by the cluster. [default: 0]\n --part-size=PART_SIZE Interval in bytes by which the files will be copied\n needs to be a Powers of 2. [default: 65536]\n\n Examples:\n pydistcp -s prod -d preprod -v /tmp/src /tmp/dest\n```\n\nAll cluster connection parameters will be fetched from the json configuration file. \n\n\nbenchmarks\n------------\n\nBelow some benchmarks showing the impact of data size on the copy performance using pydistcp :\n\n\n| File Count | Data Size | Time |\n| ---------- | --------- | ------- |\n| 2379 | 11.4 G | 4m39.069s |\n| 242 | 25.9 G | 5m39.348s |\n| 869 | 116.9 G | 25m53.231s |\n| 42 | 545.8 M | 0m19.946s |\n| 1788 | 5.2 G | 2m25.649s |\n| 4428 | 35.7 G | 10m20.129s |\n| 2357 | 5.6 G | 3m2.598s |\n| 180 | 2.3 G | 0m33.133s |\n| 334 | 7.6 G | 1m26.260s |\n\nNote that all test cases are executed with 10 concurent threads on a machine having 6 cores and supporting up to 12 threads and no files\nare skipped during the copy. Both the source and destination clusters are secured with kerberos and use ssl to encrypt transferred data.\n\nPydistcp performance may be impact by lot of parameters like:\n- the size of the machine performing the copy.\n- The type of the source and destination clusters (secure clusters with kerberos does not support lot of concurrent threads, it is better from a performance perspective to use token authentication)\n- SSL and the length of encyption key used\n- The type of data to be transfered : Pydistcp deliver the best performance for multiple files having approximatily uniform sizes. \n\nContributing\n------------\n\nFeedback and Pull requests are very welcome!", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/yassineazzouz/pydistcp", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pydistcp", "package_url": "https://pypi.org/project/pydistcp/", "platform": "", "project_url": "https://pypi.org/project/pydistcp/", "project_urls": { "Homepage": "https://github.com/yassineazzouz/pydistcp" }, "release_url": "https://pypi.org/project/pydistcp/1.0.4/", "requires_dist": null, "requires_python": "", "summary": "pydistcp: python WebHDFS inter/intra-cluster data copy tool.", "version": "1.0.4" }, "last_serial": 2685620, "releases": { "1.0.0": [], "1.0.1": [], "1.0.2": [ { "comment_text": "", "digests": { "md5": "5e83722695001589686ac7d5e895ee4a", "sha256": "84cbcaf7e73d0a3486f1bc2c8817111d5214754c6e990341841c2895ec7d6a58" }, "downloads": -1, "filename": "pydistcp-1.0.2.tar.gz", "has_sig": false, "md5_digest": "5e83722695001589686ac7d5e895ee4a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11043, "upload_time": "2017-02-22T23:26:11", "url": "https://files.pythonhosted.org/packages/48/c6/d7ff8b71a93f9d58420959ae8721ba0562e808598b67b66ae70ec6f43034/pydistcp-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "9a2366fcc7eaef25477970a785711eb4", "sha256": "f80f1f14ad3c4cfcc9633c8ccadaed94da2cc45e841df49ff4c8d59615d1c2bd" }, "downloads": -1, "filename": "pydistcp-1.0.3.tar.gz", "has_sig": false, "md5_digest": "9a2366fcc7eaef25477970a785711eb4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11035, "upload_time": "2017-02-22T23:45:50", "url": "https://files.pythonhosted.org/packages/28/f1/1b8579d45084b6bce289b36c4882d3db8e44dec6a360a69c641e3f808a03/pydistcp-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "6c6b447e7b267f2e3dd721ac1162f1c1", "sha256": "26bb4fb9c587aec24a5af93b03ad74a8f550af58288a5032a481d2fc3b97753c" }, "downloads": -1, "filename": "pydistcp-1.0.4.tar.gz", "has_sig": false, "md5_digest": "6c6b447e7b267f2e3dd721ac1162f1c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11598, "upload_time": "2017-02-27T08:25:29", "url": "https://files.pythonhosted.org/packages/8b/c3/c4cd221e2619f9fd9ac8fd5757d1faa198148fa0b6a606652c6966d04298/pydistcp-1.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6c6b447e7b267f2e3dd721ac1162f1c1", "sha256": "26bb4fb9c587aec24a5af93b03ad74a8f550af58288a5032a481d2fc3b97753c" }, "downloads": -1, "filename": "pydistcp-1.0.4.tar.gz", "has_sig": false, "md5_digest": "6c6b447e7b267f2e3dd721ac1162f1c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11598, "upload_time": "2017-02-27T08:25:29", "url": "https://files.pythonhosted.org/packages/8b/c3/c4cd221e2619f9fd9ac8fd5757d1faa198148fa0b6a606652c6966d04298/pydistcp-1.0.4.tar.gz" } ] }