{ "info": { "author": "Daniel McClary", "author_email": "dan.mcclary@northwestern.edu", "bugtrack_url": null, "classifiers": [], "description": "===========\nZiggy\n===========\n\nZiggy provides a collection of python methods for Hadoop Streaming. Ziggy is\nuseful for building complex MapReduce programs, using Hadoop for batch processing\nof many files, Monte Carlo processes, graph algorithms, and common utility tasks (e.g. sort, search).\nTypical usage\noften looks like this::\n\n #!/usr/bin/env python\n\n import ziggy.hdmc as hdmc\n\tfrom glob import glob\n\t\n\tfiles_to_process = glob(\"/some/path/*\")\n\tresults = hdmc.submit_checkpoint_inline(script_to_run, output_filename, files_to_process, argument_string)\n \nTo install run::\n\n\tpython setup.py hadoop\n\tpython setup.py install\n\nZiggy was authored by Dan McClary, Ph.D. and originates in the\n `Amaral Lab at Northwestern University `_.\n \n=====================\nInstallation Details\n=====================\n \nUnsurprisingly, Ziggy requires a Hadoop cluster. To make Ziggy work with\nyour cluster, you need to edit the setup.cfg file before running \n``python setup.py hadoop``. This ensures that Ziggy creates the correct configuration\nfiles for its modules.\n \nsetup.cfg currently contains 3 definitions that must be specified. These are:\n\n* hadoop-home\n\n The HADOOP_HOME for your system. For example, the default on our clusters at Northwestern is /usr/local/hadoop.\n \t\n* num-map-tasks\n\n The total number of map tasks your cluster for which your cluster is configured. The default is 20.\n \t\n* shared-tmp-space\n\n This is the path to a shared space (usually via NFS) available to all nodes on your Hadoop cluster. While this space is not necessary for building and executing custom Hadoop-streaming calls, the \"checkpointing\" calls in HDMC require a shared directory from which to coordinate task and data distribution.\n\n \t\nOnce these are set to your liking, run\n``python setup.py hadoop`` \nto create the hadoop_config.py modules. Then run\n``python setup.py hadoop``\nto install Ziggy\n\nZiggy's Features\n=====================\n\nHDMC\n----------\n\nHDMC provides **3** principle means of interacting with a Hadoop server.\nImport it using ``import ziggy.hdmc``. The interaction types are:\n\n* Call Assembly\n\n Building custom and executing Hadoop streaming calls. This is done using the ``build_generic_hadoop_call``, ``execute`` and ``execute_and_wait`` methods.\n\n* Monte Carlo Mapping\n\n Running Monte Carlo-type operations by providing only a mapping script and a number of iterations. This is done using the ``submit_inline`` and ``submit`` methods.\n\n* Data/Argument Distribution\n\n Processing several datafiles or a list of arguments in parallel across mappers. \n This is done using the ``submit_checkpoint_inline`` and the ``submit_checkpoint`` methods.\n\nIt should be noted that Monte Carlo Mapping and Data Distribution very much\nviolate the *spirit* of Hadoop. However, they do provide a very simple way\nto mimic traditional compute-cluster tasks without the need for cluster management\nalong the lines of SGE or Torque. Similarly, they don't require a *real* cluster,\njust a Hadoop installation.\n\nHDFS\n----------\n\nZiggy provides methods for interacting with the HDFS distributed filesystem\nfrom within Python. These methods can be accessed by importing, for example,\n``import ziggy.hdmc.hdfs`` Method calls mimic those found under ``hadoop dfs``. \n\nUtilities\n----------\n\nZiggy provides a number of simple utilities for manipulating very large datasets\nwith Hadoop. Utilities provided include: search, grep, numeric sort, and ascii sort.\nEach of these is accessed under ``ziggy.util``. **Note** ``ziggy.util.search``\nprovides file names and line numbers a number of files in an HDFS directory or file.\n``ziggy.util.grep`` provides the lines themselves.\n\nGraphReduce\n--------------\n\nWhile Hadoop's Map/Reduce paradigm is poorly suited to graph algorithms, the\nGraphReduce modules allow for certain graph analyses on a Hadoop Cluster. Currently\nanalyes are limited to: degree-based measures, shortest-path based measures, page-rank measures,\nand connected-component measures. Except for page-rank, all path-derived measures rely\non parallel breadth-first search. See the Epydoc documentation for more information.\nGraphReduce can be accessed by importing ``ziggy.GraphReduce``\n\nExamples\n----------\n*Building a custom Hadoop streaming call*::\n\n import ziggy.hdmc as hdmc\n import ziggy.hdmc.hdfs as hdfs\n #load data to hdfs\n hdfs.copyToHDFS(localfilename, hfds_input_filename)\n mapper = '/path/to/mapper.py'\n reducer = '/path/to/reducer.py'\n output_filename ='hdfs_relative/output_filename'\n supporting_files = [list,of,files,mappers,require]\n maps = 20\n hadoop_call = hdmc.build_generic_hadoop_call(mapper, reducer, hdfs_input_filename, output_filename, supporting_files, maps)\n hdmc.execute_and_wait(hadoop_call)\n\t\n*Building a Monte Carlo Job*::\n\n import ziggy.hdmc as hdmc\n mapper = '/path/to/job_with_needs_to_be_done_many_times.py'\n iterations = 1000\n output_file = 'output_filename'\n hdmc.submit_inline(mapper, output_file, iterations)\n\t\n*Building a Task Distribution Job*::\n\n import ziggy.hdmc as hdmc\n url_list = [a, list, of, url, strings]\n mapper = '/path/to/script/which/takes/a/url/as/sys.argv[1].py'\n output_filename = 'output_file_name'\n supporting_files = []\n hdmc.submit_checkpoint_inline(mapper, output_filename, url_list, supporting_files, files=False)\n\n*Building a Data Distribution Job*::\n\n import ziggy.hdmc as hdmc\n file_list = [a, list, of, filenames, usually, provided, by, glob]\n mapper = '/path/to/script/which/takes/a/filename/as/sys.argv[1].py'\n output_filename = 'output_file_name'\n supporting_files = [filenames, my, mapper,needs]\n hdmc.submit_checkpoint_inline(mapper, output_filename, file_list, supporting_files, files=True)", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/Ziggy/", "keywords": null, "license": "LICENSE.txt", "maintainer": null, "maintainer_email": null, "name": "Ziggy", "package_url": "https://pypi.org/project/Ziggy/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/Ziggy/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://pypi.python.org/pypi/Ziggy/" }, "release_url": "https://pypi.org/project/Ziggy/0.1.3.1/", "requires_dist": null, "requires_python": null, "summary": "Python modules for Hadoop Streaming", "version": "0.1.3.1" }, "last_serial": 786080, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "f1e2bf4cacceef65f9ee5d3dd821cb89", "sha256": "3fe0aa60ac4f1401683a4752bcab4c8408b8a88e585949c757d686871aad4d59" }, "downloads": -1, "filename": "Ziggy-0.1.0.tar.gz", "has_sig": false, "md5_digest": "f1e2bf4cacceef65f9ee5d3dd821cb89", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 124344, "upload_time": "2010-09-14T00:34:34", "url": "https://files.pythonhosted.org/packages/2e/e6/b188d7de35404773b00c0379be9e329dae4dfdd9c1d2d1d7bc5016db7c17/Ziggy-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "7e11586a7a3ff81ebf961e2733a2f0ab", "sha256": "4a85c3d5fd121401d9d3356c989ba05ac3b86bc448f17904ae0a2e1756e00a46" }, "downloads": -1, "filename": "Ziggy-0.1.1.tar.gz", "has_sig": false, "md5_digest": "7e11586a7a3ff81ebf961e2733a2f0ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 128831, "upload_time": "2010-09-14T18:35:35", "url": "https://files.pythonhosted.org/packages/27/a1/d0a44ba13e67c86bae62039a2d8f8d45acd98ebb8b8ea46a211a7e4c7474/Ziggy-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "7b4f61123550a0877e47c175747651d5", "sha256": "225ecaa0865f2863ac7e51395d0144bf1599a92e76f04195f639b11dcdfe2a2a" }, "downloads": -1, "filename": "Ziggy-0.1.2.tar.gz", "has_sig": false, "md5_digest": "7b4f61123550a0877e47c175747651d5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 128940, "upload_time": "2010-09-14T19:22:21", "url": "https://files.pythonhosted.org/packages/2f/6e/b11a0a124901a15d68434579e9f7d0bb830480d25f561d52e983a705b210/Ziggy-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "a3a0f1ebf0cb207f78cd9957e68d105e", "sha256": "6cbe1b5fb06769b69dbdf07987624ed5d68849b842242561ce8a9073139f46c2" }, "downloads": -1, "filename": "Ziggy-0.1.3.tar.gz", "has_sig": false, "md5_digest": "a3a0f1ebf0cb207f78cd9957e68d105e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2827520, "upload_time": "2010-12-22T22:45:47", "url": "https://files.pythonhosted.org/packages/39/d4/d06fd1474d7896ed514d843ccad7f2760fb503f9a8052c79c3cd49da03c0/Ziggy-0.1.3.tar.gz" } ], "0.1.3.1": [ { "comment_text": "", "digests": { "md5": "9f2eafdf81c17312048a9c72a27a9922", "sha256": "e5a8cdde010bfbf5a0dade4e00852b4e18b9499edf5fe11c74a6252d0b5d6732" }, "downloads": -1, "filename": "Ziggy-0.1.3.1.tar.gz", "has_sig": false, "md5_digest": "9f2eafdf81c17312048a9c72a27a9922", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2827476, "upload_time": "2010-12-22T23:10:28", "url": "https://files.pythonhosted.org/packages/62/7a/806cbbd1359dffd2870e15213593c9e6688e643ec52e9c26493c3dec0cf6/Ziggy-0.1.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9f2eafdf81c17312048a9c72a27a9922", "sha256": "e5a8cdde010bfbf5a0dade4e00852b4e18b9499edf5fe11c74a6252d0b5d6732" }, "downloads": -1, "filename": "Ziggy-0.1.3.1.tar.gz", "has_sig": false, "md5_digest": "9f2eafdf81c17312048a9c72a27a9922", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2827476, "upload_time": "2010-12-22T23:10:28", "url": "https://files.pythonhosted.org/packages/62/7a/806cbbd1359dffd2870e15213593c9e6688e643ec52e9c26493c3dec0cf6/Ziggy-0.1.3.1.tar.gz" } ] }