{ "info": { "author": "Jay Marcyes", "author_email": "jay@marcyes.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3" ], "description": "Mister\n======\n\nFor all your medium data needs!\n\nMister attempts to make running a map/reduce job approachable.\n\nWhen you've got data that isn't really big and so you're not quite ready\nto distribute the data across a gazillian machines and stuff but would\nstill like an answer in a reasonable amount of time.\n\n5 minute getting started\n------------------------\n\nMister needs you to define three methods: ``prepare`` (get the data\nready to be run across multiple processes), ``map`` (actually do\nsomething with the chunks of data from ``prepare``), and ``reduce``\n(mash all the values returned from ``map`` together).\n\nThe ``reduce`` method\n~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n prepare(self, count, *args, **kwargs)\n\nThe ``count`` is the number of processes the job will be run across, and\n``*args`` and ``**kwargs`` is whatever is passed into your child class's\n``__init__`` method. The ``prepare`` method returns **count** rows\ncontaining a tuple ``((), {})`` of the arguments that will be passed to\neach ``map`` process.\n\nThe ``map`` method\n~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n map(self, *args, **kwargs)\n\nThe ``*args`` and ``**kwargs`` are whatever was returned from\n``prepare``. The ``map`` method returns whatever you want ``reduce`` to\nuse to merge all the data together.\n\nThe ``reduce`` method\n~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: python\n\n reduce(self, output, value)\n\nThe ``output`` is the global aggregation of all the ``value`` arguments\nthe ``reduce`` method has seen. Basically, whatever you return from one\n``reduce`` call will be passed back into the next ``reduce`` call as\n``output``. The ``value`` argument is whatever the recently finished\n``map`` call returned.\n\nBringing it all together\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nSo let's bring it all together in our ``MrHelloWorld`` job, first let's\nget the skeleton in place:\n\n.. code:: python\n\n from mister import BaseMister\n\n\n class MrHelloWorld(BaseMister):\n def prepare(self, count, *args, **kwargs): pass\n def map(self, *args, **kwargs): pass\n def reduce(self, output, value): pass\n\nNow let's flesh out the ``prepare`` method:\n\n.. code:: python\n\n def prepare(self, count, name):\n # we're just going to return the number and the name we pass in \n for x in range(count):\n yield ([x, name], {})\n\nAnd our ``map`` method:\n\n.. code:: python\n\n def map(self, x, name):\n return \"Process {} says 'hello {}'\".format(x, name)\n\nFinally, our ``reduce`` method:\n\n.. code:: python\n\n def reduce(self, output, value):\n if output is None:\n output = []\n output.append(value)\n return output\n\nRunning our job:\n\n.. code:: python\n\n mr = MrHelloWorld(\"Alice\")\n output = mr.run()\n print(output)\n\nwill result in:\n\n::\n\n [\n \"Process 1 says 'hello Alice'\",\n \"Process 0 says 'hello Alice'\",\n \"Process 2 says 'hello Alice'\",\n \"Process 3 says 'hello Alice'\",\n \"Process 4 says 'hello Alice'\",\n \"Process 5 says 'hello Alice'\",\n \"Process 6 says 'hello Alice'\",\n \"Process 7 says 'hello Alice'\",\n \"Process 8 says 'hello Alice'\",\n \"Process 9 says 'hello Alice'\",\n \"Process 10 says 'hello Alice'\"\n ]\n\nCongrats, you just ran a map/reduce job, you are now an AI and a ML\nengineer, remember me when you're famous!\n\nAnother Example\n---------------\n\nI think word counting is the traditional map/reduce example? So here it\nis:\n\n.. code:: python\n\n import os\n import re\n improt math\n from collections import Counter\n\n from mister import BaseMister\n\n\n class MrWordCount(BaseMister):\n def prepare(self, count, path):\n \"\"\"prepare segments the data for the map() method\"\"\"\n size = os.path.getsize(path)\n length = int(math.ceil(size / count))\n start = 0\n for x in range(count):\n kwargs = {}\n kwargs[\"path\"] = path\n kwargs[\"start\"] = start\n kwargs[\"length\"] = length\n start += length\n yield (), kwargs\n\n def map(self, path, start, length):\n \"\"\"all the magic happens right here\"\"\"\n output = Counter()\n with open(path) as fp:\n fp.seek(start, 0)\n words = fp.read(length)\n\n # I don't compensate for word boundaries because example\n for word in re.split(r\"\\s+\", words):\n output[word] += 1\n return output\n\n def reduce(self, output, count):\n \"\"\"take all the return values from map() and aggregate them to the final value\"\"\"\n if not output:\n output = Counter()\n output.update(count)\n return output\n \n # let's count the bible\n path = \"./testdata/bible-kjv.txt\"\n mr = MrWordCount(path)\n wordcounts = mr.run()\n print(wordcounts.most_common(10))\n\nOn my computer, the asynchronous code above runs about 3x faster than\nits syncronous equivalent below:\n\n.. code:: python\n\n import re\n from collections import Counter\n\n path = \"./testdata/bible-kjv.txt\"\n\n output = Counter()\n with open(path) as fp:\n words = fp.read()\n\n for word in re.split(r\"\\s+\", words):\n output[word] += 1\n\n print(wordcounts.most_common(10))\n\nInstallation\n------------\n\nTo install, use Pip:\n\n::\n\n $ pip install mister\n\nOr, to grab the latest and greatest:\n\n::\n\n $ pip install --upgrade git+https://github.com/Jaymon/mister#egg=mister\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/Jaymon/mister", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "mister", "package_url": "https://pypi.org/project/mister/", "platform": "", "project_url": "https://pypi.org/project/mister/", "project_urls": { "Homepage": "http://github.com/Jaymon/mister" }, "release_url": "https://pypi.org/project/mister/0.0.2/", "requires_dist": null, "requires_python": "", "summary": "Approachable map/reduce jobs", "version": "0.0.2" }, "last_serial": 4542151, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "94f97b2606fb50d1fb1e247255a09ee1", "sha256": "e97fdbac6bd8ea5e40ad1f4c752ad7898eed85c92d09a8756cf812cd9cfabf9f" }, "downloads": -1, "filename": "mister-0.0.1.tar.gz", "has_sig": false, "md5_digest": "94f97b2606fb50d1fb1e247255a09ee1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5194, "upload_time": "2018-11-29T08:11:32", "url": "https://files.pythonhosted.org/packages/a5/26/e7e4807b70581516d30046119940948792c1eb0dbe56fd129ac18ae4d36d/mister-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "fa3344ff1b357e75b20c1945162589fe", "sha256": "5b837b952920189e6d1e189f5a4a09d6d24596623a13c59519c142fe761f5afd" }, "downloads": -1, "filename": "mister-0.0.2.tar.gz", "has_sig": false, "md5_digest": "fa3344ff1b357e75b20c1945162589fe", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6513, "upload_time": "2018-11-29T08:48:01", "url": "https://files.pythonhosted.org/packages/38/73/887e8787648bb26862e4c042472cb9046af0155e235fdf5c32a14438d61b/mister-0.0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fa3344ff1b357e75b20c1945162589fe", "sha256": "5b837b952920189e6d1e189f5a4a09d6d24596623a13c59519c142fe761f5afd" }, "downloads": -1, "filename": "mister-0.0.2.tar.gz", "has_sig": false, "md5_digest": "fa3344ff1b357e75b20c1945162589fe", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6513, "upload_time": "2018-11-29T08:48:01", "url": "https://files.pythonhosted.org/packages/38/73/887e8787648bb26862e4c042472cb9046af0155e235fdf5c32a14438d61b/mister-0.0.2.tar.gz" } ] }