{ "info": { "author": "Kevin Wurster", "author_email": "wursterk@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Software Development :: Libraries", "Topic :: Text Processing", "Topic :: Utilities" ], "description": "======\ntinymr\n======\n\nExperimental in-memory Pythonic MapReduce inspired by `Spotify's luigi framework `_.\n\n.. image:: https://travis-ci.org/geowurster/tinymr.svg?branch=master\n :target: https://travis-ci.org/geowurster/tinymr?branch=master\n\n.. image:: https://coveralls.io/repos/geowurster/tinymr/badge.svg?branch=master\n :target: https://coveralls.io/r/geowurster/tinymr?branch=master\n\n\nThe Word Count Example\n======================\n\nCurrently the only MapReduce implementation is in-memory and serial, but an\nimplementation with parallelized map and reduce phases will be added.\n\n.. code-block:: python\n\n from collections import Counter\n import json\n import re\n import sys\n\n from tinymr.memory import MemMapReduce\n\n\n class WordCount(MemMapReduce):\n\n \"\"\"\n The go-to MapReduce example.\n\n Don't worry, a better example is on its way:\n https://github.com/geowurster/tinymr/issues/11\n \"\"\"\n\n # Counting word occurrence does not benefit from sorting post-map or\n # post-reduce and our `final_reducer()` doesn't care about key order\n # so we can disable sorting for a speed boost.\n sort_map = False\n sort_reduce = False\n sort_final_reduce = False\n\n def __init__(self):\n\n \"\"\"\n Stash a regex to strip off punctuation so we can grab it later.\n \"\"\"\n\n self.pattern = re.compile('[\\W_]+')\n\n def mapper(self, line):\n\n \"\"\"\n Take a line of text from the input file and figure out how many\n times each word appears.\n\n An alternative, simpler implementation would be:\n\n def mapper(self, item):\n for word in item.split():\n word = self.pattern.sub('', word)\n if word:\n yield word.lower(), 1\n\n This simpler example is more straightforward, but holds more data\n in-memory. The implementation below does more work per line but\n potentially has a smaller memory footprint. Like anything\n MapReduce the implementation benefits greatly from knowing a lot\n about the input data.\n \"\"\"\n\n # Normalize all words to lowercase\n line = line.lower().strip()\n\n # Strip off punctuation\n line = [self.pattern.sub('', word) for word in line]\n\n # Filter out empty strings\n line = filter(lambda x: x != '', line)\n\n # Return an iterable of `(word, count)` pairs\n return Counter(line).items()\n\n def reducer(self, key, values):\n\n \"\"\"\n Just sum the number of times each word appears in the entire\n dataset.\n\n At this phase `key` is a word and `values` is an iterable producing\n all of the values for that word from the map phase. Something like:\n\n key = 'the'\n values = (1, 1, 2, 2, 1)\n\n The word `the` appeared once on 3 lines and twice on two lines for\n a total of `7` occurrences, so we yield:\n\n ('the', 7)\n \"\"\"\n\n yield key, sum(values)\n\n def output(self, pairs):\n\n \"\"\"\n Normally this phase is where the final dataset is written to disk,\n but since we're operating in-memory we just want to re-structure as\n a dictionary.\n\n `pairs` is an iterator producing `(key, iter(values))` tuples from\n the reduce phase, and since we know that we only produced one key\n from each reduce we want to extract it for easier access later.\n \"\"\"\n\n return {k: tuple(v)[0] for k, v in pairs}\n\n\n wc = WordCount()\n with open('LICENSE.txt') as f:\n out = wc(f)\n print(json.dumps(out, indent=4, sort_keys=True))\n\nTruncated output:\n\n.. code-block:: json\n\n {\n \"a\": 1,\n \"above\": 2,\n \"advised\": 1,\n \"all\": 1,\n \"and\": 8,\n \"andor\": 1\n }\n\nWord Count Workflow\n-------------------\n\nInternally, the workflow looks like this:\n\n**Input data**:\n\n.. code-block:: console\n\n $ head -10 LICENSE.txt\n\n New BSD License\n\n Copyright (c) 2015, Kevin D. Wurster\n All rights reserved.\n\n Redistribution and use in source and binary forms, with or without\n modification, are permitted provided that the following conditions are met:\n\n * Redistributions of source code must retain the above copyright notice, this\n list of conditions and the following disclaimer.\n\n**Map**\n\nCount occurrences of each word in every line.\n\n.. code-block:: python\n\n # Input line\n line = 'Copyright (c) 2015, Kevin D. Wurster'\n\n # Sanitized words\n words = ['Copyright', 'c', '2015', 'Kevin', 'D', 'Wurster']\n\n # Return tuples with word as the first element and count as the second\n pairs = [('Copyright', 1), ('c', 1), ('2015', 1), ('Kevin', 1), ('D', 1), ('Wurster', 1)]\n\n**Partition**\n\nOrganize all of the ``(word, count)`` pairs by ``word``. The ``word`` keys are\nkept at this point in case the data is sorted. Sorting grabs the second to last\nkey, so the data could be partitioned on one key and sorted on another with\n``(word, sort, count)``. The second to last key is used for sorting so the keys\nthat appear below match the ``word`` only because a ``sort`` key was not given.\n\nWords that appear in the input text on multiple lines have multiple\n``(word, count)`` pairs. A ``count`` of ``2`` would indicate a word that\nappeared twice on a single line, but our input data does not have this\ncondition. Truncated output below. The dictionary values are lists containing\ntuples to allow for a sort key, which is explained elsewhere.\n\n.. code-block:: python\n\n {\n '2015': [(1,)]\n 'above': [(1,)]\n 'all': [(1,)]\n 'and': [(1,), (1,), (1,)]\n 'are': [(1,), (1,)]\n 'binary': [(1,)]\n 'bsd': [(1,)]\n 'c': [(1,)]\n 'code': [(1,)]\n }\n\n\n**Reduce**\n\nSum ``count`` for each ``word``.\n\n.. code-block:: python\n\n # The ``reducer()`` receives a key and an iterator of values\n key = 'the'\n values = (1, 1, 1)\n def reducer(key, values):\n yield key, sum(values)\n\n**Partition**\n\nThe reducer does not _have_ to produces the same key it was given, so the data\nis partitioned by key again, which is superfluous for this wordcount example.\nAgain the keys are kept in case the data is sorted and only match ``word``\nbecause an optional ``sort`` key was not given. Truncated output below.\n\n.. code-block:: python\n\n {\n '2015': [(1,)]\n 'above': [(1,)]\n 'all': [(1,)]\n 'and': [(3,)]\n 'are': [(2,)]\n 'binary': [(1,)]\n 'bsd': [(1,)]\n 'c': [(1,)]\n 'code': [(1,)]\n }\n\n**Output**\n\nThe default implementation is to return ``(key, iter(values))`` pairs from the\n``final_reducer()``, which would look something like:\n\n.. code-block:: python\n\n values = [\n ('the', (3,)),\n ('in', (1,),\n ]\n\nBut a dictionary is much more useful, and we know that we only got a single\nvalue for each ``word`` in the reduce phase, so we can extract that value\nand produce a dictionary.\n\n.. code-block:: python\n\n return {k: tuple(v)[0] for k, v in values}\n\nThe ``tuple()`` call is included because the data in the ``value`` key is\n_always_ an iterable object but _may_ be an iterator. Calling ``tuple()``\nexpands the iterable and lets us get the first element.\n\n\nDeveloping\n==========\n\n.. code-block:: console\n\n $ git clone https://github.com/geowurster/tinymr.git\n $ cd tinymr\n $ pip install -e .\\[dev\\]\n $ py.test tests --cov tinymr --cov-report term-missing\n\n\nLicense\n=======\n\nSee ``LICENSE.txt``\n\n\nChangelog\n=========\n\nSee ``CHANGES.md``", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/geowurster/tinysort", "keywords": "sorting tools memory", "license": "New BSD", "maintainer": null, "maintainer_email": null, "name": "tinysort", "package_url": "https://pypi.org/project/tinysort/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/tinysort/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/geowurster/tinysort" }, "release_url": "https://pypi.org/project/tinysort/0.1/", "requires_dist": null, "requires_python": null, "summary": "General purpose tools for sorting with limited memory.", "version": "0.1" }, "last_serial": 1973346, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "9b13db61889b39260c01ccbce0fd2003", "sha256": "fa7ccff649744344ea9703376b1ecc215493b76abb3b5e772426171c0276a420" }, "downloads": -1, "filename": "tinysort-0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9b13db61889b39260c01ccbce0fd2003", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 26100, "upload_time": "2016-02-24T05:42:56", "url": "https://files.pythonhosted.org/packages/1e/fe/dd5ea938c19f0363792ff6497b6dce77509ad6b73e952dc76c7604aa1223/tinysort-0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "71009a3a21b75963c2ee7d797f94b682", "sha256": "ed0d82d6601b654baf1623cb45fa0f3be4b46377aff915b822bb0cb82ec8ccba" }, "downloads": -1, "filename": "tinysort-0.1.tar.gz", "has_sig": false, "md5_digest": "71009a3a21b75963c2ee7d797f94b682", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 29025, "upload_time": "2016-02-24T05:43:01", "url": "https://files.pythonhosted.org/packages/47/3f/32bd1de477d58a98c5276064fe1ddceded9d0fb299e9950fb368560620f9/tinysort-0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9b13db61889b39260c01ccbce0fd2003", "sha256": "fa7ccff649744344ea9703376b1ecc215493b76abb3b5e772426171c0276a420" }, "downloads": -1, "filename": "tinysort-0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9b13db61889b39260c01ccbce0fd2003", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 26100, "upload_time": "2016-02-24T05:42:56", "url": "https://files.pythonhosted.org/packages/1e/fe/dd5ea938c19f0363792ff6497b6dce77509ad6b73e952dc76c7604aa1223/tinysort-0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "71009a3a21b75963c2ee7d797f94b682", "sha256": "ed0d82d6601b654baf1623cb45fa0f3be4b46377aff915b822bb0cb82ec8ccba" }, "downloads": -1, "filename": "tinysort-0.1.tar.gz", "has_sig": false, "md5_digest": "71009a3a21b75963c2ee7d797f94b682", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 29025, "upload_time": "2016-02-24T05:43:01", "url": "https://files.pythonhosted.org/packages/47/3f/32bd1de477d58a98c5276064fe1ddceded9d0fb299e9950fb368560620f9/tinysort-0.1.tar.gz" } ] }