{ "info": { "author": "John Bjorn Nelson", "author_email": "jbn@abreka.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": ".. image:: https://img.shields.io/pypi/v/modpipe.svg\n :target: https://pypi.python.org/pypi/modpipe\n.. image:: https://img.shields.io/travis/jbn/modpipe.svg\n :target: https://travis-ci.org/jbn/modpipe\n.. image:: https://coveralls.io/repos/github/jbn/modpipe/badge.svg?branch=master\n :target: https://coveralls.io/github/jbn/modpipe?branch=master\n\n\n=============================\nmodpipe: modules as pipelines\n=============================\n\n--------------\nWhat is this?\n--------------\n\nAn opinionated package that loads a module as a callable pipeline.\n\n------------\nInstallation\n------------\n\n.. code-block:: bash\n \n pip install modpipe\n\n # For Python 2.7, also do,\n pip install git+git://github.com/jbn/funcsigs.git@214840c53529f019638229d72dcf2257fe154458\n\nPython 2.7 requires the `functools `_ \nbackport. But, the current master branch doesn't have support for pickling.\nUntil my `PR `_ gets\naccepted, the pypi package won't work in spark environments. the above pip\ncommand will install my branch.\n\n-------------\nWhy is this?\n-------------\n\nVery frequently, I interactively write data transformations in \n`Jupyter `_; then, once it works as (initially) expected,\nI move it to *some* module. Recognizing that pattern, this package makes using \nmodules as pipelines (esp. in the context of ETL) easier. \n\n------------------------\nCan I see an example?\n------------------------\n\nAssume you have a set of raw items (probably deserialized objects) that you\nwant to transform into clean ones. You accumulate your transformation functions\nin ``ingest_pipeline.py``. When the following code runs,\n\n.. code-block:: python\n \n import modpipe\n\n with modpipe.ModPipe.on('ingest_pipeline') as f:\n clean_items = [f(item) for item in raw_items]\n\nit automatically re/loads the ``ingest_pipeline`` module, sythesizing a \ncomposition ordered by source code line position. That's a mouth full. \nConcretely, say you have the following in ``ingest_pipeline.py``,\n\n.. code-block:: python\n \n def twice(x):\n return x * 2\n\n def weird_pair(x):\n return x, -x\n\nCalling ``f`` -- an instance of ``ModPipe`` -- works as if it were defined as,\n\n.. code-block:: python\n \n def f(x):\n return weird_pair(twice(x))\n\n----------\nSo What?\n----------\n\nIf you do a painful amount of ETL work, this might strike you as useful. \nIf you don't, you might be saying, \"so what?\" Unfortunately, I think this \ndoesn't communicate well without a developer-in-motion screencast (XXX: TODO).\nBut, for now, I'll lean on the concluding line in Tim Peter's wonderful \n`PEP20 `_,\n\n Namespaces are one honking great idea -- let's do more of those!\n\nBy moving transformational code into modules, you free up notebook \n(and cognitive space) for subsequent steps. If you spend a little time \nproperly naming the module, it's somewhat easy to navigate. If you \npartition your transformation code into different logical units, it's somewhat\nrobust. And, if you pepper your pipeline module code with assertions, it \ndocuments your expectations for the data. \n\nEveryone wants to write ETL code the same way that they would write other code: \ncarefully and with a good test suite. But, realistically, we don't do \nthat because of various constraints. Plus, it's *really* tedious code. \nThis package (and, the one I extracted it from, \n`vaquero `_) recognizes your competing \ndemands and the reality of the task, and works with you.\n\n\n----------------------\nHow is it opinionated?\n----------------------\n\n~~~~~~~~~~~~~~~~~~~\nAuxillary Functions\n~~~~~~~~~~~~~~~~~~~\n\nBy default, underscore-prefixed callables do not enter in the pipeline. \nBy assumption, these callables are **auxillary functions** used by the \nones you really want in the linearized pipeline.\n\n.. code-block:: python\n \n def _encode_as_binary(x): # Not in pipeline by default!\n return bin(x)[2:].rjust(32, '0')\n\n def convert_seeds(seeds):\n return [_encode_as_binary(seed) for seed in seeds]\n\n~~~~~~~~~~~~~~~~~~~~~~~~\nAssert Your Expectations\n~~~~~~~~~~~~~~~~~~~~~~~~\n\nAs mentioned above, unit-testing ETL code is a pain. And, since dirty data \nviolate expectations near-continously, it seriously impedes progress. \nRather than relying upon unit tests for the transformation functions, \nit's easier to write in-line (in the context of the module) assertions \nthat document your assumptions and guard against inadvertent regressions \nby failing loudly at re/import time. \n\nThe following (contrived) example ensures that your pipeline's sqrt function\nproperly computes the square root and returns the result in the same\nnumeric type as given.\n\n.. code-block:: python\n \n def sqrt(x):\n return type(x)(x ** 0.5)\n \n assert sqrt(1776) == 42, \"Uh oh!\" # Fails loudly if bad code!\n\n\n~~~~~~~~~~~~~~~~~~\nDAGS are Confusing\n~~~~~~~~~~~~~~~~~~\n\nDirected Acyclic Graphs (DAGs) are greate when computers construct \nthem for you. But, in lots of contexts, they make it hard to reason \nabout what your code is doing when it fails. For the most part, the \npipeline is a linearized composition of the functions in your module. \nThus, if the function on line 85 raises an exception when used, you \nknow that only the functions above have already executed. This is a \nsurprisingly useful cognitive device, especially when you step way \nfrom your code for six months and visit it again only when it \nbecomes a problem.\n\nBut, there are two exception to this simple linearization. Sometimes, \nit is necessary to either: 1) abort the pipeline early or 2) skip \nover some of the functions. This package provides a sentinel return \nvalue for both cases.\n\n\n.. code-block:: python\n \n from lxml.html import fromstring\n from modpipe import SkipTo, Done\n\n def extract_doc(raw_html):\n if raw_html.strip():\n return {'doc': fromstring(raw_html)}\n else:\n return Done(None) # Nothing can be done! Abort!\n\n def extract_title(d):\n for title in d['doc'].xpath(\"//title/text()\"):\n d['title'] = title\n\n if 'error' in d['title'].lower():\n return SkipTo(cleanup, d) # Skip to cleanup!\n else:\n return d\n\n def extract_headers(d):\n d['headers'] = d['doc'].xpath('//h1/text()')\n\n def cleanup(d):\n del d['doc']\n\n\n~~~~~~~~~~~~~~\nReturning None\n~~~~~~~~~~~~~~\n\nIn the prior code listing, ``extract_headers`` and ``cleanup`` did in-place \ntransformations on the passed dict. To cut down on LoC while communicating \nmutation, neither returned a value. There are pros on cons to this style. \nBut, in any case, ``modpipe`` handles it by assuming the given arguments to a \nfunction that returns ``None`` should be passed to the next function in the \npipe. Thus, cleanup receives ``d``.\n\nThis begs the question: if you want to return None, how do you do so? In \nthis case, you need to return a ``Result``. For example,\n\n.. code-block:: python\n \n from modpipe import Result\n \n def f(s):\n tok = s.upper().strip()\n return tok if tok else Result(None) # or Done(None)\n\n\n~~~~~~~~~~~~~~~~~~\nTuples are special\n~~~~~~~~~~~~~~~~~~\n\nIf you return a tuple from a function and that tuple's length matches \nthe arity of the next function in the pipeline, modpipe star-expands\nit when calling the next function, otherwise, it does f(res). \n\n.. code-block:: python\n \n def f(x):\n return x, -x # i.e. add(x, -x)\n\n def g(x, y):\n return x + y, x * y # i.e. h(x + y, x * y)\n\n def h(items):\n return sum(items)\n\nThis works for tuples and tuples alone. (That is, if you returned a list, it \nalways passes the whole list as an argument.) You'll note that the call \nstructure doesn't allow for keyword arguments. I've tried working around this \nbut I didn't find anything that wasn't intrusive. \n\n~~~~~~~~~~~~~~~~~~~~~~~\nIs there anything else?\n~~~~~~~~~~~~~~~~~~~~~~~\n\nYes. Pipelines in ``modpipe`` are very \n`pyspark `_ \nfriendly. Although the Spark team doesn't recommend using RDDs anymore, \nSpark is useful for writing ETL pipelines. But, python object serialization \nand deserialization adds a lot of expense to chains of transformations in \npyspark proper (i.e. ``map`` on RDDs). If you collect your \ntransformations into logical units serialized as modules, it amortizes the \npickling-related expenses. It won't be scala speed, but at least you can \ntake advantage of already existing infrastructure in a somewhat more \nperformant manner.\n\n\n~~~~\nMisc\n~~~~\n\nI think I got this idea from `PyMC3 `_. For the major \nversion bump, lots of examples started using modules and I thought it was \nannoying at first. Then I realized how nice it can be. Since modeling and \nETL tend to go hand-to-hand (albeit in a 1:99 ratio), I started writing my \nETL code in the same way. I'm sure I'm not the first to do so, but I hadn't \nseen it before. (It's probably just one of those things that lots of people \ndo naturally without writing up.)\n\nI also wanted to point out `bonobo `_. \nIt's a lot more mature and flexible. According to the docs,\n\n Bonobo is a young rewrite of an old python2.7 tool that ran millions of \n transformations per day for years on production. Although it may not \n yet be complete or fully stable (please, allow us to reach 1.0), \n the basics are there.\n\nStill, for 90% of projects, vaquero (which uses modpipe) suits me better.\n\n\n=======\nHistory\n=======\n\n0.0.1 (2018-04-30)\n------------------\n\n* First release on PyPI.\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jbn/modpipe", "keywords": "modpipe", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "modpipe", "package_url": "https://pypi.org/project/modpipe/", "platform": "", "project_url": "https://pypi.org/project/modpipe/", "project_urls": { "Homepage": "https://github.com/jbn/modpipe" }, "release_url": "https://pypi.org/project/modpipe/0.0.4/", "requires_dist": null, "requires_python": "", "summary": "A package that loads a module as a callable pipeline.", "version": "0.0.4" }, "last_serial": 4098370, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "c0bf92221837ca24cf5be0af0d0fe036", "sha256": "d2f45df39bea1867ab926320dcc168149993aa393b7706a170db59e8f3906e6b" }, "downloads": -1, "filename": "modpipe-0.0.1.tar.gz", "has_sig": false, "md5_digest": "c0bf92221837ca24cf5be0af0d0fe036", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11016, "upload_time": "2018-06-13T23:36:55", "url": "https://files.pythonhosted.org/packages/4b/58/327124f6319e06c4952a9123c60c20416af5ca5b1e84439651fe2118b87a/modpipe-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "9afc696fae167f0531caf8f1782d361a", "sha256": "36a5b9d8df6c0e41478676a2b5f6bd4679183686a9dd8f168913fc725447ecb5" }, "downloads": -1, "filename": "modpipe-0.0.2.tar.gz", "has_sig": false, "md5_digest": "9afc696fae167f0531caf8f1782d361a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16100, "upload_time": "2018-06-15T23:44:02", "url": "https://files.pythonhosted.org/packages/31/1c/91f4a80f764cb34a7169f0b0454875220acf2001a957a5dba73fc932d2d7/modpipe-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "348055fbc16e198fe8d7fffdad4e6114", "sha256": "adb20b4bc8567563efb0f949f6be9c19e1d8dc5504b9d4850cd471c17d17953a" }, "downloads": -1, "filename": "modpipe-0.0.3.tar.gz", "has_sig": false, "md5_digest": "348055fbc16e198fe8d7fffdad4e6114", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16536, "upload_time": "2018-07-17T23:24:23", "url": "https://files.pythonhosted.org/packages/e1/5c/c1cee9db3962fe4b8a3ad76ac5a3aaab6b83592684c61cb97b2fb127e4ce/modpipe-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "eb75323092a7b5c10270994401ab0791", "sha256": "c256c4b2f0b75a50e916e037e75a1014122cabc0f931d8c8586bd5d112f1ec4e" }, "downloads": -1, "filename": "modpipe-0.0.4.tar.gz", "has_sig": false, "md5_digest": "eb75323092a7b5c10270994401ab0791", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16728, "upload_time": "2018-07-24T20:27:08", "url": "https://files.pythonhosted.org/packages/53/81/9f28f3aa03fb78be5512b815256d44ca326d1e2d4c3f204c05e8e0f632c1/modpipe-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "eb75323092a7b5c10270994401ab0791", "sha256": "c256c4b2f0b75a50e916e037e75a1014122cabc0f931d8c8586bd5d112f1ec4e" }, "downloads": -1, "filename": "modpipe-0.0.4.tar.gz", "has_sig": false, "md5_digest": "eb75323092a7b5c10270994401ab0791", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16728, "upload_time": "2018-07-24T20:27:08", "url": "https://files.pythonhosted.org/packages/53/81/9f28f3aa03fb78be5512b815256d44ca326d1e2d4c3f204c05e8e0f632c1/modpipe-0.0.4.tar.gz" } ] }