{ "info": { "author": "John Bjorn Nelson", "author_email": "jbn@abreka.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": ".. image:: https://travis-ci.org/jbn/vaquero.svg?branch=master\n :target: https://travis-ci.org/jbn/vaquero\n.. image:: https://ci.appveyor.com/api/projects/status/bbs3p2osllgohxco?svg=true\n :target: https://ci.appveyor.com/project/jbn/vaquero/branch/master\n.. image:: https://coveralls.io/repos/github/jbn/vaquero/badge.svg?branch=master\n :target: https://coveralls.io/github/jbn/vaquero?branch=master \n.. image:: https://img.shields.io/pypi/v/vaquero.svg\n :target: https://pypi.python.org/pypi/vaquero\n.. image:: https://img.shields.io/badge/license-MIT-blue.svg\n :target: https://raw.githubusercontent.com/jbn/vaquero/master/LICENSE\n.. image:: https://img.shields.io/pypi/pyversions/vaquero.svg\n :target: https://pypi.python.org/pypi/vaquero\n\nWhat is Vaquero?\n================\n\n.. image:: logo.png\n :alt: vaquero logo\n :align: center\n\nTL;DR\n-----\n\nIt's a library for iterative and interactive data wrangling at\nlaptop-scale. If you spend a lot of time in a `Jupyter\nnotebook `__, trying to clean dirty, raw data, it's\nprobably useful.\n\nIt would be nice if it were possible to write data cleaning code\ncorrectly. But, the people who pay you to do data analysis don't do data\nanalysis and don't understand how dangerous dirty data are, so you\nrarely get the luxury of feeling secure with what you extract. Vaquero\ntries to find a balance between \"business\" demands and good hygiene.\nBorrowing from Larry Wall, it tries \"to make the easy things easy, and\nthe hard things possible.\" In this context, \"hard things\" refers to\nthose wonderfully fun situations where, you write some code that you\nknow will break in the future but you have no time to fix it; then,\nthree months later, it breaks and you have no idea what your code does.\n\nSee also: `On Disappearing\nCode `__\n\nAn Example\n----------\n\nIt's easier to get a sense of \"why\" by looking at a notebook.\n\n- `Notebook for the common usage\n pattern `__\n- `Notebook for inline\n pipelines `__\n\nExpecting Exceptions\n--------------------\n\nVaquero *expects* exceptions, making them pretty unexceptional. But,\nPython's exception handling is cheap, so that's fine (i.e. EAFP --\nEasier to ask for forgiveness than permission). Plus, with dirty data,\nyou know it will probably fail for some records. During development,\nrather than halting each time, vaquero continues on its merry way, up to\nsome failure limit. For each failure, the library logs the exception,\nincluding the name of the file *and* the arguments which resulted in a\nfailure.\n\nAfter you have processed all the documents, you can then inspect the\nerrors. This helps you scan for error patterns, rather than programming\nby the coincidence of the first error raised. Moreover, since you the\noffending function and its arguments, it is easy to update the new\nfunction, ensuring it passes with the prior bad example. Vaquero reloads\nthe pipeline for you. (Or, at least tries to, because reloading is\ntricky.)\n\nModules as Pipelines\n--------------------\n\n Namespaces are one honking great idea -- let's do more of those!\n\nProgrammers use namespaces everywhere to organize their code. Yet, when\nwriting data cleaning code, everything ends up in a big file with lots\nof poorly-named functions. Think: ``from hellishlib import *``. The\nperfectionist in me says, \"this is awful, and I should write it\nproperly, as a full library with lots of unit tests!\" But, for\n\"perfectionists with deadlines,\" that's not possible.\n\nFurthermore, the single-file-of-functions pattern emerges not only\nbecause of time constraints; it's a reflection of the problem! ELT code\nis **inherently** tightly-coupled. Code that extracts this variable\nprobably depends on that one which in turn also depends on some other\none. This leads to a tree of transformations, encapsulated by function\ncalls.\n\nRecognizing this, ``vaquero`` doesn't try to move you away from\ncollecting all your ELT code in a single file. It's going to happen\nanyway. Instead, it makes it safer with some conventions.\n\n1. A module represents a single encapsulated pipeline. It should process\n a well-defined document.\n2. The function definition order is meaningful. Functions at the top of\n the file execute before those above them. Again, it's a pipeline.\n3. As per pythonic convention, functions prefixed with ``_`` are\n private. Here, that means, the pipeline constructor ignores it when\n compiling the pipeline. This gives you nice helper functions.\n4. You're probably not going to use unit tests -- you don't have time.\n But, since it's a module, pepper it with assertions. And, using the\n ``_``-prefix, you can actually write namespaced tests (e.g.\n ``_my_test()``), and immediately call them in the module. (I actually\n write a lot of my code with ``unittest`` in the pipeline module and\n it gets called right before the module fully imports.) Then, when you\n break something, you can't even start pipeline processing. It fails\n fast. (You can deviate from this pattern -- but, in general, don't.)\n\nInstallation\n------------\n\n.. code:: sh\n\n pip install vaquero\n\nTips\n----\n\nThe ``f(src, dst)`` Pattern\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nFor most of my pipelines, I tend to write functions that look like,\n\n.. code:: python\n\n def f(src_d, dst_d):\n dst_d['age'] = int(src_d['AGE1'])\n\nComing from functional languages, I'd prefer immutable objects. But, in\nPython, that tends to be painfully slow. This pattern represents a\ncompromise that usually works well. On the one side (``dst_d``) you have\nalready processed elements; on the other, the raw data.\n\nHidden field pattern\n~~~~~~~~~~~~~~~~~~~~\n\n- Assume you are processing a pipeline with a dict destination\n document. Use '\\_key\\_name' fields for intermediary results in a\n document. You can delete them at the end of the pipeline (easily, via\n ``vaquero.transformations.remove_private_keys``), but in the interim,\n you'll see these fields on failure.\n\nDisclaimer\n----------\n\nI have this big monstrous library called vaquero on my computer. It's a\ncollection of lots of functions I've written over (entirely too) many\ndata munging projects. I use it often, and keep telling myself \"once I\nfind the time, I'll release it!\" And, that never happens. It's too big\nto clean up in a way that makes me comfortable. Instead, I'll be\nreleasing little bits of code in a ad-hoc, just-in-time fashion. When I\nabsolutely need some feature of the big library going forward, I'll\nextract it and put it here.\n\nThat makes me wildly uncomfortable, but...I'm starving for time.\n\nIn any case, library-user beware. Things will break.\n\n.. |Build Status| image:: https://travis-ci.org/jbn/vaquero.svg?branch=master\n :target: https://travis-ci.org/jbn/vaquero\n.. |Coverage Status| image:: https://coveralls.io/repos/github/jbn/vaquero/badge.svg?branch=master\n :target: https://coveralls.io/github/jbn/vaquero?branch=master\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jbn/vaquero", "keywords": "data analysis", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "vaquero", "package_url": "https://pypi.org/project/vaquero/", "platform": "", "project_url": "https://pypi.org/project/vaquero/", "project_urls": { "Homepage": "https://github.com/jbn/vaquero" }, "release_url": "https://pypi.org/project/vaquero/0.0.5/", "requires_dist": null, "requires_python": "", "summary": "A library for iterative and interactive data wrangling", "version": "0.0.5" }, "last_serial": 3562140, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "1fd79f3ceca146b71ace0f0f7eed3650", "sha256": "34eb27d066a45a1c438cd18a61f78740d5d1219205814b6259dcc7e5eaab9eef" }, "downloads": -1, "filename": "vaquero-0.0.1.tar.gz", "has_sig": false, "md5_digest": "1fd79f3ceca146b71ace0f0f7eed3650", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12670, "upload_time": "2016-11-04T20:01:39", "url": "https://files.pythonhosted.org/packages/6b/49/3eeba62700b1cf417400a57f4765fc80d4292e63c0da7a7084eca96d2e79/vaquero-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "eff1a5946a0ac880b2fab778bc321be3", "sha256": "b12401ca57b50d5f0f5add344796072d3c0ed0b1ea0ab153f97263a66c07f90e" }, "downloads": -1, "filename": "vaquero-0.0.2.tar.gz", "has_sig": false, "md5_digest": "eff1a5946a0ac880b2fab778bc321be3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12829, "upload_time": "2016-11-10T22:57:05", "url": "https://files.pythonhosted.org/packages/c8/e3/38333a32da729c33bcd3a1d5521a3c03ee5b992d4556b590506dc9ba3245/vaquero-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "c3db0fe947a1547e7790e12c290ace37", "sha256": "e8af88f143f7e3216d90224b477f540cd57b20341c70b99061879fb475a2fdca" }, "downloads": -1, "filename": "vaquero-0.0.3.tar.gz", "has_sig": false, "md5_digest": "c3db0fe947a1547e7790e12c290ace37", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14568, "upload_time": "2017-10-02T19:01:07", "url": "https://files.pythonhosted.org/packages/e2/75/4c804161b3d96b8a46b91c075f285ec08dc8c2ca6c96182a0133eda821c3/vaquero-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "34d274174c69875473158b7538da4cee", "sha256": "ddace07a900cbe15d3d71a6e7e8744efce64e3feece38888028396abd7424918" }, "downloads": -1, "filename": "vaquero-0.0.4.tar.gz", "has_sig": false, "md5_digest": "34d274174c69875473158b7538da4cee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24936, "upload_time": "2018-02-07T23:15:06", "url": "https://files.pythonhosted.org/packages/40/9f/b4340ffa9dbeb302b2f8510761f859feb8553575eb230e5c37e64417f217/vaquero-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "a7f15e98aa4a02b5cdda72724741488e", "sha256": "ca67560c2234ff7d1d84660430f06e15b4dc11a1c40a7aebacf74314d6943b56" }, "downloads": -1, "filename": "vaquero-0.0.5.tar.gz", "has_sig": false, "md5_digest": "a7f15e98aa4a02b5cdda72724741488e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24959, "upload_time": "2018-02-08T00:03:48", "url": "https://files.pythonhosted.org/packages/d1/9f/ea6a3d6f77506ef462d46690a4123b747b8400d333ffd36c9b774ce0995e/vaquero-0.0.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a7f15e98aa4a02b5cdda72724741488e", "sha256": "ca67560c2234ff7d1d84660430f06e15b4dc11a1c40a7aebacf74314d6943b56" }, "downloads": -1, "filename": "vaquero-0.0.5.tar.gz", "has_sig": false, "md5_digest": "a7f15e98aa4a02b5cdda72724741488e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24959, "upload_time": "2018-02-08T00:03:48", "url": "https://files.pythonhosted.org/packages/d1/9f/ea6a3d6f77506ef462d46690a4123b747b8400d333ffd36c9b774ce0995e/vaquero-0.0.5.tar.gz" } ] }