{ "info": { "author": "Will Larson", "author_email": "lethain@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# organize\n\nReal world data is messy:\n\n1. the [IMF eLibrary Data](http://www.imf.org/external/data.htm) uses the ``.xls`` extension for TSV,\n2. files on [data.gov](https://www.data.gov/) sometimes include preambles which break straightforward parsing,\n3. if you've been using public data sources, then you have horror stories of your own.\n4. (If all your data comes from coworkers or colleagues then undoubtedly it's always perfectly formatted.)\n\n``organize`` aims to make it easy to eliminate the hand-scrub phase from working with real-world data files:\n\n1. Read CSV, TSV and Excel formats, even if they are poorly labeled or missing a filename.\n2. Skip over preambles lines which would otherwise require cleaning up by hand.\n3. Ignore lines with whitespace or where every column is empty.\n\nIn most cases it should be as simple as:\n\n from organize import organize\n\n with open('myfile.csv|myfile.tsv|myfile.xls|myfile', 'r') as fin:\n for row in organize(fin):\n for column_name, column_value in row:\n print column_name, column_value\n\nFor best performance, also supply the filename or mimetype to help\nprioritize the most likely parsers:\n\n for row in organize(fin, filename='myfile.csv', mimetype='text/csv'):\n # and so on\n\n``organize`` owes a spiritual debt to [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/),\nwhich similarly offered a sane abstraction over broken HTML, saving programmers great swaths of time\nand energy.\n\nDeveloped against Python 2.7, and hosted on [Github](https://github.com/lethain/organize).\n\n\n## Installation\n\nSimplest is to install via pip from [PyPi](https://pypi.python.org/pypi?name=organize):\n\n pip install organize\n\nAlternative, you can install it from Github if you want an unreleased branch or commit:\n\n pip install -e git+https://github.com/lethain/organize#egg=organize\n\nNote that installing from Github will also download the <5MB of test datasets\nwhich are used when running the test suite. These are not installed when you\ninstall the package via ``setup.py build install``.\n\nFor development:\n\n git clone git@github.com:lethain/organize.git\n cd organize\n make env\n\nAnd then you can run the tests and stylechecks via:\n\n make test style\n\n\n## Usage\n\nFor normal usage, ``organize`` presents a very simple interface:\n\n from organize import organize\n\n with open('myfile', 'r') as fin:\n for row in organize(fin):\n for name, val in row:\n print \"%s: %s\" % (name, val)\n\nIf possible, you should also pass the filename (or mimetype) to\nthe ``organize.organize`` function in order to prioritize the\nparsers. Supplying the filename or mimetype will improve performance\nin the normal case but will not impact correctness:\n\n from organize import organize\n\n filename = 'myfile.csv'\n mimetype = 'text/csv'\n with open('myfile', 'r') as fin:\n for row in organize(fin, filename=filename, mimetype=mimetype):\n for name, val in row:\n print \"%s: %s\" % (name, val)\n\nIn most cases there is no benefit to passing both filename and mimetype,\nbut for cases where you're reading a bunch of files, generally try to\nsupply as much information as possible.\n\nNote that both the row and columns are generators, so you cannot\niterate through the returned generator multiple times. If you\nwanted to process the same file twice, you would need to do\nsomething like this:\n\n from organize import organize\n\n with open('myfile', 'r') as fin:\n for row in organize(fin):\n print row\n fin.seek(0)\n for row in organize(fin):\n print row\n\n\n### Exposes duplicate columns, and unnamed columns\n\nBecause we don't want to accidentally drop data we do not\nuse a dictionary to store columns. This means by default you\nwill see all columns even if they are unnamed or have duplicate names.\n\nIf you do **not** want duplicates, you can convert the rows into dictionaries\nsimply by calling ``dict`` on each row:\n\n with open('myfile', 'r') as fin:\n for row in organize(fin):\n row_dict = dict(row)\n print row_dict.keys()\n\nIf you want uniqueness and ordering, then you can use [collections.OrderedDict](https://docs.python.org/2/library/collections.html#collections.OrderedDict):\n\n from collections import OrderedDict\n\n with open('myfile', 'r') as fin:\n for row in organize(fin):\n row_dict = OrderedDict(row)\n print row_dict.keys()\n\n\nIn this case you'll get the first value for each name.\n\n\n### Only read certain rows\n\nFor some reason you may only want a subset of rows.\nFor getting a subset of rows we recommend using [itertools.islice](https://docs.python.org/2/library/itertools.html#itertools.islice)\nfrom the Python ``itertools`` module:\n\n with open('myfile', 'r') as fin:\n for row in islice(organize(fin), 5, 25):\n print row\n\nNote that you won't get the literal rows 5 through 25 from the document,\nbut rather rows 5 through 25 from the dataset extracted from the document.\nI honestly have no idea why you'd actually want to do this. If you do have\na usecase, please let us know.\n\n\n## Examples\n\nThis section briefly describes the contents of the ``examples`` directory.\n\n\n### [transform_directory](organize/examples/transform_directory.py)\n\nWalks through slurping up all files in a directory, organizing them,\nand then writing the cleaned up versions into another directory.\n\n\n### [Using with Pandas](organize/example/with_pandas.py)\n\n[Pandas](http://pandas.pydata.org/) is one of the core Python\ndata science packages, and a commonly used tool. While it provides\nmany [tools for loading data](http://pandas.pydata.org/pandas-docs/stable/io.html),\nit doesn't place as much emphasis on handling poorly formatted data.\n\n from organize import organize\n import pandas as pd\n\n pd.DataFrame.from_records(organize(open('myfile', 'r')))\n\nThe example file is a bit better formatted, but expresses the same idea.\n\n\n## What it doesn't do\n\n``organize`` tries to parse tabular data files, and any tabular data file it does\nnot parse is a bug. However, there are many things it does not do.\nThese are not necessarily the final say, but represent our best current thinking.\n\n\n### Does not read directly from databases\n\nWe do not plan to support reading directly from databases (MySQL, PostgreSQL, SQLite, MongoDB, ...).\n\n\n### Does not generate or enforce schemas\n\nWe do not intend to create or enforce schemas on the organized data itself.\nData will always be what the underlying format parser returns. For TSV, CSV,\nExcel this means a string, for more helpful formats like JSON it will be a\nstring or integer or whatnot.\n\nWe like this functionality, but believe it would be best suited to\na different library built on top of ``organize`` as opposed to including\nit directly within the library itself.\n\n\n## Contribution and development\n\nThis section includes some commentary for those interested and willing to contribute\nadditions to this codebase. First, and most importantly: thank you!\n\nSuccessful pull requests will pass pep8, pylint and tests, as well as including new\ntests for any additional functionality (or fixed bugs). You can verify they are working\nvia:\n\n cd ~/path/to/organize\n make env test style\n\nIt's OK to disable certain pylint checks within the files you edit if it's not feasible\nto resolve the pylint complaint. (For example, when it believes a given attribute doesn't\nexist which does but it can't lint properly for whatever reason.)\n\n### Areas for future development\n\nWe're using the [Github issue tracker](https://github.com/lethain/organize/issues) to\ntrack specific projects for future development, but broadly there are two areas we'd\nlike to continue improving upon:\n\n1. File parsing should be as robust as possible for existing formats.\n2. File parsing should support as many formats as useful.\n\nAnything along those lines that doesn't introduce significant complexity or\nperformance degradation will probably be viewed as a very good thing.\n\n\n### Approach\n\nOur implementation approach is:\n\n1. The library should do the intended thing, even if it requires a hack\n or underwhelming heuristic.\n2. Deal with streams, return streams. Many files are huge and we don't want\n to be needlessly wasteful.\n3. Don't rely on filenames to identify data formats. Files are often mislabled or\n not labeled at all.\n\n### Implemntation notes\n\nThis section includes a variety of implementation notes which may become somewhat relevant\nto you in some odd edge-cases, but generally are not important for using the library.\n\n#### Looking ahead a bit\n\nWe need to look ahead a bit at the beginning of files in order to exclude preamble\nrows within a given dataset. This means that we will read some rows of the loaded\nfile before you explicitly load them.\n\nFrom a user's point of view, this *should* be transparent as we will replay the rows\nwe've read in advance as if we're reading them when you read them. However, I would not\nbe shocked if in some rare and bizarre scenarios this leaks.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/organize/", "keywords": null, "license": "LICENSE.txt", "maintainer": null, "maintainer_email": null, "name": "organize", "package_url": "https://pypi.org/project/organize/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/organize/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://pypi.python.org/pypi/organize/" }, "release_url": "https://pypi.org/project/organize/0.1.2/", "requires_dist": null, "requires_python": null, "summary": "Parse real-world tabular data in a wide variety of formats.", "version": "0.1.2" }, "last_serial": 1104800, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "4562bb0a1f13bcc39dad2298abe81444", "sha256": "7fc24c16843234ccb476e95cfc1eac5274f7f134117b2c91f26604c0676f0891" }, "downloads": -1, "filename": "organize-0.1.0.macosx-10.9-intel.exe", "has_sig": false, "md5_digest": "4562bb0a1f13bcc39dad2298abe81444", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 84933, "upload_time": "2014-05-26T15:31:33", "url": "https://files.pythonhosted.org/packages/e4/40/0015875908842b87bdad63fbce3c5940b8a5ef77cc89dc6a7a136a571e96/organize-0.1.0.macosx-10.9-intel.exe" }, { "comment_text": "", "digests": { "md5": "71ff8ac37c01de46b692811e34b96eb5", "sha256": "94387be3ab5ccaad007fee8451bc4ed9edb886187ec433a2ecd1d8c5c42595fe" }, "downloads": -1, "filename": "organize-0.1.0.tar.gz", "has_sig": false, "md5_digest": "71ff8ac37c01de46b692811e34b96eb5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11159, "upload_time": "2014-05-26T15:31:31", "url": "https://files.pythonhosted.org/packages/d2/8e/1c7ab2bc9b27b0952d428934c155b3a39d2a0eafc4de0af4373a037f9027/organize-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "ffb725aa351f9f9aaf1538b7762546c6", "sha256": "f4f545b21c6726d2c623d4e238c638659a1bc651b5eb5f2a8ba2d1c34b3ecc5f" }, "downloads": -1, "filename": "organize-0.1.1.macosx-10.9-intel.exe", "has_sig": false, "md5_digest": "ffb725aa351f9f9aaf1538b7762546c6", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 85809, "upload_time": "2014-05-26T15:46:44", "url": "https://files.pythonhosted.org/packages/2b/82/22453dcd8c447520107a3b0b8db2daad8879ab93c458c525221be587d265/organize-0.1.1.macosx-10.9-intel.exe" }, { "comment_text": "", "digests": { "md5": "506095ad4099eb589f882510ca1b879c", "sha256": "420fc6b450ac824ce510b88236b4fa1a2ea5eb9cccfe7a8bb6b11f00dee9a1db" }, "downloads": -1, "filename": "organize-0.1.1.tar.gz", "has_sig": false, "md5_digest": "506095ad4099eb589f882510ca1b879c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11437, "upload_time": "2014-05-26T15:46:40", "url": "https://files.pythonhosted.org/packages/6c/d1/04176655c4880cebe8b4533ad5a22d490ab8e3ae55c826fd8e1df4a50c96/organize-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "26a49632289809ea2473b68876acca2d", "sha256": "dd7a5cba8466025e077afd84dd842a4a3cc195dc9911d9a6da793509cdb86244" }, "downloads": -1, "filename": "organize-0.1.2.macosx-10.9-intel.exe", "has_sig": false, "md5_digest": "26a49632289809ea2473b68876acca2d", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 86647, "upload_time": "2014-05-26T17:27:57", "url": "https://files.pythonhosted.org/packages/10/28/da051a9befa2ed51c66684c336780078f1a266dcde04f78bddb075d4416e/organize-0.1.2.macosx-10.9-intel.exe" }, { "comment_text": "", "digests": { "md5": "e3c756e2125c699211408b02ba37baa5", "sha256": "3c383f2be5244d52fd9c38ec5746d5939bbbb6ec00497d205d2bad82f1ad2a10" }, "downloads": -1, "filename": "organize-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e3c756e2125c699211408b02ba37baa5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11709, "upload_time": "2014-05-26T17:27:54", "url": "https://files.pythonhosted.org/packages/67/31/7f38656c5fac0ae14dc529ad1b1a3fb790d6df5a052d488a20814c11cd79/organize-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "26a49632289809ea2473b68876acca2d", "sha256": "dd7a5cba8466025e077afd84dd842a4a3cc195dc9911d9a6da793509cdb86244" }, "downloads": -1, "filename": "organize-0.1.2.macosx-10.9-intel.exe", "has_sig": false, "md5_digest": "26a49632289809ea2473b68876acca2d", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 86647, "upload_time": "2014-05-26T17:27:57", "url": "https://files.pythonhosted.org/packages/10/28/da051a9befa2ed51c66684c336780078f1a266dcde04f78bddb075d4416e/organize-0.1.2.macosx-10.9-intel.exe" }, { "comment_text": "", "digests": { "md5": "e3c756e2125c699211408b02ba37baa5", "sha256": "3c383f2be5244d52fd9c38ec5746d5939bbbb6ec00497d205d2bad82f1ad2a10" }, "downloads": -1, "filename": "organize-0.1.2.tar.gz", "has_sig": false, "md5_digest": "e3c756e2125c699211408b02ba37baa5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11709, "upload_time": "2014-05-26T17:27:54", "url": "https://files.pythonhosted.org/packages/67/31/7f38656c5fac0ae14dc529ad1b1a3fb790d6df5a052d488a20814c11cd79/organize-0.1.2.tar.gz" } ] }