{
    "info": {
        "author": "Loanzen",
        "author_email": "hello@loanzen.in",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 4 - Beta",
            "Intended Audience :: Developers",
            "License :: OSI Approved :: MIT License",
            "Programming Language :: Python :: 2",
            "Programming Language :: Python :: 2.6",
            "Programming Language :: Python :: 2.7",
            "Programming Language :: Python :: 3",
            "Programming Language :: Python :: 3.3",
            "Programming Language :: Python :: 3.4",
            "Programming Language :: Python :: 3.5",
            "Topic :: Software Development :: Libraries",
            "Topic :: Utilities"
        ],
        "description": "# Zen Document Parser\n\n## Intro\n\nzen_document_parser is a utility for extracting data from various official documents. It uses [PDFQuery](https://github.com/jcushman/pdfquery) behind the scenes.\n\nCurrently, there is out-of-the-box support for parsing **Indian Government ITR-V PDF documents.**\n\nThe library also supports parsing of arbitrary PDF documents by allowing you to specify a 'schema' for the document. The library allows for multiple 'variants' of a document. For example, The Indian ITR-V document has slightly different fields and layout depending on whether it was generated in 2013, 2014, 2015 etc.\n\nCheck out the examples below.\n\n\n## Installation\n\nInstall using [pip](https://pip.pypa.io/en/stable/installing/) like so:\n\n```bash\n\n$ pip install zen_document_parser\n```\n\n## Usage\n\n### ITR-V Docs\n\n```python\n\nfrom zen_document_parser.itr.itr import ITRVDocument\n\n# You can pass in a path or a file-like object during instantiation.\ndoc = ITRVDocument('/path/to/itrv.pdf')\n\n# Will load the file, auto-detect the variant and perform extraction of all\n# fields and store results internally.\ndoc.extract()\n\n# Extracted fields are available in the `data` property.\nprint(doc.data.company_name)\nprint(doc.data.gross_total_income)\n\n```\n\n\n### Configuring for custom PDF documents\n\nYou basically follow these steps:\n\n- Define one or more 'schemas', ie. `DocVariant` subclasses, to go with each variant of the doc.\n- In each of these variants, define a `check_for_match()` method that returns `True` if a file was successfully parsed.\n  - Make sure to define `test_fields` as an attribute on each class that is a list of all field names used inside `check_for_match()`. (This is required at present for optimization purposes, but will not be a requirement in an upcoming version.)\n- Define a `Doc` subclass that represents your document. In the `variants` attribute, specify possible variants.\n\n\n```python\n\nfrom zen_document_parser.base import DocField, DocVariant, Document\n\n\nclass Variant1(DocVariant):\n\n    # The fields that are used inside `check_for_match()`. (for optimization)\n    test_fields = ['form_title']\n\n    form_title = DocField((30, 300, 500, 380))\n    name = DocField((100, 120, 400, 140.5))\n    address = DocField((150, 90, 650, 110))\n\n    def check_for_match(self):\n        if self.form_title == 'Application Form For 2014':\n            return True\n        return False\n\n\nclass Variant2(DocVariant):\n\n    test_fields = ['form_title']\n\n    form_title = DocField((30, 290, 500, 380))\n    name = DocField((70, 140, 350, 160))\n    address = DocField((150, 120, 650, 140))\n    pan_no = DocField((150, 80, 650, 100))\n\n    def check_for_match(self):\n        if self.form_title == 'Application Form For 2015-16':\n            return True\n        return False\n\n\nclass MyForm(Document):\n\n    variants = [Variant1, Variant2]\n\n\ndef main():\n    doc = MyForm('/path/to/form.pdf')\n    doc.extract()\n    print(doc.data.to_dict())\n```\n\n\n# TODO\n\n- Hanle data-type specification\n- Handle fields being mandatory/non-mandatory.\n- Right now the user has to explicitly specify `test_fields` for optimization purposes. Find a way where this isn't needed.\n  - Automatically load them the first time they're referred to? `extract()` can still be there as a way to bulk-load all fields in one go.",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/loanzen/zen_document_parser",
        "keywords": "pdf parse itr documents",
        "license": "MIT",
        "maintainer": null,
        "maintainer_email": null,
        "name": "zen_document_parser",
        "package_url": "https://pypi.org/project/zen_document_parser/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/zen_document_parser/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/loanzen/zen_document_parser"
        },
        "release_url": "https://pypi.org/project/zen_document_parser/0.11/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "A library for parsing various government documents as well as general PDFs",
        "version": "0.11"
    },
    "last_serial": 2044123,
    "releases": {
        "0.11": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "50f7b7ce5d1df2a068eff39f1410421b",
                    "sha256": "27b33f0844e90f88aecbf581011701626fc75e5fc6402d914f7b1efa4354569e"
                },
                "downloads": -1,
                "filename": "zen_document_parser-0.11.tar.gz",
                "has_sig": false,
                "md5_digest": "50f7b7ce5d1df2a068eff39f1410421b",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 7872,
                "upload_time": "2016-04-04T05:07:09",
                "url": "https://files.pythonhosted.org/packages/70/2f/90071786f35886e5d230fc072fd207a39220240619e58a8dc5f75d5e0c9d/zen_document_parser-0.11.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "50f7b7ce5d1df2a068eff39f1410421b",
                "sha256": "27b33f0844e90f88aecbf581011701626fc75e5fc6402d914f7b1efa4354569e"
            },
            "downloads": -1,
            "filename": "zen_document_parser-0.11.tar.gz",
            "has_sig": false,
            "md5_digest": "50f7b7ce5d1df2a068eff39f1410421b",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7872,
            "upload_time": "2016-04-04T05:07:09",
            "url": "https://files.pythonhosted.org/packages/70/2f/90071786f35886e5d230fc072fd207a39220240619e58a8dc5f75d5e0c9d/zen_document_parser-0.11.tar.gz"
        }
    ]
}