{ "info": { "author": "Loanzen", "author_email": "hello@loanzen.in", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Software Development :: Libraries", "Topic :: Utilities" ], "description": "# Zen Document Parser\n\n## Intro\n\nzen_document_parser is a utility for extracting data from various official documents. It uses [PDFQuery](https://github.com/jcushman/pdfquery) behind the scenes.\n\nCurrently, there is out-of-the-box support for parsing **Indian Government ITR-V PDF documents.**\n\nThe library also supports parsing of arbitrary PDF documents by allowing you to specify a 'schema' for the document. The library allows for multiple 'variants' of a document. For example, The Indian ITR-V document has slightly different fields and layout depending on whether it was generated in 2013, 2014, 2015 etc.\n\nCheck out the examples below.\n\n\n## Installation\n\nInstall using [pip](https://pip.pypa.io/en/stable/installing/) like so:\n\n```bash\n\n$ pip install zen_document_parser\n```\n\n## Usage\n\n### ITR-V Docs\n\n```python\n\nfrom zen_document_parser.itr.itr import ITRVDocument\n\n# You can pass in a path or a file-like object during instantiation.\ndoc = ITRVDocument('/path/to/itrv.pdf')\n\n# Will load the file, auto-detect the variant and perform extraction of all\n# fields and store results internally.\ndoc.extract()\n\n# Extracted fields are available in the `data` property.\nprint(doc.data.company_name)\nprint(doc.data.gross_total_income)\n\n```\n\n\n### Configuring for custom PDF documents\n\nYou basically follow these steps:\n\n- Define one or more 'schemas', ie. `DocVariant` subclasses, to go with each variant of the doc.\n- In each of these variants, define a `check_for_match()` method that returns `True` if a file was successfully parsed.\n - Make sure to define `test_fields` as an attribute on each class that is a list of all field names used inside `check_for_match()`. (This is required at present for optimization purposes, but will not be a requirement in an upcoming version.)\n- Define a `Doc` subclass that represents your document. In the `variants` attribute, specify possible variants.\n\n\n```python\n\nfrom zen_document_parser.base import DocField, DocVariant, Document\n\n\nclass Variant1(DocVariant):\n\n # The fields that are used inside `check_for_match()`. (for optimization)\n test_fields = ['form_title']\n\n form_title = DocField((30, 300, 500, 380))\n name = DocField((100, 120, 400, 140.5))\n address = DocField((150, 90, 650, 110))\n\n def check_for_match(self):\n if self.form_title == 'Application Form For 2014':\n return True\n return False\n\n\nclass Variant2(DocVariant):\n\n test_fields = ['form_title']\n\n form_title = DocField((30, 290, 500, 380))\n name = DocField((70, 140, 350, 160))\n address = DocField((150, 120, 650, 140))\n pan_no = DocField((150, 80, 650, 100))\n\n def check_for_match(self):\n if self.form_title == 'Application Form For 2015-16':\n return True\n return False\n\n\nclass MyForm(Document):\n\n variants = [Variant1, Variant2]\n\n\ndef main():\n doc = MyForm('/path/to/form.pdf')\n doc.extract()\n print(doc.data.to_dict())\n```\n\n\n# TODO\n\n- Hanle data-type specification\n- Handle fields being mandatory/non-mandatory.\n- Right now the user has to explicitly specify `test_fields` for optimization purposes. Find a way where this isn't needed.\n - Automatically load them the first time they're referred to? `extract()` can still be there as a way to bulk-load all fields in one go.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/loanzen/zen_document_parser", "keywords": "pdf parse itr documents", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "zen_document_parser", "package_url": "https://pypi.org/project/zen_document_parser/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/zen_document_parser/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/loanzen/zen_document_parser" }, "release_url": "https://pypi.org/project/zen_document_parser/0.11/", "requires_dist": null, "requires_python": null, "summary": "A library for parsing various government documents as well as general PDFs", "version": "0.11" }, "last_serial": 2044123, "releases": { "0.11": [ { "comment_text": "", "digests": { "md5": "50f7b7ce5d1df2a068eff39f1410421b", "sha256": "27b33f0844e90f88aecbf581011701626fc75e5fc6402d914f7b1efa4354569e" }, "downloads": -1, "filename": "zen_document_parser-0.11.tar.gz", "has_sig": false, "md5_digest": "50f7b7ce5d1df2a068eff39f1410421b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7872, "upload_time": "2016-04-04T05:07:09", "url": "https://files.pythonhosted.org/packages/70/2f/90071786f35886e5d230fc072fd207a39220240619e58a8dc5f75d5e0c9d/zen_document_parser-0.11.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "50f7b7ce5d1df2a068eff39f1410421b", "sha256": "27b33f0844e90f88aecbf581011701626fc75e5fc6402d914f7b1efa4354569e" }, "downloads": -1, "filename": "zen_document_parser-0.11.tar.gz", "has_sig": false, "md5_digest": "50f7b7ce5d1df2a068eff39f1410421b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7872, "upload_time": "2016-04-04T05:07:09", "url": "https://files.pythonhosted.org/packages/70/2f/90071786f35886e5d230fc072fd207a39220240619e58a8dc5f75d5e0c9d/zen_document_parser-0.11.tar.gz" } ] }