{
"info": {
"author": "Andreas Kloeckner",
"author_email": "inform@tiker.net",
"bugtrack_url": null,
"classifiers": [
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Intended Audience :: Other Audience",
"Intended Audience :: Science/Research",
"License :: OSI Approved :: MIT License",
"Natural Language :: English",
"Programming Language :: Python :: 3",
"Topic :: Utilities"
],
"description": "pdf2data\n========\n\nHave some tabluar data locked away in PDF format? Like the financial information at my\nesteemed place of current employment, which looks roughly like this:\n\n.. image:: example/ui-financials.png\n\n`Tabula `__ not plausible for the volume of\ninformation you're needing to extract? (Thousands of pages in my case.) This\npackage may be what you're looking for. I should note that this is\na simple tool aimed at very structured data. Tabula can handle *far* messsier\nsituations than this package. Misaligned cell heights? Word-wrapped cells?\nSpanning cells? You're better off with Tabula. Computer-generated report\nPDFs that urgently want to be in a SQLite database? You've come to the right\nplace.\n\nReading UIUC Financials\n-----------------------\n\nIf you came here looking to read financial statements at UIUC, there's a page\n`just for you `__.\n\nPackage Overview\n----------------\n\nThis package builds on `pdfminer `__ to make it\neasy to absorb computer-generated tabular data in PDF form and produce JSON-like lists of\nrow dictionaries. The basic workflow is as follows:\n\n.. code:: python\n\n # identify top of table\n top_y0 = find_attr_group_matching(\n [\"Last Name\", \"First Name\"], \"y0\", page_it.lines)\n\n # extract text snippets making up table body\n table_lines = [l for l in page_it.lines if l.y0 < top_y0]\n\n # extract header text snippets\n headers = [l for l in page_it.lines if abs(l.y0 - top_y0) < 5]\n\n # extract table\n rows = find_row_table(headers, table_lines)\n rows = merge_overlapping_rows(rows, \"y0\", \"y1\")\n\nThis will leave ``rows`` to be a data structure roughly like the following:\n\n.. code:: js\n\n {'Amount ': TL(' 60.00 '), 'Last Name': TL('Lidstad'), 'Address': TL('62\\xa0Mississippi\\xa0River\\xa0Blvd\\xa0N'), 'First Name': TL('Dick\\xa0&\\xa0Peg'), 'City': TL('Saint\\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55104'), 'Occupation': TL('retired'), 'Date': TL('10/12/2012')}\n {'Amount ': TL(' 60.00 '), 'Last Name': TL('Strom'), 'Address': TL('1229\\xa0Hague\\xa0Ave'), 'First Name': TL('Pam'), 'City': TL('St.\\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55104'), 'Date': TL('9/12/2012')}\n {'Amount ': TL(' 60.00 '), 'Last Name': TL('Seeba'), 'Address': TL('1399\\xa0Sheldon\\xa0St'), 'First Name': TL('Louise\\xa0&\\xa0Paul'), 'City': TL('Saint\\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55108'), 'Occupation': TL('BOE'), 'Employer': TL('City\\xa0of\\xa0Saint\\xa0Paul'), 'Date': TL('10/12/2012')}\n {'Amount ': TL(' 60.00 '), 'Last Name': TL('Schumacher\\xa0/\\xa0Bales'), 'First Name': TL('Douglas\\xa0L.\\xa0/\\xa0Patricia\\xa0948\\xa0County\\xa0Rd.\\xa0D\\xa0W'), 'City': TL('Saint\\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55126'), 'Date': TL('10/13/2012')}\n {'Amount ': TL(' 75.00 '), 'Last Name': TL('Abrams'), 'Address': TL('238\\xa08th\\xa0St\\xa0east'), 'First Name': TL('Marjorie'), 'City': TL('St\\xa0Paul'), 'State': TL('MN'), 'Zip': TL('55101'), 'Occupation': TL('Retired'), 'Employer': TL('Retired'), 'Date': TL('8/8/2012')}\n\nSee `this demo `__ for a minimal, fully functional example.\nThere is some documentation in the `source code `__. In\naddition, there are some (sparsely documented) facilities for inserting the\nobtained data into a SQLite3 database and the full script I use to make my\nfinancial info digestible.\n\nThe package is Python 3-only. Install using::\n\n pip install pdf2data\n\nhttps://github.com/inducer/pdf2data\n\nCopyright 2019 Andreas Kloeckner\n\nReleased under the MIT License\n\nIn terms of support, if this doesn't do what you need, you're likely to be on\nyour own. I'm happy to take patches, but I'm unlikely to have to time to fix\nyour use case.",
"description_content_type": "",
"docs_url": null,
"download_url": "",
"downloads": {
"last_day": -1,
"last_month": -1,
"last_week": -1
},
"home_page": "https://github.com/inducer/pdf2data",
"keywords": "",
"license": "MIT",
"maintainer": "",
"maintainer_email": "",
"name": "pdf2data",
"package_url": "https://pypi.org/project/pdf2data/",
"platform": "",
"project_url": "https://pypi.org/project/pdf2data/",
"project_urls": {
"Homepage": "https://github.com/inducer/pdf2data"
},
"release_url": "https://pypi.org/project/pdf2data/2019.1/",
"requires_dist": null,
"requires_python": "",
"summary": "Tools for extracting tabular data from PDFs, using pdfminer",
"version": "2019.1"
},
"last_serial": 5663975,
"releases": {
"2019.1": [
{
"comment_text": "",
"digests": {
"md5": "a055e751be03629724f5f7456a71987c",
"sha256": "c3933138bd67b3791571ea781cbc34ccbc457eead3de9bb4d8f51ec3aa01f726"
},
"downloads": -1,
"filename": "pdf2data-2019.1.tar.gz",
"has_sig": false,
"md5_digest": "a055e751be03629724f5f7456a71987c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 82051,
"upload_time": "2019-08-12T04:01:12",
"url": "https://files.pythonhosted.org/packages/b8/72/4421e84576046b53e0c024b3538b456e7bc675e8a8829d02ff5f31665135/pdf2data-2019.1.tar.gz"
}
]
},
"urls": [
{
"comment_text": "",
"digests": {
"md5": "a055e751be03629724f5f7456a71987c",
"sha256": "c3933138bd67b3791571ea781cbc34ccbc457eead3de9bb4d8f51ec3aa01f726"
},
"downloads": -1,
"filename": "pdf2data-2019.1.tar.gz",
"has_sig": false,
"md5_digest": "a055e751be03629724f5f7456a71987c",
"packagetype": "sdist",
"python_version": "source",
"requires_python": null,
"size": 82051,
"upload_time": "2019-08-12T04:01:12",
"url": "https://files.pythonhosted.org/packages/b8/72/4421e84576046b53e0c024b3538b456e7bc675e8a8829d02ff5f31665135/pdf2data-2019.1.tar.gz"
}
]
}