{ "info": { "author": "Marlan Perumal", "author_email": "marlan.perumal@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Office/Business :: Financial" ], "description": "# PDF Statement Reader\n[![Build Status](https://travis-ci.com/marlanperumal/pdf_statement_reader.svg?branch=master)](https://travis-ci.com/marlanperumal/pdf_statement_reader)\n[![PyPI version](https://badge.fury.io/py/pdf-statement-reader.svg)](https://badge.fury.io/py/pdf-statement-reader)\n[![Coverage Status](https://coveralls.io/repos/github/marlanperumal/pdf_statement_reader/badge.svg)](https://coveralls.io/github/marlanperumal/pdf_statement_reader)\n\nPython library and command line tool for parsing pdf bank statements\n\nInspired by https://github.com/antonburger/pdf2csv\n\n## Objectives\n\nBanks generally send account statements in pdf format. These pdfs are often encrypted, the pdf format is difficult to extract tables from and when you finally get the table out it's in a non tidy format. This package aims to help by providing a library of functions and a set of command line tools for converting these statements into more useful formats such as csv files and pandas dataframes.\n\n## Installation\n\n```\npip install pdf-statement-reader\n```\n\n### Troubleshooting\n\nThis package uses [tabula-py](https://github.com/chezou/tabula-py) under the hood, which itself is a wrapper for [tabula-java](https://github.com/tabulapdf/tabula-java). You thus need to have java installed for it to work. If you have any errors complaining about java, checkout out the `tabula-py` page for troubleshooting advice.\n\nIn the future, we hope to move to a pure python implementation.\n\n## Usage\n\nThe package provides a command line application `psr`\n\n```\nUsage: psr [OPTIONS] COMMAND [ARGS]...\n\n Utility for reading bank and other statements in pdf form\n\nOptions:\n --help Show this message and exit.\n\nCommands:\n bulk Bulk converts all files in a folder\n decrypt Decrypts a pdf file Uses pikepdf to open an encrypted pdf file...\n pdf2csv Converts a pdf statement to a csv file using a given format\n validate Validates the csv statement rolling balance\n```\n\n## Configuration\n\nPDF files are notoriously difficult to extract data from. (Here's a nice [blog post](https://www.propublica.org/nerds/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult) on why). For a really good semi-manual GUI solution, check out [tabula](https://tabula.technology/). In fact this package uses tabula's pdf parsing library under the hood.\n\nSince bank statements are generally of the same (if inconvenient) format, we can set up a configuration to tell the tool how to grab the data.\n\nFor each type of bank statement, the exact format will be different. A config file holds the instructions for how to process the raw pdf. For now the only config supported is for Cheque account statements from Absa bank in South Africa. \n\nTo set up a different statement, you can simply add a new config file and then tell the `psr` tool to use it. These config files are stored in a folder structure as follows:\n\n config > [country code] > [bank] > [statement type].json\n\nSo for example the default config is stored in\n\n config > za > absa > cheque.json\n\nThe config spec is a code of the form\n\n [country code].[bank].[statement type]\n\nOnce again for the default this will be\n\n za.absa.cheque\n\nThe configuration file itself is in JSON format. Here's the Absa cheque account one with some commentary to explain what each field does.\n\n```json5\n{\n // Describes the page layout that should be scanned\n \"layout\": { \n // Default layout for all pages not otherwise defined\n \"default\": {\n // The page coordinates in containing the table in pts \n // [top, left, bottom, right]\n \"area\": [280, 27, 763, 576],\n // The right x coordinate of each column in the table\n \"columns\": [83, 264, 344, 425, 485, 570]\n },\n // Layout for the first page\n \"first\": {\n \"area\": [480, 27, 763, 576],\n \"columns\": [83, 264, 344, 425, 485, 570]\n }\n },\n\n // The columns names to be used as they exactly appear\n // in the statement\n \"columns\": {\n \"trans_date\": \"Date\",\n \"trans_type\": \"Transaction Description\",\n \"trans_detail\": \"Transaction Detail\",\n \"debit\": \"Debit Amount\",\n \"credit\": \"Credit Amount\",\n \"balance\": \"Balance\"\n },\n\n // The order of the columns to be output in the csv\n \"order\": [\n \"trans_date\",\n \"trans_type\",\n \"trans_detail\",\n \"debit\",\n \"credit\",\n \"balance\"\n ],\n\n // Specifies any cleaning operations required\n \"cleaning\": {\n // Convert these columns to numeric\n \"numeric\": [\"debit\", \"credit\", \"balance\"],\n // Convert these columns to date\n \"date\": [\"trans_date\"],\n // Use this date format to parse any date columns\n \"date_format\": \"%d/%m/%Y\",\n // For cases where the transaction detail is stored\n // in the next line below the transaction type\n \"trans_detail\": \"below\",\n // Only keep the rows where these columns are populated\n \"dropna\": [\"balance\"]\n }\n}\n```\n\nThese were the configuration options that were required for the default format. It is envisaged that as more formats are added, the list of options will grow.\n\n## CLI API\n\n### decrypt\n\n```\nUsage: psr decrypt [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]\n\n Decrypts a pdf file\n\n Uses pikepdf to open an encrypted pdf file and then save the unencrypted\n version. If no output_filename is specified then overwrites the original\n file.\n\nOptions:\n -p, --password TEXT The pdf encryption password. If not supplied, it will\n be requested at the prompt\n --help Show this message and exit.\n```\n\n### pdf2csv\n\n```\nUsage: psr pdf2csv [OPTIONS] INPUT_FILENAME [OUTPUT_FILENAME]\n\n Converts a pdf statement to a csv file using a given format\n\nOptions:\n -c, --config TEXT The configuration code defining how the file should be\n parsed [default: za.absa.cheque]\n --help Show this message and exit.\n```\n\n### validate\n\n```\nUsage: psr validate [OPTIONS] INPUT_FILENAME\n\n Validates the csv statement rolling balance\n\nOptions:\n -c, --config TEXT The configuration code defining how the file should be\n parsed [default: za.absa.cheque]\n --help Show this message and exit.\n```\n\n### bulk\n\n```\nUsage: psr bulk [OPTIONS] FOLDER\n\n Bulk converts all files in a folder\n\nOptions:\n -c, --config TEXT The configuration code defining how the file\n should be parsed [default: za.absa.cheque]\n -p, --password TEXT The pdf encryption password. If not supplied, it\n will be requested at the prompt\n -d, --decrypt-suffix TEXT The suffix to append to the decrypted pdf file\n when created [default: _decrypted]\n -k, --keep-decrypted Keep the a copy of the decrypted file. It is\n removed by default\n -v, --verbose Print verbose output while running\n --help Show this message and exit.\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/marlanperumal/pdf_statement_reader", "keywords": "bank statement pdf digitise", "license": "", "maintainer": "", "maintainer_email": "", "name": "pdf-statement-reader", "package_url": "https://pypi.org/project/pdf-statement-reader/", "platform": "", "project_url": "https://pypi.org/project/pdf-statement-reader/", "project_urls": { "Bug Reports": "https://github.com/marlanperumal/pdf_statement_reader/issues", "Homepage": "https://github.com/marlanperumal/pdf_statement_reader", "Source": "https://github.com/marlanperumal/pdf_statement_reader" }, "release_url": "https://pypi.org/project/pdf-statement-reader/0.1.3/", "requires_dist": [ "pikepdf", "tabula-py", "pandas", "numpy", "click", "check-manifest ; extra == 'dev'", "pytest ; extra == 'test'", "coverage ; extra == 'test'" ], "requires_python": ">=3.5", "summary": "PDF Statement Reader", "version": "0.1.3" }, "last_serial": 4822122, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "d43e5d1fce6cf81f1ea1c8a0d6c3d03f", "sha256": "6d0cdf27f94f370066fc3205a7c7048ecdfdeb1875728312a27052f4c977af29" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "d43e5d1fce6cf81f1ea1c8a0d6c3d03f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 6298, "upload_time": "2019-02-10T14:06:03", "url": "https://files.pythonhosted.org/packages/58/a4/968a9e40c439a8c9fd4c3068313f48e79844521567f605246e338148286a/pdf_statement_reader-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5324923232cad38c076a5095c7fc4feb", "sha256": "d0b3a4c3e7b2ddf1c93d9cc7667555332ebb7e2206b49f82ab2916eb48b060ca" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5324923232cad38c076a5095c7fc4feb", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 5188, "upload_time": "2019-02-10T14:06:05", "url": "https://files.pythonhosted.org/packages/81/2d/5dab7667b140f37feb24a27468a8e0680d295cfe7edfb57dd3de57039320/pdf_statement_reader-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "6a9b8739e2847766f89d37a2a92825ce", "sha256": "0aba3371661830af10f985130b21d808bfd2b78e64368ccab213792736df0f3f" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "6a9b8739e2847766f89d37a2a92825ce", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 9658, "upload_time": "2019-02-10T20:38:33", "url": "https://files.pythonhosted.org/packages/f0/55/9ba23de3b7a23fb97cc52aeb28d56ecf7e09bfd6aacef6a6665b5e4d6a17/pdf_statement_reader-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9fea528634e7e615b3950290af369887", "sha256": "82c1675d8b5717f8c3147ba2089aacaf349b91154ca1e498417aba4f475f98f0" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.1.tar.gz", "has_sig": false, "md5_digest": "9fea528634e7e615b3950290af369887", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 8294, "upload_time": "2019-02-10T20:38:34", "url": "https://files.pythonhosted.org/packages/0c/99/1ba50b7b1b9e75803a835c5486290fe19e1899e069d404c1af867c05b743/pdf_statement_reader-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "aa6cd82d36f5fbd19229d8f0441fc729", "sha256": "c2dff97d1c411c15c7340660de07deb944fe61fddf16c0500ca5ac91a26cc288" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "aa6cd82d36f5fbd19229d8f0441fc729", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 9659, "upload_time": "2019-02-11T21:38:33", "url": "https://files.pythonhosted.org/packages/8e/6c/ae1bde9bd1068dead8981417754256f6498f97ce3d25a1505e9681cdce93/pdf_statement_reader-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c139928ae5331100acd39a0af096eacb", "sha256": "8895ec0e5d654bb2025b3311329efe7ece8a35b318455507c86084706dd9d131" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.2.tar.gz", "has_sig": false, "md5_digest": "c139928ae5331100acd39a0af096eacb", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 8563, "upload_time": "2019-02-11T21:38:36", "url": "https://files.pythonhosted.org/packages/60/4e/c3420fa5f91b40abe43d2d5cf7d0c00ba70b58d735e9301747f2fb06d7ca/pdf_statement_reader-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "46afc794c2d6ca8766d3fb7dc5bd9201", "sha256": "f68704b8d6b5dc053eadcb200daabc39fc2f94c47df227afd12b2b1e2d6c1b5c" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "46afc794c2d6ca8766d3fb7dc5bd9201", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 10242, "upload_time": "2019-02-14T21:00:32", "url": "https://files.pythonhosted.org/packages/a2/c2/f67ba95be9576d73bcb9a17bd04f24f4bcfdee6d40f015b967645c660bb0/pdf_statement_reader-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "574fd48e716622759719f3a39adedcfc", "sha256": "104763c63012cd044f951617f563c1092379e0846451c15a8a14b2490074ca18" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.3.tar.gz", "has_sig": false, "md5_digest": "574fd48e716622759719f3a39adedcfc", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 8679, "upload_time": "2019-02-14T21:00:34", "url": "https://files.pythonhosted.org/packages/c5/6a/480cc968225a29636ed596ade44266190d16a7fb1709cdf3b6a0caf314c8/pdf_statement_reader-0.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "46afc794c2d6ca8766d3fb7dc5bd9201", "sha256": "f68704b8d6b5dc053eadcb200daabc39fc2f94c47df227afd12b2b1e2d6c1b5c" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "46afc794c2d6ca8766d3fb7dc5bd9201", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5", "size": 10242, "upload_time": "2019-02-14T21:00:32", "url": "https://files.pythonhosted.org/packages/a2/c2/f67ba95be9576d73bcb9a17bd04f24f4bcfdee6d40f015b967645c660bb0/pdf_statement_reader-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "574fd48e716622759719f3a39adedcfc", "sha256": "104763c63012cd044f951617f563c1092379e0846451c15a8a14b2490074ca18" }, "downloads": -1, "filename": "pdf_statement_reader-0.1.3.tar.gz", "has_sig": false, "md5_digest": "574fd48e716622759719f3a39adedcfc", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 8679, "upload_time": "2019-02-14T21:00:34", "url": "https://files.pythonhosted.org/packages/c5/6a/480cc968225a29636ed596ade44266190d16a7fb1709cdf3b6a0caf314c8/pdf_statement_reader-0.1.3.tar.gz" } ] }