{ "info": { "author": "Yoshihiko Ueno", "author_email": "windows7home@hotmail.co.jp", "bugtrack_url": null, "classifiers": [], "description": "# About\nThis script converts PDF to txt using PDFMiner\n(http://www.unixuser.org/~euske/python/pdfminer/index.html).\n\nPDFMiner is a pdf parsing library written in Python by Yusuke Shinyama.\n\nIn addition to the pdf2txt.py and dumppdf.py command line tools, there\nis a way of analyzing the content tree of each page programmatically.\n\nThis is a more complete example of programming with\nPDFMiner, which continues where the default documentation\n(http://www.unixuser.org/~euske/python/pdfminer/programming.html#layout)\nstops.\n\n**This code is still a work-in-progress, with room for improvement.**\n\n# Install\nSince it's available on PyPI, it's super easy to instal.\n\n```\npip3 install pdf_layout_scanner\n```\n\n# Advantages over PDFMiner\nThis script will extract text from PDFs with *multiple columns*.\n\n# Usage\n## General Usage\n```python\nfrom pdf_layout_scanner import layout_scanner\n\n# get a list of the table of contents\nget_toc()\n\n# get the full text\nget_pages()\n```\n\n## Practical examples\n```python\nfrom pdf_layout_scanner import layout_scanner\ntoc=layout_scanner.get_toc('/path/to/your/pdf-file.pdf')\nprint(len(toc))\n# the number of elements in the pdf document's table of contents\n\nprint(toc[0])\n# a tuple containing the ordinal sequence and the title string,\n# for example:\n# (1, u'Introduction')\n\npages=layout_scanner.get_pages('/path/to/your/pdf-file.pdf')\nprint(len(pages))\n# should return the number of pages in the pdf document\nprint(pages[0])\n# a string of all the text on the first page\n```\n\n# Room for Improvement\n * Column Merging - while the fuzzy heuristic I described works well for\n the pdf files I've parsed so far, I can imagine more complex documents\n where it would break-down (perhaps this is where the analysis should be\n more sophisticated, and not ignore so many types of pdfminer.layout.LT\\* objects).\n\n * Image Extraction - I'd like to be able to be at least as good as\n pdftoimages, and save every file in ppm or pnm default format, but I'm\n not sure what I could be doing differently\n\n * Title and Heading Capitalization - this seems to be an issue with\n PDFMiner, since I get similar results in using the command line tools,\n but it is annoying to have to go back and fix all the mis-capitalizations\n manually, particularly for larger documents.\n\n * Title and Heading Fonts and Spacing - a related issue, though probably\n something in my own code, is that those same title and paragraph headings\n aren't distinguished from the rest of the text. In many cases, I have to\n go back and add vertical spacing and font attributes for those manually.\n\n * Page Number Removal - originally, I thought I could just use a regex\n for an all-numeric value on a single physical line, but each document\n does page numbering slightly differently, and it's very difficult to\n get rid of these without manually proofreading each page.\n\n * Footnotes - handling these where the note and the reference both appear\n on the same page is hard enough, but doing it when they span different\n (even consecutive) pages is worse.\n\n# Contribution\nIn this forked project, I made a bit changes into the original one. \n * Added support for texts in LTFigures\n * Optimized data manipulation and storage\n changed from simple dict to dataframe.\n This should make further contributions easier.\n * Added Progressbar\n\n# Github\nhttps://github.com/yoshihikoueno/pdfminer-layout-scanner\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/yoshihikoueno/pdfminer-layout-scanner", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "PDF-Layout-Scanner", "package_url": "https://pypi.org/project/PDF-Layout-Scanner/", "platform": "", "project_url": "https://pypi.org/project/PDF-Layout-Scanner/", "project_urls": { "Homepage": "https://github.com/yoshihikoueno/pdfminer-layout-scanner" }, "release_url": "https://pypi.org/project/PDF-Layout-Scanner/1.3.3/", "requires_dist": [ "pandas", "tqdm", "pdfminer.six" ], "requires_python": "", "summary": "", "version": "1.3.3" }, "last_serial": 5765728, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "20224b65d48634782e91a21670b3905e", "sha256": "30c34d41954b93b8fa523f4061b69dac0ebb15014d8eb1f3b930f71ff7c4b7eb" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "20224b65d48634782e91a21670b3905e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7513, "upload_time": "2019-07-24T11:51:49", "url": "https://files.pythonhosted.org/packages/f0/c0/536a12e68fa798b329cb35b4ceab2b098a08b8d41f4c9002783c201b216d/PDF_Layout_Scanner-1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "280061daae16a3e54511bb43b03afa5b", "sha256": "e765be794e67dc9fc2f8ac54a7d250e5aed58deaf5a4dcb231ee388661e5910e" }, "downloads": -1, "filename": "PDF Layout Scanner-1.0.tar.gz", "has_sig": false, "md5_digest": "280061daae16a3e54511bb43b03afa5b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6106, "upload_time": "2019-07-24T11:53:10", "url": "https://files.pythonhosted.org/packages/e0/56/679fc974d67fd96242397ad47ca3fb39368245b65499d4a1ad7a21deb20c/PDF%20Layout%20Scanner-1.0.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "e3a67072a9305cff6c3b5b94ebb956df", "sha256": "0c8b160720a07dd2e8757a9814c2ef8044b42f78d83b50bf2cee4128636dd54c" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "e3a67072a9305cff6c3b5b94ebb956df", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7559, "upload_time": "2019-07-24T12:26:43", "url": "https://files.pythonhosted.org/packages/c9/07/25121e337b49fc12fadd7764dc4da9b56ae114ca4eb2afd1b6ba943697b4/PDF_Layout_Scanner-1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "af08cd4eab1ab581e4fd6828b8144a42", "sha256": "1d98f5e46c611141ed021f3f0c81e42f701a5bf49c3146560cc2d20ee9744a00" }, "downloads": -1, "filename": "PDF Layout Scanner-1.1.tar.gz", "has_sig": false, "md5_digest": "af08cd4eab1ab581e4fd6828b8144a42", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6177, "upload_time": "2019-07-24T12:26:45", "url": "https://files.pythonhosted.org/packages/91/6e/6fa485f12042360aa05e1b8c04f34ec298645f8f2e8f1e8c71983cd1401d/PDF%20Layout%20Scanner-1.1.tar.gz" } ], "1.2": [ { "comment_text": "", "digests": { "md5": "e77e524500caa8ed8659b80c1e118a42", "sha256": "74388245b3a649bbdf551b78fe5f010e29feb8357aeb27e3fd7fd0e7fc2ac86a" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "e77e524500caa8ed8659b80c1e118a42", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7555, "upload_time": "2019-07-24T12:35:34", "url": "https://files.pythonhosted.org/packages/fb/61/30dfc5552cf397b0a2de9f23255cbd905597d75f264a33bafe2a32885d49/PDF_Layout_Scanner-1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "64470f33e6ece380014a4e398a4c56c1", "sha256": "9ae6ff4033088dc90716a47005ab17057b2134e03a8568ee0c187288101024c1" }, "downloads": -1, "filename": "PDF Layout Scanner-1.2.tar.gz", "has_sig": false, "md5_digest": "64470f33e6ece380014a4e398a4c56c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6182, "upload_time": "2019-07-24T12:35:35", "url": "https://files.pythonhosted.org/packages/9f/64/c8142607c195a637db5c1ef9a85509570625d71273c0a978f82b16cd68cb/PDF%20Layout%20Scanner-1.2.tar.gz" } ], "1.3": [ { "comment_text": "", "digests": { "md5": "475c86ae60a2889ea1fb242ea866dbe6", "sha256": "19078dcaa681259ac149e436e56376e343fa14bd5d3a4a1a994908ff10d9e0ec" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "475c86ae60a2889ea1fb242ea866dbe6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7726, "upload_time": "2019-08-07T03:48:05", "url": "https://files.pythonhosted.org/packages/95/f0/382216ec66685517ed5802d3344831597270434d699bcc8ec7f394c01204/PDF_Layout_Scanner-1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7dcf53f0488b3f9c3b1f410824b051ab", "sha256": "72dfb6627c7f0bc4dd94dd45ee9fde28d9cb46d62eacdc74b7f96af4063b4139" }, "downloads": -1, "filename": "PDF Layout Scanner-1.3.tar.gz", "has_sig": false, "md5_digest": "7dcf53f0488b3f9c3b1f410824b051ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6361, "upload_time": "2019-08-07T03:48:07", "url": "https://files.pythonhosted.org/packages/84/79/10e0bf5bd25c7d429dbee59c3942ea8a6426faedb3e0ce529d5c2209ae4c/PDF%20Layout%20Scanner-1.3.tar.gz" } ], "1.3.1": [ { "comment_text": "", "digests": { "md5": "b8c24a0fe08976c54c20b378645aaf5a", "sha256": "ca09fd3adcfb010315f0200c542bf566427d11dbfabdb13c70ffd8ee104038a2" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "b8c24a0fe08976c54c20b378645aaf5a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7789, "upload_time": "2019-08-07T06:13:54", "url": "https://files.pythonhosted.org/packages/67/b0/ccecb3bc3ab5374adf474ee2c2a16ef96330cdff2dcbb8cff8d2248ff244/PDF_Layout_Scanner-1.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf150003c1fb35023ae34d2c4e1b1d26", "sha256": "cfb2b9435815e4522c18845f329c72ba80f374a80046be7d237ff041a134dbca" }, "downloads": -1, "filename": "PDF Layout Scanner-1.3.1.tar.gz", "has_sig": false, "md5_digest": "cf150003c1fb35023ae34d2c4e1b1d26", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6411, "upload_time": "2019-08-07T06:13:55", "url": "https://files.pythonhosted.org/packages/e5/49/911de6fa4cc6938248b8d8c1a119e4b119d1677e732f1fc3dfcc6aa6adc6/PDF%20Layout%20Scanner-1.3.1.tar.gz" } ], "1.3.2": [ { "comment_text": "", "digests": { "md5": "2935ddf045c171239f3331c896fb6db5", "sha256": "a163e7d9987e77ea0bad8dcc20aed4edf8a60e80101406a50cf43ec663c96e18" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "2935ddf045c171239f3331c896fb6db5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8065, "upload_time": "2019-08-07T10:49:29", "url": "https://files.pythonhosted.org/packages/c0/ad/039a9de6c8848b34b34af91998d115587cecd506207a50d2c3f156aa50e9/PDF_Layout_Scanner-1.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cfb4857fa0d236b78afba1ebeaee2911", "sha256": "2b38f959fb434b9b5ba07109a925af1870cb2a264149a48d4de20a1ae89dde79" }, "downloads": -1, "filename": "PDF Layout Scanner-1.3.2.tar.gz", "has_sig": false, "md5_digest": "cfb4857fa0d236b78afba1ebeaee2911", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6662, "upload_time": "2019-08-07T10:49:31", "url": "https://files.pythonhosted.org/packages/ad/b4/49464a4f9f865448b4ff71a454c76ce3c766787d0374a5169a697aaacacc/PDF%20Layout%20Scanner-1.3.2.tar.gz" } ], "1.3.3": [ { "comment_text": "", "digests": { "md5": "a88cfe93b5007401a57ba731f7b2a79f", "sha256": "d12f68b5fc3c185194df0ff159a7ad7c9b666bd50f62c3148458bb9a023dfbb3" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "a88cfe93b5007401a57ba731f7b2a79f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8149, "upload_time": "2019-09-01T01:12:40", "url": "https://files.pythonhosted.org/packages/a0/d8/7656355891284fe03d1a1b63285972c36103205d288887cf2cecd7633894/PDF_Layout_Scanner-1.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0f28ab5482aeaef5449d9508ad210676", "sha256": "507596d9fe49eec4038869543f0696a29847e177b11fb5c4a77d80610509b945" }, "downloads": -1, "filename": "PDF Layout Scanner-1.3.3.tar.gz", "has_sig": false, "md5_digest": "0f28ab5482aeaef5449d9508ad210676", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6792, "upload_time": "2019-09-01T01:12:41", "url": "https://files.pythonhosted.org/packages/e5/32/ac5d57c31917685e9f4da5547c5c69da71bf06d84fb8c88837e9ccba710a/PDF%20Layout%20Scanner-1.3.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a88cfe93b5007401a57ba731f7b2a79f", "sha256": "d12f68b5fc3c185194df0ff159a7ad7c9b666bd50f62c3148458bb9a023dfbb3" }, "downloads": -1, "filename": "PDF_Layout_Scanner-1.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "a88cfe93b5007401a57ba731f7b2a79f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 8149, "upload_time": "2019-09-01T01:12:40", "url": "https://files.pythonhosted.org/packages/a0/d8/7656355891284fe03d1a1b63285972c36103205d288887cf2cecd7633894/PDF_Layout_Scanner-1.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0f28ab5482aeaef5449d9508ad210676", "sha256": "507596d9fe49eec4038869543f0696a29847e177b11fb5c4a77d80610509b945" }, "downloads": -1, "filename": "PDF Layout Scanner-1.3.3.tar.gz", "has_sig": false, "md5_digest": "0f28ab5482aeaef5449d9508ad210676", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6792, "upload_time": "2019-09-01T01:12:41", "url": "https://files.pythonhosted.org/packages/e5/32/ac5d57c31917685e9f4da5547c5c69da71bf06d84fb8c88837e9ccba710a/PDF%20Layout%20Scanner-1.3.3.tar.gz" } ] }