{ "info": { "author": "Andrew Marks", "author_email": "ajmarks@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3 :: Only", "Topic :: Text Processing", "Topic :: Utilities" ], "description": "Gymnast: It's not Acrobat\r\n----------------------------\r\n\r\n|GitHub license| |Code Issues|\r\n\r\nPDF parser written in Python 3 (backport to 2.7 in the works). This was\r\ndesigned to provide a Pythonic interface to access (and, eventually,\r\nwrite) Adobe PDF files. Some of attributes have non-Pythonic\r\ncapitalization, but that is to match the underlying structure of the PDF\r\ndocument (doing otherwise would get very confusing).\r\n\r\nUsage\r\n-----\r\n\r\n.. code:: python\r\n\r\n import io\r\n from gymnast import PdfDocument\r\n from gymnast.renderer import PdfBaseRenderer\r\n\r\n class PdfSimpleRenderer(PdfBaseRenderer):\r\n \"\"\"Simple renderer example that just extracts text with no processing\"\"\"\r\n def __init__(self, page):\r\n super().__init__(page)\r\n self._text = io.StringIO()\r\n def _render_text(self, text, new_state):\r\n self._text.write(self.active_font.decode_string(text))\r\n def _return(self):\r\n return self._text.getvalue()\r\n\r\n fname = '/path/to/file.pdf'\r\n pdf = PdfDocument(fname).parse()\r\n text = SimpleRenderer(pdf.Pages[-3]).render()\r\n\r\nTODO (in no particular order)\r\n-----------------------------\r\n\r\n- **Features and functionality**\r\n- [x] Rewrite the parser and document class to lazy-load the document\r\n based on the xrefs table\r\n- [x] Complete the base page renderer\r\n- [ ] Page Rendering\r\n\r\n - [x] Getting the ``BaseRenderer`` class working\r\n - [x] Implement a proof of concept extractor that just dumps strings\r\n - [ ] Get a bit fancier, assigning textblocks to lines and such\r\n\r\n- [ ] Handle page numbering more fully\r\n\r\n - [ ] Add a method to ``PdfDocument`` to get a page by number\r\n - [ ] Add propreties to ``PdfPage`` for the page number (both as an\r\n ``int`` and a formatted ``str`` according to\r\n ``PdfDocument.Root.PageLabels['Nums']``)\r\n\r\n- [ ] Backport to Python 2.7 (about 80% done or so)\r\n- [ ] Font stuff\r\n\r\n - [x] Carve the ``PdfFont`` class into an abstract ``PdfBaseFont``\r\n and a ``PdfType1Font`` implementation\r\n - [x] ``PdfFont.__new__`` will pick the correct subclass based on\r\n the font's Subtype element\r\n - [x] PdfBasefFont class will also have an abstract method for the\r\n glyph space to text space transformation\r\n - [ ] Add subcless for Type3 fonts\r\n - [x] Add subcless for TrueType fonts\r\n - [ ] Add subcless for composite fonts\r\n - [x] Add legacy support for the 14 standard fonts\r\n - [ ] Font-to-unicode CMAPs\r\n\r\n- [ ] Implement the remaining ``StreamFilter``\\ s (will probably have\r\n the image ones return a ``PIL.Image``)\r\n\r\n - [ ] ``RunLengthDecode``\r\n - [ ] ``CCITTFaxDecode``\r\n - [ ] ``JBIG2Decode``\r\n - [ ] ``DCTDecode``\r\n - [ ] ``JPXDecode``\r\n - [ ] ``Crypt``\r\n\r\n- [ ] Implement remaining object types\r\n\r\n - [ ] ``ObjStm``\r\n - [x] ``XRef``\r\n - [ ] ``Filespec``\r\n - [ ] ``EmbeddedFile``\r\n - [ ] ``CollectionItem`` / ``CollectionSubitem``\r\n - [ ] ``XObject``\r\n\r\n- [ ] Handle document encryption\r\n- [ ] Start on graphics stuff (maybe)\r\n- [ ] Interactive forms (AcroForms)\r\n- **Administrative**\r\n- [ ] Write tests for existing code\r\n- [x] Come up with a better name\r\n- [ ] Document everything much, much better internally\r\n- [ ] Package it up neatly and pypi it\r\n- [ ] Write some proper documentation\r\n\r\n.. |GitHub license| image:: https://img.shields.io/github/license/mashape/apistatus.svg\r\n :target: https://github.com/ajmarks/pdf_parser/blob/master/LICENSE\r\n.. |Code Issues| image:: https://www.quantifiedcode.com/api/v1/project/d0106c63f4f8467586aae7498f148e94/badge.svg\r\n :target: https://www.quantifiedcode.com/app/project/d0106c63f4f8467586aae7498f148e94", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/ajmarks/gymnast/tarball/0.1a5", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ajmarks/gymnast/", "keywords": "pdf,acrobat", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "gymnast", "package_url": "https://pypi.org/project/gymnast/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/gymnast/", "project_urls": { "Download": "https://github.com/ajmarks/gymnast/tarball/0.1a5", "Homepage": "https://github.com/ajmarks/gymnast/" }, "release_url": "https://pypi.org/project/gymnast/0.1a5/", "requires_dist": null, "requires_python": null, "summary": "Gymnast: PDF document parser in Python 3", "version": "0.1a5" }, "last_serial": 1824960, "releases": { "0.1a5": [ { "comment_text": "", "digests": { "md5": "09ad7634c63246c7bc128d55ec9abae4", "sha256": "66eeb12762d7af83acacf2c3af69ab0af282cb7150167106ac6aef2e05de0b51" }, "downloads": -1, "filename": "gymnast-0.1a5.zip", "has_sig": false, "md5_digest": "09ad7634c63246c7bc128d55ec9abae4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 212024, "upload_time": "2015-11-18T23:22:10", "url": "https://files.pythonhosted.org/packages/0d/47/2bf455d7817f0fe22de578d3467892fbb9eca551684b3cc6ae963153f96c/gymnast-0.1a5.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "09ad7634c63246c7bc128d55ec9abae4", "sha256": "66eeb12762d7af83acacf2c3af69ab0af282cb7150167106ac6aef2e05de0b51" }, "downloads": -1, "filename": "gymnast-0.1a5.zip", "has_sig": false, "md5_digest": "09ad7634c63246c7bc128d55ec9abae4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 212024, "upload_time": "2015-11-18T23:22:10", "url": "https://files.pythonhosted.org/packages/0d/47/2bf455d7817f0fe22de578d3467892fbb9eca551684b3cc6ae963153f96c/gymnast-0.1a5.zip" } ] }