{ "info": { "author": "Jon Ribbens", "author_email": "", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "# pycpdf - Extract content and metadata from PDF files efficiently\n\npycpdf is a Python extension to enable read-only access to the contents of\nPDF files. It is similar to PyPDF/PyPDF2 except that it is read-only, is much\nquicker, and has some of the bugs fixed.\n\n## Usage\n\nThe main class is `pycpdf.PDF`, which is given the PDF data as a byte-string\nand, optionally, a decryption password. This object has the following\nproperties:\n\n - `pages`: a list of `Page` objects\n - `version`: the PDF file format version as a string (e.g. `\"1.6\"`)\n - `info`: the Document Information Dictionary, or `None`\n - `linearized`: the Linearization Parameter Dictionary, or `None`\n - `catalog`: the Document Catalog dictionary\n - `trailer`: the trailer dictionary\n\n`Page` objects are similar to dictionaries, but they have the following two\nextra properties:\n\n - `text`: the textual contents of the page\n - `contents`: a list of the PDF operators that comprise the page contents\n\n`StreamObject` objects are also similar to dictionaries, but have the\nfollowing two extra properties:\n\n - `data`: the binary data contained within the stream\n - `contents`: the binary data parsed as a list of PDF operators\n\n## Unicode string simplification\n\nThe module also contains `unicode_translations`, which is a dictionary\nsuitable for passing to `str.translate` to simplify Unicode strings\nsomewhat:\n\n - replace various spaces (00A0, 2000-200A, 3000) with an ASCII space\n - remove soft hyphens (00AD) and zero width space (200B)\n - replace various hyphens (2010-2014, 2212) with an ASCII hyphen\n - replace various sexed quotation marks (2018, 2019, 201C, 201D)\n with the equivalent unsexed ASCII quotation marks\n - replace the horizontal ellipsis character (2026) with '...'\n - replace various latin ligatures (ae, dz, ff, fi, fl, ffi, ffl, st)\n with their equivalent ASCII character strings\n\n## Example\n\n```python\npdf = pycpdf.PDF(open('file.pdf', 'rb').read())\nif pdf.info and pdf.info.get('Title'):\n print('Title:', pdf.info['Title'])\nfor pageno, page in enumerate(pdf.pages):\n print('Page', pageno + 1)\n print(page.text.translate(pycpdf.unicode_translations))\n```\n\n## Notes\n\nAll strings are always Unicode. Note that, unlike in PyPDF, name objects do\nnot start with a forward slash `/` - i.e. you should reference `obj['Type']`\nnot `obj['/Type']`. This is because the PDF Reference unequivocally states\nthat the slash is not part of the name.\n\nThe `page.text` property lists the textual contents of the page in the order\nthat the relevant operators occur. It may sometimes be out of order, and\nthere may be unexpected spacing.\n\nIf container objects (arrays and dictionaries) contain indirect objects then\nthese objects will not be extracted from the PDF file until they are\nreferenced for the first time. This is to enable opening large PDF files\nefficiently. Altering the contents of container objects is not actively\nprevented but should never be done and will lead to undefined behaviour.\n\nThe intention of the extension is to allow fast low-level access to all the\ninternal structures of PDF files. Higher-level interfaces could be added in\nPython to allow, for example, better text extraction or access to images.\n\nSome encryption and image encoding methods are not currently supported.\n\n## History\n\n### 1.0.4 (2018-04-18)\n\n - Bug fixes to do with in-line images\n\n### 1.0.3 (2017-11-26)\n\n - Bug-fixed StreamObject.contents\n - When decoding text, if ToUnicode fails, fall back to Encoding\n\n### 1.0.2 (2017-11-26)\n\n - Extended documentation slightly.\n - Added Travis auto-deployment to PyPI.\n - Added `pycpdf.__version__`\n\n### 1.0.1 (2017-11-23)\n\n - Initial release.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jribbens/pycpdf", "keywords": "pdf pypdf", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pycpdf", "package_url": "https://pypi.org/project/pycpdf/", "platform": "", "project_url": "https://pypi.org/project/pycpdf/", "project_urls": { "Homepage": "https://github.com/jribbens/pycpdf" }, "release_url": "https://pypi.org/project/pycpdf/1.0.4/", "requires_dist": null, "requires_python": "", "summary": "Extract content and metadata from PDF files efficiently", "version": "1.0.4" }, "last_serial": 3777137, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "aad15ebd82796134b2597b605b36f8df", "sha256": "09ebef8a27c470d8fcb58981ed778ab5e40462cabb441ee6565cffb5526da92e" }, "downloads": -1, "filename": "pycpdf-1.0.1.tar.gz", "has_sig": true, "md5_digest": "aad15ebd82796134b2597b605b36f8df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58482, "upload_time": "2017-11-23T00:30:24", "url": "https://files.pythonhosted.org/packages/92/12/014f284b69394baa716e38f77daff35d36f3894f4dcca1844444e95fe6a3/pycpdf-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "c309a64026a87703f47c668c0a84eb22", "sha256": "18015f75cb17a64f0acc35f03bd321138e405bb401e3426e6dbf736000c1a41d" }, "downloads": -1, "filename": "pycpdf-1.0.2.tar.gz", "has_sig": true, "md5_digest": "c309a64026a87703f47c668c0a84eb22", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 59180, "upload_time": "2017-11-26T18:00:16", "url": "https://files.pythonhosted.org/packages/37/27/7a4e5bf5935f30411797ddcd14ed12ae33997e2485811611ca928d07677c/pycpdf-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "d33a9c3819eac29f8001091647edbe72", "sha256": "fd3efbe9c63d9144e72fc9c96f5d5dfefd89d099123f5dcf5603887109cd6d57" }, "downloads": -1, "filename": "pycpdf-1.0.3.tar.gz", "has_sig": false, "md5_digest": "d33a9c3819eac29f8001091647edbe72", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 57782, "upload_time": "2017-11-26T22:03:02", "url": "https://files.pythonhosted.org/packages/c5/11/15a3c09e7e8d0ce486aaa2a31f3e2babd4d3453016bbe90a83b11f82bf16/pycpdf-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "d725cc175ce5d3b73f5a3bcfa21a7ef1", "sha256": "dfc876d29bf48ee2a31735704d11620f50ba9b976a75dba8185c839c55735d34" }, "downloads": -1, "filename": "pycpdf-1.0.4.tar.gz", "has_sig": false, "md5_digest": "d725cc175ce5d3b73f5a3bcfa21a7ef1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61356, "upload_time": "2018-04-18T13:27:48", "url": "https://files.pythonhosted.org/packages/73/46/8b9925390240cdbab38a4deec1089c8d3176e202dc8b1735000ee8890f2c/pycpdf-1.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d725cc175ce5d3b73f5a3bcfa21a7ef1", "sha256": "dfc876d29bf48ee2a31735704d11620f50ba9b976a75dba8185c839c55735d34" }, "downloads": -1, "filename": "pycpdf-1.0.4.tar.gz", "has_sig": false, "md5_digest": "d725cc175ce5d3b73f5a3bcfa21a7ef1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 61356, "upload_time": "2018-04-18T13:27:48", "url": "https://files.pythonhosted.org/packages/73/46/8b9925390240cdbab38a4deec1089c8d3176e202dc8b1735000ee8890f2c/pycpdf-1.0.4.tar.gz" } ] }