{ "info": { "author": "Microsoft Research", "author_email": "jathorpe@microsoft.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Operating System :: Unix", "Programming Language :: Python", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Utilities" ], "description": "# Overview\n\nDOCX files are complex, and their complexity makes scraping documents\nfor their content difficult. The aim of this package is to simplify\n`.docx` files to just the components which carry meaning thereby easing the\nprocess of document identification and scraping by converting a `.docx`\nfile into a predictable an *human readable* JSON file.\n\nSimplifying a complex document down to it's *meaningful* parts of course\nrequires taking a position on what does and does-not convey meaning in a\ndocument. Generally, this package takes the stance that the document\nstructure (body, paragraphs, tables, etc.) are meaningful as is the text\nitself, whereas text styling (font, font-weight, etc.) is ignored almost\nentirely, with the exception of paragraph indentation and numbering which\nis often used to create lists, block quotes, etc. Furthermore, the\nopinions expressed by this package are explained in the Options section\nbelow and can be changed to suite your needs.\n\n# Usage\n```python\nimport docx\nfrom simplify_docx import simplify\n\n# read in a document \nmy_doc = docx.Document(\"/path/to/my/favorite/file.docx\")\n\n# coerce to JSON using the standard options\nmy_doc_as_json = simplify(my_doc)\n\n# or with non-standard options\nmy_doc_as_json = simplify(my_doc,{\"remove-leading-white-space\":False})\n```\n\n# Installation\n\nThis project relies on the `python-docx` package which can be installed via\n`pip install python-docx`. **However**, as of this writing, if you wish to\nscrape documents which contain (A) form fields such as drop down lists,\ncheckboxes and text inputs or (B) nested documents (subdocs, altChunks,\netc.), you'll need to clone [this fork](https://github.com/jdthorpe/python-docx) of the python-docx package.\n\n# Options\n\n### General\n\n* **\"friendly-names\"**: (*Default = `True`*): Use user-friendly type names\n\tsuch as \"table-cell\", over standard element names like \"CT_Tc\"\n\n### Ignoring Invisible things\n\n* **\"ignore-empty-paragraphs\"**: (*Default = `True`*): Empty paragraphs are\n\toften used for styling purpose and rarely have significance in the\n\tmeaning of the document.\n* **\"ignore-empty-text\"**: (*Default = `True`*): Empty text runs can make an\n\totherwise empty paragraph appear to contain data.\n* **\"remove-leading-white-space\"**: (*Default = `True`*): Leading white-space\n\tat the start of a paragraph is ocassionaly used for styling purposes\n\tand rarely has significance in the interpretation of a document.\n* **\"remove-trailing-white-space\"**: (*Default = `True`*): Trailing white-space\n\tat the end of a paragraph rarely has significance in the interpretation\n\tof a document.\n* **\"flatten-inner-spaces\"**: (*Default = `False`*): Collapse multiple\n\tspace characters between words to a single space.\n* **\"ignore-joiners\"**: (*Default = `False`*): Zero width joiner and non-joiner \n\tcharacters are special characters used to create ligatures in displayed\n\ttext and don't typically convey meaning (at least in alphabet based\n\tlanguages).\n\n### Special symbols\n\n* **\"dumb-quotes\"**: (*Default = `True`*): Replace smart quotes with\n\tdumb quotes.\n* **\"dumb-hyphens\"**: (*Default = `True`*): Replace en-dash, em-dash,\n\tfigure-dash, horizontal bar, and non-breaking hyphens with ordinary hyphens.\n* **\"dumb-spaces\"**: (*Default = `True`*): Replace zero width spaces, hair \n\tspaces, thin spaces, punctuation spaces, figure spaces, six per em\n\tspaces, four per em spaces, three per em spaces, em spaces, en spaces,\n\tem quad spaces, and en quad spaces with ordinary spaces.\n* **\"special-characters-as-text\"**: (*Default = `True`*): Coerce special\n\tcharacters into text equivalents according to the following table:\n\n| Character | Text Equivalent | \n| --------- | --------------- | \n| CarriageReturn | `\\n` |\n| Break | `\\r` |\n| TabChar | `\\t` |\n| PositionalTab | `\\t` |\n| NoBreakHyphen | `-` |\n| SoftHyphen | `-` |\n\n* **\"symbol-as-text\"**: (*Default = `True`*): Special symbols often cary\n\tmeaning other than the underlying unicode character, especially when\n\tthe font is a special font such as `Wingdings`. If `True` these are\n\tincluded as ordinary text and their font information is omitted.\n* **\"empty-as-text\"**: (*Default = `False`*): There are a variety of \"Empty\"\n\ttags such as the `<\"w:yearLong\">` tag which cause the current year to\n\tbe inserted into the document text. If `True`, include these as text\n\tformatted as `\"[yearLong]\"`.\n* **\"ignore-left-to-right-mark\"**: (*Default = `False`*): Ignore the left-to-right\n\tmark, which is not writeable by pythons csv writer.\n* **\"ignore-right-to-left-mark\"**: (*Default = `False`*): Ignore the right-to-left\n\tmark which is not writeable by pythons csv writer.\n\n### Paragraph style:\n\nParagraph style markup are one exception to the styling vs. content\ndichotomy. For example, block quotes are often indicated by indenting whole\nparagraphs, and Ordered lists, Unordered lists and nesting of lists is\noften used to divide sections of a document into logical components. \n\n* **\"include-paragraph-indent\"**: (*Default = `True`*): Include the\n\tindentation markup on paragraph (`CT_P`) elements. Indentation is\n\tmeasured in twips\n* **\"include-paragraph-numbering\"**: (*Default = `True`*): Include the\n\tnumbering styles, which are included in the `CT_P.pPr.numPr` element.\n\tThe `ilvl` attribute indicates the level of nesting (zero based index)\n\tand the `numId` attribute refers to a specific numbering style\n\tincluded in the document's internal styles sheet. \n\n### Form Elements\n\n* **\"simplify-dropdown\"**: (*Default = `True`*): Include just the selected\n\tand default values, the available options, and the name and label attributes in the form element.\n* **\"simplify-textinput\"**: (*Default = `True`*): Include just the current\n\tand default values, and the name and label attributes in the form element.\n* **\"greedy-text-input\"**: (*Default = `True`*): Continue consuming run\n\telements when the text-input has not ended at the end of a paragraph,\n\tand the next block level element is also a paragraph. This typically\n\toccurs when the user preses the return key while editing a text input\n\tfield.\n* **\"simplify-checkbox\"**: (*Default = `True`*): Include just the current\n\tand default values, and the name and label attributes in the form element.\n* **\"use-checkbox-default\"**: (*Default = `True`*): If the checkbox has no\n\t`value` attribute (typically because the user has not interacted with\n\tit), report the default value as the checkbox value.\n* **\"checkbox-as-text\"**: (*Default = `False`*): Coerce the value of the\n\tcheckbox to text, represented as either `\"[CheckBox:True]\"` or `\"[CheckBox:False]\"`\n* **\"dropdown-as-text\"**: (*Default = `False`*): Coerce the value of the\n\tcheckbox to text, represented as `\"[DropDown:]\"`\n* **\"trim-dropdown-options\"**: (*Default = `True`*): Remove white-space on\n\tthe left and right of drop down option items.\n* **\"flatten-generic-field\"**: (*Default = `True`*): `generic-fields` are\n\t`CT_FldChar` runs which are not marked as a drop-down, text-input, or\n\tcheckbox. These may include special instructions which apply special\n\tformatting to a text run (e.g. a hyper link). If `True`, the contents\n\tof generic-fields are included in the normal flow of text\n\n### Special content\n\n* **\"merge-consecutive-text\"**: (*Default = `True`*): Sentences and even single\n\twords can be represented by multiple text elements. If `True`,\n\tconcatenate consecutive text elements into a single text element.\n* **\"flatten-hyperlink\"**: (*Default = `True`*): Flatten hyperlinks, including\n\ttheir contents in the flow of normal text.\n* **\"flatten-smartTag\"**: (*Default = `True`*): Flatten smartTag elements, \n\tincluding their contents in the flow of normal text.\n* **\"flatten-customXml\"**: (*Default = `True`*): Flatten customXml elements, \n\tincluding their contents in the flow of normal text.\n* **\"flatten-simpleField\"**: (*Default = `True`*): Flatten simpleField elements, \n\tincluding their contents in the flow of normal text.\n\n# Contributing\n\nThis project welcomes contributions and suggestions. Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.microsoft.com.\n\nWhen you submit a pull request, a CLA-bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://microsofteconomics.visualstudio.com/EconTools/_git/loadify?_a=history", "keywords": "sql,csv", "license": "", "maintainer": "", "maintainer_email": "", "name": "simplify-docx", "package_url": "https://pypi.org/project/simplify-docx/", "platform": "", "project_url": "https://pypi.org/project/simplify-docx/", "project_urls": { "Homepage": "https://microsofteconomics.visualstudio.com/EconTools/_git/loadify?_a=history" }, "release_url": "https://pypi.org/project/simplify-docx/0.1.0/", "requires_dist": [ "lxml (==4.3.3)", "more-itertools (==7.0.0)", "python-docx (==0.8.10)", "six (==1.12.0)", "wincertstore (==0.2)", "argparse ; python_version==\"2.6\"" ], "requires_python": "", "summary": "A utility for simplifying python-docx document objects", "version": "0.1.0" }, "last_serial": 5210453, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "2d25e5e90f36c326e74379916c07c85f", "sha256": "b4e70b49ca47451cd008707707020fb0f021983c63f10924f11ec7f61e623b4a" }, "downloads": -1, "filename": "simplify_docx-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "2d25e5e90f36c326e74379916c07c85f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 27996, "upload_time": "2019-04-30T22:55:37", "url": "https://files.pythonhosted.org/packages/18/55/0950a725f5cec2a05af083ff102ab1832f09f86b857154b1939717b84f54/simplify_docx-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d8738f255f1e716e465c7692eb23183c", "sha256": "dd8a6e45e7d6977b5f67222c501bec0e4f4cbac22e48b22af16bd51289cccca6" }, "downloads": -1, "filename": "simplify-docx-0.1.0.tar.gz", "has_sig": false, "md5_digest": "d8738f255f1e716e465c7692eb23183c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23362, "upload_time": "2019-04-30T22:55:44", "url": "https://files.pythonhosted.org/packages/e8/1b/df0b52e87995cc8ab50549b0b6b7bcb8ee839883d5ef324be967fd801d43/simplify-docx-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2d25e5e90f36c326e74379916c07c85f", "sha256": "b4e70b49ca47451cd008707707020fb0f021983c63f10924f11ec7f61e623b4a" }, "downloads": -1, "filename": "simplify_docx-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "2d25e5e90f36c326e74379916c07c85f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 27996, "upload_time": "2019-04-30T22:55:37", "url": "https://files.pythonhosted.org/packages/18/55/0950a725f5cec2a05af083ff102ab1832f09f86b857154b1939717b84f54/simplify_docx-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d8738f255f1e716e465c7692eb23183c", "sha256": "dd8a6e45e7d6977b5f67222c501bec0e4f4cbac22e48b22af16bd51289cccca6" }, "downloads": -1, "filename": "simplify-docx-0.1.0.tar.gz", "has_sig": false, "md5_digest": "d8738f255f1e716e465c7692eb23183c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23362, "upload_time": "2019-04-30T22:55:44", "url": "https://files.pythonhosted.org/packages/e8/1b/df0b52e87995cc8ab50549b0b6b7bcb8ee839883d5ef324be967fd801d43/simplify-docx-0.1.0.tar.gz" } ] }