{ "info": { "author": "Danilo de Jesus da Silva Bellini", "author_email": "danilo.bellini@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Environment :: Other Environment", "Environment :: Web Environment", "Framework :: Flask", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3 :: Only", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing", "Topic :: Text Processing :: Markup", "Topic :: Text Processing :: Markup :: XML" ], "description": "# Clea\n\nThis project is an XML front matter metadata reader for documents\nthat *almost* follows the [SciELO Publishing Schema],\nextracting and sanitizing the values regarding the affiliations.\n\n\n## Installation\n\nOne can install Clea with either:\n\n```\npip install scielo-clea # Minimal\npip install scielo-clea[cli] # Clea with CLI (recommended)\npip install scielo-clea[server] # Clea with the testing/example server\npip install scielo-clea[all] # Clea with both CLI and the server\n```\n\nActually all these commands installs everything,\nonly the dependencies aren't the same.\nThe first is an installation with minimal requirements,\nintended for use within Python, as an imported package.\n\n\n## Running the command line interface\n\nThe CLI is a way to use Clea as an article XML to JSONL converter\n(one JSON output line for each XML input):\n\n```\nclea -o output.jsonl article1.xml article2.xml article3.xml\n```\n\nThe same can be done with ``python -m clea`` instead of ``clea``.\nThe output is the standard output stream.\nSee ``clea --help`` for more information.\n\n\n## Running the testing server\n\nYou can run the development server using the flask CLI.\nFor example, for listening at 8080 from every host:\n\n```\nFLASK_APP=clea.server flask run -h 0.0.0.0 -p 8080\n```\n\nIn a production server with 4 worker processes for handling requests,\nyou can, for example:\n\n- Install gunicorn (it's not a dependency)\n- Run `gunicorn -b 0.0.0.0:8080 -w 4 clea.server:app`\n\n\n## Clea as a library\n\nA simple example to see all the extracted data is:\n\n```python\nfrom clea import Article\nfrom pprint import pprint\n\nart = Article(\"some_file.xml\")\npprint(art.data_full)\n```\n\nThat's a dictionary of lists with all the \"raw\" extracted data.\nThe keys of that dictionary can be directly accessed,\nso one can avoid extracting everything from the XML\nby getting just the specific items/attributes\n(e.g. `art[\"journal_meta\"][0].data_full`\n or `art.journal_meta[0].data_full`\n instead of `art.data_full[\"journal_meta\"][0]`).\nThese items/attributes are always lists, for example:\n\n* `art[\"aff\"]`: List of `clea.core.Branch` instances\n* `art[\"sub_article\"]`: List of `clea.core.SubArticle` instances\n* `art[\"contrib\"][0][\"contrib_name\"]`: List of strings\n\nWhere the `art[\"contrib\"][0]` is a `Branch` instance,\nand all such instances behave in the same way\n(there's no nested branches).\nThat can be seen as another way to navigate in the former dictionary,\nthe last example should return the same list one would get with\n`art.data_full[\"contrib\"][0][\"contrib_name\"]`,\nbut without extracting everything else\nthat appears in the `art.data_full` dictionary.\n\nMore simple stuff that can be done:\n\n```python\nlen(art.aff) # Number of entries\nlen(art.sub_article) # Number of \nart.contrib[0].data_full # Data from the first contributor as a dict\n\n# Something like {\"type\": [\"translation\"], \"lang\": [\"en\"]},\n# the content from attributes\nart[\"sub_article\"][0][\"article\"][0].data_full\n\n# A string with the article title, accessing just the desired content\nart[\"article_meta\"][0][\"article_title\"][0]\n```\n\nAll `SubArticle`, `Article` and `Branch` instances\nhave the `data_full` property and the `get` method,\nthe latter being internally used for item/attribute getting.\nTheir behavior is:\n\n* `Branch.get` always returns a list of strings\n* `Article.get(\"sub_article\")` returns a list of `SubArticle`\n* `Article.get(...)` returns a list of `Branch`\n* `SubArticle` behaves like `Article`\n\nThe extracted information is not exhaustive!\nIts result should not be seen as a replacement of the raw XML.\n\nOne of the goals of this library was\nto help on creating a tabular data from a given XML\nwith as many rows as required\nto have a pair of a matching `` and `` in each row.\nThese are the `Article` methods/properties that does that matching:\n\n* `art.aff_contrib_inner_gen()`\n* `art.aff_contrib_full_gen()`\n* `art.aff_contrib_inner`\n* `art.aff_contrib_full`\n* `art.aff_contrib_inner_indices`\n* `art.aff_contrib_full_indices`\n\nThe most useful ones are probably the last ones,\nwhich return a list of pairs (tuples) of indices (ints),\nso one can use a `(ai, ci)` result\nto access the `(art.aff[ai], art.contrib[ci])` pair,\nunless the index is `-1` (not found).\nThe ones with the `_gen` suffix are generator functions\nthat yields a tuple with two `Branch` entries (or `None`),\nthe ones without a suffix return a list of merged dictionaries\nin an almost tabular format (dictionary of lists of strings).\nEach list regarding these elements for these specific elements\nshould usually have at most one string,\nbut that's not always the case even for these specific elements,\nthen one should be careful when using the `data` property.\n\nThe `inner` and `full` in the names\nregards to `INNER JOIN` and `FULL OUTER JOIN` from SQL,\nmeaning the unmatched elements\n(all `` and `` unreferred nodes)\nare discarded in the former strategy,\nwhereas they're forcefully matched with `None` in the latter.\n\nTo print all the extracted data from a XML\nincluding the indices of matching `` and `` pairs\nperformed in the `FULL OUTER JOIN` sense,\nsimilar to the test server response:\n\n```python\npprint({\n **article.data_full,\n \"aff_contrib_pairs\": article.aff_contrib_full_indices,\n})\n```\n\n\n[SciELO Publishing Schema]: http://docs.scielo.org/projects/scielo-publishing-schema", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/scieloorg/clea", "keywords": "", "license": "2-clause BSD", "maintainer": "", "maintainer_email": "", "name": "scielo-clea", "package_url": "https://pypi.org/project/scielo-clea/", "platform": "", "project_url": "https://pypi.org/project/scielo-clea/", "project_urls": { "Homepage": "https://github.com/scieloorg/clea" }, "release_url": "https://pypi.org/project/scielo-clea/0.4.0/", "requires_dist": null, "requires_python": ">=3.6", "summary": "SciELO Publishing Schema XML document front matter metadata reader/sanitizer", "version": "0.4.0" }, "last_serial": 5393343, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "34e0360c2df6674683bac4f8de2d4c2f", "sha256": "b8fa46bb23e583923136a1aba60941ee4d07a0ba416d46ee7d9f85dda914a08e" }, "downloads": -1, "filename": "scielo-clea-0.1.0.tar.gz", "has_sig": false, "md5_digest": "34e0360c2df6674683bac4f8de2d4c2f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6415, "upload_time": "2018-06-04T21:39:04", "url": "https://files.pythonhosted.org/packages/0b/e0/437840fe7fd642c5e7c4612da9a48a9935ebeceef4c776b1fef3935370bd/scielo-clea-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "60a6f1a9161b213d03084fc09b5aa199", "sha256": "35ba5e47fdc07ed45a3cda3cb1987e44af586d3775dc0a4b16237e0dcc5e87d8" }, "downloads": -1, "filename": "scielo-clea-0.1.1.tar.gz", "has_sig": false, "md5_digest": "60a6f1a9161b213d03084fc09b5aa199", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6419, "upload_time": "2018-10-25T21:05:31", "url": "https://files.pythonhosted.org/packages/c4/17/21fb4b41b4998f42fcce8b28e0147139212d6577ef23ff742a76e83efe59/scielo-clea-0.1.1.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "2b8d99c657388a850a5bc905202d5f2f", "sha256": "7df35f98fa71a8e393c079a042480560bc3b32174b1bd89449d2c60bea6ca7d2" }, "downloads": -1, "filename": "scielo-clea-0.2.0.tar.gz", "has_sig": false, "md5_digest": "2b8d99c657388a850a5bc905202d5f2f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 10301, "upload_time": "2019-04-08T01:55:56", "url": "https://files.pythonhosted.org/packages/4b/ef/dc729fafa0c330bc8ee0c9415ef8d3f9e1a9a54a9ce60bebb4c23eaffb46/scielo-clea-0.2.0.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "62f7f30ad1dfccec98d90a10935505e1", "sha256": "e43b8c4140171fbdfd24464d04aa77289aa4650c9f6508c05f6ab6a297c11179" }, "downloads": -1, "filename": "scielo-clea-0.2.1.tar.gz", "has_sig": false, "md5_digest": "62f7f30ad1dfccec98d90a10935505e1", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 10293, "upload_time": "2019-04-10T16:44:57", "url": "https://files.pythonhosted.org/packages/fb/83/4281c2a0500faa48294a18140b7e386037887c741d75b213cd6a2fe51522/scielo-clea-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "66973121c4c609f64f5cec0bddb4ed68", "sha256": "699a0b73b8fab47c213ff364a96fd96f3eca49a20ff7223a38de8b17c00f40c9" }, "downloads": -1, "filename": "scielo-clea-0.2.2.tar.gz", "has_sig": false, "md5_digest": "66973121c4c609f64f5cec0bddb4ed68", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 10774, "upload_time": "2019-04-11T21:39:34", "url": "https://files.pythonhosted.org/packages/35/1c/815855ccce3e054e08dd8acd70eba3956d1fac5fc1f06d677f5708239879/scielo-clea-0.2.2.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "56e2bd1757cf673f8e3982701770e7a2", "sha256": "99da38d1c019e432718c018f7038b695fd33e000be1e4c07cafc705e07bd0bd2" }, "downloads": -1, "filename": "scielo-clea-0.3.0.tar.gz", "has_sig": false, "md5_digest": "56e2bd1757cf673f8e3982701770e7a2", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 11579, "upload_time": "2019-05-02T22:35:24", "url": "https://files.pythonhosted.org/packages/31/b7/3e191172c6c2bda676785906155a6f4ed6288b90dcd77e087dc014b2bfab/scielo-clea-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "9191502aa31b5a1e38e3d8df6e6884ec", "sha256": "c63e45bfca4d8a698908aab9edeb4c69520921570ef6e185295b77cf26a7ff8f" }, "downloads": -1, "filename": "scielo-clea-0.3.1.tar.gz", "has_sig": false, "md5_digest": "9191502aa31b5a1e38e3d8df6e6884ec", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 12265, "upload_time": "2019-05-31T18:39:16", "url": "https://files.pythonhosted.org/packages/62/11/784599dec573937c827127c53218553eff4f422fa6eb3242f48a775db9b1/scielo-clea-0.3.1.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "431e01f53b12d444e6d9af8bf55db274", "sha256": "8d9a11a9b77b1605e00b37c9e272ac79e34092d15da89723c5a8f354e7e82227" }, "downloads": -1, "filename": "scielo-clea-0.4.0.tar.gz", "has_sig": false, "md5_digest": "431e01f53b12d444e6d9af8bf55db274", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14005, "upload_time": "2019-06-12T22:20:53", "url": "https://files.pythonhosted.org/packages/7d/da/c16b406a6edf2a4a12e9bcca643b9c5bc8754622bf7f2253fa685e0d5bfc/scielo-clea-0.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "431e01f53b12d444e6d9af8bf55db274", "sha256": "8d9a11a9b77b1605e00b37c9e272ac79e34092d15da89723c5a8f354e7e82227" }, "downloads": -1, "filename": "scielo-clea-0.4.0.tar.gz", "has_sig": false, "md5_digest": "431e01f53b12d444e6d9af8bf55db274", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 14005, "upload_time": "2019-06-12T22:20:53", "url": "https://files.pythonhosted.org/packages/7d/da/c16b406a6edf2a4a12e9bcca643b9c5bc8754622bf7f2253fa685e0d5bfc/scielo-clea-0.4.0.tar.gz" } ] }