{ "info": { "author": "Swen Vermeul \u2022 ID SIS \u2022 ETH Z\u00fcrich", "author_email": "swen@ethz.ch", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# TEI parser\n\nThis is a parser written in Python 3 that takes TEI-XML Documents as an inpput and writes them in a [Neo4j Graph Database](https://neo4j.com).\n\nIt makes use of the following existing libraries:\n* [Beautiful Soup 4](https://beautiful-soup-4.readthedocs.io/en/latest/)\n* [Spacy](https://spacy.io). Currently we use the \n* [Py2neo v4](https://py2neo.org/v4/) whih is a library to work with the Neo4j database.\n\n\n## Synopsis\n\n```\nfrom tei2neo import parse, GraphUtils\ngraph = Graph(host=\"localhost\", user=\"neo4j\", password=\"password\")\ndoc, status, soup = parse(\n\tfilename=file, \n\tstart_with_tag='TEI', \n\tidno='20-MS-221'\n)\ntx = graph.begin()\ndoc.save(tx)\ntx.commit()\n\nut = GraphUtils(graph)\nparas = ut.paragraphs_for_filename('20_MS_221_1.xml')\n\n# create unhyphened tokens\nfor para in paras:\n tokens = ut.tokens_in_paragraph(para)\n ut.create_unhyphenated(tokens)\n \n# show hyphened text\nfor token in ut.tokens_in_paragraph(paras[5], concatenated=0):\n if 'lb' in token.labels:\n print(' | ', end='')\n print(token.get('string',''), end='')\n print(token.get('whitespace', ''), end='')\n \n# show concatenated (non-hyphened) version of the text\nfor token in ut.tokens_in_paragraph(paras[5], concatenated=1):\n if 'lb' in token.labels:\n print(' ', end='')\n\n print(token.get('string',''), end='')\n print(token.get('whitespace', ''), end='')\n```\n\n# How the parser works\n\nA TEI document can be constructed in various ways and there are many elements that work very similarly. Likewise, this parser expects certain elements and treats them in a specific manner.\n\n## Elements that affect all following elements\n\n### handShift\n\nA `handShift` element **affects all elements that are below**, until another `handShift` element is encountered. \n\n**Example**\n\nFrom now on everything is written in \u00abLatein\u00bb and a pencil is being used (medium=Blei):\n```\n\n```\n\nNow we switch to \u00abKurrent\u00bb script and use black ink (STinte):\n\n```\n\n```\n**Appearance in Neo4j**\n\nAs we have seen, a `handShift` element contains three attributes:\n\n* new=\"#hWH\"\n* medium=\"Blei\"\n* script=\"Latein\"\n\nThese attributes are passed to all Token elements that follow after a `handShift` occurs. Previous attributes are not deleted, i.e. if only the medium changes from \u00abBlei\u00bb to \u00abSTinte\u00bb, all other attributes stay the same.\nThe `handShift` element will *not* appear as a node in Neo4j.\n\n\n### delSpan\n\nA `delSpan` element works much like a `handShift` element, as it alters the appearance of all the following text until it reaches its `spanTo` target:\n\n```\n\n... (a lot of XML code here)\n\n```\n\n**Appearance in Neo4j**\n\n* both the `delSpan` and the `anchor` appear as additional nodes.\n* all elements between the `delSpan` and the `anchor` element receive an additional `delSpan` label\n* a `delSpan` attribute is added to every element, the value is equal to the `xml:id` attribute of the anchor.\n\n\n\n## Elements that affect all contained elements\n\n### del\n\n### add\n\n### rs", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://sissource.ethz.ch/sis/semper-tei", "keywords": "", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "tei2neo", "package_url": "https://pypi.org/project/tei2neo/", "platform": "", "project_url": "https://pypi.org/project/tei2neo/", "project_urls": { "Homepage": "https://sissource.ethz.ch/sis/semper-tei" }, "release_url": "https://pypi.org/project/tei2neo/0.1.0/", "requires_dist": null, "requires_python": ">=3.5", "summary": "TEI (Text Encoding Initiative) parser to extract information and store it in Neo4j database", "version": "0.1.0" }, "last_serial": 5924209, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "2a124437d6680464675df489d82b0180", "sha256": "4e5c4b207698bbe942aea9b45c470ec456799ded4a079d459be4375403829e38" }, "downloads": -1, "filename": "tei2neo-0.1.0.tar.gz", "has_sig": false, "md5_digest": "2a124437d6680464675df489d82b0180", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 17455, "upload_time": "2019-10-03T16:07:25", "url": "https://files.pythonhosted.org/packages/25/0a/52a12292c9f7d06ce204e52987def1e7a6aaf396501579fcd79e1feed606/tei2neo-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2a124437d6680464675df489d82b0180", "sha256": "4e5c4b207698bbe942aea9b45c470ec456799ded4a079d459be4375403829e38" }, "downloads": -1, "filename": "tei2neo-0.1.0.tar.gz", "has_sig": false, "md5_digest": "2a124437d6680464675df489d82b0180", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 17455, "upload_time": "2019-10-03T16:07:25", "url": "https://files.pythonhosted.org/packages/25/0a/52a12292c9f7d06ce204e52987def1e7a6aaf396501579fcd79e1feed606/tei2neo-0.1.0.tar.gz" } ] }