{ "info": { "author": "Maarten van Gompel", "author_email": "", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: POSIX", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Internet :: WWW/HTTP :: WSGI :: Application", "Topic :: Text Processing :: Linguistic" ], "description": "Piereling\n===========\n\n.. image:: https://travis-ci.com/proycon/piereling.svg?branch=master\n :target: https://travis-ci.com/proycon/piereling\n\n.. image:: http://applejack.science.ru.nl/lamabadge.php/piereling\n :target: http://applejack.science.ru.nl/languagemachines/\n\n.. image:: https://www.repostatus.org/badges/latest/active.svg\n :alt: Project Status: Active \u2013 The project has reached a stable, usable state and is being actively developed.\n :target: https://www.repostatus.org/#active\n\n.. image:: https://img.shields.io/pypi/v/piereling\n :alt: Latest release in the Python Package Index\n :target: https://pypi.org/project/piereling/\n\n.. image:: https://raw.githubusercontent.com/proycon/piereling/master/logo.png\n :alt: Piereling Logo\n :align: center\n :scale: 40%\n\n**Piereling** is a webservice and web-application to convert between a\nvariety of document formats and to and from the `Format for Linguistic\nAnnotation `__ (FoLiA). It is intended\nto be used in Natural Language Processing pipelines. Piereling itself\ndoes not actually implement the convertors but relies on numerous other\nspecialised conversion tools in combination with notable third-party\ntools such as `pandoc `__ to accomplish its goals.\n\n*Piereling* is the word for earthworm in Limburgish dialect. Data\nconversion forms the *groundwork* for linguistic annotation, and thse\nlittle worms, eating the input file on one end and secreting a\nconversion on its outer end, perform that job.\n\nWe use `FoLiA `__ as our pivot format\nso you will mostly encounter conversions from or to FoLiA. FoLiA is a\nformat for Linguistic Annotation that also incorporates elaborate\ndocument structure and mark-up facilities. Another important\nintermediate format used in many of our conversions through pandoc is\n`ReStructuredText `__, a\nlightweight markup format. Although, Pandoc support a huge number of\nconversions between all its supported document formats, it is beyond the\nscope of his project to offer those in the webservice.\n\nAvailable Conversions\n---------------------\n\nConversions to FoLiA\n~~~~~~~~~~~~~~~~~~~~\n\nFrom Document and Markup Formats\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n- from **plain text**; uses ``txt2folia`` from\n `FoLiA-Tools `__.\n\n - In addition to an attempted extraction of text structure\n (paragraphs) by detecting blank lines, it also supports\n one-sentence-per-line and one-paragraph-per-line styles.\n - If you can deliver your input as ReStructuredText or Markdown then\n this is is strongly preferred if you want to preserve structure\n and markup, as these formats resolve a lot of ambiguity inherent\n in unspecified plain text.\n - Information loss: None\n\n- from **ReStructuredText**; using ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Minimal to None (almost all rst structures are\n supported)\n\n- from **Markdown**; via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Minimal to None (most markdown structures are\n supported; exceptions are mathematical formula)\n\n- from **HTML**; via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some; complex layout, complex tables, and\n imagery will generally get lost. Should only be used for\n semantically clean and simple HTML rather than complex\n presentational HTML from the web.\n\n- from **Word** (Office Open XML, docx); via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some; complex layout, complex tables, and\n imagery will generally get lost.\n - Note that the Word 2007 DOC format from up until 2007 is not\n supported, only the modern DOCX variant.\n\n- from **OpenDocument Text** (LibreOffice, odt); via ReStructuredText\n using `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some; complex layout, complex tables, and\n imagery will generally get lost.\n\n- from **EPUB**; via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some; complex layout, complex tables, and\n imagery will generally get lost.\n\n- from **LaTeX**; via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some to considerable; complex layout, complex\n tables, custom packages, math, and imagery will generally get\n lost.\n\n- from **MediaWiki** (as used by Wikipedia); via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Some; complex layout, complex tables. Wikipedia\n specific elements.\n\n- from **DocBook**; via ReStructuredText using\n `pandoc `__ and then ``rst2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Unknown\n\n- from **TEI P5 XML** (Text Encoding Initiative); uses ``tei2folia``\n from `FoLiA-Tools `__.\n\n - TEI is a very extensive and flexible format with many different\n forms\n - Information loss: Our converter will only work for a certain\n subset of TEI, mostly equivalent to TEI Lite, and may fail on\n others. Though we support a lot of TEI elements, there is also\n still a lot that is not covered by the converter. There will be\n comments in the output for anything that could not be converted\n properly.\n\n- from **PDF**; uses ``pdftotext`` from\n `Poppler `__ and then ``txt2folia``\n from FoLiA-tools.\n\n - Only works for PDFs with embedded text, not for imagery which\n would require OCR!\n - Information loss: **Considerable!** PDF conversion is notoriously\n difficult, the layout of the document will most probably get lost\n in the conversion (especially in case of multi-columned output).\n The markup will get lost too.\n - Structural conversion is very inaccurate (i.e.\u00a0paragraphs will not\n be nicely mapped) and produces ugly FoLiA.\n - Always avoid this conversion if you can!\n\n- from **hOCR**; uses ``FoLiA-hocr`` from\n `foliautils `__.\n\n - hOCR is a standard format outputted by OCR systems such as\n `Tesseract `__.\n - Information loss: Unknown\n\n- from **ALTO**; uses ``FoLiA-alto`` from\n `foliautils `__.\n\n - ALTO is a standard format for the description of text OCR and\n layout information of pages for digitized material.\n - Information loss: Unknown\n\nFrom other Linguistic Annotation Formats\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\n- from **NAF** (NLP Annotation Format) to FoLiA; uses ``naf2folia``\n from `NAFFoLiAPy `__.\n\n - This converter is still in an early and experimental stage.\n - Information loss: Not all annotation layers are implemented yet.\n Those that are should suffer minimal to no information loss. See\n the `website `__ for details.\n\n- from **CONLL-U**; uses ``conllu2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: None\n\n- from **Alpino XML**; uses ``alpino2folia`` from\n `FoLiA-Tools `__.\n\n - Information loss: Minimal to None\n\nConversions from FoLiA\n~~~~~~~~~~~~~~~~~~~~~~\n\n- to **plain text**, uses ``folia2txt`` from\n `FoLiA-Tools `__.\n\n - Information loss: Considerable, as only the text will be outputted\n and any annotations, most structure, and all markup will be lost.\n The text itself, however, will be very accurately converted, in\n either tokenised (if available) or untokenised form.\n\n- to **HTML**; this conversion is offered through the default viewer in\n the web-interface.\n\n - Information loss: Minimal, but information is represented purely\n for presentational purposes rather than focussing on semantics.\n\n- to **ReStructuredText**, uses ``folia2rst`` from\n `FoLiA-Tools `__.\n\n - Information loss: Structure and mark-up will be preserved, but\n annotations will be lost!\n\nValidation & Upgrade\n~~~~~~~~~~~~~~~~~~~~\n\n- FoLiA validation; using ``foliavalidator`` from\n `FoLiA-Tools `__.\n- FoLiA upgrade; upgrades an older FoLiA version to a newer one (mostly\n inteneded for FoLiA v1 to FoLiA v2); uses ``foliaupgrade`` from\n `FoLiA-Tools `__.\n\nInstallation\n------------\n\nInstall using pip (preferably in a Python virtual environment):\n\n``pip install piereling``\n\nPiereling is supplied as part of our\n`LaMachine `__ distribution, which\nincludes all dependencies out of the box. If you don\u2019t use this, you\nwill need to take care of installing certain dependencies yourself in\norder for all convertors to work, this includes:\n\n- `pandoc `__\n- `foliautils `__\n- `poppler-utils `__\n\nFor production use, we recommend using uwsgi in combination with a\nwebserver such as Apache (with mod_uwsgi_proxy), or nginx. A uwsgi\nconfiguration has been generated (``piereling.example.ini``); it is\nspecific to the host you deploy the webservice on. This in turn loads\nthe wsgi script (``piereling.wsgi``), which loads your webservice.\n\nSample configurations for nginx and Apache have been generated as a\nstarting point, add these to your server and then use the\n``./startserver_production.sh`` script to launch CLAM using uwsgi. If\nyou use LaMachine, all this has already been set up for you.\n\nUsage\n-----\n\nRun ``clamservice piereling.piereling`` to start the *development*\nserver and then navigate your browser to the address printed.\n\nWeb\n---\n\nPiereling is a RESTful webservice and also provides a web-interface for\nhuman end users (powered by `CLAM `__).\nIf you instead seek to do conversions locally on the command line then\nyou have no need for Piereling and should simply invoke the\naforementioned conversion tools directly.\n\nA public instance of this webservice is available at ``https://webservices-lst.science.ru.nl/piereling``, register for a\nfree account at ``https://webservices-lst.science.ru.nl`` first.\n\nRelated Tools\n-------------\n\nIf you want to convert to TEI, or use TEI as a pivot format for\nconversions, then you can look at\n`OxGarage `__\n(`source `__) and\n`OpenConvert `__.\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/proycon/piereling", "keywords": "webservice nlp computational_linguistics rest folia conversion", "license": "GPL", "maintainer": "", "maintainer_email": "", "name": "piereling", "package_url": "https://pypi.org/project/piereling/", "platform": "", "project_url": "https://pypi.org/project/piereling/", "project_urls": { "Homepage": "https://github.com/proycon/piereling" }, "release_url": "https://pypi.org/project/piereling/0.2.1/", "requires_dist": null, "requires_python": "", "summary": "Piereling is a webservice and web-application to convert between a variety of document formats, mostly from and to FoLiA XML. It is intended for NLP pipelines.", "version": "0.2.1", "yanked": false, "yanked_reason": null }, "last_serial": 8781295, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "96ee21fb82198e340a4fa9c1c9a8f2a4", "sha256": "55ee37b2dbde050edb064494b22e1dcbc0234c9e534d70e0b72ba2224387b5c4" }, "downloads": -1, "filename": "piereling-0.1.1.tar.gz", "has_sig": false, "md5_digest": "96ee21fb82198e340a4fa9c1c9a8f2a4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18207, "upload_time": "2019-10-21T14:18:47", "upload_time_iso_8601": "2019-10-21T14:18:47.729149Z", "url": "https://files.pythonhosted.org/packages/59/93/d63a6ea5de04a4f196c7e16c61332218c83f553fa59da9f421e8e8768ea3/piereling-0.1.1.tar.gz", "yanked": false, "yanked_reason": null } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "4ca3b56e1df28c6c41846a0018c058c6", "sha256": "e7dca8fcc4bef3e0e250720d3dadd8d9e461d7965bf1df2117692052da06b566" }, "downloads": -1, "filename": "piereling-0.1.2.tar.gz", "has_sig": false, "md5_digest": "4ca3b56e1df28c6c41846a0018c058c6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18398, "upload_time": "2019-10-21T18:13:25", "upload_time_iso_8601": "2019-10-21T18:13:25.377919Z", "url": "https://files.pythonhosted.org/packages/eb/c3/cf1d0eca4f9d1143e865a192079cad6944f7944926a63dbf77ba38098a91/piereling-0.1.2.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2": [ { "comment_text": "", "digests": { "md5": "e555009c844f4f4bfc7ed8eeb34e6559", "sha256": "2f6bbdc57995ac8eae154114a478e706fe04537d84d1086ba161ecfad153db62" }, "downloads": -1, "filename": "piereling-0.2.tar.gz", "has_sig": false, "md5_digest": "e555009c844f4f4bfc7ed8eeb34e6559", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18629, "upload_time": "2019-11-08T17:07:48", "upload_time_iso_8601": "2019-11-08T17:07:48.458297Z", "url": "https://files.pythonhosted.org/packages/94/18/5534d09356a37c2ae829649250dc2da1da1abc7504be2f68291fa99e389e/piereling-0.2.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "cd9704b782a305900ff7ab074ed7db0d", "sha256": "2f66254c1a4bc25e599eb160f2ef519aeed11ca849bb7e47f3ed9db01b97c73d" }, "downloads": -1, "filename": "piereling-0.2.1.tar.gz", "has_sig": false, "md5_digest": "cd9704b782a305900ff7ab074ed7db0d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19076, "upload_time": "2020-11-30T17:23:48", "upload_time_iso_8601": "2020-11-30T17:23:48.927419Z", "url": "https://files.pythonhosted.org/packages/96/98/29370410b35fbb1a2b175ea38ee3a54cc809e14a1c07f788554675232d43/piereling-0.2.1.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cd9704b782a305900ff7ab074ed7db0d", "sha256": "2f66254c1a4bc25e599eb160f2ef519aeed11ca849bb7e47f3ed9db01b97c73d" }, "downloads": -1, "filename": "piereling-0.2.1.tar.gz", "has_sig": false, "md5_digest": "cd9704b782a305900ff7ab074ed7db0d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19076, "upload_time": "2020-11-30T17:23:48", "upload_time_iso_8601": "2020-11-30T17:23:48.927419Z", "url": "https://files.pythonhosted.org/packages/96/98/29370410b35fbb1a2b175ea38ee3a54cc809e14a1c07f788554675232d43/piereling-0.2.1.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }