{ "info": { "author": "Adrien Barbaresi", "author_email": "barbaresi@bbaw.de", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Linguistic", "Topic :: Text Processing :: Markup :: HTML" ], "description": "trafilatura: Scrapes the main text of web pages while preserving some structure\n===============================================================================\n\n.. image:: https://img.shields.io/pypi/v/trafilatura.svg\n :target: https://pypi.python.org/pypi/trafilatura\n :alt: Python package\n\n.. image:: https://img.shields.io/pypi/l/trafilatura.svg\n :target: https://pypi.python.org/pypi/trafilatura\n :alt: License\n\n.. image:: https://img.shields.io/pypi/pyversions/trafilatura.svg\n :target: https://pypi.python.org/pypi/trafilatura\n :alt: Python versions\n\n.. image:: https://img.shields.io/travis/adbar/trafilatura.svg\n :target: https://travis-ci.org/adbar/trafilatura\n :alt: Travis build status\n\n.. image:: https://img.shields.io/codecov/c/github/adbar/trafilatura.svg\n :target: https://codecov.io/gh/adbar/trafilatura\n :alt: Code Coverage\n\n\n:Code: https://github.com/adbar/trafilatura\n:Documentation: see README file\n:Issue tracker: https://github.com/adbar/trafilatura/issues\n\n\nRobust extraction of main text content and boilerplate removal based on a combination of DOM-based examination, XPath expressions and rules. Given a HTML document, this library parses it, retrieves the main body text and converts it to XML or plain text, while preserving part of the text formatting and page structure.\n\nIn a nutshell, with Python:\n\n.. code-block:: python\n\n >>> import requests, trafilatura\n >>> response = requests.get('https://www.iana.org/about')\n >>> trafilatura.process_record(response.text)\n >>> # outputs main content in plain text format ...\n\nOn the command-line:\n\n.. code-block:: bash\n\n $ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358\n $ # outputs main content in plain text format ...\n\n\n.. contents:: **Contents**\n :backlinks: none\n\n\nDescription\n-----------\n\nScrapes the main text of web pages while preserving some structure. Distinguishing between the whole page and the main text content can help alleviating many quality problems related to web texts.\n\nThe purpose is to find relevant sections of a web page, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments. In addition, the extraction focuses on original text and can help with the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.)\n\nAlso known as web scraping, boilerplate removal or boilerplate detection, DOM-based content extraction, main content identification, web page template detection, web page cleaning, web content extraction, or HTML text cleaning.\n\n\nFeatures\n--------\n\nBecause it relies on `lxml `_, trafilatura is comparatively fast. It is also robust, as the additional generic `jusText algorithm `_ is used as a backup solution.\n\nThe result of processing can be in plain text or XML format. In the latter case, basic formatting elements are preserved such as text formatting (bold, italic, etc.) and page structure (paragraphs, titles, lists), which can be used for further processing.\n\n*Work in progress*, currently experimental features:\n\n- Separate extraction of main text and comments\n- Duplicate detection at paragraph level using a least recently used (LRU) cache\n- Language detection on the extracted content\n- XML output compatible with the recommendations of the Text Encoding Initiative (XML TEI)\n\n\nInstallation\n------------\n\n*trafilatura* is a Python package (compatible with Python 3.5 upwards) that is tested on Linux and macOS, is available on `PyPI `_ and can be installed using ``pip``:\n\nInstall from package repository: ``pip install trafilatura``\n\n*(Or use ``pip3 install trafilatura`` on systems where Python 2 and 3 are both globally installed and pip refers to Python 2.)*\n\nFor all experimental functionality please use ``pip install trafilatura[all]``\nMost notably: language detection and faster processing of downloads. The ``cchardet`` package is currently not working on some macOS versions.\n\nDirect installation of the latest version (see `build status `_):\n\n``pip install git+https://github.com/adbar/trafilatura.git``\n\n(For dependency management see `this thread `_)\n\n\nWith Python\n-----------\n\nBasic use\n~~~~~~~~~\n\nThe simplest way to use trafilatura is as follows:\n\n.. code-block:: python\n\n >>> import requests, trafilatura\n >>> response = requests.get('https://www.iana.org/about')\n >>> result = trafilatura.process_record(response.text)\n >>> print(result) # newlines preserved, TXT output\n >>> result = trafilatura.process_record(response.text, xml_output=True)\n >>> print(result) # some formatting preserved in basic XML structure\n\nThe only required argument is the ``response`` element, the rest is optional. It is also possible to use a previously parsed tree (i.e. a lxml.html object) as input, which is then handled seamlessly.\n\n.. code-block:: python\n\n >>> from lxml import html\n >>> mytree = html.fromstring('

Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

')\n >>> trafilatura.process_record(mytree)\n 'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\\n'\n\nExperimental feature: the target language can also be set using 2-letter codes (`ISO 639-1 `_), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).\n\n.. code-block:: python\n\n >>> result = trafilatura.process_record(response.text, url, target_language='de')\n\nFor further configuration see the variables in ``settings.py``.\n\n\nOn the command-line\n-------------------\n\nA command-line interface is included, URLs can be used directly (``-u/--URL``):\n\n.. code-block:: bash\n\n $ trafilatura -u https://www.sueddeutsche.de/politik/usa-pompeo-maas-merkel-iran-nordstream-1.4434358\n $ # outputs main content in plain text format ...\n $ trafilatura --xml --URL \"https://de.creativecommons.org/index.php/was-ist-cc/\"\n $ # outputs main text with basic XML structure ...\n\nYou can also pipe a HTML document (and response body) to the trafilatura:\n\n.. code-block:: bash\n\n $ wget -qO- \"https://de.creativecommons.org/index.php/was-ist-cc/\" | trafilatura\n\nFor usage instructions see ``trafilatura -h``:\n\n``usage: trafilatura [-h] [-f] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]``\n\noptional arguments:\n -h, --help show this help message and exit\n -f, --fast Fast (without fallback detection)\n --nocomments Don't output any comments\n --notables Don't output any table elements\n --xml XML output\n --xmltei XML TEI output\n -u URL, --URL URL custom URL download\n -v, --verbose increase output verbosity\n\n\nAdditional information\n----------------------\n\nContext\n~~~~~~~\n\nThis module is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations. For more information:\n\n.. image:: https://zenodo.org/badge/DOI/10.5281/zenodo.3460969.svg\n :target: https://doi.org/10.5281/zenodo.3460969\n\n- Barbaresi, Adrien. \"`The Vast and the Focused: On the need for domain-focused web corpora `_\", Proceedings of the `7th Workshop on Challenges in the Management of Large Corpora (CMLC-7) `_, 2019.\n- Barbaresi, Adrien. \"`Efficient construction of metadata-enhanced web corpora `_\", Proceedings of the `10th Web as Corpus Workshop (WAC-X) `_, 2016.\n\nName\n~~~~\n\n*Trafilatura*: `Italian word `_ for `wire drawing `_.\n\nKudos to...\n~~~~~~~~~~~\n\n- `lxml `_\n- `jusText `_\n- `cchardet `_ & `ftfy `_\n\nAlternatives\n~~~~~~~~~~~~\n\nMost corresponding Python packages are not actively maintained, the following alternatives exist:\n\n- `dragnet `_ features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning\n- `python-readability `_ cleans the page and preserves some markup but is mostly geared towards news texts\n- `goose `_ can extract information for embedded content but doesn't preserve markup and is not maintained\n- `html2text `_ converts HTML pages to Markup language and thus keeps the structure, though it doesn't focus on main text extraction\n\nContact\n~~~~~~~\n\nPull requests are welcome.\n\nSee my `contact page `_ for additional details.\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/adbar/trafilatura", "keywords": "entity-extraction,html-extraction,html-parsing,text-mining,webarchives,web-scraping", "license": "GPLv3+", "maintainer": "", "maintainer_email": "", "name": "trafilatura", "package_url": "https://pypi.org/project/trafilatura/", "platform": "", "project_url": "https://pypi.org/project/trafilatura/", "project_urls": { "Homepage": "http://github.com/adbar/trafilatura" }, "release_url": "https://pypi.org/project/trafilatura/0.1.1/", "requires_dist": [ "ftfy (>=5.6)", "justext (>=2.2.0)", "lru-dict (>=1.1.6)", "lxml (>=4.4.1)", "requests (>=2.22.0)", "cchardet (>=2.0.0); extra == 'all'", "langid (>=1.1.6); extra == 'all'" ], "requires_python": ">=3", "summary": "Scrapes the main text of web pages while preserving some structure.", "version": "0.1.1" }, "last_serial": 5945519, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "0b2a258f1ed552feeb2c8b730d7ab480", "sha256": "131c37004eddb869259765318c5d14ed80714b408d346b478c67ac4bcb1af15c" }, "downloads": -1, "filename": "trafilatura-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "0b2a258f1ed552feeb2c8b730d7ab480", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 19884, "upload_time": "2019-07-17T11:13:14", "url": "https://files.pythonhosted.org/packages/a9/d1/8742f9f32488b9999c9cb2a6bcf4d80a1ee3af14ce8687b19ef7c541cbfc/trafilatura-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f1ebf62c78cfb35b0bcb55e492461289", "sha256": "04497173116b2715c1129fa104dfcdb4ade2bfc5aacd2f0081299c3a27767d54" }, "downloads": -1, "filename": "trafilatura-0.0.1.tar.gz", "has_sig": false, "md5_digest": "f1ebf62c78cfb35b0bcb55e492461289", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1131091, "upload_time": "2019-07-17T11:13:18", "url": "https://files.pythonhosted.org/packages/4a/df/75eabf693851408ffeacc5447c49ebf7a85d8319c6db00ddb2ff874c8b1b/trafilatura-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "7504d2bf0af37dd32239c0d4663bd3e0", "sha256": "b52bd76ab4cbc548b79ab294e1c9b6cab9dff395a223edd65b45535fb276c693" }, "downloads": -1, "filename": "trafilatura-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "7504d2bf0af37dd32239c0d4663bd3e0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 21897, "upload_time": "2019-08-02T14:20:39", "url": "https://files.pythonhosted.org/packages/bf/81/220321dd12acd76c964de4c6044aca12d3b8b2406d90afd2514d7ca9bacb/trafilatura-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "0016a8c50779e4c4ae839d2192b61e0a", "sha256": "b212da2c9aed2037493a94035c1e16c1cf1f4068d9376b0d445a1e33077f025b" }, "downloads": -1, "filename": "trafilatura-0.0.2.tar.gz", "has_sig": false, "md5_digest": "0016a8c50779e4c4ae839d2192b61e0a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1287516, "upload_time": "2019-08-02T14:20:43", "url": "https://files.pythonhosted.org/packages/aa/f6/aceb39cdc2c220076bcf4038949f74c98c6a37b3b45f94fd551f6515ad5c/trafilatura-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "9a29ba52375e74a850cdbd63fdd6f397", "sha256": "0b0267cf66bf8e9670425d5bdd73457da649632ba55f619a314e4e47429edd85" }, "downloads": -1, "filename": "trafilatura-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "9a29ba52375e74a850cdbd63fdd6f397", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 22583, "upload_time": "2019-08-09T12:33:50", "url": "https://files.pythonhosted.org/packages/eb/cb/355d692d7886d0135b3a7a3151ecf9808eee45e509615eda823185cad037/trafilatura-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "79165b20c6edef5911149d7fb38b4fa0", "sha256": "15aa7d34fcd4f1114f65d461b32d43e5486c971163939028cbd233c7700b5237" }, "downloads": -1, "filename": "trafilatura-0.0.3.tar.gz", "has_sig": false, "md5_digest": "79165b20c6edef5911149d7fb38b4fa0", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1288193, "upload_time": "2019-08-09T12:33:53", "url": "https://files.pythonhosted.org/packages/a5/7e/9dfa7b2542ca624959603e4ee8cae18a7584d4b6f35a95f9a6ddabf24d10/trafilatura-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "2e7989d66f1ab67f1050ae04afb413d8", "sha256": "31291ed98817472c403b508e50e52316286afefdda6ed298f627059740f809fc" }, "downloads": -1, "filename": "trafilatura-0.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "2e7989d66f1ab67f1050ae04afb413d8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 22612, "upload_time": "2019-08-23T11:59:12", "url": "https://files.pythonhosted.org/packages/aa/17/0fb9e4b08c9d6cfec427bc714c6c6106335995fc71a828d1c067843d208f/trafilatura-0.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "933288ec68c3abde1a2ecc64e3eadd5b", "sha256": "419a47714bff2b8bb9821446b33835f24ba9f8c842ef5c19676c03b82a0f8f4f" }, "downloads": -1, "filename": "trafilatura-0.0.4.tar.gz", "has_sig": false, "md5_digest": "933288ec68c3abde1a2ecc64e3eadd5b", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1470639, "upload_time": "2019-08-23T11:59:14", "url": "https://files.pythonhosted.org/packages/da/ef/96c4f99713530bd07ee99f73c1197930d722d3a17f26d3bca5cde90a762d/trafilatura-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "90d965ca13d8a66f08e662790af56072", "sha256": "96082f9678e8fa15b67782c149e3063a60bf4dd39818fc31b602e90c837109c8" }, "downloads": -1, "filename": "trafilatura-0.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "90d965ca13d8a66f08e662790af56072", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 22997, "upload_time": "2019-09-16T16:15:12", "url": "https://files.pythonhosted.org/packages/13/3a/4851075947b08ee17b7dc08fb3a158b1e0edacc5c04e09b8675dbf516bd8/trafilatura-0.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a77e518abff743e0e96ee5b16dfdbb40", "sha256": "91ad9f7672f4b9291f8fee2e6c1c0aac33eeb4d787848d5c3463a1c06dcddb08" }, "downloads": -1, "filename": "trafilatura-0.0.5.tar.gz", "has_sig": false, "md5_digest": "a77e518abff743e0e96ee5b16dfdbb40", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1471175, "upload_time": "2019-09-16T16:15:15", "url": "https://files.pythonhosted.org/packages/18/7d/010d2df6e0f03b440bf6a7018716323e17dce9e0a02df538325e992937d4/trafilatura-0.0.5.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "4bea3e008eff5ca16f924bc22abd6b75", "sha256": "9acb4686892da163bf25a36ee9a8e5ad18cd6b1be1462bde998e74b55ce639f9" }, "downloads": -1, "filename": "trafilatura-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "4bea3e008eff5ca16f924bc22abd6b75", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 24385, "upload_time": "2019-09-25T17:54:08", "url": "https://files.pythonhosted.org/packages/7d/63/3f85e938df523bcaddcf55d509510da98f464513aa00a15f80eaa507a94a/trafilatura-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ebcdd6bf82ee1b1b00d3d8f86c5c51d3", "sha256": "e61586a0fe84262977444b8a0c0f95d73b0ac4f4e94d0d8b01b2f187a55e9944" }, "downloads": -1, "filename": "trafilatura-0.1.0.tar.gz", "has_sig": false, "md5_digest": "ebcdd6bf82ee1b1b00d3d8f86c5c51d3", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1530144, "upload_time": "2019-09-25T17:54:11", "url": "https://files.pythonhosted.org/packages/f4/1d/59442e22c0091f0ded7825b59a482756a8abad37612c3556bac133861780/trafilatura-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "3fe758cbf1cc58b10457d6a151c705c8", "sha256": "15185f43bbd0d10b20c00e97762f87ffe08fa2c7330718a20bbeffd6a4ab169a" }, "downloads": -1, "filename": "trafilatura-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "3fe758cbf1cc58b10457d6a151c705c8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 24561, "upload_time": "2019-10-08T16:07:08", "url": "https://files.pythonhosted.org/packages/3c/67/177a490d9224ecc2fdeeea7bf96be0785c2cef7fa9c993b5d6bef7316d5f/trafilatura-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "aa06f1f175b1eafdcd9652fc91c4b63f", "sha256": "e0400e2bdbb23018987e9638a91bd0edc64798f5205be3a4f92f29f1557937cb" }, "downloads": -1, "filename": "trafilatura-0.1.1.tar.gz", "has_sig": false, "md5_digest": "aa06f1f175b1eafdcd9652fc91c4b63f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1530453, "upload_time": "2019-10-08T16:07:12", "url": "https://files.pythonhosted.org/packages/98/e7/04640914aacf912bd8c0a6ad3013ba1783e79f8b0376923a05981f6a564b/trafilatura-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3fe758cbf1cc58b10457d6a151c705c8", "sha256": "15185f43bbd0d10b20c00e97762f87ffe08fa2c7330718a20bbeffd6a4ab169a" }, "downloads": -1, "filename": "trafilatura-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "3fe758cbf1cc58b10457d6a151c705c8", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 24561, "upload_time": "2019-10-08T16:07:08", "url": "https://files.pythonhosted.org/packages/3c/67/177a490d9224ecc2fdeeea7bf96be0785c2cef7fa9c993b5d6bef7316d5f/trafilatura-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "aa06f1f175b1eafdcd9652fc91c4b63f", "sha256": "e0400e2bdbb23018987e9638a91bd0edc64798f5205be3a4f92f29f1557937cb" }, "downloads": -1, "filename": "trafilatura-0.1.1.tar.gz", "has_sig": false, "md5_digest": "aa06f1f175b1eafdcd9652fc91c4b63f", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 1530453, "upload_time": "2019-10-08T16:07:12", "url": "https://files.pythonhosted.org/packages/98/e7/04640914aacf912bd8c0a6ad3013ba1783e79f8b0376923a05981f6a564b/trafilatura-0.1.1.tar.gz" } ] }