{ "info": { "author": "John Riebold", "author_email": "jmriebold@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Utilities" ], "description": "# BoilerPy3\n\n\n## About\n\nBoilerPy3 is a native Python [port](https://github.com/natural/java2python) of Christian Kohlsch\u00fctter's [Boilerpipe](https://github.com/kohlschutter/boilerpipe) library, released under the Apache 2.0 Licence.\n\nThis package is based on [sammyer's](https://github.com/sammyer) [BoilerPy](https://github.com/sammyer/BoilerPy), specifically [mercuree's](https://github.com/mercuree) [Python3-compatible fork](https://github.com/mercuree/BoilerPy). This fork updates the codebase to be more Pythonic (proper attribute access, docstrings, type-hinting, snake case, etc.) and make use Python 3.6 features (f-strings), in addition to switching testing frameworks from Unittest to PyTest.\n\n**Note**: This package is based on Boilerpipe 1.2 (at or before [this commit](https://github.com/kohlschutter/boilerpipe/tree/b0816590340f4317f500c64565b23beb4fb9a827)), as that's when the code was originally ported to Python. I experimented with updating the code to match Boilerpipe 1.3, however because it performed worse in my tests, I ultimately decided to leave it at 1.2-equivalent.\n\n\n## Installation\n\nTo install the latest version from PyPI, execute:\n\n```shell\npip install boilerpy3\n```\n\nIf you'd like to try out any unreleased features you can install directly from GitHub like so:\n\n```shell\npip install git+https://github.com/jmriebold/BoilerPy\n```\n\n\n## Usage\n\nThe top-level interfaces are the Extractors. Use the `get_content()` methods to extract the filtered text.\n\n```python\nfrom boilerpy3 import extractors\n\nextractor = extractors.ArticleExtractor()\n\n# From a URL\ncontent = extractor.get_content_from_url('http://www.example.com/')\n\n# From a file\ncontent = extractor.get_content_from_file('tests/test.html')\n\n# From raw HTML\ncontent = extractor.get_content('

Example

')\n```\n\nAlternatively, use `get_doc()` to return a Boilerpipe document from which you can get more detailed information.\n\n```python\nfrom boilerpy3 import extractors\n\nextractor = extractors.ArticleExtractor()\n\ndoc = extractor.get_doc_from_url('http://www.example.com/')\ncontent = doc.content\ntitle = doc.title\n```\n\n\n## Extractors\n\n\n### DefaultExtractor\n\nUsually worse than ArticleExtractor, but simpler/no heuristics. A quite generic full-text extractor. \n\n\n### ArticleExtractor\n\nA full-text extractor which is tuned towards news articles. In this scenario it achieves higher accuracy than DefaultExtractor. Works very well for most types of Article-like HTML.\n\n### ArticleSentencesExtractor\n\nA full-text extractor which is tuned towards extracting sentences from news articles.\n\n\n### LargestContentExtractor\n\nA full-text extractor which extracts the largest text component of a page. For news articles, it may perform better than the DefaultExtractor but usually worse than ArticleExtractor\n\n\n### CanolaExtractor\n\nA full-text extractor trained on [krdwrd](http://krdwrd.org) [Canola](https://krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf). Works well with SimpleEstimator, too.\n\n\n### KeepEverythingExtractor\n\nDummy extractor which marks everything as content. Should return the input text. Use this to double-check that your problem is within a particular Extractor or somewhere else.\n\n\n### NumWordsRulesExtractor\n\nA quite generic full-text extractor solely based upon the number of words per block (the current, the previous and the next block).\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jmriebold/BoilerPy3", "keywords": "boilerpipe,boilerpy,html text extraction,text extraction,full text extraction", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "boilerpy3", "package_url": "https://pypi.org/project/boilerpy3/", "platform": "", "project_url": "https://pypi.org/project/boilerpy3/", "project_urls": { "Homepage": "https://github.com/jmriebold/BoilerPy3" }, "release_url": "https://pypi.org/project/boilerpy3/1.0.1/", "requires_dist": null, "requires_python": ">=3.6.*", "summary": "Python port of Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages", "version": "1.0.1" }, "last_serial": 5658091, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "f78bb84c2fbdc000e5dc3603eed8c968", "sha256": "ba33a6f65b6110c162b0a9b88ee7fc0469c9e2ac9f34968122ee243f22d69243" }, "downloads": -1, "filename": "boilerpy3-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f78bb84c2fbdc000e5dc3603eed8c968", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.*", "size": 21607, "upload_time": "2019-08-09T23:15:51", "url": "https://files.pythonhosted.org/packages/31/fb/0c027efd8db1ab6e1a4a1582292251cf86749a46dfd006ed2afe643ff9bd/boilerpy3-1.0.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f78bb84c2fbdc000e5dc3603eed8c968", "sha256": "ba33a6f65b6110c162b0a9b88ee7fc0469c9e2ac9f34968122ee243f22d69243" }, "downloads": -1, "filename": "boilerpy3-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f78bb84c2fbdc000e5dc3603eed8c968", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6.*", "size": 21607, "upload_time": "2019-08-09T23:15:51", "url": "https://files.pythonhosted.org/packages/31/fb/0c027efd8db1ab6e1a4a1582292251cf86749a46dfd006ed2afe643ff9bd/boilerpy3-1.0.1-py3-none-any.whl" } ] }