{ "info": { "author": "Michal Belica", "author_email": "miso.belica@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: Implementation :: CPython", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Pre-processors", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: Markup :: HTML" ], "description": ".. _jusText: http://code.google.com/p/justext/\n.. _Python: http://www.python.org/\n.. _lxml: http://lxml.de/\n\njusText\n=======\n.. image:: https://api.travis-ci.org/miso-belica/jusText.png?branch=master\n :target: https://travis-ci.org/miso-belica/jusText\n\nProgram jusText is a tool for removing boilerplate content, such as navigation\nlinks, headers, and footers from HTML pages. It is\n`designed `_ to preserve\nmainly text containing full sentences and it is therefore well suited for\ncreating linguistic resources such as Web corpora. You can\n`try it online `_.\n\nThis is a fork of original (currently unmaintained) code of jusText_ hosted\non Google Code. Below are some alternatives that I found:\n\n- https://github.com/bookieio/breadability\n- http://code.google.com/p/boilerpipe/\n- http://sourceforge.net/projects/webascorpus/?source=navbar\n- https://github.com/jiminoc/goose\n- https://github.com/grangier/python-goose\n- https://github.com/dcramer/decruft\n- https://github.com/FeiSun/ContentExtraction\n\n- https://github.com/JalfResi/justext\n- https://github.com/andreypopp/extracty/tree/master/justext\n- https://github.com/dreamindustries/jaws/tree/master/justext\n- https://github.com/says/justext\n- https://github.com/chbrown/justext\n- https://github.com/says/justext-app\n\n\nInstallation\n------------\nMake sure you have Python_ 2.6+/3.3+ and `pip `_\n(`Windows `_,\n`Linux `_) installed.\nRun simply:\n\n.. code-block:: bash\n\n $ [sudo] pip install justext\n\n\nDependencies\n------------\n::\n\n lxml>=2.2.4\n\n\nUsage\n-----\n.. code-block:: bash\n\n $ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/\n $ python -m justext -s English -o plain_text.txt english_page.html\n $ python -m justext --help # for more info\n\n\nPython API\n----------\n.. code-block:: python\n\n import requests\n import justext\n\n response = requests.get(\"http://planet.python.org/\")\n paragraphs = justext.justext(response.content, justext.get_stoplist(\"English\"))\n for paragraph in paragraphs:\n if not paragraph.is_boilerplate:\n print paragraph.text\n\n\nTesting\n-------\nRun tests via\n\n.. code-block:: bash\n\n $ py.test-2.6 && py.test-3.3 && py.test-2.7 && py.test-3.4 && py.test-3.5\n\n\nAcknowledgements\n----------------\n.. _`Natural Language Processing Centre`: http://nlp.fi.muni.cz/en/nlpc\n.. _`Masaryk University in Brno`: http://nlp.fi.muni.cz/en\n.. _PRESEMT: http://presemt.eu/\n.. _`Lexical Computing Ltd.`: http://lexicalcomputing.com/\n.. _`PhD research`: http://is.muni.cz/th/45523/fi_d/phdthesis.pdf\n\nThis software has been developed at the `Natural Language Processing Centre`_ of\n`Masaryk University in Brno`_ with a financial support from PRESEMT_ and\n`Lexical Computing Ltd.`_ It also relates to `PhD research`_ of Jan Pomik\u0102\u02c7lek.\n\n\n.. :changelog:\n\nChangelog for jusText\n=====================\n\n2.2.0 (2016-03-06)\n------------------\n- *INCOMPATIBLE CHANGE:* Stop words are case insensitive.\n- *INCOMPATIBLE CHANGE:* Dropped support for Python 3.2\n- *BUG FIX:* Preserve new lines from original text in paragraphs.\n\n2.1.1 (2014-05-27)\n------------------\n- *BUG FIX:* Function ``decode_html`` now respects parameter ``errors`` when falling to ``default_encoding`` `#9 `_.\n\n2.1.0 (2014-01-25)\n------------------\n- *FEATURE:* Added XPath selector to the paragrahs. XPath selector is also available in detailed output as ``xpath`` attribute of ``

`` tag `#5 `_.\n\n2.0.0 (2013-08-26)\n------------------\n- *FEATURE:* Added pluggable DOM preprocessor.\n- *FEATURE:* Added support for Python 3.2+.\n- *INCOMPATIBLE CHANGE:* Paragraphs are instances of\n ``justext.paragraph.Paragraph``.\n- *INCOMPATIBLE CHANGE:* Script 'justext' removed in favour of\n command ``python -m justext``.\n- *FEATURE:* It's possible to enter an URI as input document in CLI.\n- *FEATURE:* It is possible to pass unicode string directly.\n\n1.2.0 (2011-08-08)\n------------------\n- *FEATURE:* Character counts used instead of word counts where possible in\n order to make the algorithm work well in the language independent\n mode (without a stoplist) for languages where counting words is\n not easy (Japanese, Chinese, Thai, etc).\n- *BUG FIX:* More robust parsing of meta tags containing the information about\n used charset.\n- *BUG FIX:* Corrected decoding of HTML entities € to Ÿ\n\n1.1.0 (2011-03-09)\n------------------\n- First public release.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/miso-belica/jusText", "keywords": null, "license": "The BSD 2-Clause License", "maintainer": null, "maintainer_email": null, "name": "jusText", "package_url": "https://pypi.org/project/jusText/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/jusText/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/miso-belica/jusText" }, "release_url": "https://pypi.org/project/jusText/2.2.0/", "requires_dist": null, "requires_python": null, "summary": "Heuristic based boilerplate removal tool", "version": "2.2.0" }, "last_serial": 1992431, "releases": { "1.2.0": [ { "comment_text": "", "digests": { "md5": "5707f2053ffb57f89e6f80a53ae453a7", "sha256": "6715127a631795bda60660b6b24510c1b23ba4c7855bb5fddfb3d61027be39f0" }, "downloads": -1, "filename": "jusText-1.2.0.zip", "has_sig": false, "md5_digest": "5707f2053ffb57f89e6f80a53ae453a7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 860846, "upload_time": "2013-08-26T18:57:30", "url": "https://files.pythonhosted.org/packages/dd/67/352453fc02d18e554503d42f772087d5ce42b3d89700cfd132ded6a71af0/jusText-1.2.0.zip" } ], "2.0.0": [ { "comment_text": "", "digests": { "md5": "b2e2e8287eb6bc0b2999e659618130b6", "sha256": "c132eea09cebfe8e0f74f2c8c5e7f74ec948b17774244613ee44d0d01033333f" }, "downloads": -1, "filename": "jusText-2.0.0.zip", "has_sig": false, "md5_digest": "b2e2e8287eb6bc0b2999e659618130b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 864264, "upload_time": "2013-08-26T19:20:00", "url": "https://files.pythonhosted.org/packages/c8/e0/b2cb91cab1661de64fd8aa464615b0a24dd6b40e4ff7894c3bf69f0ef951/jusText-2.0.0.zip" } ], "2.1.0": [ { "comment_text": "", "digests": { "md5": "a31553384c164bcf266bdd8e7d1cffb4", "sha256": "c62fc00abc25f2c7648cf9e8f649e2a40070cae9f46fdb45731d41ea34f80e04" }, "downloads": -1, "filename": "jusText-2.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "a31553384c164bcf266bdd8e7d1cffb4", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 862397, "upload_time": "2014-01-25T19:37:06", "url": "https://files.pythonhosted.org/packages/bb/eb/84a0ee05ff96a510204317dbf85fec99766d862882cf4315d644531ccf1a/jusText-2.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d07c1e0b93aeb44181b9ebb785524b11", "sha256": "a0263e2db06b2d4839df0503e43579d6b7685600d87d2ab50c43a56a44623365" }, "downloads": -1, "filename": "jusText-2.1.0.win32.exe", "has_sig": false, "md5_digest": "d07c1e0b93aeb44181b9ebb785524b11", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 1057226, "upload_time": "2014-01-26T10:45:06", "url": "https://files.pythonhosted.org/packages/77/65/c4b88dcdd58c0f0f7825f9a3e137cb87984429d3b994a17f69f9af5a905b/jusText-2.1.0.win32.exe" }, { "comment_text": "", "digests": { "md5": "a0d6ede25651e99f1b6ef52a9fad611f", "sha256": "c2d3cd19feff41d2ef9e4a9915decf4c14a47dc63de94ddd539739d9bfc349fd" }, "downloads": -1, "filename": "jusText-2.1.0.zip", "has_sig": false, "md5_digest": "a0d6ede25651e99f1b6ef52a9fad611f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 865613, "upload_time": "2014-01-25T19:37:00", "url": "https://files.pythonhosted.org/packages/fc/54/d6ade8b5dc5adb32c9826b7f2d28fdc2788b95ec752a9eb637173008eec4/jusText-2.1.0.zip" } ], "2.1.1": [ { "comment_text": "", "digests": { "md5": "b3f02f7fe9e4881ef0427cefe393f2d3", "sha256": "5361f7d40568d6bdf2432aeba0bc11d1107db565e712bdb6624eec1f98e594ee" }, "downloads": -1, "filename": "jusText-2.1.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "b3f02f7fe9e4881ef0427cefe393f2d3", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 861092, "upload_time": "2014-05-27T20:29:51", "url": "https://files.pythonhosted.org/packages/15/1e/a97c2cd1169e87e74e954e9b3946c6fc6b34e5178268d1cbabd6209ae289/jusText-2.1.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d6c9247acd16c7c37ef0a2da3e2a2dad", "sha256": "1dd07067f415a87f22323a54d3d2238c9b0fe8e7945551a0763496ac56d30a16" }, "downloads": -1, "filename": "jusText-2.1.1.zip", "has_sig": false, "md5_digest": "d6c9247acd16c7c37ef0a2da3e2a2dad", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 864531, "upload_time": "2014-05-27T20:29:36", "url": "https://files.pythonhosted.org/packages/04/6d/31e00efd44497ffa0e433d5f8eca61216ad3feba4ee7b0dcb76ba9d75df4/jusText-2.1.1.zip" } ], "2.2.0": [ { "comment_text": "", "digests": { "md5": "ac1b8f271cee0983001fef9351ac6e1f", "sha256": "4b8b7f0749e8725f0089ebe0239c1a45286d61bf507b3f05d136c2700dea4aa6" }, "downloads": -1, "filename": "jusText-2.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ac1b8f271cee0983001fef9351ac6e1f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 860786, "upload_time": "2016-03-06T18:05:05", "url": "https://files.pythonhosted.org/packages/6c/5f/c7b909b4b864ebcacfac23ce2f6f01a50c53628787cc14b3c06f79464cab/jusText-2.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3411d66f19c58b24a2fdd2358e7093d", "sha256": "330035dfaaa960465276afa1836dfb6e63791011a8dfc6da2757142cc4d14d54" }, "downloads": -1, "filename": "jusText-2.2.0.zip", "has_sig": false, "md5_digest": "d3411d66f19c58b24a2fdd2358e7093d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 864108, "upload_time": "2016-03-06T18:05:39", "url": "https://files.pythonhosted.org/packages/4a/1a/d167b2d3a7062e48f00d6bf7aa554c9b8e77f12fcd88fdd8271853d93449/jusText-2.2.0.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "ac1b8f271cee0983001fef9351ac6e1f", "sha256": "4b8b7f0749e8725f0089ebe0239c1a45286d61bf507b3f05d136c2700dea4aa6" }, "downloads": -1, "filename": "jusText-2.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ac1b8f271cee0983001fef9351ac6e1f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 860786, "upload_time": "2016-03-06T18:05:05", "url": "https://files.pythonhosted.org/packages/6c/5f/c7b909b4b864ebcacfac23ce2f6f01a50c53628787cc14b3c06f79464cab/jusText-2.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3411d66f19c58b24a2fdd2358e7093d", "sha256": "330035dfaaa960465276afa1836dfb6e63791011a8dfc6da2757142cc4d14d54" }, "downloads": -1, "filename": "jusText-2.2.0.zip", "has_sig": false, "md5_digest": "d3411d66f19c58b24a2fdd2358e7093d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 864108, "upload_time": "2016-03-06T18:05:39", "url": "https://files.pythonhosted.org/packages/4a/1a/d167b2d3a7062e48f00d6bf7aa554c9b8e77f12fcd88fdd8271853d93449/jusText-2.2.0.zip" } ] }