{ "info": { "author": "Aleksandr Smechov", "author_email": "aleks@smechov.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "===============================================\nVellichor: a succinct article text extractor \n===============================================\n\n*Vellichor (n): the strange wistfulness of used bookstores*\n\n`Vellichor `_'s aims aren't ambitious. It does its duty relatively well, living a simple package's life, sustaining itself solely on URL or HTML strings. Provide it with these basic comforts and you shall receive a lean, healthy block of article text. \n\nQuickstart\n==========\n\nDependencies\n------------\n\nDespite its simple purpose, Vellichor has a few dependencies, as it uses a random forest model to classify a candidate HTML node as relevant or not. These will be installed automatically, if you don't already have them: **urlvalidator**, **requests**, **commonregex**, **lxml**, **beautifulsoup4**, **scipy**, **scikit-learn**, **numpy**. *The library was tested with Python 3.6 only*.\n\n\nInstallation\n------------\n\nOf course, `virtualenv`_ would be a nice idea, considering you may want a few of those important dependencies untouched::\n\n virtualenv test_env --python==python3.6\n\n.. _virtualenv: http://www.virtualenv.org\n\nYou can use ``pip`` to install Vellichor::\n\n pip install vellichor\n\n\nUsage\n-----\n\nVellichor extracts relevant text from an article URL or HTML string. To begin, import the Extract class::\n\n from vellichor.extract import Extract\n\nYou can then create an instance of Extract and feed a URL or HTML string to several methods::\n\n url = \"http://www.example.com/you-wont-believe-these-examples\"\n html = \"

Example

\n\n extract = Extract()\n\n # Main method\n article_text = extract.article_text_from(url)\n # OR extract.article_text_from(html=html)\n\n # Extract raw text directly from the retrieved HTML\n raw_text = extract.raw_text_from(url)\n\n # Extract the HTML only - URL parameter only\n html_only = extract.html_from(url)\n\n # Outputs a Beautiful Soup object from the retrieved HTML\n soup = extract.soup_from(url)\n\nTo extract text from a sea of article URLs, be sure to instantiate ``Extract`` for every new URL. \n\nNot satisfied with just a clean block of text? Vellichor comes with a few methods for extracting some basic details::\n\n extract.article_details()\n\n # outputs a list of author candidates: [\"Dr. Exampleton\"]\n extract.author \n\n # outputs the site name: \"Example\"\n extract.site_name \n\n # outputs the article title: \"You Won't Believe these Examples!\"\n extract.article_title\n\nA few things to note. Running the ``article_text_from()`` method on an instance of ``Extract`` automatically gives access to the following class attributes: ``html``, ``article_text``, ``soup``, and ``soup_blocks`` (a collection of candidate nodes, or

tags, that were used for deciding the final output text). \n\nSecond, there is a bit of hierarchy built in. Running the ``get_soup_blocks()`` method also gives access to the ``soup`` and ``html`` class methods. Running ``get_soup()`` on your instance also gets you the ``html`` class method. \n\n``raw_text`` is only available when the ``raw_text_from()`` method is called on an instance of Extract (the URL or HTML parameter is required if this will be the first class method you call).\n\nThat's all folks.\n\n...\n\n*I have always imagined that Paradise will be a kind of library.* \n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/aleksandr-smechov/vellichor.git", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "vellichor", "package_url": "https://pypi.org/project/vellichor/", "platform": "", "project_url": "https://pypi.org/project/vellichor/", "project_urls": { "Homepage": "https://github.com/aleksandr-smechov/vellichor.git" }, "release_url": "https://pypi.org/project/vellichor/0.0.1/", "requires_dist": [ "urlvalidator", "requests", "commonregex", "beautifulsoup4", "scipy", "scikit-learn", "numpy", "lxml" ], "requires_python": "", "summary": "A succinct article text extractor.", "version": "0.0.1" }, "last_serial": 4545098, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "95d1f5673bda5525f525a40b4a3dc814", "sha256": "c45a65712f8c0a8bd37bf0b4534008b7692d525d144ec89660008501964587d3" }, "downloads": -1, "filename": "vellichor-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "95d1f5673bda5525f525a40b4a3dc814", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 245671, "upload_time": "2018-11-30T01:07:22", "url": "https://files.pythonhosted.org/packages/d2/ed/34f90288595e35c08c63c1c0f3790c870943d790ea1b2d01fcda1365b75d/vellichor-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d8315e71e252c5198b820cd518062d61", "sha256": "1d64aa6f945c68f15471a4a17b403ca7564c22c2eda29bc90e115b20c3cfee8a" }, "downloads": -1, "filename": "vellichor-0.0.1.tar.gz", "has_sig": false, "md5_digest": "d8315e71e252c5198b820cd518062d61", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 234802, "upload_time": "2018-11-30T01:07:26", "url": "https://files.pythonhosted.org/packages/fb/8a/0fad7aea7ef0e7cb9d32f43223cf73761e26a0f630ecbda378a4bb41cd1e/vellichor-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "95d1f5673bda5525f525a40b4a3dc814", "sha256": "c45a65712f8c0a8bd37bf0b4534008b7692d525d144ec89660008501964587d3" }, "downloads": -1, "filename": "vellichor-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "95d1f5673bda5525f525a40b4a3dc814", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 245671, "upload_time": "2018-11-30T01:07:22", "url": "https://files.pythonhosted.org/packages/d2/ed/34f90288595e35c08c63c1c0f3790c870943d790ea1b2d01fcda1365b75d/vellichor-0.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d8315e71e252c5198b820cd518062d61", "sha256": "1d64aa6f945c68f15471a4a17b403ca7564c22c2eda29bc90e115b20c3cfee8a" }, "downloads": -1, "filename": "vellichor-0.0.1.tar.gz", "has_sig": false, "md5_digest": "d8315e71e252c5198b820cd518062d61", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 234802, "upload_time": "2018-11-30T01:07:26", "url": "https://files.pythonhosted.org/packages/fb/8a/0fad7aea7ef0e7cb9d32f43223cf73761e26a0f630ecbda378a4bb41cd1e/vellichor-0.0.1.tar.gz" } ] }