{
    "info": {
        "author": "Rodrigo Palacios, Eeo Jun",
        "author_email": "rodrigopala91@gmail.com, packwolf58@gmail.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "Libextract: extract data from websites\r\n======================================\r\n\r\n.. image:: https://travis-ci.org/datalib/libextract.svg?branch=master\r\n    :target: https://travis-ci.org/datalib/libextract\r\n\r\n::\r\n\r\n        ___ __              __                  __\r\n       / (_) /_  ___  _  __/ /__________ ______/ /_\r\n      / / / __ \\/ _ \\| |/_/ __/ ___/ __ `/ ___/ __/\r\n     / / / /_/ /  __/>  </ /_/ /  / /_/ / /__/ /_\r\n    /_/_/_.___/\\___/_/|_|\\__/_/   \\__,_/\\___/\\__/\r\n\r\n\r\nLibextract is a `statistics-enabled <https://github.com/datalib/StatsCounter>`_\r\ndata extraction library that works on HTML and XML documents and written in \r\nPython. Originating from `eatiht <http://rodricios.github.io/eatiht/>`_, the\r\nextraction algorithm works by making one simple assumption: *data appear as \r\ncollections of repetitive elements*. You can read about the reasoning \r\n`here <http://rodricios.github.io/posts/solving_the_data_extraction_problem.html>`_. \r\n\r\n\r\nOverview\r\n--------\r\n\r\n`libextract.api.extract(document, encoding='utf-8', count=5)` \r\n    Given an html *document*, and optionally the *encoding*, return\r\n    a list of nodes likely containing data (5 by default).\r\n\r\n\r\nInstallation\r\n------------\r\n\r\n::\r\n\r\n    pip install libextract\r\n\r\nUsage\r\n-----\r\n\r\nDue to our simple definition of \"data\", we open up a single\r\ninterfaceable method. Post-processing is up to you. \r\n\r\n.. code-block:: python\r\n\r\n    from requests import get\r\n    from libextract.api import extract\r\n\r\n    r = get('http://en.wikipedia.org/wiki/Information_extraction')\r\n    textnodes = list(extract(r.content))\r\n\r\n\r\nUsing lxml's built-in methods for post-processing:\r\n\r\n.. code-block:: python\r\n\r\n    >> print(textnodes[0].text_content())\r\n    Information extraction (IE) is the task of automatically extracting structured information...\r\n\r\n\r\nThe extraction algo is agnostic to article text as it is with\r\ntabular data:\r\n\r\n.. code-block:: python\r\n\r\n    height_data = get(\"http://en.wikipedia.org/wiki/Human_height\")\r\n    tabs = list(extract(height_data.content))\r\n    \r\n\r\n.. code-block:: python\r\n\r\n    >> [elem.text_content() for elem in tabs[0].iter('th')]\r\n    ['Country/Region',\r\n     'Average male height',\r\n     'Average female height',\r\n     ...]\r\n\r\nDependencies\r\n~~~~~~~~~~~~\r\n\r\n::\r\n\r\n    lxml\r\n    statscounter\r\n\r\nDisclaimer\r\n~~~~~~~~~~\r\n\r\nThis project is still in its infancy; and advice and suggestions as\r\nto what this library could and should be would be greatly appreciated\r\n\r\n:) \r\n",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/datalib/libextract",
        "keywords": "extract extraction main article text html data-extraction data              content-extraction content unsupervised classification machine              learning AI artificial intelligence ML",
        "license": "MIT",
        "maintainer": null,
        "maintainer_email": null,
        "name": "libextract",
        "package_url": "https://pypi.org/project/libextract/",
        "platform": "any",
        "project_url": "https://pypi.org/project/libextract/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/datalib/libextract"
        },
        "release_url": "https://pypi.org/project/libextract/0.0.12/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "A HT/XML web scraping tool",
        "version": "0.0.12"
    },
    "last_serial": 1610628,
    "releases": {
        "0.0.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "e02f535506333e116ead701a939c73a0",
                    "sha256": "4a7b10c755d80975f108a5d2b4d30ba0ae095a9e7568e9be78f6812fe42f80fc"
                },
                "downloads": -1,
                "filename": "libextract-0.0.0.zip",
                "has_sig": false,
                "md5_digest": "e02f535506333e116ead701a939c73a0",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 12497,
                "upload_time": "2015-04-15T00:14:43",
                "url": "https://files.pythonhosted.org/packages/7c/1f/15b14a088599fd77f86762bdbbe3f9ea2dda39b29d32282ccf66ad111742/libextract-0.0.0.zip"
            }
        ],
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "1461e45585c4fac8c02d813772b48c14",
                    "sha256": "46b4e04f4ac6f7b230740cfb159e8136ee810e78600710682332bafbd29f03bf"
                },
                "downloads": -1,
                "filename": "libextract-0.0.1.zip",
                "has_sig": false,
                "md5_digest": "1461e45585c4fac8c02d813772b48c14",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 47025,
                "upload_time": "2015-04-16T03:01:17",
                "url": "https://files.pythonhosted.org/packages/42/68/697b92e622535a1ba4863a72771316a9aa674f374a1b2b6a22473b69f8bf/libextract-0.0.1.zip"
            }
        ],
        "0.0.11": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "0e84b3966913af89a695ce70ee1d4cd7",
                    "sha256": "380cc55ac25b3776b1761e4942c7e78ec791ef49b2e2b4c72252a4b6901facb4"
                },
                "downloads": -1,
                "filename": "libextract-0.0.11.zip",
                "has_sig": false,
                "md5_digest": "0e84b3966913af89a695ce70ee1d4cd7",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 46807,
                "upload_time": "2015-05-15T10:26:38",
                "url": "https://files.pythonhosted.org/packages/55/a4/9edfe15bb0ac126fd7569887d8b8318455db30b739372edb8fb18c14701f/libextract-0.0.11.zip"
            }
        ],
        "0.0.12": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "869acc9725a9d883a412c4e74ce400d3",
                    "sha256": "053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc"
                },
                "downloads": -1,
                "filename": "libextract-0.0.12.zip",
                "has_sig": false,
                "md5_digest": "869acc9725a9d883a412c4e74ce400d3",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 49802,
                "upload_time": "2015-06-28T22:20:05",
                "url": "https://files.pythonhosted.org/packages/88/c8/434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c/libextract-0.0.12.zip"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "869acc9725a9d883a412c4e74ce400d3",
                "sha256": "053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc"
            },
            "downloads": -1,
            "filename": "libextract-0.0.12.zip",
            "has_sig": false,
            "md5_digest": "869acc9725a9d883a412c4e74ce400d3",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 49802,
            "upload_time": "2015-06-28T22:20:05",
            "url": "https://files.pythonhosted.org/packages/88/c8/434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c/libextract-0.0.12.zip"
        }
    ]
}