{ "info": { "author": "Rodrigo Palacios, Eeo Jun", "author_email": "rodrigopala91@gmail.com, packwolf58@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Libextract: extract data from websites\r\n======================================\r\n\r\n.. image:: https://travis-ci.org/datalib/libextract.svg?branch=master\r\n :target: https://travis-ci.org/datalib/libextract\r\n\r\n::\r\n\r\n ___ __ __ __\r\n / (_) /_ ___ _ __/ /__________ ______/ /_\r\n / / / __ \\/ _ \\| |/_/ __/ ___/ __ `/ ___/ __/\r\n / / / /_/ / __/> `_\r\ndata extraction library that works on HTML and XML documents and written in \r\nPython. Originating from `eatiht `_, the\r\nextraction algorithm works by making one simple assumption: *data appear as \r\ncollections of repetitive elements*. You can read about the reasoning \r\n`here `_. \r\n\r\n\r\nOverview\r\n--------\r\n\r\n`libextract.api.extract(document, encoding='utf-8', count=5)` \r\n Given an html *document*, and optionally the *encoding*, return\r\n a list of nodes likely containing data (5 by default).\r\n\r\n\r\nInstallation\r\n------------\r\n\r\n::\r\n\r\n pip install libextract\r\n\r\nUsage\r\n-----\r\n\r\nDue to our simple definition of \"data\", we open up a single\r\ninterfaceable method. Post-processing is up to you. \r\n\r\n.. code-block:: python\r\n\r\n from requests import get\r\n from libextract.api import extract\r\n\r\n r = get('http://en.wikipedia.org/wiki/Information_extraction')\r\n textnodes = list(extract(r.content))\r\n\r\n\r\nUsing lxml's built-in methods for post-processing:\r\n\r\n.. code-block:: python\r\n\r\n >> print(textnodes[0].text_content())\r\n Information extraction (IE) is the task of automatically extracting structured information...\r\n\r\n\r\nThe extraction algo is agnostic to article text as it is with\r\ntabular data:\r\n\r\n.. code-block:: python\r\n\r\n height_data = get(\"http://en.wikipedia.org/wiki/Human_height\")\r\n tabs = list(extract(height_data.content))\r\n \r\n\r\n.. code-block:: python\r\n\r\n >> [elem.text_content() for elem in tabs[0].iter('th')]\r\n ['Country/Region',\r\n 'Average male height',\r\n 'Average female height',\r\n ...]\r\n\r\nDependencies\r\n~~~~~~~~~~~~\r\n\r\n::\r\n\r\n lxml\r\n statscounter\r\n\r\nDisclaimer\r\n~~~~~~~~~~\r\n\r\nThis project is still in its infancy; and advice and suggestions as\r\nto what this library could and should be would be greatly appreciated\r\n\r\n:) \r\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/datalib/libextract", "keywords": "extract extraction main article text html data-extraction data content-extraction content unsupervised classification machine learning AI artificial intelligence ML", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "libextract", "package_url": "https://pypi.org/project/libextract/", "platform": "any", "project_url": "https://pypi.org/project/libextract/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/datalib/libextract" }, "release_url": "https://pypi.org/project/libextract/0.0.12/", "requires_dist": null, "requires_python": null, "summary": "A HT/XML web scraping tool", "version": "0.0.12" }, "last_serial": 1610628, "releases": { "0.0.0": [ { "comment_text": "", "digests": { "md5": "e02f535506333e116ead701a939c73a0", "sha256": "4a7b10c755d80975f108a5d2b4d30ba0ae095a9e7568e9be78f6812fe42f80fc" }, "downloads": -1, "filename": "libextract-0.0.0.zip", "has_sig": false, "md5_digest": "e02f535506333e116ead701a939c73a0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12497, "upload_time": "2015-04-15T00:14:43", "url": "https://files.pythonhosted.org/packages/7c/1f/15b14a088599fd77f86762bdbbe3f9ea2dda39b29d32282ccf66ad111742/libextract-0.0.0.zip" } ], "0.0.1": [ { "comment_text": "", "digests": { "md5": "1461e45585c4fac8c02d813772b48c14", "sha256": "46b4e04f4ac6f7b230740cfb159e8136ee810e78600710682332bafbd29f03bf" }, "downloads": -1, "filename": "libextract-0.0.1.zip", "has_sig": false, "md5_digest": "1461e45585c4fac8c02d813772b48c14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 47025, "upload_time": "2015-04-16T03:01:17", "url": "https://files.pythonhosted.org/packages/42/68/697b92e622535a1ba4863a72771316a9aa674f374a1b2b6a22473b69f8bf/libextract-0.0.1.zip" } ], "0.0.11": [ { "comment_text": "", "digests": { "md5": "0e84b3966913af89a695ce70ee1d4cd7", "sha256": "380cc55ac25b3776b1761e4942c7e78ec791ef49b2e2b4c72252a4b6901facb4" }, "downloads": -1, "filename": "libextract-0.0.11.zip", "has_sig": false, "md5_digest": "0e84b3966913af89a695ce70ee1d4cd7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 46807, "upload_time": "2015-05-15T10:26:38", "url": "https://files.pythonhosted.org/packages/55/a4/9edfe15bb0ac126fd7569887d8b8318455db30b739372edb8fb18c14701f/libextract-0.0.11.zip" } ], "0.0.12": [ { "comment_text": "", "digests": { "md5": "869acc9725a9d883a412c4e74ce400d3", "sha256": "053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc" }, "downloads": -1, "filename": "libextract-0.0.12.zip", "has_sig": false, "md5_digest": "869acc9725a9d883a412c4e74ce400d3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 49802, "upload_time": "2015-06-28T22:20:05", "url": "https://files.pythonhosted.org/packages/88/c8/434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c/libextract-0.0.12.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "869acc9725a9d883a412c4e74ce400d3", "sha256": "053e846b235fc5dc1d7c8a0fa806207ba676631ebf3f30fb52fb6c6c1e0849cc" }, "downloads": -1, "filename": "libextract-0.0.12.zip", "has_sig": false, "md5_digest": "869acc9725a9d883a412c4e74ce400d3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 49802, "upload_time": "2015-06-28T22:20:05", "url": "https://files.pythonhosted.org/packages/88/c8/434eff3237cd0ddc21c45a1ae52de3e94a33aad9c55468ce79da5f93f10c/libextract-0.0.12.zip" } ] }