{ "info": { "author": "Will Larson", "author_email": "lethain@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "==========\nExtraction\n==========\n\nExtraction is a Python package for extracting titles, descriptions,\nimages and canonical urls from web pages. You might want to use Extraction\nif you're building a link aggregator where users submit links and you\nwant to display them (like submitting a link to Facebook, Digg or Delicious).\n\nExtraction is not a web crawling or content retrieval mechanism, rather\nit is a tool to use on data which has always been retrieved or crawled\nby a different tool.\n\nUpdated to work with Python3. `See the last Python 2.x compatible version here `_.\n\nSee on `Github `_, or on\n`PyPi `_.\n\n\nHello World Usage\n=================\n\nAn extremely simple example of using `extraction` is::\n\n >>> import extraction\n >>> import requests\n >>> url = \"http://lethain.com/social-hierarchies-in-engineering-organizations/\"\n >>> html = requests.get(url).text\n >>> extracted = extraction.Extractor().extract(html, source_url=url)\n >>> extracted.title\n >>> \"Social Hierarchies in Engineering Organizations - Irrational Exuberance\"\n >>> print extracted.title, extracted.description, extracted.image, extracted.url\n >>> print extracted.titles, extracted.descriptions, extracted.images, extracted.urls\n\nNote that `source_url` is optional in extract, but is recommended\nas it makes it possible to rewrite relative urls and image urls\ninto absolute paths. `source_url` is not used for fetching data,\nbut can be used for targetting extraction techniques to the correct\ndomain.\n\nMore details usage examples, including how to add your own\nextraction mechanisms, are beneath the installation section.\n\n\nInstallation\n============\n\nThe simplest way to install Extraction is via PyPi::\n\n pip install extraction\n\nYou'll also have to install a parser for `BeautifulSoup4 `_,\nand while ``extraction`` already pulls down `html5lib `_\nthrough it's requirements, I really recommend installing `lxml `_ as well,\nbecause there are some extremely gnarly issues with `html5lib`\nfailing to parse XHTML pages (for example, PyPi fails to parse entirely\nwith html5lib::\n\n >>> bs4.BeautifulSoup(text, [\"html5lib\"]).find_all(\"title\")\n []\n >>> bs4.BeautifulSoup(text, [\"lxml\"]).find_all(\"title\")\n [extraction 0.1.3 : Python Package Index]\n\nYou should be able to install `lxml `_ via pip::\n\n pip install lxml\n\nIf you want to develop extraction, then after installing `lxml`,\nyou can install from GitHub::\n\n git clone\n cd extraction\n python3 -m venv env \n . ./env/bin/activate\n pip install -r requirements.txt\n pip install -e .\n\nThen you can run the tests::\n\n python tests/tests.py\n\nAll of which should pass in a sane installation.\n\n\nUsing Extraction\n================\n\nThis section covers various ways to use extraction, both using\nthe existing extraction techniques as well as add your own.\n\nFor more examples, please look in the `extraction/examples`\ndirectory.\n\n\nBasic Usage\n-----------\n\nThe simplest possible example is the \"Hello World\" example from above::\n\n >>> import extraction\n >>> import requests\n >>> url = \"http://lethain.com/social-hierarchies-in-engineering-organizations/\"\n >>> html = requests.get(url).text\n >>> extracted = extraction.Extractor().extract(html, source_url=url)\n >>> extracted.title\n >>> \"Social Hierarchies in Engineering Organizations - Irrational Exuberance\"\n >>> print extracted.title, extracted.description, extracted.image, extracted.url\n\nYou can get the best title, description and such out of an `Extracted`\ninstance (which are returned by `Extractor.extract`) by::\n\n >>> print extracted.title\n >>> print extracted.description\n >>> print extracted.url\n >>> print extracted.image\n >>> print extracted.feed\n\nYou can get the full list of extracted values using the plural versions::\n\n >>> print extracted.titles\n >>> print extracted.descriptions\n >>> print extracted.urls\n >>> print extracted.images\n >>> print extracted.feeds\n\nIf you're looking for data which is being extracted but doesn't fall into\none of those categories (perhaps using a custom technique), then\ntake a look at the `Extracted._unexpected_values` dictionary::\n\n >>> print extracted._unexpected_values\n\nAny type of metadata which isn't anticipated is stored there\n(look at `Subclassing Extracted to Extract New Types of Data`\nif this is something you're running into frequently).\n\n\nUsing Custom Techniques and Changing Technique Ordering\n-------------------------------------------------------\n\nThe order techniques are run in is significant, and the most accurate\ntechniques should always run first, and more general, lower quality\ntechniques later on.\n\nThis is because titles, descriptions, images and urls are stored\ninternally in a list, which is built up as techniques are run,\nand the `title`, `url`, `image` and `description` properties\nsimply return the first item from the corresponding list.\n\nTechniques are represented by a string with the full path to the\ntechnique, including its class. For example `\"extraction.technique.FacebookOpengraphTags\"`\nis a valid representation of a technique.\n\nThe default ordering of techniques is within the extraction.Extractor's\n`techniques` class variable, and is::\n\n extraction.techniques.FacebookOpengraphTags\n extraction.techniques.TwitterSummaryCardTags\n extraction.techniques.HTML5SemanticTags\n extraction.techniques.HeadTags\n extraction.techniques.SemanticTags\n\nYou can modify the order and inclusion of techniques in three ways.\nFirst, you can modify it by passing in a list of techniques to the\noptional `techniques` parameter when initializing an extraction.Extractor::\n\n >>> techniques = [\"my_module.MyTechnique\", \"extraction.techniques.FacebookOpengraphTags\"]\n >>> extractor = extraction.Extractor(techniques=techniques)\n\nThe second approach is to subclass Extractor with a different value of `techniques`::\n\n from extraction import Extractor\n\n class MyExtractor(Extractor):\n techniques = [\"my_module.MyTechnique\"]\n\nFinally, the third option is to directly modify the `techniques` class variable.\nThis is probably the most unpredictable technique, as it's possible for mutiple\npieces of code to perform this modification and to create havoc, if possible\nuse one of the previous two techniques to avoid future debugging::\n\n >>> import extraction\n >>> extraction.Extractor.techniques.insert(0, \"my_module.MyAwesomeTechnique\")\n >>> extraction.Extractor.techniques.append(\"my_module.MyLastReportTechnique\")\n\nAgain, please try the first two techniques instead if you value sanity.\n\n\nWriting New Technique\n---------------------\n\nIt may be that you're frequently parsing a given website and\naren't impressed with how the default extraction techniques are\nperforming. In that case, consider writng your own technique.\n\nLet's take for example a blog entry at `lethain.com `_,\nwhich uses the `H1` tag to represent the overall blogs title,\nand always uses the first `H2` tag in `DIV.page` for its actual\ntitle.\n\nA technique to properly extract this data would look like::\n\n from extraction.techniques import Technique\n from bs4 import BeautifulSoup\n class LethainComTechnique(Technique):\n def extract(self, html):\n \"Extract data from lethain.com.\"\n soup = BeautifulSoup(html)\n page_div = soup.find('div', class_='page')\n text_div = soup.find('div', class_='text')\n return { 'titles': [page_div.find('h2').string],\n 'dates': [page_div.find('span', class_='date').string],\n 'descriptions': [\" \".join(text_div.find('p').strings)],\n 'tags': [x.find('a').string for x in page_div.find_all('span', class_='tag')],\n 'images': [x.attrs['src'] for x in text_div.find_all('img')],\n }\n\nTo integrate your technique, take a look at the `Using Custom Techniques and Changing Technique Ordering`\nsection above.\n\nAdding new techniques incorporating microformats is an interesting\narea for some consideration. Most microformats have very limited\nusage, but where they are in use they tend to be high quality sources\nof information.\n\n\nSubclassing Extracted to Extract New Types of Data\n--------------------------------------------------\n\nYour techniques can return non-standard keys in the dictionary\nreturned by `extract`, which will be available in the `Extracted()._unexpected_values`\ndictionary. In this way you could fairly easily add support for extracting\naddresses or whatnot.\n\nFor a contrived example, we'll extract my address from `willarson.com `_,\nwhich is in no way a realistic example of extracting an address, and is\nonly meant as an example of how to add a new type of extracted data.\n\nAs such, to add support for extracting address should look like (a fuller,\ncommented version of this example is available in `extraction/examples/new_return_type.py`,\nI've written this as concisely as possible to fit into this doc more cleanly)::\n\n from extraction.techniques import Technique\n from extraction import Extractor, Extracted\n from bs4 import BeautifulSoup\n\n class AddressExtracted(Extracted):\n def __init__(self, addresses=None, *args, **kwargs):\n self.addresses = addresses or []\n super(AddressExtracted, self).__init__(*args, **kwargs)\n\n @property\n def address(self):\n return self.addresses[0] if self.addresses else None\n\n class AddressExtractor(Extractor):\n \"Extractor which supports addresses as first-class data.\"\n extracted_class = AddressExtracted\n text_types = [\"titles\", \"descriptions\", \"addresses\"]\n\n class AddressTechnique(Technique):\n def extract(self, html):\n \"Extract address data from willarson.com.\"\n soup = BeautifulSoup(html)\n return {'addresses': [\" \".join(soup.find('div', id='address').strings)]}\n\nUsage would then look like::\n\n >>> import requests\n >>> from extraction.examples.new_return_type import AddressExtractor\n >>> extractor = AddressExtractor()\n >>> extractor.techniques = [\"extraction.examples.new_return_type.AddressTechnique\"]\n >>> extracted = extractor.extract(requests.get(\"http://willarson.com/\"))\n >>> extracted.address\n \"Cole Valey San Francisco, CA USA\"\n\nThere you have it, extracted addresses as first class extracted data.\n\n\nPassing Parameters to Techniques\n--------------------------------\n\nThere isn't a mechanism for passing parameters to Techniques\nwhen they are initialized, but it is possible to customize\nthe behavior of Techniques in a couple of ways.\n\nFirst, you can simply subclass the Technique with the specific\nbehavior you want, perhaps pulling the data from Django settings\nor what not::\n\n class MyTechnique(Technique):\n def __init__(self, *args, **kwargs):\n if 'something' in kwargs:\n self.something = kwargs['something']\n\t del kwargs['something']\n else:\n self.something = \"something else\"\n return super(MyTechnique, self).__init__(*args, **kwargs)\n\n def extract(html, source_url=None):\n print self.something\n return super(MyTechnique, self).extract(html, source_url=source_url)\n\nSecond, all techniques are passed in the Extractor being used\nto process them, so you can bake the customization into an\nextraction.Extractor subclass::\n\n from extraction import Extractor\n from extraction.techniques import Technique\n\n class MyExtractor(Extractor):\n techniques = [\"my_module.MyTechnique\"]\n def __init__(self, something, *args, **kwargs):\n self.something = something\n super(MyExtractor, self).__init__(*args, **kwargs)\n\n class MyTechnique(Technique):\n class extract(self, html, source_url=None):\n print self.extractor.something\n return super(MyTechnique, self).extract(html, source_url=source_url)\n\nBetween these two techniques, it should be feasible to get the\ncustomization of behavior you need.\n\n\nExtraction Techniques\n=====================\n\nThis section lists the current techniques used by extraction.\nTo rerank the techniques, remove techniques or add new techniques\nof your own, look at the `Using Extraction` section below.\n\n\nextraction.techniques.HeadTags\n------------------------------\n\nEvery webpage's head tag contains has a title tag, and many also\ninclude additional data like descriptions, RSS feeds and such.\nThis technique parses data that looks like::\n\n \n \n \n \n Digg v4's Architecture and Development Processes - Irrational Exuberance\n \n\nWhile the head tag is authoritative source of canonical URLs and RSS,\nit's often very hit or miss for titles, descriptions and such.\nAt worst, it's better than nothing.\n\nextraction.techniques.FacebookOpengraphTags\n-------------------------------------------\n\nFor better or for worse, the highest quality source of page data is usually\nthe `Facebook Opengraph meta tags `_.\nThis technique uses Opengraph tags, which look like this::\n\n \n ...\n \n \n \n \n ...\n \n\nas their source of data.\n\n\nextraction.techniques.TwitterSummaryCardTags\n-------------------------------------------\n\nAnother, increasingly common set of meta tags is the `Twitter Card tags `_.\nThis technique parses those tags, which look like::\n\n \n ...\n \n \n \n \n \n \n ...\n \n\nFor sites with cards integration (which many high quality sites have, because it's necessary for\nrendering with images in the Twitter feed), this will be a very high quality source of data.\n\nOne oddity is that Twitter cards don't include a URL tag, so they don't help much with\ncanonicalizing articles.\n\n\nextraction.techniques.HTML5SemanticTags\n---------------------------------------\n\nThe HTML5 `article` tag, and also the `video` tag give us some useful\nhints for extracting page information for the sites which happen to\nutilize these tags.\n\nThis technique will extract information from pages formed like::\n\n \n \n

This is not a title to HTML5SemanticTags

\n
\n

This is a title

\n

This is a description.

\n

This is not a description.

\n
\n \n \n \n\nNote that `HTML5SemanticTags` is intentionally much more conservative than\n`SemanticTags`, as it provides high quality information in the small number\nof cases where it hits, and otherwise expects `SemanticTags` to run sweep\nbehind it for the lower quality, more abundant hits it discovers.\n\n\nextraction.techniques.SemanticTags\n----------------------------------\n\nThis technique relies on the basic tags themselves--for example,\nall `img` tags include images, most `h1` and `h2` tags include titles,\nand `p` tags often include text usable as descriptions::\n\n \n \n

This will be extracted as a title.

\n

So will this, but after all H1s.

\n \n

And this as a description.

\n

This as another possible description.

\n

This as a third possible description.

\n \n \n\nThere is a limit, defined within `SemanticTags` of how many\ntags of a given type will be consumed, and is usually 3-5,\nwith the exception of images, where it is 10 (as this is\nactually a valid way to detect images, unlike the others).\n\nThis is a true last resort technique.\n\n\nImplementation Details\n======================\n\nI've tried to comment the classes and modules themselves in a fairly\nindepth fashion, and would recommend reading them for the most details,\nthe recommended reading order is::\n\n extraction/tests.py\n extraction/__init__.py\n extraction/techniques.py\n\nHopefully all questions are answered therein.\n\n\nContributions, Questions, Concerns\n==================================\n\nPlease open a GitHub pull-request with any improvements,\npreferably with tests, and I'll be glad to merge it in.\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/extraction/", "keywords": "", "license": "LICENSE.txt", "maintainer": "", "maintainer_email": "", "name": "extraction", "package_url": "https://pypi.org/project/extraction/", "platform": "", "project_url": "https://pypi.org/project/extraction/", "project_urls": { "Homepage": "http://pypi.python.org/pypi/extraction/" }, "release_url": "https://pypi.org/project/extraction/0.3/", "requires_dist": [ "beautifulsoup4", "html5lib" ], "requires_python": "", "summary": "Extract basic info from HTML webpages.", "version": "0.3" }, "last_serial": 4498268, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "392014d64e26ad9dfe81181e2f6fca9e", "sha256": "ea17d45f0f7c2bc86770b53c0ee6a3fa07803b319d9d483dff65bf7954fb1a69" }, "downloads": -1, "filename": "extraction-0.1.0.tar.gz", "has_sig": false, "md5_digest": "392014d64e26ad9dfe81181e2f6fca9e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15081, "upload_time": "2012-11-24T02:08:05", "url": "https://files.pythonhosted.org/packages/29/55/95b378c27f64629cea9f534caa02088733ad77b36c998278a734d98031d7/extraction-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "7bc8acc31127d82fd25903af2a20c2f3", "sha256": "3688212a0c85d162ed066caedb2cc588c2222aa0ab49ef615fbbf4cd275fa98b" }, "downloads": -1, "filename": "extraction-0.1.1.tar.gz", "has_sig": false, "md5_digest": "7bc8acc31127d82fd25903af2a20c2f3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15181, "upload_time": "2012-11-24T02:41:14", "url": "https://files.pythonhosted.org/packages/15/e9/caa425764327518acc0b8dd82ef8276026f791107ac30ccdf1f61a462bcd/extraction-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "8368c561f6ffc61702f65f1946915483", "sha256": "3a8d89595c568791ef33ad98ced0f368d471b892ccc2b9dce4b387a3a1cd9314" }, "downloads": -1, "filename": "extraction-0.1.2.tar.gz", "has_sig": false, "md5_digest": "8368c561f6ffc61702f65f1946915483", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15188, "upload_time": "2012-11-24T02:44:18", "url": "https://files.pythonhosted.org/packages/51/53/ba8318b247295759fc858c375895d762da0293efbd78aee4fbfa924dcc78/extraction-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "bb7366d38e44efc4e329c96049536649", "sha256": "2574bd4bdcb2e6307b89c6100fc01bfa309bd38ba9b902e492e2fb6f07ed53fb" }, "downloads": -1, "filename": "extraction-0.1.3.tar.gz", "has_sig": false, "md5_digest": "bb7366d38e44efc4e329c96049536649", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15554, "upload_time": "2012-11-24T03:21:59", "url": "https://files.pythonhosted.org/packages/8d/2b/21cb7cd3b248eba4d23860c8e25b7bee1e213409946830377a63090713fb/extraction-0.1.3.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "3921212ba48693074d086b432a599702", "sha256": "7130f9f7e1956313a8125c79ed5eed211430242b3c12d37f7307fc2f3ba25492" }, "downloads": -1, "filename": "extraction-0.2.macosx-10.9-intel.exe", "has_sig": false, "md5_digest": "3921212ba48693074d086b432a599702", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 97964, "upload_time": "2014-06-06T15:37:24", "url": "https://files.pythonhosted.org/packages/88/ce/e4a6def8732bb9d61f35e960f1eaf6c5cb2342fa45746f83ef4f5543afe8/extraction-0.2.macosx-10.9-intel.exe" }, { "comment_text": "", "digests": { "md5": "a18ed0e35954def7b7d9ed5069297935", "sha256": "cf2ac4fa42a6175d98b9e90988f3b1d12ef9360b489309853ac1fb007187bfd0" }, "downloads": -1, "filename": "extraction-0.2.tar.gz", "has_sig": false, "md5_digest": "a18ed0e35954def7b7d9ed5069297935", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17377, "upload_time": "2014-06-06T15:37:20", "url": "https://files.pythonhosted.org/packages/a6/4e/d599117d97d7067cd8e8d1d8a843060108687729e6d4f12b21b1ad89c647/extraction-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "e1c6e786a19b669751c2c31b4899661d", "sha256": "c39f204d1394055cd68cef1f61cfc9b231af0dc13aeb5d2d6bd85e48b69f8d1b" }, "downloads": -1, "filename": "extraction-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "e1c6e786a19b669751c2c31b4899661d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21240, "upload_time": "2018-11-17T22:15:05", "url": "https://files.pythonhosted.org/packages/8a/f3/17aaa649ca30a321ded2098927a9ba586a559bbbc94a897070426e0008cf/extraction-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "44dec4a187c18ad0594f393407ee5f30", "sha256": "8449cb3454380689fee892c2190a6f5a3acce7d64608bff504891eb54a332858" }, "downloads": -1, "filename": "extraction-0.3.tar.gz", "has_sig": false, "md5_digest": "44dec4a187c18ad0594f393407ee5f30", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18534, "upload_time": "2018-11-17T22:15:07", "url": "https://files.pythonhosted.org/packages/41/34/954819612945cc79ad78f0a2c756cadf6907f38a28cf2d19a7a7a8919a60/extraction-0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e1c6e786a19b669751c2c31b4899661d", "sha256": "c39f204d1394055cd68cef1f61cfc9b231af0dc13aeb5d2d6bd85e48b69f8d1b" }, "downloads": -1, "filename": "extraction-0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "e1c6e786a19b669751c2c31b4899661d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 21240, "upload_time": "2018-11-17T22:15:05", "url": "https://files.pythonhosted.org/packages/8a/f3/17aaa649ca30a321ded2098927a9ba586a559bbbc94a897070426e0008cf/extraction-0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "44dec4a187c18ad0594f393407ee5f30", "sha256": "8449cb3454380689fee892c2190a6f5a3acce7d64608bff504891eb54a332858" }, "downloads": -1, "filename": "extraction-0.3.tar.gz", "has_sig": false, "md5_digest": "44dec4a187c18ad0594f393407ee5f30", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18534, "upload_time": "2018-11-17T22:15:07", "url": "https://files.pythonhosted.org/packages/41/34/954819612945cc79ad78f0a2c756cadf6907f38a28cf2d19a7a7a8919a60/extraction-0.3.tar.gz" } ] }