{ "info": { "author": "Jonathan Vanasco", "author_email": "jonathan@findmeon.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 3", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Markup :: HTML" ], "description": "MetadataParser is a python module for pulling metadata out of web documents.\n\nIt requires BeautifulSoup , and was largely based on Erik River's opengraph module ( https://github.com/erikriver/opengraph ).\n\nI needed something more aggressive than Erik's module , so had to fork.\n\n\nInstallation\n=============\n\npip install metadata_parser\n\n\nInstallation Recommendation\n===========================\n\nI strongly suggest you use the `requests` library version 2.4.3 or newer\n\nThis is not required, but it is better. On earlier versions it is possible to have an uncaught DecodeError exception when there is an underlying redirect/404. Recent fixes to `requests` improve redirect handling, urllib3 and urllib3 errors.\n\n\nFeatures\n=============\n\n* it pulls as much metadata out of a document as possible\n* you can set a 'strategy' for finding metadata (i.e. only accept opengraph or page attributes)\n* lightweight but functional(!) url validation\n* logging is verbose, but nested under `__debug__` statements, so it is compiled away when PYTHONOPTIMIZE is set\n\nNotes\n=============\n1. This requires BeautifulSoup 4.\n2. For speed, it will instantiate a BeautifulSoup parser with lxml , and fall back to 'none' (the internal pure python) if it can't load lxml\n3. URL Validation is not RFC compliant, but tries to be \"Real World\" compliant\n\n* It is HIGHLY recommended that you install lxml for usage. It is considerably faster. Considerably faster. *\n\nYou should also use a very recent version of lxml. I've had problems with segfaults on some versions < 2.3.x ; i would suggest using the most recent 3.x if possible.\n\nThe default 'strategy' is to look in this order:\n\n og,dc,meta,page\n og = OpenGraph\n dc = DublinCore\n meta = metadata\n page = page elements\n\nYou can specify a strategy as a comma-separated list of the above.\n\nThe only 2 page elements currently supported are:\n\n