{ "info": { "author": "Alexandru Stanciu", "author_email": "alexandru.stanciu@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "=======\r\nSummary\r\n=======\r\n\r\nCurrent release: v0.2 - see CHANGES.txt for details.\r\n\r\nSimple usage\r\n------------\r\n\r\nWorking with the ``summary`` package::\r\n\r\n >>> import summary\r\n >>> s = summary.Summary('https://github.com/svven/summary')\r\n >>> s.extract()\r\n >>> s.title\r\n u'svven/summary'\r\n >>> s.image\r\n https://avatars0.githubusercontent.com/u/7524085?s=400\r\n >>> s.description\r\n u'summary - Summary is a complete solution to extract the title, image and description from any URL.'\r\n\r\nBatch usage with HTML rendering\r\n-------------------------------\r\n\r\nIf you fork or clone the repo you can use summarize.py like this::\r\n\r\n >>> import summary\r\n >>> summary.GET_ALL_DATA = True # default is False\r\n >>> urls = [\r\n 'http://www.wired.com/',\r\n 'http://www.nytimes.com/', \r\n 'http://www.technologyreview.com/lists/technologies/2014/'\r\n ]\r\n >>> from summarize import summarize, render\r\n >>> summaries, result, speed = summarize(urls)\r\n -> http://www.wired.com/\r\n [BadImage] RatioImageException(398, 82): http://www.wired.com/wp-content/vendor/condenast/pangea/themes/wired/assets/images/wired_logo.gif\r\n -> http://www.nytimes.com/\r\n [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nyt.png\r\n [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-3panel-nytcom.png\r\n [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/33/ad.373366/bar1-4panel-opinion.png\r\n [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375173/CRS-1572_nytpinion_EARS_L_184x90_CP2.gif\r\n [BadImage] AdblockURLFilter: http://graphics8.nytimes.com/adx/images/ADS/37/51/ad.375174/CRS-1572_nytpinion_EARS_R_184x90_ER1.gif\r\n [BadImage] RatioImageException(379, 64): http://i1.nyt.com/images/misc/nytlogo379x64.gif\r\n [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/facebook.gif\r\n [BadImage] TinyImageException(16, 16): http://graphics8.nytimes.com/images/article/functions/twitter.gif-> http://www.technologyreview.com/lists/technologies/2014/\r\n Success: 3.\r\n >>> html = render(template=\"news.html\",\r\n summaries=summaries, result=result, speed=speed)\r\n >>> with open('demo.html', 'w') as file:\r\n ... file.write(html)\r\n >>> \r\n\r\nIn a nutshell\r\n-------------\r\n\r\nSummary requests the page from the URL, then uses\r\n`extraction `__ to parse the\r\nHTML. \r\n\r\nWorth mentioning that it downloads the head tag first, performs\r\nspecific extraction techniques, and goes further to body only if\r\nextracted data is not complete. Unless ``summary.GET_ALL_DATA = True``.\r\n\r\nThe resulting lists of titles, images, and descriptions are filtered on\r\nthe fly to rule out unwanted items like ads, tiny images (tracking\r\nimages or sharing buttons), and plain white images. See the whole list\r\nof filters below.\r\n\r\nMany thanks to Will Larson (`@lethain `__)\r\nfor adapting his `extraction `__\r\nlibrary to version 0.2 to accomodate summary.\r\n\r\nRendering\r\n---------\r\n\r\nThe purpose of the HTML rendering mechanism is just to visualize\r\nextracted data. \r\nThe included Jinja2 template (news.html) is built on top of bootstrap and displays the summaries in a nice responsive grid layout.\r\n\r\nYou can completely disregard the rendering mechanism and just\r\nimport summary module for data extraction and filtering. You probably\r\nhave your own means to render the data, so you only need the summary\r\nfolder.\r\n\r\n|image|\r\n\r\n![news.html\r\npreview](\\ https://dl.dropboxusercontent.com/u/134594/Svven/news.png)\r\n\r\nThis is the output having ``summary.GET_ALL_DATA = True``.\r\n\r\n**Clicking the summary title, image and description cycles through the\r\nmultiple extracted values.**\r\n\r\n\r\n\r\n\r\n\r\nAnd this one produced much faster (see footer) with\r\n``summary.GET_ALL_DATA = False``. It contains only the first valid item\r\nof each kind - title, image, and description. This is the default\r\nbehaviour. \r\n\r\n\r\n\r\nInstallation\r\n------------\r\nPip it for simple usage::\r\n\r\n $ pip install summary-extraction\r\n\r\n\r\nOr clone the repo if you need rendering::\r\n\r\n $ virtualenv env \r\n $ source env/bin/activate\r\n $ git clone https://github.com/svven/summary.git \r\n $ pip install -r summary/requirements.txt \r\n\r\n $ cd summary\r\n $ python # see the usage instructions above\r\n\r\nRequirements\r\n------------\r\nBase required packages are ``extraction`` and ``requests``, but it doesn't do much withouth ``adblockparser`` and ``Pillow``::\r\n\r\n Jinja2==2.7.2 # only for rendering \r\n Pillow==2.4.0\r\n adblockparser==0.2\r\n extraction==0.2 \r\n lxml==3.3.5 \r\n re2==0.2.20 # good for adblockparser\r\n requests==2.2.1\r\n w3lib==1.6\r\n\r\nFilters\r\n-------\r\n\r\nFilters are *callable* classes that perform specific data checks.\r\n\r\nFor the moment there are only image filters. The image URL is passed as\r\ninput parameter to the first filter. The check is performed and the URL\r\nis returned if it is valid, so it is passed to the second filter and so\r\non. When the check fails it returns ``None``.\r\n\r\nThis pattern makes it possible to write the filtering routine like this::\r\n\r\n def _filter_image(self, url):\r\n \"The param is the image URL, which is returned if it passes *all* the filters.\"\r\n return reduce(lambda f, g: f and g(f), \r\n [\r\n filters.AdblockURLFilter()(url),\r\n filters.NoImageFilter(),\r\n filters.SizeImageFilter(),\r\n filters.MonoImageFilter(),\r\n filters.FormatImageFilter(),\r\n ])\r\n\r\n images = filter(None, map(self._filter_image, image_urls))\r\n\r\n- **AdblockURLFilter**\r\n\r\n Uses `adblockparser `__\r\n and returns ``None`` if it ``should_block`` the URL. \r\n \r\n Hats off to Mikhail Korobov (`@kmike `__) for the\r\n awesome work. It gives a lot of value to this mashup repo.\r\n\r\n- **NoImageFilter**\r\n\r\n Retrieves actual image file, and returns ``None`` if it fails. \r\n \r\n Otherwise it returns an instance of the ``filters.Image`` class\r\n containing the URL, together with the size and format of the actual\r\n image. Basically it hydrates this instance which is passed to\r\n following filters. \r\n The ``Image.__repr__`` override returns just\r\n the URL so we can write the beautiful filtering routine you can see\r\n above.\r\n\r\n Worth mentioning again that it only gets first few chunks of the\r\n image file until the PIL parser gets the size and format of the\r\n image.\r\n\r\n- **SizeImageFilter**\r\n\r\n Checks the ``filters.Image`` instance to have proper size. \r\n \r\n This can raise following exceptions based on defined limits:\r\n ``TinyImageException``, ``HugeImageException``, or\r\n ``RatioImageException``. If any of these happens it returns ``None``.\r\n\r\n- **MonoImageFilter**\r\n\r\n Checks whether the image is plain white and returns ``None``. \r\n \r\n This filter retrieves the whole image file so it has an extra regex\r\n check before. E.g.: rules out these URLs: \r\n \r\n - http://wordpress.com/i/blank.jpg?m=1383295312g \r\n - http://images.inc.com/leftnavmenu/inc-logo-white.png\r\n\r\n- **FormatImageFilter**\r\n\r\n Rules out animated gif images for the moment. \r\n This can be extended to exclude other image formats based on file contents.\r\n\r\n\r\nThat's it for now. You're very welcome to contribute. \r\n\r\nComments and suggestions are welcome as well. Cheers, `@ducu `__\r\n\r\n\r\n.. |image| image:: https://dl.dropboxusercontent.com/u/134594/Svven/news.png", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/svven/summary", "keywords": "", "license": "LICENSE.txt", "maintainer": "", "maintainer_email": "", "name": "summary-extraction", "package_url": "https://pypi.org/project/summary-extraction/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/summary-extraction/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/svven/summary" }, "release_url": "https://pypi.org/project/summary-extraction/0.2/", "requires_dist": null, "requires_python": null, "summary": "Extract the title, image and description from any URL.", "version": "0.2" }, "last_serial": 1216077, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "89065165947e801905613d76c8519843", "sha256": "5c26a1a7b7d8313d232a6c98ae18f23f9cfec1de1f42190c30c61b072a663529" }, "downloads": -1, "filename": "summary-extraction-0.1.tar.gz", "has_sig": false, "md5_digest": "89065165947e801905613d76c8519843", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10447, "upload_time": "2014-06-08T17:01:49", "url": "https://files.pythonhosted.org/packages/04/7e/96825cfcfc14b3db339fcfb65ed459d038cdee4416601c1bd795c4f1d98c/summary-extraction-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "4270de39cd97426ce942048d71966752", "sha256": "1fd0c6f74a24e4dbb59818688ef773ad6531ffd27230f136737e35ca618a621a" }, "downloads": -1, "filename": "summary-extraction-0.2.tar.gz", "has_sig": false, "md5_digest": "4270de39cd97426ce942048d71966752", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16412, "upload_time": "2014-09-07T14:33:32", "url": "https://files.pythonhosted.org/packages/83/41/2ae0801ed23f4c4c3376a02b4a4e900abb6d7604610588b9b2a49f851ac1/summary-extraction-0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4270de39cd97426ce942048d71966752", "sha256": "1fd0c6f74a24e4dbb59818688ef773ad6531ffd27230f136737e35ca618a621a" }, "downloads": -1, "filename": "summary-extraction-0.2.tar.gz", "has_sig": false, "md5_digest": "4270de39cd97426ce942048d71966752", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16412, "upload_time": "2014-09-07T14:33:32", "url": "https://files.pythonhosted.org/packages/83/41/2ae0801ed23f4c4c3376a02b4a4e900abb6d7604610588b9b2a49f851ac1/summary-extraction-0.2.tar.gz" } ] }