{ "info": { "author": "Mahmoud Lababidi", "author_email": "lababidi+py@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Other Environment", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Operating System :: MacOS :: MacOS X", "Operating System :: Microsoft :: Windows", "Operating System :: POSIX", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Internet", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "Goose3 - Article Extractor\n===============================================\n\n.. image:: https://travis-ci.org/goose3/goose3.svg?branch=master\n :target: https://travis-ci.org/goose3/goose3\n.. image:: https://badge.fury.io/py/goose3.svg\n :target: https://badge.fury.io/py/goose3\n\nIntro\n--------------------------------------------------------------------------------\n\nGoose was originally an article extractor written in Java that has most\nrecently (Aug2011) been converted to a `scala project `_.\n\nThis is a complete rewrite in Python. The aim of the software is to\ntake any news article or article-type web page and not only extract what\nis the main body of the article but also all meta data and most probable\nimage candidate.\n\nGoose will try to extract the following information:\n\n- Main text of an article\n- Main image of article\n- Any YouTube/Vimeo movies embedded in article\n- Meta Description\n- Meta tags\n\nThe Python version was originally rewritten by:\n\n- Xavier Grangier\n\nLicensing\n--------------------------------------------------------------------------------\n\nIf you find Goose useful or have issues please drop me a line. I'd love\nto hear how you're using it or what features should be improved.\n\nGoose is licensed by Gravity.com under the Apache 2.0 license; see the\nLICENSE file for more details.\n\nOn-line Documentation\n--------------------------------------------------------------------------------\nOn-line documentation is available on\n`Read the Docs `_ which contains more in-depth\ndocumentation.\n\nSetup\n--------------------------------------------------------------------------------\n\nTo install using pip:\n\n.. code-block::\n\n pip install goose3\n\nTo install from source:\n\n.. code-block::\n\n mkvirtualenv --no-site-packages goose3\n git clone https://github.com/goose3/goose3.git\n cd goose3\n pip install -r ./requirements/python\n python setup.py install\n\nTake it for a spin\n--------------------------------------------------------------------------------\n\n.. code-block:: python\n\n >>> from goose3 import Goose\n >>> url = 'http://edition.cnn.com/2012/02/22/world/europe/uk-occupy-london/index.html?hpt=ieu_c2'\n >>> g = Goose()\n >>> article = g.extract(url=url)\n >>> article.title\n u'Occupy London loses eviction fight'\n >>> article.meta_description\n \"Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoid eviction Wednesday in a decision made by London's Court of Appeal.\"\n >>> article.cleaned_text[:150]\n (CNN) - Occupy London protesters who have been camped outside the landmark St. Paul's Cathedral for the past four months lost their court bid to avoi\n >>> article.top_image.src\n http://i2.cdn.turner.com/cnn/dam/assets/111017024308-occupy-london-st-paul-s-cathedral-story-top.jpg\n\nConfiguration\n--------------------------------------------------------------------------------\n\nThere are two ways to pass configuration to goose. The first one is to\npass goose a Configuration() object. The second one is to pass a\nconfiguration dict.\n\nFor instance, if you want to change the userAgent used by Goose just\npass:\n\n.. code-block:: python\n\n >>> g = Goose({'browser_user_agent': 'Mozilla'})\n\nSwitching parsers: Goose can now be used with lxml html parser or lxml\nsoup parser. By default the html parser is used. If you want to use the\nsoup parser pass it in the configuration dict :\n\n.. code-block:: python\n\n >>> g = Goose({'browser_user_agent': 'Mozilla', 'parser_class':'soup'})\n\nOne can also set Goose to be more lenient on network exceptions. To turn off\nthrowing all network exceptions, set the strict configuration setting to false:\n\n.. code-block:: python\n\n >>> g = Goose({'strict': False})\n\n\nTo turn on image fetching, one can simply enable it using the enable_image_fetching\nconfiguration property:\n\n.. code-block:: python\n\n >>> g = Goose({'enable_image_fetching': True})\n\n\nGoose is now language aware\n--------------------------------------------------------------------------------\n\nFor example, scraping a Spanish content page with correct meta language\ntags:\n\n.. code-block:: python\n\n >>> from goose3 import Goose\n >>> url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'\n >>> g = Goose()\n >>> article = g.extract(url=url)\n >>> article.title\n u'Las listas de espera se agravan'\n >>> article.cleaned_text[:150]\n u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\\xe1s ciudad'\n\nSome pages don't have correct meta language tags, you can force it using\nconfiguration :\n\n.. code-block:: python\n\n >>> from goose3 import Goose\n >>> url = 'http://www.elmundo.es/elmundo/2012/10/28/espana/1351388909.html'\n >>> g = Goose({'use_meta_language': False, 'target_language':'es'})\n >>> article = g.extract(url=url)\n >>> article.cleaned_text[:150]\n u'Importante golpe a la banda terrorista ETA en Francia. La Guardia Civil ha detenido en un hotel de Macon, a 70 kil\\xf3metros de Lyon, a Izaskun Lesaka y '\n\nPassing {'use\\_meta\\_language': False, 'target\\_language':'es'} will\nforcibly select Spanish.\n\n\nVideo extraction\n--------------------------------------------------------------------------------\n\n.. code-block:: python\n\n >>> import goose3\n >>> url = 'http://www.liberation.fr/politiques/2013/08/12/journee-de-jeux-pour-ayrault-dans-les-jardins-de-matignon_924350'\n >>> g = goose3.Goose({'target_language':'fr'})\n >>> article = g.extract(url=url)\n >>> article.movies\n []\n >>> article.movies[0].src\n 'http://sa.kewego.com/embed/vp/?language_code=fr&playerKey=1764a824c13c&configKey=dcc707ec373f&suffix=&sig=9bc77afb496s&autostart=false'\n >>> article.movies[0].embed_code\n '