{ "info": { "author": "Ivan Begtin", "author_email": "ivan@begtin.tech", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy" ], "description": "====================================================\nnewsworker -- automatic news extractor using HTML scraping\n====================================================\n\n.. image:: https://img.shields.io/travis/ivbeg/qddate/master.svg?style=flat-square\n :target: https://travis-ci.org/ivbeg/qddate\n :alt: travis build status\n\n.. image:: https://img.shields.io/pypi/v/qddate.svg?style=flat-square\n :target: https://pypi.python.org/pypi/qddate\n :alt: pypi version\n\n.. image:: https://readthedocs.org/projects/qddate/badge/?version=latest\n :target: http://qddate.readthedocs.org/en/latest/?badge=latest\n :alt: Documentation Status\n\n.. image:: https://codecov.io/gh/scrapinghub/dateparser/branch/master/graph/badge.svg\n :target: https://codecov.io/gh/ivbeg/qddate\n :alt: Code Coverage\n\n.. image:: https://badges.gitter.im/scrapinghub/dateparser.svg\n :alt: Join the chat at https://gitter.im/ivbeg/qddate\n :target: https://gitter.im/ivbeg/qddate?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge\n\n\n`newsworker` is a Python 3 lib that extracts feeds from html pages. It's useful when you need to subscribe to a news\nfrom website that doesn't publish RSS/ATOM feeds and you don't want to use page change monitoring tools since it's not\nso accurate.\n\n\nUsage examples\n---------------\n\nExtract news from html page from EIB website and Bulgarian government website\n\n >>> feed, session = f.get_feed(url=\"http://government.bg/bg/prestsentar/novini\")\n >>> feed\n ...\n\n\n >>> from newsworker.extractor import FeedExtractor\n >>> f = FeedExtractor(filtered_text_length=150)\n >>> feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\")\n >>> feed\n {'title': 'European Investment Bank (EIB)', 'language': 'en', 'link': 'http://www.eib.org/en/index.htm?lang=en', 'description': 'European Investment Bank (EIB)', 'items': [{'title': 'Blockchain Challenge: coders at the EIB', 'description': 'Blockchain Challenge: coders at the EIB', 'pubdate': datetime.datetime(2018, 6, 18, 0, 0), 'unique_id': 'f9d359f76118076c5331ffec3cdb82eb', 'raw_html': b'
18/06/2018 | 02:12
Blockchain Challenge: coders at the EIB
', 'extra': {'links': ['https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=YlKa2LZgxhE?autoplay=1'}, {'title': 'A brighter life for Kenyan women', 'description': 'Jujuy Verde \u00e2\u20ac\u201c new horizons for women waste-pickers in Argentina', 'pubdate': datetime.datetime(2018, 6, 5, 0, 0), 'unique_id': '9caef61535352d2734d122c0e092b011', 'raw_html': b'
04/06/2018 | 01:32
A brighter life for Kenyan women
05/06/2018 | 03:12
Jujuy Verde \\xc3\\xa2\\xe2\\x82\\xac\\xe2\\x80\\x9c new horizons for women waste-pickers in Argentina
', 'extra': {'links': ['https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1', 'https://www.youtube.com/watch?v=d-btxsYT9hI?autoplay=1'], 'images': ['http://www.eib.org/img/site/play.png']}, 'link': 'https://www.youtube.com/watch?v=T_7OmSDSXtc?autoplay=1'}], 'cache': {'pats': ['dt:date:date_1']}}\n\nReuse cached patterns to speed up further news extraction. It could greatly improve page parsing speed since it minimizes number of date comparsion up to 100x times\n(2-3 date patterns instead of 350 patterns)\n >>> pats = feeds['cache']['pats']\n >>> feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\", cached_p=pats)\n\nChange user agent if needed\n >>> USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11'\n >>> feed, session = f.get_feed(url=\"http://www.eib.org/en/index.htm?lang=en\", user_agent=USER_AGENT)\n\n\nInitialize feed finder on webpage\n >>> from newsworker.finder import FeedsFinder\n >>> f = FeedsFinder()\nTry to extract feeds if no one feed exists\n >>> feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini')\n {'url': 'http://government.bg/bg/prestsentar/novini', 'items': []}\n\nAdd \"extractrss\" param launches FeedExtractor\n >>> feeds = f.find_feeds('http://government.bg/bg/prestsentar/novini', extractrss=True)\n >>> feeds\n {'url': 'http://government.bg/bg/prestsentar/novini', 'items': [{'feedtype': 'html', 'title': '\u041c\u0438\u043d\u0438\u0441\u0442\u0435\u0440\u0441\u043a\u0438 \u0441\u044a\u0432\u0435\u0442 :: \u041d\u043e\u0432\u0438\u043d\u0438', 'num_entries': 12, 'url': 'http://government.bg/bg/prestsentar/novini'}]}\n\nFind all feeds and more info from feeds. With \"noverify=False\" each feed parsed\n >>> feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False)\n >>> feeds\n {'url': 'https://www.dta.gov.au/news/', 'items': [{'title': 'Digital Transformation Agency', 'url': 'https://www.dta.gov.au/feed.xml', 'feedtype': 'rss', 'num_entries': 10}]}\n\nAddind \"include_entries=True\" returns feeds and all parsed feed entries\n >>> feeds = f.find_feeds('https://www.dta.gov.au/news/', noverify=False, include_entries=True)\n >>> feeds\n\n\n\nDocumentation\n=============\n\nDocumentation is built automatically and can be found on\n`Read the Docs `_.\n\n\nFeatures\n========\n\n* Identifies news blocks on webpages using date patterns. More than 348 date patterns supported. Uses \n* Extremely fast, uses pyparsing\n* Includes function to find feeds on html page and if no feed found, than extract news\n\nLimitations\n========\n\n* Not all language-specific dates supported\n* Right aligned dates like \"Published - 27-01-2018\" not supported. It's not hard to add it but it greatly increases false acceptance rate.\n* Some news pages has no dates with urls or texts. These pages are not supported yet\n\nSpeed optimization\n========\n\n* qddate date parsing lib was created for this algorithm. Right now pattern marching is really fast\n* date patterns could be cached to speed up parsing speed for the same website\n* feed finder without verification of feeds works fast, but if verification enabled than it's slowed down\n\n\nTODO\n====\n* Support more date formats and improve qddate lib\n* Support news pages without dates\n\nUsage\n=====\n\nThe easiest way is to use the `newsworker.FeedExtractor <#newsworker.FeedExtractor>`_ class,\nand it's `get_feed` function.\n\n\n \n\n \n\n\nDependencies\n============\n\n`newsworker` relies on following libraries in some ways:\n\n * qddate_ is a module for data processing\n.. _qddate: https://pypi.python.org/pypi/qddate\n * pyparsing_ is a module for advanced text processing.\n.. _pyparsing: https://pypi.python.org/pypi/pyparsing\n * lxml is a module for xml parsing.\n.. _lxml: https://pypi.python.org/pypi/lxml\n\n\nSupported languages specific dates\n==================================\n* Bulgarian\n* Czech\n* English\n* French\n* German\n* Portuguese\n* Russian\n* Spanish\n\nThanks\n======\nI wrote this news extraction code at 2008 year and later only updated it several times, migrating from regular expressions\nto pyparsing. Initial project was divided between qddate date parsing lib and newsworker intended to news identification\non html pages\n\nFeel free to ask question ivan@begtin.tech\n\n.. image:: https://badges.gitter.im/newsworker/Lobby.svg\n :alt: Join the chat at https://gitter.im/newsworker/Lobby\n :target: https://gitter.im/newsworker/Lobby?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge\n\n\n.. :changelog:\n\nHistory\n=======\n\n\n1.0.1 (2018-07-21)\n------------------\n* First public release on PyPI and github\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ivbeg/newsworker", "keywords": "news parsing extraction feeds rss atom", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "newsworker", "package_url": "https://pypi.org/project/newsworker/", "platform": "", "project_url": "https://pypi.org/project/newsworker/", "project_urls": { "Homepage": "https://github.com/ivbeg/newsworker" }, "release_url": "https://pypi.org/project/newsworker/1.0.1/", "requires_dist": null, "requires_python": "", "summary": " Advanced news feeds extractor and finder library. Helps to automatically extract news from websites without RSS/ATOM feeds", "version": "1.0.1" }, "last_serial": 4087941, "releases": { "1.0.1": [ { "comment_text": "", "digests": { "md5": "4dde4d9551faf39355b356a293efffbc", "sha256": "e37400a69005bd1d2e7cb5fbfb0d046ebacddc4f2a9f6413cf5cd97b274d6d1b" }, "downloads": -1, "filename": "newsworker-1.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4dde4d9551faf39355b356a293efffbc", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 19938, "upload_time": "2018-07-21T07:37:07", "url": "https://files.pythonhosted.org/packages/a4/7b/f4a1e706232d98d68988402ec259204cfdfcf08b29ccf4346733d58e1dcf/newsworker-1.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "20728aac6f9aeafbcc8f198808db1fc3", "sha256": "8266e5a86c3f6079801d6bfb9bde22b44e6a364dbb380531055061144ec66e58" }, "downloads": -1, "filename": "newsworker-1.0.1.tar.gz", "has_sig": false, "md5_digest": "20728aac6f9aeafbcc8f198808db1fc3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14717, "upload_time": "2018-07-21T07:37:17", "url": "https://files.pythonhosted.org/packages/e4/2b/66fc02f7c2caac0b35f57decc1f6c152ee1fe6041816d694cf93c0c18c27/newsworker-1.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4dde4d9551faf39355b356a293efffbc", "sha256": "e37400a69005bd1d2e7cb5fbfb0d046ebacddc4f2a9f6413cf5cd97b274d6d1b" }, "downloads": -1, "filename": "newsworker-1.0.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4dde4d9551faf39355b356a293efffbc", "packagetype": "bdist_wheel", "python_version": "3.6", "requires_python": null, "size": 19938, "upload_time": "2018-07-21T07:37:07", "url": "https://files.pythonhosted.org/packages/a4/7b/f4a1e706232d98d68988402ec259204cfdfcf08b29ccf4346733d58e1dcf/newsworker-1.0.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "20728aac6f9aeafbcc8f198808db1fc3", "sha256": "8266e5a86c3f6079801d6bfb9bde22b44e6a364dbb380531055061144ec66e58" }, "downloads": -1, "filename": "newsworker-1.0.1.tar.gz", "has_sig": false, "md5_digest": "20728aac6f9aeafbcc8f198808db1fc3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14717, "upload_time": "2018-07-21T07:37:17", "url": "https://files.pythonhosted.org/packages/e4/2b/66fc02f7c2caac0b35f57decc1f6c152ee1fe6041816d694cf93c0c18c27/newsworker-1.0.1.tar.gz" } ] }