{ "info": { "author": "Andrey Sechenov", "author_email": "dronych@mail.ru", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "MulTithreaded SCRAPER\n===================\n\nHello, welcome you here. This is the mt_scraper library documentation for python version 3.\n\nDescription\n--------\n\nThis is a project of a multithreaded site scraper. Multithreading operation speeds up data collection from Web several times (more than 10 on a normal old work laptop). To use it, you need to redefine the parse method for your needs and enjoy the benefits of multithreading (with all its implementation in Python)\n\nCollecting data in the JSON file, which stores a list of objects (dictionaries) with the collected data.\n\nApplication\n----------\n\n### Simple application\n\n#### Main Library Usage Scenario\n\n```\nimport mt_scraper\n\nscraper = mt_scraper.Scraper ()\nscraper.run ()\n```\n\nAs you can see there are only three lines of code\n\n#### What happens when this happens\n\nWith this application, you get a data scraper from the pages of the list:\n```\nurl_components_list = [\n 'http://example.com/',\n 'http://scraper.iamengineer.ru',\n 'http://scraper.iamengineer.ru/bad-file.php',\n 'http://badlink-for-scarper.ru',\n]\n```\nThe last two pages were added to demonstrate the two most common errors when retrieving data from the Internet, these are HTTP 404 - Not Found, and the URL Error: Name: or service not known.\n\nThe real URL is obtained by substituting this list into a template:\n```\nurl_template = '{}'\n```\nData is accumulated in the file:\n```\nout_filename = 'out.json'\n```\nThe work is conducted in 5 threads and a task queue of 5 units is created (this has a value, for example, when canceling an operation from the keyboard, the queue length indicates how many tasks were sent for execution):\n```\nthreads_num = 5\nqueue_len = 5\n```\nThe following is used as a parser function:\n```\ndef parse (self, num, url_component, html):\n '''You must override this method.\n Must return a dictionary or None if parsing the page\n impossible\n '''\n parser = MyDummyHTMLParser ()\n parser.feed (html)\n obj = parser.obj\n obj ['url_component'] = url_component\n return parser.obj\n```\nDummyParser is a simple version of HTML parser, it is remarkable only because it uses only one standard library and does not require any additional modules.\nFile dummy_parser.py:\n```\nfrom html.parser import HTMLParser\n\nclass MyDummyHTMLParser (HTMLParser):\n def __init __ (self):\n super () .__ init __ ()\n self.a_tag = False\n self.h1_tag = False\n self.p_tag = False\n self.obj = {}\n\n def handle_starttag (self, tag, attrs):\n if tag == 'h1':\n self.h1_tag = True\n elif tag == 'p':\n self.p_tag = True\n elif tag == 'a':\n self.a_tag = True\n for (attr, value,) in attrs:\n if attr == 'href':\n self.obj ['link'] = value\n\n def handle_endtag (self, tag):\n if tag == 'h1':\n self.h1_tag = False\n elif tag == 'p':\n self.p_tag = False\n elif tag == 'a':\n self.a_tag = False\n\n def handle_data (self, data):\n if self.h1_tag:\n self.obj ['header'] = data\n elif self.p_tag and not self.a_tag:\n self.obj ['article'] = data\n```\nThis approach is used only to demonstrate the capabilities of multithreading, in real projects it is recommended to use the lxml or BS libraries, a more advanced application will be shown below in the section \"Advanced Application\"\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://bitbucket.org/dronych/mt_scraper-project", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "mt-scraper", "package_url": "https://pypi.org/project/mt-scraper/", "platform": "", "project_url": "https://pypi.org/project/mt-scraper/", "project_urls": { "Homepage": "https://bitbucket.org/dronych/mt_scraper-project" }, "release_url": "https://pypi.org/project/mt-scraper/0.4.0/", "requires_dist": [ "PySocks (==1.7.0)" ], "requires_python": "", "summary": "Easy multythread web scraper", "version": "0.4.0" }, "last_serial": 5595040, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "ee5d7fd5b4546037ab88a79b148bd818", "sha256": "2b216dacb14ab969111828d62555c0ca4e83138200e263869532e0cc614fdc90" }, "downloads": -1, "filename": "mt_scraper-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "ee5d7fd5b4546037ab88a79b148bd818", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5762, "upload_time": "2019-06-29T11:44:12", "url": "https://files.pythonhosted.org/packages/2e/7d/40b619f683a9be8484e920c599d281310d1ffdc6fc7a6656a89444812118/mt_scraper-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "3cc35270400958b4893e4112a01993b2", "sha256": "8c6cdb7bec73b5c359687f071f17d177f4b890d9807dedc4f2c88c2a2c9f1337" }, "downloads": -1, "filename": "mt_scraper-0.2.tar.gz", "has_sig": false, "md5_digest": "3cc35270400958b4893e4112a01993b2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4454, "upload_time": "2019-06-29T11:44:15", "url": "https://files.pythonhosted.org/packages/0a/7d/a0868c973451015da8b2a785ab6766a011f73777f581ea15c7967785c6ee/mt_scraper-0.2.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "791d1718ce4da0604e2a5308a9e59e13", "sha256": "f2da1e889da3d3565ec6d335cb638bf1d6fd9f078dcdf3b18698407a34e7f06c" }, "downloads": -1, "filename": "mt_scraper-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "791d1718ce4da0604e2a5308a9e59e13", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5784, "upload_time": "2019-06-29T11:50:03", "url": "https://files.pythonhosted.org/packages/cd/cd/53918c5040b84eca001ca141486a5660a1b428f3a042126d257cafdd3c34/mt_scraper-0.2.1-py3-none-any.whl" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "8dfc8c84b9b377be5134c688d0530636", "sha256": "aabe344a63e18e78bd15b91f334dcb2d899069cfc98a184a4db1d1252a02c096" }, "downloads": -1, "filename": "mt_scraper-0.2.2-py3-none-any.whl", "has_sig": false, "md5_digest": "8dfc8c84b9b377be5134c688d0530636", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5797, "upload_time": "2019-06-29T17:24:36", "url": "https://files.pythonhosted.org/packages/3f/d5/5bb43e0d85d1b985e8c1f880086f764d72878a8972a73721e3c57c1a4b5e/mt_scraper-0.2.2-py3-none-any.whl" } ], "0.2.3": [ { "comment_text": "", "digests": { "md5": "3bb2eba1befa3b5c538c8cc7a4e71332", "sha256": "7574ac90f25fba070cc49c6ff17d34e2778eb2f5f9a8e096aa733ad21d5b4885" }, "downloads": -1, "filename": "mt_scraper-0.2.3-py3-none-any.whl", "has_sig": false, "md5_digest": "3bb2eba1befa3b5c538c8cc7a4e71332", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6015, "upload_time": "2019-07-01T16:20:31", "url": "https://files.pythonhosted.org/packages/f6/96/df01f7badd155cf5ef4b5d71f65410973ceef74c6b3c68b73191935df1cd/mt_scraper-0.2.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a3dc665c15a60d0d8691dcb084946381", "sha256": "0c462ae0db09f780b6289c5de647be81bdfabbfcf841973482f8ef089e676b2f" }, "downloads": -1, "filename": "mt_scraper-0.2.3.tar.gz", "has_sig": false, "md5_digest": "a3dc665c15a60d0d8691dcb084946381", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4684, "upload_time": "2019-07-01T16:20:32", "url": "https://files.pythonhosted.org/packages/6e/b6/27b1f3c64922b4f81a21a9793ea67e2b6c1fc394c7cca9025061ed33316c/mt_scraper-0.2.3.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "324b2c897608485c6a62ceaca2ecf024", "sha256": "cc0e44399726f3d9d770a128461033a816e1a70b1ea066b86b997fd3cfacb56b" }, "downloads": -1, "filename": "mt_scraper-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "324b2c897608485c6a62ceaca2ecf024", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6169, "upload_time": "2019-07-03T17:21:09", "url": "https://files.pythonhosted.org/packages/a1/f8/27dfebaee262014d240ed0e2dfac2adea89d92f839e8df883b00ab74327a/mt_scraper-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6532930bc35764bea0f3418e56409d06", "sha256": "49b9f2e1bfed0462507bcc14e89c969bb2b57618b186471ece0efb3e160e5c0e" }, "downloads": -1, "filename": "mt_scraper-0.3.0.tar.gz", "has_sig": false, "md5_digest": "6532930bc35764bea0f3418e56409d06", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4815, "upload_time": "2019-07-03T17:21:12", "url": "https://files.pythonhosted.org/packages/af/be/a9a67c63be211f88fd07fe05d7ecb232c55f3157fdef9ac38ac8b685dda8/mt_scraper-0.3.0.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "3a5022ea7f2c71b5abed9142251eb213", "sha256": "9fdd30759cd5361a1a161b550b6136d799648655c24e5a3d7ae75e96dcf37aae" }, "downloads": -1, "filename": "mt_scraper-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "3a5022ea7f2c71b5abed9142251eb213", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7632, "upload_time": "2019-07-04T22:02:14", "url": "https://files.pythonhosted.org/packages/e1/d8/fe2d75cc30d76d6989465fb0c7013ad27a7d1f11f3298ac6c1d09b91d4d8/mt_scraper-0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c5f28125a608c361792e73c4e3297e12", "sha256": "cea6487f0ea37789bf4c8b8ebcbec09e2c29db30cd37e08f211429718470d049" }, "downloads": -1, "filename": "mt_scraper-0.3.2.tar.gz", "has_sig": false, "md5_digest": "c5f28125a608c361792e73c4e3297e12", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6321, "upload_time": "2019-07-04T22:02:19", "url": "https://files.pythonhosted.org/packages/5e/d3/2e2423754f3a4bbdf43d35a62b4acf3de6ed2912f860706e7eedc50c0539/mt_scraper-0.3.2.tar.gz" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "9f90ec79569b0a30754da91351f027de", "sha256": "54745b4f19ea82a73e0d8d9cd959562e0c6be943d69ab560fe47b0243870eedf" }, "downloads": -1, "filename": "mt_scraper-0.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "9f90ec79569b0a30754da91351f027de", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7640, "upload_time": "2019-07-16T06:25:50", "url": "https://files.pythonhosted.org/packages/2a/83/133e8b26a8532e42676962e5220cc9631f675d4ee8fc7b02d9a1e9cef187/mt_scraper-0.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "42544622c3541305d7b41b27e5420745", "sha256": "0398efa9090a1ce8301852c21b223d5df603f2c06101e4e2beb0698afe34b962" }, "downloads": -1, "filename": "mt_scraper-0.3.3.tar.gz", "has_sig": false, "md5_digest": "42544622c3541305d7b41b27e5420745", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6334, "upload_time": "2019-07-16T06:25:53", "url": "https://files.pythonhosted.org/packages/8f/07/e7a80f746598f0c257be788a0f4be8d94f7b7a98fc063309e76cb327b0b0/mt_scraper-0.3.3.tar.gz" } ], "0.3.4": [ { "comment_text": "", "digests": { "md5": "21b5ddf8e68efede5585d5596243b7d6", "sha256": "1178b4297f2655f608e589e6877e54ce0d1a356bfa97b3e12e000faec9bb6ce6" }, "downloads": -1, "filename": "mt_scraper-0.3.4-py3-none-any.whl", "has_sig": false, "md5_digest": "21b5ddf8e68efede5585d5596243b7d6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7687, "upload_time": "2019-07-18T09:50:31", "url": "https://files.pythonhosted.org/packages/02/ea/7f4e3dd28435e23e1c663fc320236034459d524bface2deb7cd483eb5a17/mt_scraper-0.3.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "7dd337ae8188e05d8e2211b75d2f4a7e", "sha256": "8a407cb82db6ce9b1ac7b075635be7155f41f97d0d6243d09fa655a440ef827c" }, "downloads": -1, "filename": "mt_scraper-0.3.4.tar.gz", "has_sig": false, "md5_digest": "7dd337ae8188e05d8e2211b75d2f4a7e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6381, "upload_time": "2019-07-18T09:50:36", "url": "https://files.pythonhosted.org/packages/f5/8e/52e8021b39b6563f310712aeaa5b58c2e697c64d175f40705f61623e1f25/mt_scraper-0.3.4.tar.gz" } ], "0.3.5": [ { "comment_text": "", "digests": { "md5": "fe44e226c1b4d1c6ee8f615f6b3edbaa", "sha256": "9b4079e2060d732feeb45ffb52ed7c8c283d77bf26e5c3ee7f2a7d7762a617f9" }, "downloads": -1, "filename": "mt_scraper-0.3.5-py3-none-any.whl", "has_sig": false, "md5_digest": "fe44e226c1b4d1c6ee8f615f6b3edbaa", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7884, "upload_time": "2019-07-18T12:14:41", "url": "https://files.pythonhosted.org/packages/6d/ea/bc0e3fd5ed7ca06f0b809531b9d970a358f77026b296605f9950c8357bbc/mt_scraper-0.3.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e469403cea4c102baf9c2a992c4f66dc", "sha256": "17df48ac11b34896a25d637a0289e3873e603043b5f8ca4d7d6cc47f92e292dd" }, "downloads": -1, "filename": "mt_scraper-0.3.5.tar.gz", "has_sig": false, "md5_digest": "e469403cea4c102baf9c2a992c4f66dc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6576, "upload_time": "2019-07-18T12:14:47", "url": "https://files.pythonhosted.org/packages/47/3e/22a8379f0facc76815e3cd62cc5ba0c297ddf182ddee18ab7c11a7fa98e9/mt_scraper-0.3.5.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "978ce3c9b53a7b8a525f730405292dca", "sha256": "a475060849a58159dd22600dad08fc807e310576dc6fa15cf576614923865d30" }, "downloads": -1, "filename": "mt_scraper-0.4.0-py3-none-any.whl", "has_sig": false, "md5_digest": "978ce3c9b53a7b8a525f730405292dca", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7504, "upload_time": "2019-07-28T07:30:24", "url": "https://files.pythonhosted.org/packages/99/15/b7687dbf880e32fe4b5c0ec29541d029415c0824f7427b7dc4e3b07273c6/mt_scraper-0.4.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eaf96aff9fa6af75477111a4492e3881", "sha256": "9d7f6754dbc164f597b6efd283118c432e100866f4db73eecab92387580ee77b" }, "downloads": -1, "filename": "mt_scraper-0.4.0.tar.gz", "has_sig": false, "md5_digest": "eaf96aff9fa6af75477111a4492e3881", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6902, "upload_time": "2019-07-28T07:30:25", "url": "https://files.pythonhosted.org/packages/5c/0e/e1d6361cd2eb00346137b424f8d9dc411c6dbc022f4bf162b50c298c49bb/mt_scraper-0.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "978ce3c9b53a7b8a525f730405292dca", "sha256": "a475060849a58159dd22600dad08fc807e310576dc6fa15cf576614923865d30" }, "downloads": -1, "filename": "mt_scraper-0.4.0-py3-none-any.whl", "has_sig": false, "md5_digest": "978ce3c9b53a7b8a525f730405292dca", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 7504, "upload_time": "2019-07-28T07:30:24", "url": "https://files.pythonhosted.org/packages/99/15/b7687dbf880e32fe4b5c0ec29541d029415c0824f7427b7dc4e3b07273c6/mt_scraper-0.4.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eaf96aff9fa6af75477111a4492e3881", "sha256": "9d7f6754dbc164f597b6efd283118c432e100866f4db73eecab92387580ee77b" }, "downloads": -1, "filename": "mt_scraper-0.4.0.tar.gz", "has_sig": false, "md5_digest": "eaf96aff9fa6af75477111a4492e3881", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6902, "upload_time": "2019-07-28T07:30:25", "url": "https://files.pythonhosted.org/packages/5c/0e/e1d6361cd2eb00346137b424f8d9dc411c6dbc022f4bf162b50c298c49bb/mt_scraper-0.4.0.tar.gz" } ] }