{ "info": { "author": "qiulimao", "author_email": "qiulimao@getqiu.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Topic :: Internet :: WWW/HTTP", "Topic :: Software Development :: Libraries :: Application Frameworks", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "weblocust\n========\n\nA Powerful Spider(Web Crawler) System in Python based on **pyspider**.\n\n- Write script in Python\n- more Powerful WebUI with script editor, task monitor, project manager and result viewer than weblocust\n- [MongoDB](https://www.mongodb.org/), as database backend\n- [RabbitMQ](http://www.rabbitmq.com/), [Beanstalk](http://kr.github.com/beanstalkd/), [Redis](http://redis.io/) and [Kombu](http://kombu.readthedocs.org/) as message queue\n- Task priority, retry, periodical, recrawl by age, etc...\n- Distributed architecture, Crawl Javascript pages, Python 2&3, etc...\n\n\nrelease Note\n-----------\n\u867d\u7136`pyspider`\u8fd9\u4e2a\u6846\u67b6\u662f\u4e00\u4e2a\u56fd\u4eba\u5199\u7684\u3002\u4f46\u662f\u4ed6\u5728\u82f1\u56fd\u5de5\u4f5c\uff0c\u82f1\u6587\u725b\u903c\u3002`pyspider`\u5728`python`\u7684\u722c\u866b\u65b9\u9762\u4e0d\u4ec5\u4ec5\u53ea\u5728\u56fd\u5185\u6709\u540d\u6c14\u3002\n\u5728\u56fd\u5916\u4e5f\u6709\u5f88\u591a\u4eba\u4f7f\u7528\u3002\u6240\u4ee5\u4f5c\u8005\u6ca1\u6709\u60f3\u8fc7\u8981\u4e13\u95e8\u5199\u4e00\u4efd\u4e2d\u6587\u7684\u6587\u6863\u3002\u5728\u4ed6\u7684\u535a\u5ba2\u5f53\u4e2d\u6709\u4e00\u4e9b\u65e9\u671f\u7248\u672c\u7684\u4ecb\u7ecd\u548c\u4f7f\u7528\u3002\u867d\u7136\u73b0\u5728\u66f4\u65b0\u5f97\u6bd4\u8f83\u5feb\uff0c\n\u4f46\u662f\u4f7f\u7528\u65b9\u5f0f\u4e0a\u57fa\u672c\u6ca1\u600e\u4e48\u53d8\u3002\u5185\u90e8\u7684\u7ed3\u6784\u53ef\u80fd\u6709\u6240\u6539\u53d8\u3002\n\n`pyspider`\u4f5c\u8005\u6635\u79f0\u53ebbunix,\u611f\u89c9\u4f5c\u8005\u5f88\u725b,\u5f88\u535a\u5b66,\u3002\u518d\u6b64\u8868\u793a\u656c\u4f69.\n`weblocust` \u662f\u6211\u6839\u636e\u6211\u4eec\u7684\u9700\u6c42\u5728`pyspider`\u4e0a\u505a\u4e86\u4e00\u4e9b\u6539\u8fdb\uff0c\u4f7f\u5f97\u66f4\u52a0\u7b26\u5408\u6211\u4eec\u7684\u9700\u6c42\u3002`pyspider`\u539f\u672c\u652f\u6301\u5f88\u591a`resultdb`\n\u6211\u4ec5\u4ec5\u5728`mongodb` \u4f5c\u4e3a`resultdb`\u4e4b\u4e0a\u505a\u4f18\u5316\u3002\u5982\u679c\u60a8\u91c7\u7528`mysql`\u5b58\u50a8\uff0c\u53ef\u80fd\u5c06\u4e0d\u4f1a\u6709`weblocust`\u7684\u65b0\u7279\u6027\u3002\n\n\u4e3b\u8981\u7684\u6539\u8fdb\uff1a\n\n* `webui` \u90e8\u5206\u7684\u6539\u8fdb\u3002\u8fd9\u90e8\u5206\u5b9e\u9645\u4e0abunix\u5df2\u7ecf\u505a\u5f97\u5f88\u597d\u4e86\u3002\u4e3a\u4e86\u6709\u66f4\u597d\u7684\u64cd\u63a7\u4f53\u9a8c\u548c\u663e\u793a\u6548\u679c\uff0c\u6211\u66f4\u6539\u4e86\u8fd9\u4e2a\u6a21\u5757\u7684\u5927\u90e8\u5206\u5185\u5bb9\u3002\n* \u539f\u5148\u7684`js`,`css`\u7b49\u6587\u4ef6\u90fd\u653e\u5728\u4e91\u7aef\uff0c\u6211\u5c06\u5b83\u653e\u5728\u4e86\u672c\u5730\u3002\u6211\u89c9\u5f97\u867d\u7136\u6ca1\u7f51\u722c\u866b\u4e0d\u80fd\u7528\uff0c\u4f46\u662f\u6709\u4e9b\u65f6\u5019\u6211\u4eec\u4e5f\u9700\u8981\u6d4f\u89c8\u7ed3\u679c\u3002\n* \u66f4\u6539\u4e86`mongodb`\u5b58\u50a8`result`\u7684\u7ed3\u6784\u3002\u6211\u89c9`mongodb`\u7684`schemaless`\u6070\u597d\u89e3\u51b3\u722c\u866b\u5b57\u6bb5\u53d8\u5316\u5927\u7684\u95ee\u9898\uff0c\u6240\u4ee5\u5e94\u8be5\u5145\u5206\u5229\u7528\u8fd9\u6837\u7684\u7279\u6027\uff0c\u56e0\u6b64\u6ca1\u6709\u5fc5\u8981\u548c`mysql`\u505a\u7edf\u4e00\u3002\n* \u5bf9\u7f51\u9875\u5185\u5bb9\u63d0\u53d6\u589e\u52a0\u4e86`xpath`\u65b9\u6cd5\u3002\n* `response`\u5bf9`scrapy`\u90e8\u5206\u517c\u5bb9,\u56e0\u4e3a\u6211\u89c9\u5f97`scrapy`\u7684`linkextractor`\u5f88\u597d\u7528\uff0c\u5982\u679c\u4f60\u8fd0\u884c\u7684`python`\u7248\u672c\u662f2.7\uff0c\u90a3\u4f60\u53ef\u4ee5\u4f7f\u7528`scrapy`\u7684`linkextractor`\u3002\n* \u52a0\u5165\u6570\u636e\u6e05\u6d17\u6a21\u5757`cleaner`.\u8fd9\u4e2a\u6a21\u5757\u7684\u5b9e\u73b0\u65b9\u5f0f\u53d7`scrapy`\u7684\u542f\u53d1\u3002\n* \u63d0\u4f9b`OnePageHasManyItem`,`OneItemHasManySubItem`\u7684\u4e00\u7ad9\u5f0f\u89e3\u51b3\u65b9\u6848\u3002\u5c24\u5176\u9002\u5408\u535a\u5ba2\u7684\u8bc4\u8bba\uff0c\u8bba\u575b\u56de\u5e16\u7b49\u7f51\u9875\u3002\n* \u63d0\u4f9b\u7075\u6d3b\u7684\u5b58\u50a8\u65b9\u5f0f\uff0c\u76ee\u524d`pyspider`\u4e00\u65e6\u8fd0\u884c\u53ea\u80fd\u91c7\u53d6\u4e00\u79cd`result_worker`\u4f7f\u5f97\u5b58\u50a8\u76f8\u5f53\u4e0d\u7075\u6d3b\u3002`weblocust`\u5f53\u4e2d\u60a8\u53ef\u4ee5\u5728\u4efb\u4f55\u4e00\u4e2a\u7ed3\u679c\u5f53\u4e2d\u5b9a\u4e49\u81ea\u5df1\u7684\u5b58\u50a8\u65b9\u5f0f\u3002\n* \u589e\u52a0\u4e86\u4e00\u4e9b\u5f00\u53d1\u8005\u4f7f\u7528\u7684\u5c0f\u5de5\u5177,\u6bd4\u5982\u81ea\u52a8\u5728\u6587\u4ef6\u4e2d\u751f\u6210\u4fee\u6539\u65f6\u95f4,\u6dfb\u52a0\u65b0\u7684\u4f5c\u8005,\u81ea\u52a8\u90e8\u7f72\u6587\u6863\u7b49\u7b49\n* \u4fee\u590d\u4e86\u4e00\u4e9bbug\n\n\u5173\u4e8e\u6587\u6863\uff1a\n\u8fd9\u4efd\u6587\u6863\u6f5c\u5728\u7684\u8bfb\u8005\u662f\u4e2d\u56fd\u4eba\uff0c\u6240\u4ee5\u6587\u6863\u5c31\u5728bunix\u7684\u6587\u6863\u4e4b\u4e0a\u4fee\u6539\u3002\u4e2d\u6587\u90e8\u5206\u662f\u6211\u65b0\u52a0\u7684,\u82f1\u6587\u90e8\u5206\u6709\u5c11\u8bb8\u4fee\u6539\u6216\u8005\u6dfb\u52a0\u3002\u53e6\u5916\u6211\u5c06\u6587\u6863\u548c\u4ee3\u7801\u4e2d\u7684`pyspider`\u90fd\u6362\u6210\u4e86`weblocust`\u5e76\u4e0d\u662f\u60f3\u63a9\u76d6`weblocust`\n\u662f\u57fa\u4e8e`pyspider`,\u4ec5\u4ec5\u662f\u4e3a\u4e86\u7edf\u4e00\u5de5\u7a0b\u547d\u540d.\u6e90\u4ee3\u7801\u5f53\u4e2d`author`\u4e00\u680f\u59cb\u7ec8\u7559\u7740`binux`\u7684\u4f4d\u7f6e,\u6211\u628a\u81ea\u5df1\u52a0\u5728\u4e86`contributor`\u4e00\u680f.\n\nSample Code \n-----------\n\n```python\nfrom weblocust.libs.base_handler import *\nfrom weblocust.libs.useragent import IphoneSafari,LinuxChrome\nfrom weblocust.libs.cleaners import TakeFirst,JoinCleaner,StripBlankMoreThan2\nfrom weblocust.libs.cleaners import reduceclean,mapclean,mapreduce\n\nclass Handler(BaseHandler):\n crawl_config = {\n 'headers': {'User-Agent': LinuxChrome},\n \"cookie\":\"a=123\",\n }\n\n @every(minutes=24 * 60)\n def on_start(self):\n self.crawl('http://scrapy.org/', callback=self.index_page)\n\n @config(age=10 * 24 * 60 * 60)\n def index_page(self, response):\n for each in response.doc('a[href^=\"http\"]').items():\n self.crawl(each.attr.href, callback=self.detail_page)\n\n def detail_page(self, response):\n return {\n \"url\": response.url,\n \"title\": response.doc('title').text(),\n }\n def on_result__detail_page(self,result):\n \"\"\" you can save the results on your own demand \"\"\"\n pass\n```\nWebUI\n---------\n\n![Demo Img]\n\n\nInstallation\nyou can install weblocust in 2 ways\n------------\n1. the most convenient way `pip install weblocust` \n2. install from source code `git clone https://github.com/qiulimao/weblocust.git` then `$python setup.py install`\n\nthen run `weblocust mkconfig` to generate simple configure file.\n\nfinally: run command `weblocust -c generatedfilename`, visit [http://localhost:5000/](http://localhost:5000/)\n\nContribute\n----------\n\n\nTODO\n----\n\n### next version\n* keep in space\n\n\n\n\n### more\n\n- [x] edit script with vim via [WebDAV](http://en.wikipedia.org/wiki/WebDAV)\n\n\nLicense\n-------\nLicensed under the Apache License, Version 2.0\n\n\n[Demo Img]: imgs/demo.png\n[Issue]: https://github.com/qiulimao/webocust/issues\n\n\n[Demo Img]: docs/imgs/demo.png\n[Issue]: https://github.com/qiulimao/webocust/issues", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/qiulimao/weblocust", "keywords": "scrapy crawler spider webui pyspider weblocust", "license": "Apache License, Version 2.0", "maintainer": null, "maintainer_email": null, "name": "weblocust", "package_url": "https://pypi.org/project/weblocust/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/weblocust/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/qiulimao/weblocust" }, "release_url": "https://pypi.org/project/weblocust/1.0.3/", "requires_dist": null, "requires_python": null, "summary": "A more Powerful Spider System in Python based on pyspider", "version": "1.0.3" }, "last_serial": 2434982, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "fe381dccc9b5ecc19c71a16879fdf545", "sha256": "891ec6f5a8abb37cfa65e8624486326ac60fca8f609e09217fa2386be5447620" }, "downloads": -1, "filename": "weblocust-1.0.tar.gz", "has_sig": false, "md5_digest": "fe381dccc9b5ecc19c71a16879fdf545", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 745287, "upload_time": "2016-10-27T12:17:01", "url": "https://files.pythonhosted.org/packages/48/e5/fe1c6aca0c6cd1ec449fba7b6ce66036115a384091ff7d189d2d15ccad0c/weblocust-1.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "351691b018c126b6a71af29d5846904e", "sha256": "4b28186b9fcad4d5f0d1c8409e2ae3d7b4d99e359342c91a4ea566092498797f" }, "downloads": -1, "filename": "weblocust-1.0.1.tar.gz", "has_sig": false, "md5_digest": "351691b018c126b6a71af29d5846904e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 747897, "upload_time": "2016-10-27T12:23:43", "url": "https://files.pythonhosted.org/packages/74/8a/790584a9064a39f7963d6e9874b2b794d4392eeae92366b3ef3cba38c52e/weblocust-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "e29c89ac44b9002ce8a33260e20b8826", "sha256": "8eb8f940bce6847b609c591e697db60c913af988f4e0361279218ae9e2e7d3f9" }, "downloads": -1, "filename": "weblocust-1.0.2-py2-none-any.whl", "has_sig": false, "md5_digest": "e29c89ac44b9002ce8a33260e20b8826", "packagetype": "bdist_wheel", "python_version": "2.7", "requires_python": null, "size": 2196931, "upload_time": "2016-10-29T02:07:23", "url": "https://files.pythonhosted.org/packages/82/43/682c5d901c751488ac900ad36419eaf2ccbd7a440c85c6f0b76baa54fa07/weblocust-1.0.2-py2-none-any.whl" }, { "comment_text": "", "digests": { "md5": "afe1009c437a0f8e9f9f537d698db72c", "sha256": "c743bb2345c4697e58adb37ee806c35197c61c5828ee45fce52ae38b848f0ee6" }, "downloads": -1, "filename": "weblocust-1.0.2.tar.gz", "has_sig": false, "md5_digest": "afe1009c437a0f8e9f9f537d698db72c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 748534, "upload_time": "2016-10-28T09:47:12", "url": "https://files.pythonhosted.org/packages/38/96/04e8788bb6487f88d8e1fd8c75956fbacf019f98163d36d3a3b5b50d8095/weblocust-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "a3c48e7750d224cd126dcd496e7aaca2", "sha256": "ade05987b0c34e9daa87c066a84fff4513f9a7ea02a2a96088d6900000fd71dc" }, "downloads": -1, "filename": "weblocust-1.0.3.tar.gz", "has_sig": false, "md5_digest": "a3c48e7750d224cd126dcd496e7aaca2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 751397, "upload_time": "2016-11-01T12:16:36", "url": "https://files.pythonhosted.org/packages/09/17/25d79f7c186705032ec79ee8f2d7ecb000626a14891e73874234231da64a/weblocust-1.0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a3c48e7750d224cd126dcd496e7aaca2", "sha256": "ade05987b0c34e9daa87c066a84fff4513f9a7ea02a2a96088d6900000fd71dc" }, "downloads": -1, "filename": "weblocust-1.0.3.tar.gz", "has_sig": false, "md5_digest": "a3c48e7750d224cd126dcd496e7aaca2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 751397, "upload_time": "2016-11-01T12:16:36", "url": "https://files.pythonhosted.org/packages/09/17/25d79f7c186705032ec79ee8f2d7ecb000626a14891e73874234231da64a/weblocust-1.0.3.tar.gz" } ] }