{ "info": { "author": "Greg Lindahl and others", "author_email": "lindahl@pbm.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Information Technology", "License :: OSI Approved :: Apache Software License", "Natural Language :: English", "Programming Language :: Python", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3 :: Only" ], "description": "CoCrawler\n=========\n\n|Build Status| |Coverage Status| |Apache License 2.0|\n\nCoCrawler is a versatile web crawler built using modern tools and\nconcurrency.\n\nCrawling the web can be easy or hard, depending upon the details. Mature\ncrawlers like Nutch and Heritrix work great in many situations, and fall\nshort in others. Some of the most demanding crawl situations include\nopen-ended crawling of the whole web.\n\nThe object of this project is to create a modular crawler with pluggable\nmodules, capable of working well for a large variety of crawl tasks. The\ncore of the crawler is written in Python 3.5+ using coroutines.\n\nStatus\n------\n\nCoCrawler is pre-release, with major restructuring going on. It is\ncurrently able to crawl at around 170 megabits / 170 pages/sec on a 4\ncore machine.\n\nScreenshot: |Screenshot|\n\nInstalling\n----------\n\nWe recommend that you use pyenv, because (1) CoCrawler requires Python\n3.5+, and (2) requirements.txt specifies exact module versions.\n\n::\n\n git clone https://github.com/cocrawler/cocrawler.git\n cd cocrawler\n make init # will install requirements using pip\n make pytest\n make test_coverage\n\nPluggable Modules\n-----------------\n\nPluggable modules make policy decisions, and use utility routines to\nkeep policy modules short and sweet.\n\nAn additional set of pluggable modules provide support for a variety of\ndatabases. These databases are mostly used to orchestrate the\ncooperation of multiple crawl processes, enabling the horizontal\nscalability of the crawler over many cores and many nodes.\n\nCrawled web assets are intended to be stored as WARC files, although\nthis interface should also pluggable.\n\nRanking\n-------\n\nEveryone knows that ranking is extremely important to search queries,\nbut it's also important to crawling. Crawling the most important stuff\nis one of the best ways to avoid crawling too much webspam, soft 404s,\nand crawler trap pages.\n\nSEO is a multi-billion-dollar industry created to game search engine\nranking, and any crawl of a wide swath of the web is going to run into\npoor-quality content attempting to appear to have high quality. There's\nlittle chance that CoCrawler's algorithms will beat the most\nsophisticated SEO techniques, but a little ranking goes a long way.\n\nCredits\n-------\n\nCoCrawler draws on ideas from the Python 3.4 code in \"500 Lines or\nLess\", which can be found at https://github.com/aosabook/500lines. It is\nalso heavily influenced by the experiences that Greg acquired while\nworking at blekko and the Internet Archive.\n\nLicense\n-------\n\nApache 2.0\n\n.. |Build Status| image:: https://travis-ci.org/cocrawler/cocrawler.svg?branch=master\n :target: https://travis-ci.org/cocrawler/cocrawler\n.. |Coverage Status| image:: https://coveralls.io/repos/github/cocrawler/cocrawler/badge.svg?branch=master\n :target: https://coveralls.io/github/cocrawler/cocrawler?branch=master\n.. |Apache License 2.0| image:: https://img.shields.io/github/license/cocrawler/cocrawler.svg\n :target: LICENSE\n.. |Screenshot| image:: https://cloud.githubusercontent.com/assets/2142266/19621581/92e83044-9849-11e6-825d-66b674cc59f0.png\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cocrawler/cocrawler", "keywords": "", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "cocrawler", "package_url": "https://pypi.org/project/cocrawler/", "platform": "", "project_url": "https://pypi.org/project/cocrawler/", "project_urls": { "Homepage": "https://github.com/cocrawler/cocrawler" }, "release_url": "https://pypi.org/project/cocrawler/0.1.11/", "requires_dist": [ "uvloop", "aiohttp", "aiodns", "pyyaml", "cchardet", "surt", "reppy", "robotexclusionrulesparser", "cachetools", "filemagic", "tldextract", "sortedcontainers", "sortedcollections", "psutil", "hdrhistogram", "beautifulsoup4", "lxml", "extensions", "warcio", "geoip2", "objgraph", "brotlipy", "setuptools-scm" ], "requires_python": "", "summary": "A modern web crawler framework for Python", "version": "0.1.11" }, "last_serial": 5869921, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "4451e810b5a8de4291081bd7f3e18f57", "sha256": "140fe10a59c4569230e56ebd235d3fc42e4a77fdca1e3e1758532cb6d8252ff8" }, "downloads": -1, "filename": "cocrawler-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "4451e810b5a8de4291081bd7f3e18f57", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 51447, "upload_time": "2017-02-02T00:08:27", "url": "https://files.pythonhosted.org/packages/7d/e9/c80a947936a7334cc468454818bc84110ea3133b96edb000943ae816d9c6/cocrawler-0.1.1-py3-none-any.whl" } ], "0.1.11": [ { "comment_text": "", "digests": { "md5": "1c81d77840163768f1be0d942a1ea8b7", "sha256": "bfb0587bbc7aa5f3db1f32cf5e0dcd32cef92d120b01ce77559b23e8c3153167" }, "downloads": -1, "filename": "cocrawler-0.1.11-py3-none-any.whl", "has_sig": false, "md5_digest": "1c81d77840163768f1be0d942a1ea8b7", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 84926, "upload_time": "2019-09-22T18:11:43", "url": "https://files.pythonhosted.org/packages/25/d6/c0fd80f33f883688caaf3fccda3afcecbda578505306531a80becab74f9a/cocrawler-0.1.11-py3-none-any.whl" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "557858501b8d06c2932ea9bd9080d6e4", "sha256": "24feb0ff780bca796e6f2476424eb6865559072b00100097211ef340804fef92" }, "downloads": -1, "filename": "cocrawler-0.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "557858501b8d06c2932ea9bd9080d6e4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60373, "upload_time": "2017-05-10T02:56:56", "url": "https://files.pythonhosted.org/packages/61/c2/177defdb729799ee8d4ec0b0d4bfb60afc4a700d48aeb3d8340ebd7e6e48/cocrawler-0.1.4-py3-none-any.whl" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "8c4d4d03715fffbc9ae961b9a050c008", "sha256": "b759d9a43abd1eeb855f80fc0eb96f8f5b5eda2277e7a385d223a99eb156c4da" }, "downloads": -1, "filename": "cocrawler-0.1.6-py3-none-any.whl", "has_sig": false, "md5_digest": "8c4d4d03715fffbc9ae961b9a050c008", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 75213, "upload_time": "2018-05-18T23:28:57", "url": "https://files.pythonhosted.org/packages/57/46/7f82ecca5b3476e0ac6b6373315c6c8f23113a7f1321b6222735c7c74749/cocrawler-0.1.6-py3-none-any.whl" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "cd8801bf803986b218ccf5daaf9ea1a1", "sha256": "a9918c1f38f18a2c0e16764044d8916004dab696e8a856adc64ba78696830bc3" }, "downloads": -1, "filename": "cocrawler-0.1.7-py3-none-any.whl", "has_sig": false, "md5_digest": "cd8801bf803986b218ccf5daaf9ea1a1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 78167, "upload_time": "2018-07-06T04:10:53", "url": "https://files.pythonhosted.org/packages/94/75/d94df72c8aeb6bc158c21969af69a0b9d475f4a19a17712cb0c8717ab957/cocrawler-0.1.7-py3-none-any.whl" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "94632a1d3821529a58d152b5aadb5a45", "sha256": "99ff6f371506fcc77b98bfa94f497a9537d8a59e882aa4eaae5d460274845881" }, "downloads": -1, "filename": "cocrawler-0.1.8-py3-none-any.whl", "has_sig": false, "md5_digest": "94632a1d3821529a58d152b5aadb5a45", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 84474, "upload_time": "2019-09-19T02:03:27", "url": "https://files.pythonhosted.org/packages/06/bb/86db1e174dbf19b3272e7e23d54e10eddaf13bd20b0e5814f5e6279f313b/cocrawler-0.1.8-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "1c81d77840163768f1be0d942a1ea8b7", "sha256": "bfb0587bbc7aa5f3db1f32cf5e0dcd32cef92d120b01ce77559b23e8c3153167" }, "downloads": -1, "filename": "cocrawler-0.1.11-py3-none-any.whl", "has_sig": false, "md5_digest": "1c81d77840163768f1be0d942a1ea8b7", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 84926, "upload_time": "2019-09-22T18:11:43", "url": "https://files.pythonhosted.org/packages/25/d6/c0fd80f33f883688caaf3fccda3afcecbda578505306531a80becab74f9a/cocrawler-0.1.11-py3-none-any.whl" } ] }