{ "info": { "author": "Thykof", "author_email": "thykof@protonmail.ch", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Internet :: WWW/HTTP :: Indexing/Search" ], "description": "# Swiftea Crawler\n\n[![Build Status](https://travis-ci.org/Swiftea/Crawler.svg?branch=master)](https://travis-ci.org/Swiftea/Crawler)\n[![Coverage Status](https://coveralls.io/repos/github/Swiftea/Crawler/badge.svg?branch=master)](https://coveralls.io/github/Swiftea/Crawler?branch=master)\n[![Documentation Status](https://readthedocs.org/projects/crawler/badge/?version=master)](http://crawler.readthedocs.io/en/master/?badge=master)\n[![Code Health](https://landscape.io/github/Swiftea/Crawler/master/landscape.svg?style=flat)](https://landscape.io/github/Swiftea/Crawler/master)\n[![Requirements Status](https://requires.io/github/Swiftea/Crawler/requirements.svg?branch=master)](https://requires.io/github/Swiftea/Crawler/requirements/?branch=master)\n\n## Description\n\nSwiftea-Crawler is an open source web crawler for Swiftea search engine.\n\nCurrently, it can:\n - Visit websites\n - check robots.txt\n - search encoding\n - Parse them\n - extract data\n - title\n - description\n - ...\n - extract words\n - filter stopwords\n - Index them\n - in database\n - in inverted-index\n - Archive log files in a zip file\n\t- avoid duplicates (http and https)\n\nThe domain crawler focus on the links that belong to the given domain name.\nThe level option of the domain crawler defines how deep the crawl goes.\nFor example, the level 2 means the crawler will crawl all the links of the domain plus the links that all pages in this domain lead to.\n\nThe domain crawler can use a MongoDB database to store the inverted index.\n\n## Install and usage\n\n\n### Run crawler\n\nCreate `crawler-config.json` file and fill it:\n\n {\n \"DIR_DATA\": \"data\",\n\n \"DB_HOST\": \"\",\n \"DB_USER\": \"\",\n \"DB_PASSWORD\": \"\",\n \"DB_NAME\": \"\",\n \"TABLE_NAMES\": [\"website\", \"suggestion\"],\n \"DIR_INDEX\": \"ii/\",\n \"FTP_HOST\": \"\",\n \"FTP_USER\": \"\",\n \"FTP_PASSWORD\": \"\",\n \"FTP_PORT\": 21,\n \"FTP_DATA\": \"/www/data/\",\n \"FTP_INDEX\": \"/www/data/inverted_index\",\n\n \"HOST\": \"\",\n\n \"MONGODB_PASSWORD\": \"\",\n \"MONGODB_USER\": \"\",\n \"MONGODB_CON_STRING\": \"\"\n }\n\nThen:\n\n from crawler import main\n\n # infinite crawling:\n crawler = main(l1=50, l2=10, dir_data='data1')\n\n # domain crawling:\n crawler = main(url='http://example.example', level=0, target_level=1, dir_data='data1')\n crawler = main(url='http://some.thing', level=1, target_level=3, use_mongodb=True)\n\n crawler.start()\n\n### Setup\n\n virtualenv -p /usr/bin/python3 crawler-env\n source crawler-env/bin/activate\n pip install -r requirements.txt\n\n### Run tests\n\nUsing only pytest:\n\n python setup.py test\n\nWith coverage:\n\n coverage run setup.py test\n coverage report\n coverage html\n\n\n### Build documentation\n\nYou must install `python3-sphinx` package.\n\n cd docs\n make html\n\n### Run linter\n\nInstall `prospector`, then:\n\n prospector > prospector_output.json\n\n### Deploy\n\nCreate directories in ftp server:\n\n - /www/data/badwords\n - /www/data/stopwords\n - /www/data/inverted_index\n\nUpload the list of words: `/www/[type]/[lang].[type].txt`.\n\nCreate database with `sql/swiftea_mysql_db.sql`.\n\n\n## How it works?\n\nIf the files below don't exist, the crawler will download them from our server:\n\n- data/stopwords/fr.stopwords.txt\n- data/stopwords/en.stopwords.txt\n- data/badwords/fr.badwords.txt\n- data/badwords/en.badwords.txt\n\nIn `crawler-config.json`, if FTP_INDEX is `\"\"`, then the inverted index will be save in `DIR_INDEX` but not send on the FTP server.\n\n### Database:\nThe DatabaseSwiftea object can:\n - send documents\n - get the id of a document by the url\n - delete a document\n - select the suggestions\n - check if a doc exists\n - check for http and https duplicate\n\n## Limits\n\nWhen stoping the crawler (ctrl+V), it will not restart with the interupted url.\n\n## Version\n\nCurrent version is 1.1.3\n\n## Tech\n\nSwiftea's Crawler uses a number of open source projects to work properly:\n\n- [Python 3](https://www.python.org/)\n - [Reppy](https://github.com/seomoz/reppy)\n - [PyMySQL](https://github.com/PyMySQL/PyMySQL/)\n - [Requests](https://github.com/kennethreitz/requests)\n\n\n## Contributing\n\nWant to contribute? Great!\n\nFork the repository. Then, run:\n\n git clone git@github.com:/Crawler.git\n cd Crawler\n\nThen, do your work and commit your changes. Finally, make a pull request.\n\n### Commit conventions:\n\n#### General\n - Use the present tense\n - Use the imperative mood\n\n#### Examples\n - Add something: \"Add feature ...\"\n - Update: \"Update ...\"\n - Improve something: \"Improve ...\"\n - Change something: \"Change ...\"\n - Fix something: \"Fix ...\"\n - Fix an issue: \"Fix #123456\" or \"Close #123456\"\n\nLicense\n----\n\nGNU GENERAL PUBLIC LICENSE (v3)\n\n**Free Software, Hell Yeah!**\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Swiftea/Crawler", "keywords": "crawler swiftea", "license": "GNU GPL v3", "maintainer": "", "maintainer_email": "", "name": "swiftea-crawler", "package_url": "https://pypi.org/project/swiftea-crawler/", "platform": "", "project_url": "https://pypi.org/project/swiftea-crawler/", "project_urls": { "Homepage": "https://github.com/Swiftea/Crawler" }, "release_url": "https://pypi.org/project/swiftea-crawler/1.1.3/", "requires_dist": [ "PyMySQL (==0.9.3)", "coverage (==4.5.3)", "dnspython (==1.16.0)", "paramiko (==2.5.0)", "pymodm (==0.4.1)", "pymongo (==3.8.0)", "pytest (==5.0.1)", "reppy (==0.4.13)", "requests", "sphinx-rtd-theme (==0.4.3)", "coverage ; extra == 'testing'", "pytest ; extra == 'testing'" ], "requires_python": "", "summary": "Swiftea's Open Source Web Crawler", "version": "1.1.3" }, "last_serial": 5673059, "releases": { "1.1.2": [ { "comment_text": "", "digests": { "md5": "136c609d336ea16ea47296199dd4bdf0", "sha256": "b9e8f8a0b47b6c0c7833ec677725cc72635890d721b2f0f3fbefe4b06a5207d5" }, "downloads": -1, "filename": "swiftea_crawler-1.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "136c609d336ea16ea47296199dd4bdf0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60379, "upload_time": "2019-08-03T14:47:46", "url": "https://files.pythonhosted.org/packages/a8/6b/53bd78db1c17d678abcffaec6dfd7405b3b7d03d1c05593eb23e94c309bf/swiftea_crawler-1.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4c54c2be7e5723f830f3f92199fba08b", "sha256": "76a2d020dd03538af8fc012177ee0ca7e3e2a641968f8bf494e685508678c42c" }, "downloads": -1, "filename": "swiftea-crawler-1.1.2.tar.gz", "has_sig": false, "md5_digest": "4c54c2be7e5723f830f3f92199fba08b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37223, "upload_time": "2019-08-03T14:47:48", "url": "https://files.pythonhosted.org/packages/be/4b/d1bde1aa12e57f92dae7a5c82dedffec337d06433f0e213b6da7d10fcba5/swiftea-crawler-1.1.2.tar.gz" } ], "1.1.3": [ { "comment_text": "", "digests": { "md5": "625ece7f7bce62f30303633731dac8de", "sha256": "2b9d31685b562bab6d5c7480e99ed5e9bbc78dd8148909f78343e56f61066e78" }, "downloads": -1, "filename": "swiftea_crawler-1.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "625ece7f7bce62f30303633731dac8de", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60815, "upload_time": "2019-08-13T18:28:37", "url": "https://files.pythonhosted.org/packages/3e/7f/627132b46cd676be0c2a4738c2047c8231cc18c80baec6256ef4a5843eb4/swiftea_crawler-1.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c2107dbbedfe2d43e91ba4956e82ac6c", "sha256": "f3335c2ee466292f61a76888be8d741b7750fdc5d9d3605906db46c9a229a8bb" }, "downloads": -1, "filename": "swiftea-crawler-1.1.3.tar.gz", "has_sig": false, "md5_digest": "c2107dbbedfe2d43e91ba4956e82ac6c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37837, "upload_time": "2019-08-13T18:28:39", "url": "https://files.pythonhosted.org/packages/a4/e2/77b6d78a07a3b3a292f6989413eb17932e2a34922787217081a3878110cf/swiftea-crawler-1.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "625ece7f7bce62f30303633731dac8de", "sha256": "2b9d31685b562bab6d5c7480e99ed5e9bbc78dd8148909f78343e56f61066e78" }, "downloads": -1, "filename": "swiftea_crawler-1.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "625ece7f7bce62f30303633731dac8de", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 60815, "upload_time": "2019-08-13T18:28:37", "url": "https://files.pythonhosted.org/packages/3e/7f/627132b46cd676be0c2a4738c2047c8231cc18c80baec6256ef4a5843eb4/swiftea_crawler-1.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c2107dbbedfe2d43e91ba4956e82ac6c", "sha256": "f3335c2ee466292f61a76888be8d741b7750fdc5d9d3605906db46c9a229a8bb" }, "downloads": -1, "filename": "swiftea-crawler-1.1.3.tar.gz", "has_sig": false, "md5_digest": "c2107dbbedfe2d43e91ba4956e82ac6c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 37837, "upload_time": "2019-08-13T18:28:39", "url": "https://files.pythonhosted.org/packages/a4/e2/77b6d78a07a3b3a292f6989413eb17932e2a34922787217081a3878110cf/swiftea-crawler-1.1.3.tar.gz" } ] }