{ "info": { "author": "Constance de Quatrebarbes", "author_email": "4barbes@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Other Environment", "Intended Audience :: Developers", "License :: OSI Approved :: Apache Software License", "Operating System :: MacOS :: MacOS X", "Operating System :: POSIX", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Topic :: Internet", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Utilities" ], "description": "![http://www.cortext.net](http://www.cortext.net/IMG/siteon0.png)\n\n\nCrawtext\n===============================================\nCrawtext is a project of the Cortext Lab. It is independant from the **Cortext manager** plateform but deisgned to interact with it.\nGet a free account and discover the tools you can use for your own research by registering at\n![Cortext](http://manager.cortext.net/)\n\n**Crawtext** is a tiny crawler in command line that let you investigate and collect the ressources of the web that match the special keywords. Usefull for archiving the web around a special theme, results could also be used with the cortext manager to explore the relationships between websites on a special topic.\n\n\nAbout\n---------\nCrawtext is a tiny crawler that goes form page to page colecting relevant article given a few keywords\n\nThe crawler needs:\n* a **query** to select pertinent pages \nand \n* **starting urls** to collect data \n\nGiven a list of url\n1. the robot will collect the article for each url\n2. It will search for the keywords inside the text extracted from the article. \n=> If the keywords are present in the page it stores the content of the page and\n3. The links inside the page will be added to the next lists to be treated\n\n\n\n\nInstallation\n------------\n\n\nTo install crawtext, clone it from the repo:\n\n$ git clone git@github.com:cortext/crawtext.git\n$ cd crawtext\n$ python setup.py install\n```\n\n\nThen you can automatically install all the dependencies using pip \n(all dependencies are available throught pip)\n\n```\n$ pip install -r dependencies.txt\n```\n\n\n\nYou *must* have MongoDB installed:\n\nTo install it\n* For Debian distribution install it from distribution adding to /etc/sources.list\n\n```\n$ deb http://downloads-distro.mongodb.org/repo/debian-sysvinit dist 10gen\n$ sudo apt-get install mongodb-10gen\n```\n\n\n* For OSX distribution install it with brew:\n\n```\n$ brew install mongodb\n```\n\n\n\n\nGetting help\n====\n\nCrawtext is a simple module in command line to crawl the web given a query.\nThis interface offers you a full set of option to set up a project.\nIf you need any help on interacting with the shell command you can just type to see all the options:\n\n```\npython crawtext.py --help\n```\n\n\nYou can also ask for pull request here http://github.com/cortext/crawtextV2/, \nwe will be happy to answer to any configuration problem or desired features.\n\n\nGetting started\n======\n\nCrawl job \n-----\n* Create a new project:\t\n\n```\npython crawtext.py pesticides\n```\n\n\n* Add a query:\n```\npython crawtext.py pesticides -q \"pesticides AND DDT\"\n```\n(Query support AND OR NOT * ? \" operators)\n\n* Add new seeds by using the search engine option:\n\n\n```\npython crawtext.py pesticides -k set \"YOUR API KEY\"\n```\n\nSee how to get your ![BING API key](https://datamarket.azure.com/dataset/bing/search)\nMore option are available to add urls see Advanced parameters for crawl\n\n\n* Launch the crawl:\n\n``` \npython crawtext.py pesticides start\n```\n\nThe crawl is limited to 20.000 results\t\n* See how it's running:\n\n``` \npython crawtetx.py pesticides report\n```\n\n\n* Export results:\n\nin json file\n``` \npython crawtext.py pesticides export\n```\n\nIf you want a csv:\n\n```\npython crawtext.py pesticides export -f csv\n```\n\nResults and report are stored in /pesticides/\t\n\n\nAdvanced usage \n====\nA project is define by its name, the results are stored in a mongo database with this given name.\n\nA project is a set of jobs:\nfor example:\n\n* Project \"pesticides\" is composed of a crawl, a report, and an export\n\n* Project \"www.lemonde.fr\" is composed of an archive and a report\n\n**You have 2 main jobs type:**\n\n- **Crawl**:\n\nCrawl the web with a given query and a set of seeds\n\n- **Archive**:\n\nCrawl the entire website given an url\n\n**And 3 optionnal jobs, as facilities to manage the main jobs:**\n\n- **Export**:\n\nExport in json/csv format results, sources and logs of the project. Datasets are stored in result/name_of_your_project\n\n- **Report**:\n\nGive stats on the current process and results stored in the database. Reports are stored in report/name_of_your_project\n\n- **Delete**:\n\nDelete the entire project. An export is automatically done when the project is deleted.\n\n\nManage a projet\n====\n\n* Consult un project : \t\t\t\n\n``` \ncrawtext.py pesticides\n```\n\n\n* Consult an archive :\t\t\t\n\n```\ncrawtext.py http://www.lemonde.fr\n```\n\n\n* Consult your projects :\t\t\n\n```\ncrawtext.py vous@cortext.net\n```\n\n\n* Get a report : \t\t\t\t\n\n``` \ncrawtext.py report pesticides\n```\n\n\n* Get an export : \t\t\t\t\n\n``` \ncrawtext.py export pesticides\n```\n\n\n* Delete a projet : \t\t\t\t\n\n``` \ncrawtext.py delete pesticides\n```\n\n\n* Run a project :\n\n``` \ncrawtext.py start pesticides\n```\n\n\n* Stop the current execution of a project :\t\t\t\t\n\n``` \ncrawtext.py stop pesticides\n```\n\n\n* Repeat the project :\n\n``` \ncrawtext.py pesticides -r (year|month|week|day)\n```\n\n\n* Define user of the project :\t\n\n```\ncrawtext pesticides -u vous@cortext.net\n```\n\n\n\nAdvanced parameters for crawl\n====\n\nA crawl needs 2 parameters to be active:\n- a **query**\n- one or several **seeds** (urls to start the crawl)\n\nThere are several ways to add seeds: \n- manually (add), \n- by configuring file or key for next run (set), \n- by collecting it and add it immediately (file or key) to sources (append)\n\n\n* Query\n----\n\nTo define a query: (Query supports AND OR NOT * ? operators)\n\n```\ncrawtext pesticides pesticides -q \"pesticide? AND DDT\"```\n\n\n\n* Sources\n----\n* define sources from file :\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s set sources.txt```\n\n\n\n* add sources from file :\t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s append sources.txt```\n\n\n\n* add sources from url : \t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s add http://www.latribune.fr```\n\n\n* define sources from Bing search results :\t\t\n\n```\ncrawtext.py pesticides -k set 12237675647```\n\n\n\n* add sources from Bing search results :\t\t\n\n```\ncrawtext.py pesticides -k append 12237675647```\n\n\n\n* expand sources set with previous results :\t\n\n```\ncrawtext.py pesticides -s expand```\n\n\n\n* delete a seed :\t\t\t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s delete http://www.latribune.fr```\n\n\n\n* delete every seeds of the job:\n\n```\ncrawtext.py pesticides -s delete```\n\n\n\nArchive parameters (Not implemented yet):\n----\n\nAn archive job need an url, you can also specify the format extraction (optionnal)\n\n* consult or create a new archive project : \t\n\n```\ncrawtext.py www.lemonde.fr```\n\n\n* create an archive for wiki : \n\n```\ncrawtext.py archive fr.wikipedia.org -f wiki```\n\n\nResults\n====\n\nThe results are stored in a mongo database called by the name of your project\nYou can export results using export option:\n\n```\npython crawtext.py pesticides export```\n\n\nDatasets are stored in json and zip in 3 collections in special directory ''results'':\n* results\n* sources\n* logs\n\nCrawtext provide a simple method to export it:\n\n```\npython crawtext.py pesticides export```\n\n\nAnd also options for format and collections\n\nThe complete structure of the datasets can be found in \n- sources_example.json\n- results_example.json\n- logs_example.json\n\n\nBug report\n-----\n* 1 outlinks empty [DONE]\n* 2 expand mode error [DONE]\n\nFeatures\n-----\n* Define recursion depth\n\nNext steps\n------\n* Run job in backround\n* Send a mail after execution\n* Build a web interface\n* Activate Archive mode to crawl a entire website\n* YAML integration\n\nSources\n------\n\nYou can see the code ![here] (https://github.com/c24b/crawtextV2)\n\n- Special thanks to Xavier Grangier and his module ![python-goose](https://github.com/grangier/python-goose) forked for automatical article detection.\n\n\n\n\n\nCOMMON PROBLEMS\n----\n\n* Mongo Database:\n\nSometimes if you shut your programm by forcing, you could have an error to connect to database such has:\t\n\n```\ncouldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:145```\n\n\n\nThe way to repair it is to remove locks of mongod \n\n```\nsudo rm /var/lib/mongodb/mongod.lock```\n\n\n```\nsudo service mongodb restart```\n\n\nIf it doesn't work it means the index is corrupted so you have to repair it:\n\n```\nsudo mongod --repair```", "description_content_type": null, "docs_url": null, "download_url": null, "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cortext/crawtext", "keywords": "web crawler,web scrapping", "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "crawtext", "package_url": "https://pypi.org/project/crawtext/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/crawtext/", "project_urls": { "Homepage": "https://github.com/cortext/crawtext" }, "release_url": "https://pypi.org/project/crawtext/4.1.1/", "requires_dist": [ "Pillow", "lxml", "cssselect", "jieba", "beautifulsoup", "nltk", "six", "pymongo", "argparse", "docopt", "tld", "wsgiref", "PyYAML", "MarkupSafe", "JINJA2" ], "requires_python": null, "summary": "Tiny WebCrawler in CLI", "version": "4.1.1" }, "last_serial": 1314255, "releases": { "4.1.0": [ { "comment_text": "", "digests": { "md5": "9206e59beabf1b56b68de944a7e9e04a", "sha256": "67e0ddd2b42eaa1a8bcbc569635910710861af558a4c857c052a18f4a9103dea" }, "downloads": -1, "filename": "crawtext-4.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9206e59beabf1b56b68de944a7e9e04a", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 7617493, "upload_time": "2014-11-19T11:38:34", "url": "https://files.pythonhosted.org/packages/83/d0/a069ef6573d383fa2b42bd80e8b98ae04a506a0e65615b940542c95b0750/crawtext-4.1.0-py2.py3-none-any.whl" } ], "4.1.1": [ { "comment_text": "", "digests": { "md5": "7479c5775a991f74b78fa1472f3b87ac", "sha256": "046884b597264ab999bcfd8dd35f3d05c556a16a45105b1ed2013f226e8d9a7a" }, "downloads": -1, "filename": "crawtext-4.1.1-py2-none-any.whl", "has_sig": false, "md5_digest": "7479c5775a991f74b78fa1472f3b87ac", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 19836585, "upload_time": "2014-11-20T11:13:28", "url": "https://files.pythonhosted.org/packages/e7/27/f201397c4aa15f068c75c2d7ae6d9a469fb8586e6b7cdcccec9c700c305c/crawtext-4.1.1-py2-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7479c5775a991f74b78fa1472f3b87ac", "sha256": "046884b597264ab999bcfd8dd35f3d05c556a16a45105b1ed2013f226e8d9a7a" }, "downloads": -1, "filename": "crawtext-4.1.1-py2-none-any.whl", "has_sig": false, "md5_digest": "7479c5775a991f74b78fa1472f3b87ac", "packagetype": "bdist_wheel", "python_version": "py2", "requires_python": null, "size": 19836585, "upload_time": "2014-11-20T11:13:28", "url": "https://files.pythonhosted.org/packages/e7/27/f201397c4aa15f068c75c2d7ae6d9a469fb8586e6b7cdcccec9c700c305c/crawtext-4.1.1-py2-none-any.whl" } ] }