{
    "info": {
        "author": "Constance de Quatrebarbes",
        "author_email": "4barbes@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 4 - Beta",
            "Environment :: Other Environment",
            "Intended Audience :: Developers",
            "License :: OSI Approved :: Apache Software License",
            "Operating System :: MacOS :: MacOS X",
            "Operating System :: POSIX",
            "Programming Language :: Python",
            "Programming Language :: Python :: 2",
            "Programming Language :: Python :: 2.6",
            "Programming Language :: Python :: 2.7",
            "Topic :: Internet",
            "Topic :: Software Development :: Libraries :: Python Modules",
            "Topic :: Utilities"
        ],
        "description": "![http://www.cortext.net](http://www.cortext.net/IMG/siteon0.png)\n\n\nCrawtext\n===============================================\nCrawtext is a project of the Cortext Lab. It is independant from the **Cortext manager** plateform but deisgned to interact with it.\nGet a free account and discover the tools you can use for your own research by registering at\n![Cortext](http://manager.cortext.net/)\n\n**Crawtext** is a tiny crawler in command line that let you investigate and collect the ressources of the web that match the special keywords. Usefull for archiving the web around a special theme, results could also be used with the cortext manager to explore the relationships between websites on a special topic.\n\n\nAbout\n---------\nCrawtext  is a tiny crawler that goes form page to page colecting relevant article given a few keywords\n\nThe crawler needs:\n* a **query** to select pertinent pages \nand \n* **starting urls** to collect data \n\nGiven a list of url\n1. the robot will collect the article for each url\n2. It will search for the keywords inside the text extracted from the article. \n=> If the keywords are present in the page it stores the content of the page and\n3. The links inside the page will be added to the next lists to be treated\n\n\n\n\nInstallation\n------------\n\n\nTo install crawtext, clone it from the repo:\n\n$ git clone git@github.com:cortext/crawtext.git\n$ cd crawtext\n$ python setup.py install\n```\n\n\nThen you can automatically install all the dependencies using pip \n(all dependencies are available throught pip)\n\n```\n$ pip install -r dependencies.txt\n```\n\n\n\nYou *must* have MongoDB installed:\n\nTo install it\n* For Debian distribution install it from distribution adding to /etc/sources.list\n\n```\n$ deb http://downloads-distro.mongodb.org/repo/debian-sysvinit dist 10gen\n$ sudo apt-get install mongodb-10gen\n```\n\n\n* For OSX distribution install it with brew:\n\n```\n$ brew install mongodb\n```\n\n\n\n\nGetting help\n====\n\nCrawtext is a simple module in command line to crawl the web given a query.\nThis interface offers you a full set of option to set up a project.\nIf you need any help on interacting with the shell command you can just type to see all the options:\n\n```\npython crawtext.py --help\n```\n\n\nYou can also ask for pull request here http://github.com/cortext/crawtextV2/, \nwe will be happy to answer to any configuration problem or desired features.\n\n\nGetting started\n======\n\nCrawl job \n-----\n* Create a new project:\t\n\n```\npython crawtext.py pesticides\n```\n\n\n* Add a query:\n```\npython crawtext.py pesticides -q \"pesticides AND DDT\"\n```\n(Query support AND OR NOT * ? \" operators)\n\n* Add new seeds by using the search engine option:\n\n\n```\npython crawtext.py pesticides -k set \"YOUR API KEY\"\n```\n\nSee how to get your ![BING API key](https://datamarket.azure.com/dataset/bing/search)\nMore option are available to add urls see Advanced  parameters for crawl\n\n\n* Launch the crawl:\n\n``` \npython crawtext.py pesticides start\n```\n\nThe crawl is limited to 20.000 results\t\n* See how it's running:\n\n``` \npython crawtetx.py pesticides report\n```\n\n\n* Export results:\n\nin json file\n``` \npython crawtext.py pesticides export\n```\n\nIf you want a csv:\n\n```\npython crawtext.py pesticides export -f csv\n```\n\nResults and report are stored in /pesticides/\t\n\n\nAdvanced usage \n====\nA project is define by its name, the results are stored in a mongo database with this given name.\n\nA project is a set of jobs:\nfor example:\n\n* Project \"pesticides\" is composed of a crawl, a report, and an export\n\n* Project \"www.lemonde.fr\" is composed of an archive and a report\n\n**You have 2 main jobs type:**\n\n- **Crawl**:\n\nCrawl the web with a given query and a set of seeds\n\n- **Archive**:\n\nCrawl the entire website given an url\n\n**And 3 optionnal jobs, as facilities to manage the main jobs:**\n\n- **Export**:\n\nExport in json/csv format results, sources and logs of the project. Datasets are stored in result/name_of_your_project\n\n- **Report**:\n\nGive stats on the current process and results stored in the database. Reports are stored in report/name_of_your_project\n\n- **Delete**:\n\nDelete the entire project. An export is automatically done when the project is deleted.\n\n\nManage a projet\n====\n\n*  Consult un project : \t\t\t\n\n``` \ncrawtext.py pesticides\n```\n\n\n*  Consult an archive :\t\t\t\n\n```\ncrawtext.py http://www.lemonde.fr\n```\n\n\n*  Consult your projects :\t\t\n\n```\ncrawtext.py vous@cortext.net\n```\n\n\n*  Get  a report : \t\t\t\t\n\n``` \ncrawtext.py report pesticides\n```\n\n\n*  Get an export : \t\t\t\t\n\n``` \ncrawtext.py export pesticides\n```\n\n\n*  Delete a projet : \t\t\t\t\n\n``` \ncrawtext.py delete pesticides\n```\n\n\n*  Run a project :\n\n``` \ncrawtext.py start pesticides\n```\n\n\n*  Stop the current execution of a project :\t\t\t\t\n\n``` \ncrawtext.py stop pesticides\n```\n\n\n*  Repeat the project :\n\n``` \ncrawtext.py pesticides -r (year|month|week|day)\n```\n\n\n*  Define user of the project :\t\n\n```\ncrawtext pesticides -u vous@cortext.net\n```\n\n\n\nAdvanced  parameters for crawl\n====\n\nA crawl needs 2 parameters to be active:\n- a **query**\n- one or several **seeds** (urls to start the crawl)\n\nThere are several ways to add seeds: \n- manually (add), \n- by configuring file or key for next run (set), \n- by collecting it and add it immediately (file or key) to sources (append)\n\n\n* Query\n----\n\nTo define a query: (Query supports AND OR NOT * ? operators)\n\n```\ncrawtext pesticides pesticides -q \"pesticide? AND DDT\"```\n\n\n\n* Sources\n----\n* define sources from file :\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s set sources.txt```\n\n\n\n* add sources from file :\t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s append sources.txt```\n\n\n\n* add sources from url : \t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s add http://www.latribune.fr```\n\n\n* define sources from Bing search results :\t\t\n\n```\ncrawtext.py pesticides -k set 12237675647```\n\n\n\n* add sources from Bing search results :\t\t\n\n```\ncrawtext.py pesticides -k append 12237675647```\n\n\n\n* expand sources set with previous results :\t\n\n```\ncrawtext.py pesticides -s expand```\n\n\n\n* delete a seed :\t\t\t\t\t\t\t\t\n\n```\ncrawtext.py pesticides -s delete http://www.latribune.fr```\n\n\n\n* delete every seeds of the job:\n\n```\ncrawtext.py pesticides -s delete```\n\n\n\nArchive parameters (Not implemented yet):\n----\n\nAn archive job need an url, you can also specify the format extraction (optionnal)\n\n* consult or create a new archive project : \t\n\n```\ncrawtext.py www.lemonde.fr```\n\n\n* create an archive for wiki : \n\n```\ncrawtext.py archive fr.wikipedia.org -f wiki```\n\n\nResults\n====\n\nThe results are stored in a mongo database called by the name of your project\nYou can export results using export option:\n\n```\npython crawtext.py pesticides export```\n\n\nDatasets are stored in json and zip in 3 collections in special directory ''results'':\n* results\n* sources\n* logs\n\nCrawtext provide a simple method to export it:\n\n```\npython crawtext.py pesticides export```\n\n\nAnd also options for format and collections\n\nThe complete structure of the datasets can be found in \n- sources_example.json\n- results_example.json\n- logs_example.json\n\n\nBug report\n-----\n* 1 outlinks empty [DONE]\n* 2 expand mode error [DONE]\n\nFeatures\n-----\n* Define recursion depth\n\nNext steps\n------\n* Run job in backround\n* Send a mail after execution\n* Build a web interface\n* Activate Archive mode to crawl a entire website\n* YAML integration\n\nSources\n------\n\nYou can see the code ![here] (https://github.com/c24b/crawtextV2)\n\n- Special thanks to Xavier Grangier and his module ![python-goose](https://github.com/grangier/python-goose) forked for automatical article detection.\n\n\n\n\n\nCOMMON PROBLEMS\n----\n\n* Mongo Database:\n\nSometimes if you shut your programm by forcing, you could have an error to connect to database such has:\t\n\n```\ncouldn't connect to server 127.0.0.1:27017 at src/mongo/shell/mongo.js:145```\n\n\n\nThe way to repair it is to remove locks of mongod \n\n```\nsudo rm /var/lib/mongodb/mongod.lock```\n\n\n```\nsudo service mongodb restart```\n\n\nIf it doesn't work it means the index is corrupted so you have to repair it:\n\n```\nsudo mongod --repair```",
        "description_content_type": null,
        "docs_url": null,
        "download_url": null,
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/cortext/crawtext",
        "keywords": "web crawler,web scrapping",
        "license": "MIT",
        "maintainer": null,
        "maintainer_email": null,
        "name": "crawtext",
        "package_url": "https://pypi.org/project/crawtext/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/crawtext/",
        "project_urls": {
            "Homepage": "https://github.com/cortext/crawtext"
        },
        "release_url": "https://pypi.org/project/crawtext/4.1.1/",
        "requires_dist": [
            "Pillow",
            "lxml",
            "cssselect",
            "jieba",
            "beautifulsoup",
            "nltk",
            "six",
            "pymongo",
            "argparse",
            "docopt",
            "tld",
            "wsgiref",
            "PyYAML",
            "MarkupSafe",
            "JINJA2"
        ],
        "requires_python": null,
        "summary": "Tiny WebCrawler in CLI",
        "version": "4.1.1"
    },
    "last_serial": 1314255,
    "releases": {
        "4.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "9206e59beabf1b56b68de944a7e9e04a",
                    "sha256": "67e0ddd2b42eaa1a8bcbc569635910710861af558a4c857c052a18f4a9103dea"
                },
                "downloads": -1,
                "filename": "crawtext-4.1.0-py2.py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "9206e59beabf1b56b68de944a7e9e04a",
                "packagetype": "bdist_wheel",
                "python_version": "py2.py3",
                "requires_python": null,
                "size": 7617493,
                "upload_time": "2014-11-19T11:38:34",
                "url": "https://files.pythonhosted.org/packages/83/d0/a069ef6573d383fa2b42bd80e8b98ae04a506a0e65615b940542c95b0750/crawtext-4.1.0-py2.py3-none-any.whl"
            }
        ],
        "4.1.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "7479c5775a991f74b78fa1472f3b87ac",
                    "sha256": "046884b597264ab999bcfd8dd35f3d05c556a16a45105b1ed2013f226e8d9a7a"
                },
                "downloads": -1,
                "filename": "crawtext-4.1.1-py2-none-any.whl",
                "has_sig": false,
                "md5_digest": "7479c5775a991f74b78fa1472f3b87ac",
                "packagetype": "bdist_wheel",
                "python_version": "py2",
                "requires_python": null,
                "size": 19836585,
                "upload_time": "2014-11-20T11:13:28",
                "url": "https://files.pythonhosted.org/packages/e7/27/f201397c4aa15f068c75c2d7ae6d9a469fb8586e6b7cdcccec9c700c305c/crawtext-4.1.1-py2-none-any.whl"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "7479c5775a991f74b78fa1472f3b87ac",
                "sha256": "046884b597264ab999bcfd8dd35f3d05c556a16a45105b1ed2013f226e8d9a7a"
            },
            "downloads": -1,
            "filename": "crawtext-4.1.1-py2-none-any.whl",
            "has_sig": false,
            "md5_digest": "7479c5775a991f74b78fa1472f3b87ac",
            "packagetype": "bdist_wheel",
            "python_version": "py2",
            "requires_python": null,
            "size": 19836585,
            "upload_time": "2014-11-20T11:13:28",
            "url": "https://files.pythonhosted.org/packages/e7/27/f201397c4aa15f068c75c2d7ae6d9a469fb8586e6b7cdcccec9c700c305c/crawtext-4.1.1-py2-none-any.whl"
        }
    ]
}