{ "info": { "author": "Niels Serup", "author_email": "ns@metanohi.org", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: End Users/Desktop", "License :: DFSG approved", "License :: OSI Approved :: GNU General Public License (GPL)", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Internet :: WWW/HTTP :: Indexing/Search", "Topic :: Software Development :: Libraries", "Topic :: Utilities" ], "description": "========\nnoncrawl\n========\n\nnoncrawl is a crawler that saves only links. It crawls the web but\ndoes not attempt to do everything. Instead, its only purpose is to\nrecursively check sites for links to other sites, which are then also\nchecked for links to other sites, etc. So, if site Y links to site X,\nthat piece of information is saved, and if site X has not been checked\nyet, it will be crawled just like site Y was. For this to work, one\nmust specify one or more startpages. By default, noncrawl will attempt\nto crawl several sites simultaneously using threading, but this can be\ndisabled. It is also possible to set a limit to the number of threads.\n\nLicense\n=======\nnoncrawl is free software under the terms of the GNU General Public\nLicense version 3 (or any later version). The author of noncrawl is\nNiels Serup, contactable at ns@metanohi.org. This is version 0.1 of\nthe program.\n\nExternal libraries included with noncrawl are GPL-compatible.\n\nInstalling\n==========\nWay #1\n------\nJust run this (requires that you have python-setuptools installed)::\n\n $ sudo easy_install noncrawl\n\nWay #2\n------\nGet newest version of noncrawl at\nhttp://metanohi.org/projects/noncrawl/ or at\nhttp://pypi.python.org/pypi/noncrawl\n\nExtract the downloaded file and run this in a terminal::\n\n $ sudo python setup.py install\n\n\nDependencies\n============\nnoncrawl has no real dependencies not included in a default Python\ninstall. Python 2.5+ is probably required, though.\n\nOptional extras\n---------------\nIf present, noncrawl will use these Python modules:\n\n``htmlentitiesdecode``\n Web address: http://pypi.python.org/pypi/htmlentitiesdecode/\n $ sudo easy_install htmlentitiesdecode\n\n(A copy of this module is included in the noncrawl distribution, so\nyou'll be fine without it)\n\n``setproctitle``\n Web address: http://pypi.python.org/pypi/setproctitle/\n $ sudo easy_install setproctitle\n\n``termcolor`` (recommended)\n Web address: http://pypi.python.org/pypi/termcolor\n $ sudo easy_install termcolor\n\n\nRunning\n=======\nnoncrawl consists of two parts: the crawler and the parser. The\ncrawler must be accessed using a command-line utility called\n``noncrawler``. Extracting information from projects can be done\neither on the command-line using the ``noncrawlget`` script or by\nimporting the ``noncrawl.parser`` module in a Python program.\n\n``noncrawler``\n--------------\n``noncrawler`` can be run like this::\n\n $ noncrawler [options] startpages\n\nnoncrawler has several options. Run ``noncrawler --help`` to see a\nlist of them. When creating a new noncrawl project, noncrawler will\ncreate a directory in which all data will be saved. All projects can\nbe resumed if they have been saved properly (which should always\nhappen). White-and-black-listing is supported using line-separated\nregular expressions with keywords. The syntax of these expressions\nwill be described in a moment.\n\n``noncrawlget``\n---------------\n``noncrawlget`` can be run like this::\n\n $ noncrawlget [options] expression\n\nThe program then looks for entries that match the expression. The\nsyntax of these expressions is explained in the next subsection.\n\nExpressions\n-----------\nThe expressions used by noncrawl consists of operator-separated\ntwo-word-groups consisting of one keyword and one Python regular\nexpression or one string with UNIX-style wildcards prefixed with an\n'*', with everything eventually prefixed with a negating character.\n\nAn expression looks like this:\n[y|n] (filter regex|wildcards [operator])+\n\n\"y\" or \"n\" specifies whether to accept the result of a match or\nnot. If there is a match between the regex/wildcards and a string,\nusing a \"n\" negates the return value. It is optional to set this\nkeyword, and it defaults to \"y\", meaning that results are not\nmodified.\n\nFilters in groups signify how string testing for matches should be\nfiltered. \"url\" means not changing them, \"domain\" means extracting the\ndomain name from the url and using that.\n\nRegular expressions can be studied in the Python documentation at\nhttp://docs.python.org/library/re.html\n\nStrings with wildcards should be parsable by the Python fnmatch\nmodule, documented at http://docs.python.org/library/fnmatch.html\n\nOperators can be either &&, meaning logical AND, or ||, meaning\nlogical OR.\n\nExpressions beginning with a '#' character are ignored completely.\n\nNote that black-and-white lists prioritize non-negating\nexpressions. That is, specifying an expression that blacklists all\nurls in existence doesn't overrule an expression that whitelists\nsomething.\n\nExamples\n~~~~~~~~\nThe following expressions examplify what is possible::\n\n # Disallow everything using the wildcard '*' (prefixed by another\n # '*' because it's not a regular expression)\n n url **\n\n # Disallow search pages because of their dynamic nature\n n url .+?\\?.*?q=.*\n\n # Still disallow them, but only on one site\n n url .+?\\?.*?q=.* && domain example.com\n\n # Allow urls containing the string \"examples\" on example.com, or\n # something similar on Wikipedia.\n domain example.com && url **examples* || domain wikipedia.org && url **wiki*\n\n # Allow all example.* domains except for .org\n domain .*?example\\.(?!org)\n\nnoncrawl comes with a base inclusion-exclusion list that it uses per\ndefault. For more examples, see the list in the file named\n\"whiteblacklist.py\" of this distribution.\n\n\nDeveloping\n==========\nnoncrawl uses Git for branches. To get the latest branch, get it from\ngitorious.org like this::\n\n $ git clone git://gitorious.org/noncrawl/noncrawl.git\n\nnoncrawl is written in Python.\n\n\nThe logo\n========\nThe logo of noncrawl, found in the \"logo\" directory, is available\nunder the terms of the Creative Commons Attribution-ShareAlike 3.0 (or\nany later version) Unported license. A copy of this license is\navailable at http://creativecommons.org/licenses/by-sa/3.0/\n\n\nThis document\n=============\nCopyright (C) 2010 Niels Serup\n\nCopying and distribution of this file, with or without modification,\nare permitted in any medium without royalty provided the copyright\nnotice and this notice are preserved. This file is offered as-is,\nwithout any warranty.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://metanohi.org/projects/noncrawl/", "keywords": null, "license": "LICENSE.txt", "maintainer": null, "maintainer_email": null, "name": "noncrawl", "package_url": "https://pypi.org/project/noncrawl/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/noncrawl/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://metanohi.org/projects/noncrawl/" }, "release_url": "https://pypi.org/project/noncrawl/0.1/", "requires_dist": null, "requires_python": null, "summary": "A web crawler creating link soups", "version": "0.1" }, "last_serial": 795486, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "4ad58c66b2c3fd145ea90b884c992e17", "sha256": "61692e1819d0a0b2257e9625eb14de654544231e58a8b3fe1f5132a89dbcc62e" }, "downloads": -1, "filename": "noncrawl-0.1.tar.gz", "has_sig": false, "md5_digest": "4ad58c66b2c3fd145ea90b884c992e17", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69982, "upload_time": "2010-08-03T16:37:03", "url": "https://files.pythonhosted.org/packages/bd/88/260985ece6032d80cc9a7f4d9cfe14cd316542b198c3f87ad3f839ebddba/noncrawl-0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4ad58c66b2c3fd145ea90b884c992e17", "sha256": "61692e1819d0a0b2257e9625eb14de654544231e58a8b3fe1f5132a89dbcc62e" }, "downloads": -1, "filename": "noncrawl-0.1.tar.gz", "has_sig": false, "md5_digest": "4ad58c66b2c3fd145ea90b884c992e17", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 69982, "upload_time": "2010-08-03T16:37:03", "url": "https://files.pythonhosted.org/packages/bd/88/260985ece6032d80cc9a7f4d9cfe14cd316542b198c3f87ad3f839ebddba/noncrawl-0.1.tar.gz" } ] }