{ "info": { "author": "4teamwork AG", "author_email": "mailto:info@4teamwork.ch", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: GNU General Public License (GPL)", "Programming Language :: Python", "Topic :: Software Development" ], "description": "ftw.crawler\n===========\n\nInstallation\n------------\n\nTo install ``ftw.crawler``, the easiest way is to create a buildout that\ncontains the configuration, pulls in the egg using ``zc.recipe.egg`` and\ncreates a script in the ``bin/`` directory that directly launches the crawler\nwith the respective configuration as an argument:\n\n- First, create a configuration file for the crawler. You can base your\n configuration on `ftw/crawler/tests/assets/basic_config.py `_ by copying\n it to your buildout and adapting it as needed.\n\n Make sure to configure at least the ``tika`` and ``solr`` URLs to point to\n the correct locations of the respective services, and to adapt the ``sites``\n list to your needs.\n\n- Create a buildout config that installs ``ftw.crawler`` using ``zc.recipe.egg``:\n\n ``crawler.cfg``\n\n\t.. code:: ini\n\n\t\t[buildout]\n\t\tparts +=\n\t\t crawler\n\t\t crawl-foo-org\n\n\t\t[crawler]\n\t\trecipe = zc.recipe.egg\n\t\teggs = ftw.crawler\n\n\n- Further define a buildout section that creates a ``bin/crawl-foo-org``\n script, which will call ``bin/crawl foo_org_config.py`` using absolute paths\n (for easier use from cron jobs):\n\n\t.. code:: ini\n\n\t\t[crawl-foo-org]\n\t\trecipe = collective.recipe.scriptgen\n\t\tcmd = ${buildout:bin-directory}/crawl\n\t\targuments =\n\t\t ${buildout:directory}/foo_org_config.py\n\t\t --tika http://localhost:9998/\n\t\t --solr http://localhost:8983/solr\n\n (The ``--tika`` and ``--solr`` command line arguments are optional, they\n can also be set in the configuration file. If given, the command line\n arguments take precedence over any parameters in the config file.)\n\n\n- Add a buildout config that downloads and configures a Tika JAXRS server:\n\n ``tika-server.cfg``\n\n\t.. code:: ini\n\n\t\t[buildout]\n\t\tparts +=\n\t\t supervisor\n\t\t tika-server-download\n\t\t tika-server\n\n\t\t[supervisor]\n\t\trecipe = collective.recipe.supervisor\n\t\tplugins =\n\t\t superlance\n\t\tport = 8091\n\t\tuser = supervisor\n\t\tpassword = admin\n\t\tprograms =\n\t\t 10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user\n\n\t\t[tika-server-download]\n\t\trecipe = hexagonit.recipe.download\n\t\turl = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar\n\t\tmd5sum = 0f70548f233ead7c299bf7bc73bfec26\n\t\tdownload-only = true\n\t\tfilename = tika-server.jar\n\n\t\t[tika-server]\n\t\tport = 9998\n\t\trecipe = collective.recipe.scriptgen\n\t\tcmd = java\n\t\targuments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}\n\n Modify ``your_os_user`` and the supervisor and Tika ports as needed.\n\n\n- Finally, add a `bootstrap.py `_\n and create the ``buildout.cfg`` that pulls all of the above together:\n\n ``buildout.cfg``\n\n\t.. code:: ini\n\n\t\t[buildout]\n\t\textensions = mr.developer\n\n\t\textends =\n\t\t tika-server.cfg\n\t\t crawler.cfg\n\n\n- Bootstrap and run buildout:\n\n\t.. code:: bash\n\n\t\tpython bootstrap.py\n\t\tbin/buildout\n\n\nRunning the crawler\n-------------------\n\nIf you created the ``bin/crawl-foo-org`` script with the buildout described\nabove, that's all you need to run the crawler:\n\n- Make sure Tika and Solr are running\n- Run ``bin/crawl-foo-org`` *(with either a relative or absolute path, working\n directory doesn't matter, so it can easily be called from a cron job)*\n\n\nRunning ``bin/crawl`` directly\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nThe ``bin/crawl-foo-org`` is just a tiny wrapper that calls the ``bin/crawl``\nscript, generated by ``ftw.crawler``'s setuptools ``console_script``\nentry point, with the absolute path to the configuration file as the only\nargument. Any other arguments to the ``bin/crawl-foo-org`` script will be\nforwarded to ``bin/crawl``.\n\nTherefore running ``bin/crawl-foo-org [args]`` is equivalent to\n``bin/crawl foo_org_config.py [args]``.\n\nProvide known sitemap urls in site configs\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nIf you know the sitemap url, you can configure one or many sitemap urls\nstatically:\n\n.. code:: python\n\n Site('http://example.org/foo/',\n sitemap_urls=['http://example.org/foo/the_sitemap.xml'])\n\n\nConfigure site ID for purging\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nIn order for the purging to work smoothly it is recommend to configure a\ncrawler site ID.\nMake sure that each site ID is unique per solr core!\nCandidate documents for purging will be identified by this crawler site id.\n\n.. code:: python\n\n Site('http://example.org/',\n crawler_site_id='example.org-news')\n\nBe aware that your solr core must provide a string-field ``crawler_site_id``.\n\n\nIndexing only a particular URL\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nIf you only want to index a particular URL, pass that URL as the first\nargument to ``bin/crawl-foo-org``. The crawler will then only fetch and index\nthat specific URL.\n\n\nSlack-Notifications\n-------------------\n\n``ftw.crawler`` supports Slack-Notifications. Those notifications can be used\nto monitor the crawler on possible errors while crawling.\nTo enable slack-notifications for your environment, you need to do the following things:\n\n- Install ``ftw.crawler`` with the ``slack`` extra.\n- Set the `SLACK_TOKEN` and the `SLACK_CHANNEL` params in your crawler config or\n- use the `--slacktoken` and the `--slackchannel` arguments in the command line when\n calling the `/crawl` script.\n\nTo generate a valid slack token for your integration, you have to create a new bot in\nyour slack-team. After you generated the new bot slack will automatically generate a\nvalid token for this bot. This token can then be used for your integration.\nYou can also generate a test token to test your integration, but don't forget to create\na bot for this if your application goes to production!\n\n\nDevelopment\n-----------\n\nTo start hacking on ``ftw.crawler``, use the ``development.cfg`` buildout:\n\n\n.. code:: bash\n\n\tln -s development.cfg buildout.cfg\n\tpython bootstrap.py\n\tbin/buildout\n\nThis will build a Tika JAXRS server and a Solr instance for you. The Solr\nconfiguration is set up to be compatible with the testing / example\nconfiguration at `ftw/crawler/tests/assets/basic_config.py `_.\n\nTo run the crawler against the example configuration:\n\n.. code:: bash\n\n\tbin/tika-server\n\tbin/solr-instance fg\n\tbin/crawl ftw/crawler/tests/assets/basic_config.py\n\n\nLinks\n-----\n\n- Github: https://github.com/4teamwork/ftw.crawler\n- Issues: https://github.com/4teamwork/ftw.crawler/issues\n- Pypi: http://pypi.python.org/pypi/ftw.crawler\n- Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler\n\n\nCopyright\n---------\n\nThis package is copyright by `4teamwork `_.\n\n``ftw.crawler`` is licensed under GNU General Public License, version 2.\n\nChangelog\n=========\n\n\n1.4.0 (2017-11-08)\n------------------\n\n- Add crawler_site_id option for improving purging. [jone]\n\n1.3.0 (2017-11-03)\n------------------\n\n- Fix purging problem.\n Warning: updating \"ftw.crawler\" to this version breaks your existing crawlers\n when you set the site url to a sitemap url. Please use the \"sitemap_urls\"\n attribute instead. You also need to purge the Solr index manually and reindex.\n [jone]\n\n1.2.1 (2017-10-30)\n------------------\n\n- Encode URL in UTF-8 before generating MD5-Hash.\n [raphael-s]\n\n\n1.2.0 (2017-06-22)\n------------------\n\n- Support Slack notifications.\n [raphael-s]\n\n\n1.1.0 (2016-10-04)\n------------------\n\n- Support configuration of absolute sitemap urls. [jone]\n\n- Slow down on too many requests. [jone]\n\n\n1.0 (2015-11-09)\n----------------\n\n- Initial implementation.\n [lgraf]", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/4teamwork/ftw.crawler", "keywords": "crawling extraction solr", "license": "GPL2", "maintainer": "", "maintainer_email": "", "name": "ftw.crawler", "package_url": "https://pypi.org/project/ftw.crawler/", "platform": "", "project_url": "https://pypi.org/project/ftw.crawler/", "project_urls": { "Homepage": "https://github.com/4teamwork/ftw.crawler" }, "release_url": "https://pypi.org/project/ftw.crawler/1.4.0/", "requires_dist": null, "requires_python": "", "summary": "Crawl sites, extract text and metadata, index it in Solr", "version": "1.4.0" }, "last_serial": 5823503, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "36c5cca11abec269a71330d91a792ef2", "sha256": "b2fda524b0505796597b9c461b945c3a26828e371d8f2190312db70cce8a1ffa" }, "downloads": -1, "filename": "ftw.crawler-1.0.zip", "has_sig": false, "md5_digest": "36c5cca11abec269a71330d91a792ef2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 55867, "upload_time": "2015-11-09T16:15:40", "url": "https://files.pythonhosted.org/packages/45/95/266c7d607ab213e5757ca92d42cf58ec5b6017a98fcc8ad9638ab33bb3a9/ftw.crawler-1.0.zip" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "299018612e57c884357c963396867a19", "sha256": "6ad9db01557424aee419d7355a9703a95dc5992b2a9a8825c749c40d6f598884" }, "downloads": -1, "filename": "ftw.crawler-1.1.0.tar.gz", "has_sig": false, "md5_digest": "299018612e57c884357c963396867a19", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 38158, "upload_time": "2016-10-04T08:55:23", "url": "https://files.pythonhosted.org/packages/cc/70/d773981f8d692c4c538d85d60686d5233fa3eff51f49936772ada6f5f7cf/ftw.crawler-1.1.0.tar.gz" } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "e55d09ba1bf7391e0eb395e0576bc102", "sha256": "1c8d15f5ffecf4961368f78ee9b3ad00f29b4f8bc28b5be5b0b80f08b133a71b" }, "downloads": -1, "filename": "ftw.crawler-1.2.0.tar.gz", "has_sig": false, "md5_digest": "e55d09ba1bf7391e0eb395e0576bc102", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40013, "upload_time": "2017-06-22T15:04:11", "url": "https://files.pythonhosted.org/packages/0a/d2/8b6a4429801d9b3eb06225e6a94c995a3c33d9730bb2ae9efbe4e5c18954/ftw.crawler-1.2.0.tar.gz" } ], "1.2.1": [ { "comment_text": "", "digests": { "md5": "2f406ada9645041116effbc495b3b8dd", "sha256": "77e1887ff147a32fc6060bd5f22b3225e70329e8878ddc48d2388591d08ffc89" }, "downloads": -1, "filename": "ftw.crawler-1.2.1.tar.gz", "has_sig": false, "md5_digest": "2f406ada9645041116effbc495b3b8dd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40082, "upload_time": "2017-10-30T17:01:13", "url": "https://files.pythonhosted.org/packages/e5/59/85956b6e30ab48488348e4dce2875b2792a1c1b9b0dc7ee0e16dd091af76/ftw.crawler-1.2.1.tar.gz" } ], "1.3.0": [ { "comment_text": "", "digests": { "md5": "4b91d716e0a403bc947138a641fe1afa", "sha256": "4b03da808c5c35b5c9419393475680deae9de7d83b94793cca8ab9e000d2c6cd" }, "downloads": -1, "filename": "ftw.crawler-1.3.0.tar.gz", "has_sig": false, "md5_digest": "4b91d716e0a403bc947138a641fe1afa", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40538, "upload_time": "2017-11-03T09:10:10", "url": "https://files.pythonhosted.org/packages/e1/80/b77f5976d3ea9ca8497e3c4d4e4016c46db484057955e42e1ee0f7044388/ftw.crawler-1.3.0.tar.gz" } ], "1.4.0": [ { "comment_text": "", "digests": { "md5": "cfe8756cbb1c4c58ba10661d182a83dd", "sha256": "93d7c95a5666ee2987187e71b8da35aa857dfe1f11acb52a09ac711619c570e6" }, "downloads": -1, "filename": "ftw.crawler-1.4.0.tar.gz", "has_sig": false, "md5_digest": "cfe8756cbb1c4c58ba10661d182a83dd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40839, "upload_time": "2017-11-08T15:27:16", "url": "https://files.pythonhosted.org/packages/b0/c5/b1862cd643a191f2637fb8dc9d7b65000e6e32471cef96eda8518ef080e2/ftw.crawler-1.4.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cfe8756cbb1c4c58ba10661d182a83dd", "sha256": "93d7c95a5666ee2987187e71b8da35aa857dfe1f11acb52a09ac711619c570e6" }, "downloads": -1, "filename": "ftw.crawler-1.4.0.tar.gz", "has_sig": false, "md5_digest": "cfe8756cbb1c4c58ba10661d182a83dd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 40839, "upload_time": "2017-11-08T15:27:16", "url": "https://files.pythonhosted.org/packages/b0/c5/b1862cd643a191f2637fb8dc9d7b65000e6e32471cef96eda8518ef080e2/ftw.crawler-1.4.0.tar.gz" } ] }