Metadata-Version: 1.1
Name: ftw.crawler
Version: 1.0
Summary: Crawl sites, extract text and metadata, index it in Solr
Home-page: https://github.com/4teamwork/ftw.crawler
Author: 4teamwork AG
Author-email: mailto:info@4teamwork.ch
License: GPL2
Description: ftw.crawler
        ===========
        
        Installation
        ------------
        
        To install ``ftw.crawler``, the easiest way is to create a buildout that
        contains the configuration, pulls in the egg using ``zc.recipe.egg`` and
        creates a script in the ``bin/`` directory that directly launches the crawler
        with the respective configuration as an argument:
        
        - First, create a configuration file for the crawler. You can base your
          configuration on `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_ by copying
          it to your buildout and adapting it as needed.
        
          Make sure to configure at least the ``tika`` and ``solr`` URLs to point to
          the correct locations of the respective services, and to adapt the ``sites``
          list to your needs.
        
        - Create a buildout config that installs ``ftw.crawler`` using ``zc.recipe.egg``:
        
          ``crawler.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		parts +=
        		    crawler
        		    crawl-foo-org
        
        		[crawler]
        		recipe = zc.recipe.egg
        		eggs = ftw.crawler
        
        
        - Further define a buildout section that creates a ``bin/crawl-foo-org``
          script, which will call ``bin/crawl foo_org_config.py`` using absolute paths
          (for easier use from cron jobs):
        
        	.. code:: ini
        
        		[crawl-foo-org]
        		recipe = collective.recipe.scriptgen
        		cmd = ${buildout:bin-directory}/crawl
        		arguments =
        		    ${buildout:directory}/foo_org_config.py
        		    --tika http://localhost:9998/
        		    --solr http://localhost:8983/solr
        
          (The ``--tika`` and ``--solr`` command line arguments are optional, they
          can also be set in the configuration file. If given, the command line
          arguments take precedence over any parameters in the config file.)
        
        
        - Add a buildout config that downloads and configures a Tika JAXRS server:
        
          ``tika-server.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		parts +=
        		    supervisor
        		    tika-server-download
        		    tika-server
        
        		[supervisor]
        		recipe = collective.recipe.supervisor
        		plugins =
        		      superlance
        		port = 8091
        		user = supervisor
        		password = admin
        		programs =
        		    10 tika-server (stopasgroup=true) ${buildout:bin-directory}/tika-server true your_os_user
        
        		[tika-server-download]
        		recipe = hexagonit.recipe.download
        		url = http://repo1.maven.org/maven2/org/apache/tika/tika-server/1.5/tika-server-1.5.jar
        		md5sum = 0f70548f233ead7c299bf7bc73bfec26
        		download-only = true
        		filename = tika-server.jar
        
        		[tika-server]
        		port = 9998
        		recipe = collective.recipe.scriptgen
        		cmd = java
        		arguments = -jar ${tika-server-download:destination}/${tika-server-download:filename} --port ${:port}
        
          Modify ``your_os_user`` and the supervisor and Tika ports as needed.
        
        
        - Finally, add a `bootstrap.py <http://downloads.buildout.org/2/bootstrap.py>`_
          and create the ``buildout.cfg`` that pulls all of the above together:
        
          ``buildout.cfg``
        
        	.. code:: ini
        
        		[buildout]
        		extensions = mr.developer
        
        		extends =
        		    tika-server.cfg
        		    crawler.cfg
        
        
        - Bootstrap and run buildout:
        
        	.. code:: bash
        
        		python bootstrap.py
        		bin/buildout
        
        
        Running the crawler
        -------------------
        
        If you created the ``bin/crawl-foo-org`` script with the buildout described
        above, that's all you need to run the crawler:
        
        - Make sure Tika and Solr are running
        - Run ``bin/crawl-foo-org`` *(with either a relative or absolute path, working
          directory doesn't matter, so it can easily be called from a cron job)*
        
        
        Running ``bin/crawl`` directly
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        The ``bin/crawl-foo-org`` is just a tiny wrapper that calls the ``bin/crawl``
        script, generated by ``ftw.crawler``'s setuptools ``console_script``
        entry point, with the absolute path to the configuration file as the only
        argument. Any other arguments to the ``bin/crawl-foo-org`` script will be
        forwarded to ``bin/crawl``.
        
        Therefore running ``bin/crawl-foo-org [args]`` is equivalent to
        ``bin/crawl foo_org_config.py [args]``.
        
        Indexing only a particular URL
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        
        If you only want to index a particular URL, pass that URL as the first
        argument to ``bin/crawl-foo-org``. The crawler will then only fetch and index
        that specific URL.
        
        
        Development
        -----------
        
        To start hacking on ``ftw.crawler``, use the ``development.cfg`` buildout:
        
        
        .. code:: bash
        
        	ln -s development.cfg buildout.cfg
        	python bootstrap.py
        	bin/buildout
        
        This will build a Tika JAXRS server and a Solr instance for you. The Solr
        configuration is set up to be compatible with the testing / example
        configuration at  `ftw/crawler/tests/assets/basic_config.py <https://github.com/4teamwork/ftw.crawler/blob/master/ftw/crawler/tests/assets/basic_config.py>`_.
        
        To run the crawler against the example configuration:
        
        .. code:: bash
        
        	bin/tika-server
        	bin/solr-instance fg
        	bin/crawl ftw/crawler/tests/assets/basic_config.py
        
        
        Links
        -----
        
        - Github: https://github.com/4teamwork/ftw.crawler
        - Issues: https://github.com/4teamwork/ftw.crawler/issues
        - Pypi: http://pypi.python.org/pypi/ftw.crawler
        - Continuous integration: https://jenkins.4teamwork.ch/search?q=ftw.crawler
        
        
        Copyright
        ---------
        
        This package is copyright by `4teamwork <http://www.4teamwork.ch/>`_.
        
        ``ftw.crawler`` is licensed under GNU General Public License, version 2.
        
        Changelog
        =========
        
        
        1.0 (2015-11-09)
        ----------------
        
        - Initial implementation.
          [lgraf]
        
Keywords: crawling extraction solr
Platform: UNKNOWN
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License (GPL)
Classifier: Programming Language :: Python
Classifier: Topic :: Software Development
