{
    "info": {
        "author": "Simon Descarpentries",
        "author_email": "contact@acoeuro.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 5 - Production/Stable",
            "Intended Audience :: Developers",
            "Intended Audience :: End Users/Desktop",
            "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3",
            "Topic :: Internet :: WWW/HTTP"
        ],
        "description": "doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT\u2026).\n\n== Synopsis\n\tdoc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://\u2026\n\tdoc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst\n\tdoc_crawler.py [--wait=0] --download-file http://\u2026\n\tor\n\tpython3 -m doc_crawler [\u2026] http://\u2026\n\n== Description\n_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the\ndescendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP\u2026)\nbased on regular expression matching (typically against their extension). Documents can be\nlisted on the standard output or downloaded (with the _--download_ argument).\n\nTo address real life situations, activities can be logged (with _--verbose_). +\nAlso, the search can be limited to one page (with the _--single-page_ argument).\n\nDocuments can be downloaded from a given list of URL, that you may have previously\nproduced using default options of _doc_crawler_ and an output redirection such as: +\n`./doc_crawler.py http://\u2026 > url.lst`\n\nDocuments can also be downloaded one by one if necessary (to finish the work), using the\n_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you\nat every steps.\n\nBy default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each\ndownload to avoid being rude toward the webserver it interacts with (and so avoid being\nblack-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_\nargument).\n\n_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://\u2026`\n\n== Options\n*--accept*=_jpe?g$_::\n\tOptional regular expression (case insensitive) to keep matching document names.\n\tExample : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg\n*--download*::\n\tDirectly downloads found documents if set, output their URL if not.\n*--single-page*::\n\tLimits the search for documents to download to the given URL.\n*--verbose*::\n\tCreates a log file to keep trace of what was done.\n*--wait*=x::\n\tChange the default waiting time before each download (page or document).\n\tExample : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.\n*--no-random-wait*::\n\tStops the random pick up of waiting times. _--wait=_ or default is used.\n*--download-files* url.lst::\n\tDownloads each documents which URL are listed in the given file.\n\tExample : _--download-files url.lst_\n*--download-file* http://\u2026::\n\tDirectly save in the current folder the URL-pointed document.\n\n== Tests\nAround 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following\ncommand in the cloned repository root: +\n`python3 -m doctest doc_crawler.py`\n\nTests can also be launched one by one using the _--test=XXX_ argument: +\n`python3 -m doc_crawler --test=download_file`\n\nTests are successfully passed if nothing is output.\n\n== Requirements\n- requests\n- yaml\n\nOne can install them under Debian using the following command : `apt install python3-requests python3-yaml`\n\n== Author\nSimon Descarpentries - https://s.d12s.fr\n\n== Ressources\nGithub repository : https://github.com/Siltaar/doc_crawler.py +\nPypi repository : https://pypi.python.org/pypi/doc_crawler\n\n== Support\nTo support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar\n\n== Licence\nGNU General Public License v3.0. See LICENCE file for more information.\n",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/Siltaar/doc_crawler.py",
        "keywords": "crawler downloader recursive pdf-extractor web-crawler web-crawler-python file-download pdf zip doc odt",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "doc_crawler",
        "package_url": "https://pypi.org/project/doc_crawler/",
        "platform": "",
        "project_url": "https://pypi.org/project/doc_crawler/",
        "project_urls": {
            "Homepage": "https://github.com/Siltaar/doc_crawler.py"
        },
        "release_url": "https://pypi.org/project/doc_crawler/1.2/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Explore a website recursively and download all the wanted documents (PDF, ODT\u2026)",
        "version": "1.2"
    },
    "last_serial": 3648620,
    "releases": {
        "1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "3de98da7d87403126c1095f60120a450",
                    "sha256": "2ffe66eadb16480fc72ef28186b54157ed7144a11ccbbc8a0f90c58f050c3afe"
                },
                "downloads": -1,
                "filename": "doc_crawler-1.0.tar.gz",
                "has_sig": false,
                "md5_digest": "3de98da7d87403126c1095f60120a450",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5347,
                "upload_time": "2017-08-29T16:36:54",
                "url": "https://files.pythonhosted.org/packages/b4/87/52d7f95d1cd72db0b1877dc4363f1f8f3227e4272d8f3edd89055fce27d4/doc_crawler-1.0.tar.gz"
            }
        ],
        "1.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a5df0389fa1ae623eb5354383641097c",
                    "sha256": "5953248e21b3d9a6239ff946e935223918421776d85ff37847b693f585f29164"
                },
                "downloads": -1,
                "filename": "doc_crawler-1.1.tar.gz",
                "has_sig": false,
                "md5_digest": "a5df0389fa1ae623eb5354383641097c",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5714,
                "upload_time": "2017-08-30T10:21:53",
                "url": "https://files.pythonhosted.org/packages/79/65/b3b70c3b39baa1f314a61065fdfd6ca871add9388f78efd98da0d9857349/doc_crawler-1.1.tar.gz"
            }
        ],
        "1.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "4a9ad71302fffd7a30901eefe1caa3a8",
                    "sha256": "148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c"
                },
                "downloads": -1,
                "filename": "doc_crawler-1.2.tar.gz",
                "has_sig": false,
                "md5_digest": "4a9ad71302fffd7a30901eefe1caa3a8",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 6190,
                "upload_time": "2018-03-07T18:18:59",
                "url": "https://files.pythonhosted.org/packages/c6/15/99098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3/doc_crawler-1.2.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "4a9ad71302fffd7a30901eefe1caa3a8",
                "sha256": "148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c"
            },
            "downloads": -1,
            "filename": "doc_crawler-1.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4a9ad71302fffd7a30901eefe1caa3a8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 6190,
            "upload_time": "2018-03-07T18:18:59",
            "url": "https://files.pythonhosted.org/packages/c6/15/99098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3/doc_crawler-1.2.tar.gz"
        }
    ]
}