{ "info": { "author": "Simon Descarpentries", "author_email": "contact@acoeuro.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Intended Audience :: End Users/Desktop", "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Internet :: WWW/HTTP" ], "description": "doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT\u2026).\n\n== Synopsis\n\tdoc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://\u2026\n\tdoc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst\n\tdoc_crawler.py [--wait=0] --download-file http://\u2026\n\tor\n\tpython3 -m doc_crawler [\u2026] http://\u2026\n\n== Description\n_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the\ndescendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP\u2026)\nbased on regular expression matching (typically against their extension). Documents can be\nlisted on the standard output or downloaded (with the _--download_ argument).\n\nTo address real life situations, activities can be logged (with _--verbose_). +\nAlso, the search can be limited to one page (with the _--single-page_ argument).\n\nDocuments can be downloaded from a given list of URL, that you may have previously\nproduced using default options of _doc_crawler_ and an output redirection such as: +\n`./doc_crawler.py http://\u2026 > url.lst`\n\nDocuments can also be downloaded one by one if necessary (to finish the work), using the\n_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you\nat every steps.\n\nBy default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each\ndownload to avoid being rude toward the webserver it interacts with (and so avoid being\nblack-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_\nargument).\n\n_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://\u2026`\n\n== Options\n*--accept*=_jpe?g$_::\n\tOptional regular expression (case insensitive) to keep matching document names.\n\tExample : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg\n*--download*::\n\tDirectly downloads found documents if set, output their URL if not.\n*--single-page*::\n\tLimits the search for documents to download to the given URL.\n*--verbose*::\n\tCreates a log file to keep trace of what was done.\n*--wait*=x::\n\tChange the default waiting time before each download (page or document).\n\tExample : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.\n*--no-random-wait*::\n\tStops the random pick up of waiting times. _--wait=_ or default is used.\n*--download-files* url.lst::\n\tDownloads each documents which URL are listed in the given file.\n\tExample : _--download-files url.lst_\n*--download-file* http://\u2026::\n\tDirectly save in the current folder the URL-pointed document.\n\n== Tests\nAround 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following\ncommand in the cloned repository root: +\n`python3 -m doctest doc_crawler.py`\n\nTests can also be launched one by one using the _--test=XXX_ argument: +\n`python3 -m doc_crawler --test=download_file`\n\nTests are successfully passed if nothing is output.\n\n== Requirements\n- requests\n- yaml\n\nOne can install them under Debian using the following command : `apt install python3-requests python3-yaml`\n\n== Author\nSimon Descarpentries - https://s.d12s.fr\n\n== Ressources\nGithub repository : https://github.com/Siltaar/doc_crawler.py +\nPypi repository : https://pypi.python.org/pypi/doc_crawler\n\n== Support\nTo support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar\n\n== Licence\nGNU General Public License v3.0. See LICENCE file for more information.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Siltaar/doc_crawler.py", "keywords": "crawler downloader recursive pdf-extractor web-crawler web-crawler-python file-download pdf zip doc odt", "license": "", "maintainer": "", "maintainer_email": "", "name": "doc_crawler", "package_url": "https://pypi.org/project/doc_crawler/", "platform": "", "project_url": "https://pypi.org/project/doc_crawler/", "project_urls": { "Homepage": "https://github.com/Siltaar/doc_crawler.py" }, "release_url": "https://pypi.org/project/doc_crawler/1.2/", "requires_dist": null, "requires_python": "", "summary": "Explore a website recursively and download all the wanted documents (PDF, ODT\u2026)", "version": "1.2" }, "last_serial": 3648620, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "3de98da7d87403126c1095f60120a450", "sha256": "2ffe66eadb16480fc72ef28186b54157ed7144a11ccbbc8a0f90c58f050c3afe" }, "downloads": -1, "filename": "doc_crawler-1.0.tar.gz", "has_sig": false, "md5_digest": "3de98da7d87403126c1095f60120a450", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5347, "upload_time": "2017-08-29T16:36:54", "url": "https://files.pythonhosted.org/packages/b4/87/52d7f95d1cd72db0b1877dc4363f1f8f3227e4272d8f3edd89055fce27d4/doc_crawler-1.0.tar.gz" } ], "1.1": [ { "comment_text": "", "digests": { "md5": "a5df0389fa1ae623eb5354383641097c", "sha256": "5953248e21b3d9a6239ff946e935223918421776d85ff37847b693f585f29164" }, "downloads": -1, "filename": "doc_crawler-1.1.tar.gz", "has_sig": false, "md5_digest": "a5df0389fa1ae623eb5354383641097c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5714, "upload_time": "2017-08-30T10:21:53", "url": "https://files.pythonhosted.org/packages/79/65/b3b70c3b39baa1f314a61065fdfd6ca871add9388f78efd98da0d9857349/doc_crawler-1.1.tar.gz" } ], "1.2": [ { "comment_text": "", "digests": { "md5": "4a9ad71302fffd7a30901eefe1caa3a8", "sha256": "148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c" }, "downloads": -1, "filename": "doc_crawler-1.2.tar.gz", "has_sig": false, "md5_digest": "4a9ad71302fffd7a30901eefe1caa3a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6190, "upload_time": "2018-03-07T18:18:59", "url": "https://files.pythonhosted.org/packages/c6/15/99098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3/doc_crawler-1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4a9ad71302fffd7a30901eefe1caa3a8", "sha256": "148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c" }, "downloads": -1, "filename": "doc_crawler-1.2.tar.gz", "has_sig": false, "md5_digest": "4a9ad71302fffd7a30901eefe1caa3a8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6190, "upload_time": "2018-03-07T18:18:59", "url": "https://files.pythonhosted.org/packages/c6/15/99098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3/doc_crawler-1.2.tar.gz" } ] }