{ "info": { "author": "Gorm Roedder", "author_email": "gormroedder@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Web Trawler\n===========\n\nGiven the url of an html web page, this Python package asynchronously downloads all non-web \nfiles linked to from that web page, e.g. audio files, Excel documents, etc. Optionally, all \nweb pages linked to from the original web page can be trawled for files as well.\n\nInstallation\n------------\n\n`Python 3`_ must be installed and in your system PATH. That is, it must be a recognised command\nfor the command line interface. Enter :code:`python --version` in your command line to see whether\nyou have Python 3 installed. \n\nThe Python package manager :code:`pip` is also required. Check that you have it by running :code:`pip --version`. \nIt is automatically installed with recent versions of Python, but it can also be installed manually. \nSee `the official installation instructions`_\n\n.. _`Python 3`: https://www.python.org/downloads/\n.. _`the official installation instructions`: https://pip.pypa.io/en/stable/installing/\n\nInstalling the web_trawler package\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n\nRun the following code in your command line interface (excluding the $, which is just a prompt icon): \n\n.. code-block:: bash\n\n $ pip install web_trawler --upgrade\n\nThe package has no external dependencies. For testing, pytest_ is required.\n\n.. _pytest: https://docs.pytest.org/en/latest/contents.html\n\nThe source code for web_trawler is available on gitlab.com_.\n\n.. _gitlab.com: https://gitlab.com/dlab-indecol/web_trawler\n\nUsage\n-----\n\nCommand line\n^^^^^^^^^^^^\n\nOnce installed, web_trawler can be used like this:\n\n.. code-block:: bash\n\n $ web_trawler google.com\n\nRun this command to see how web_trawler finds links\nand inspects their http headers for more information. A bunch of logging events will be output to console. \nThere are ordinarily no files linked to from google.com,\nbut if there are, they will be downloaded to the directory :code:`download/` relative to where you ran the command.\n\nThe url argument is required. In addition, the following optional arguments are supported:\n\n --target TARGET Give a path for where you would like the files to be downloaded.\n The default path is \"download\".\n --add_links_from_linked_pages Set web_trawler to trawl pages linked to from the original web page as well\n (only goes one step,\n and only for links within the domain of the original web page)\n --interactive Short version is \"-i\". Asks user about whether or not to trawl each linked page (has no effect\n unless the --add_links_from_linked_pages flag is set to true.\n --interactive_download_prompt Short version is \"-I\". Asks user about whether or not to download each of the files found.\n --quiet Suppresses output information about which links are being processed\n and which files are being downloaded.\n --processes PROCESSES Manually set how many processes will be spawned. The default is to spawn\n one less than the number of processors detected (so as not to stall the\n system). For each process, up to 10 threads are spawned.\n --whitelist WHITELIST Space-separated file endings to whitelist. Allows use of wildcards, e.g.\n \"xls*\" to capture all the Excel file extension variants, like xlsx, xlsb,\n xlsm and xls. A given blacklist takes precedence over the whitelist.\n --blacklist BLACKLIST Space-separated file endings to blacklist. Works just whitelist, only it\n excludes files of the given file endings.\n --no_of_files_limit LIMIT Set a maximum number of files you are willing to download, in case\n web_trawler finds more than expected.\n --mb_per_file_limit LIMIT Set a maximum file size you are willing to download. Warnings are\n logged to console for each file excluded.\n\nEach argument has a shorthand consisting of their first letters, e.g. :code:`-t`, :code:`-a`, :code:`-q`, etc.\n\nA realistic example of use\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\nIf we'd like to download, say, all zip and Excel files up to 100 MB from\n`a web page on the World Input-Output Database site`_, into a local directory called \"data\",\nwe'd need to use the arguments :code:`-t` (for target), :code:`-w` (for whitelist) and :code:`-m` \n(for mb_per_file_limit):\n\n.. _a web page on the World Input-Output Database site: http://www.wiod.org/database/wiots16\n\n.. code-block:: bash\n\n $ web_trawler http://www.wiod.org/database/wiots16 -t \"data\" -w \"zip xls*\" -m 100\n\nNotice the use of a wildcard in the whitelist. The web page specified links to two different Excel associated\nfile endings. The wildcard ensures that both are captured.\n\nIf you test this command, downloads of a bunch of large files will start. Press :code:`ctrl-c` or :code:`ctrl-z` to\ninterrupt or force quit the process, respectively.\n\nMake sure to clean up any downloaded files you don't want. They should be in a folder relative to where you ran the\ncommand. If you didn't specify a target, they are downloaded to a directory called \"download\".\n\nIncluding links from linked pages\n\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\"\n\nTo see how the :code:`-a` argument works without starting a million downloads, run the following command, where\n:code:`-m 0` ensures that all files are skipped:\n\n.. code-block:: bash\n\n $ web_trawler http://www.wiod.org/database/wiots16 -a -m 0\n\nNote that this will still create the target directory if it doesn't exist already.\n\nTo get prompted about whether or not to add the files linked to from each of the linked pages, run this command,\nwhere the :code:`-a` and :code:`-i` commands have been concatenated into one, and the whitelist is set so as to\nnot have any downloads started:\n\n.. code-block:: bash\n\n $ web_trawler http://www.wiod.org/database/wiots16 -ai -w \"nosuchfileending\"\n\nUse within Python\n^^^^^^^^^^^^^^^^^\n\nThe following code does the exact same thing as the last example for the command line usage:\n\n.. code-block::\n\n import web_trawler\n\n web_trawler.trawl(\"http://www.wiod.org/database/wiots16\", \n add_links_from_linked_pages=True, mb_per_file_limit=0)\n\nThe function :code:`trawl` does the same thing as web_trawler as run from the command line, but with the arguments\npassed to it directly in Python.\n\nSeveral of the intermediary functions used in web_trawler can also be accessed through Python, i.e. to get a\nlist with information about all links on a webpage, or just the links to files, filtered with a blacklist\nor whitelist. Here's a brief description of each of them:\n\n :get_links: Takes only one argument, a url, and returns a list of Link namedtuples, described below.\n This list is unfiltered. All http links that return a http request are included.\n :get_file_links: Runs get_links and returns a filtered list of Link namedtuples for files only,\n with whitelist and/or blacklist applied if specified. Arguments have self-explanatory names.\n The whitelist and blacklist can be provided as a space-separated string or as a list.\n\nBoth :code:`get_links` and :code:`get_file_links` return lists of namedtuples with the following fields:\n\n :href: the link url\n :title: the content of the :code:`` tag containing the link\n :mb: calculated from the http header :code:`content-length`\n :type: the http header :code:`content-type`, unmodified\n\nUse in Matlab\n^^^^^^^^^^^^^\n\nIn Matlab, functions of pip installed Python packages can be called using the :code:`py` script, where optional\narguments are specified using the pyargs function:\n\n.. code-block:: matlab\n\n >> py.web_trawler.get_file_links('http://www.wiod.org/database/wiots16', pyargs('whitelist', 'xls* doc*'))\n\nStdout isn't displayed, that's why the :code:`get_file_links` function was chosen, as it returns something.\nTo use the full functionality of web_trawler, you could run the function :code:`trawl` instead. As long as\nthere are no errors, nothing will show up in the Command Window. Files will nevertheless be downloaded,\nrelative to your Current Folder in Matlab.\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://gitlab.com/dlab-indecol/web_trawler", "keywords": "", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "web-trawler", "package_url": "https://pypi.org/project/web-trawler/", "platform": "", "project_url": "https://pypi.org/project/web-trawler/", "project_urls": { "Homepage": "https://gitlab.com/dlab-indecol/web_trawler" }, "release_url": "https://pypi.org/project/web-trawler/0.2.0/", "requires_dist": null, "requires_python": "", "summary": "Trawl web pages for files to download", "version": "0.2.0" }, "last_serial": 3028105, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "537be6eb53de37bb424885826d74682a", "sha256": "431fb5db1661dce003e925228ebf57acf9a28c5958c018b35bef481846402829" }, "downloads": -1, "filename": "web_trawler-0.1.0.tar.gz", "has_sig": false, "md5_digest": "537be6eb53de37bb424885826d74682a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3346, "upload_time": "2017-06-21T14:33:03", "url": "https://files.pythonhosted.org/packages/4a/e0/5989f0124e94f8f8676c688f8042fbc3e874638321f4160267ca3a25666d/web_trawler-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "45a451094426da4aeb9f0f89b0d707d9", "sha256": "b633630498cc0c67a6bb011c87b1896ad59ec0c2fc283adcfd239b192303ddc0" }, "downloads": -1, "filename": "web_trawler-0.1.1.tar.gz", "has_sig": false, "md5_digest": "45a451094426da4aeb9f0f89b0d707d9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6991, "upload_time": "2017-06-21T14:50:57", "url": "https://files.pythonhosted.org/packages/0d/3e/28952863d799411e7591ef0efacf1bb0b69b50dec4079a585f33ded700cd/web_trawler-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "57fc929130f9a8f7f01b2b647607b3e2", "sha256": "16c6b4aef058cfaad2b31b980fe54193d5a3b8607a7b85080f45b6a72e7a02f1" }, "downloads": -1, "filename": "web_trawler-0.1.2.tar.gz", "has_sig": false, "md5_digest": "57fc929130f9a8f7f01b2b647607b3e2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21551, "upload_time": "2017-06-22T10:00:48", "url": "https://files.pythonhosted.org/packages/fc/48/87cfc195042d7d85e51fe63b77d7ec24ae831533db4b95dbd40e9bf2882e/web_trawler-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "4c3b8eb147b46db24c08cf53f929d7c2", "sha256": "6c353e3e85f480612b4661866377c1b2136999af1a69ff1d70ce384823c49451" }, "downloads": -1, "filename": "web_trawler-0.1.3.tar.gz", "has_sig": false, "md5_digest": "4c3b8eb147b46db24c08cf53f929d7c2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21937, "upload_time": "2017-06-28T07:01:49", "url": "https://files.pythonhosted.org/packages/de/2e/a2f39b2188bfba5cfb7de454b6b7e98b7bce496b8789589b9b9c6a02a4fb/web_trawler-0.1.3.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "463776f1cb692a20d4b2432803a9f69d", "sha256": "c285ab738370f10c1a6f9e1f0df8fae594f2b3bcf2fb5227ed7846e58aa2fdcb" }, "downloads": -1, "filename": "web_trawler-0.2.0.tar.gz", "has_sig": false, "md5_digest": "463776f1cb692a20d4b2432803a9f69d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23088, "upload_time": "2017-07-17T08:23:43", "url": "https://files.pythonhosted.org/packages/1e/f7/4032b27388066904bab82d2b3e334b4dd27be53439c4f6a2b95770f1f576/web_trawler-0.2.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "463776f1cb692a20d4b2432803a9f69d", "sha256": "c285ab738370f10c1a6f9e1f0df8fae594f2b3bcf2fb5227ed7846e58aa2fdcb" }, "downloads": -1, "filename": "web_trawler-0.2.0.tar.gz", "has_sig": false, "md5_digest": "463776f1cb692a20d4b2432803a9f69d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23088, "upload_time": "2017-07-17T08:23:43", "url": "https://files.pythonhosted.org/packages/1e/f7/4032b27388066904bab82d2b3e334b4dd27be53439c4f6a2b95770f1f576/web_trawler-0.2.0.tar.gz" } ] }