{ "info": { "author": "Felipe Aguirre Martinez", "author_email": "felipeam86@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "imgdl\n=====\n\nPython package for downloading a collection of images from a list of\nurls. It comes with the following features:\n\n- Downloads are multithreaded using ``concurrent.futures``.\n- Relies on a persistent cache. Already downloaded images are not\n downloaded again, unless you force ``imgdl`` to do so.\n- Can hide requests behind proxies\n- It can be used as a command line utility or as a python library.\n- Normalizes images to JPG format + RGB mode after download.\n- Generates thumbnails of varying sizes automatically.\n- Can space downloads with a random timeout drawn from an uniform\n distribution.\n\nInstallation\n------------\n\n.. code:: bash\n\n pip install imgdl\n\nOr, from the root project directory:\n\n.. code:: bash\n\n pip install .\n\nUsage\n-----\n\nHere is a simple example using the default configurations:\n\n.. code:: python\n\n from imgdl import download\n\n urls = [\n 'https://upload.wikimedia.org/wikipedia/commons/9/92/Moh_%283%29.jpg',\n 'https://upload.wikimedia.org/wikipedia/commons/8/8b/Moh_%284%29.jpg',\n 'https://upload.wikimedia.org/wikipedia/commons/c/cd/Rostige_T%C3%BCr_P4RM1492.jpg'\n ]\n\n paths = download(urls, store_path='~/.datasets/images', n_workers=50)\n\n``100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3/3 [00:08<00:00, 2.68s/it]``\n\nImages will be downloaded to ``~/.datasets/images`` using 50 threads.\nThe function returns the list of paths to each image. Paths are\nconstructed as ``{store_data}/{SHA1-hash(url).jpg}``. If for any reason a\ndownload fails, ``imgdl`` returns a ``None`` as path.\n\nNotice that if you invoke ``download`` again with the same urls, it\nwill not download them again as it will check first that they are\nalready downloaded.\n\n.. code:: python\n\n paths = download(urls, store_path='~/.datasets/images', n_workers=50)\n\n``100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 3/3 [00:00<00:00, 24576.00it/s]``\n\nDownload was instantaneous! and ``imgdl`` is clever enough to return\nthe image paths.\n\nHere is the complete list of parameters taken by ``download``:\n\n- ``iterator``: The only mandatory parameter. Usually a list of urls,\n but can be any kind of iterator.\n- ``store_path``: Root path where images should be stored\n- ``n_workers``: Number of simultaneous threads to use\n- ``timeout``: Timeout that the url request should tolerate\n- ``thumbs``: If True, create thumbnails of sizes according to\n thumbs_size\n- ``thumbs_size``: Dictionary of the kind {name: (width, height)}\n indicating the thumbnail sizes to be created.\n- ``min_wait``: Minimum wait time between image downloads\n- ``max_wait``: Maximum wait time between image downloads\n- ``proxies``: Proxy or list of proxies to use for the requests\n- ``headers``: headers to be given to ``requests``\n- ``user_agent``: User agent to be used for the requests\n- ``notebook``: If True, use the notebook version of tqdm progress bar\n- ``debug`` If True, ``imgdl`` logs urls that could not be downloaded\n- ``force``: ``download`` checks first if the image already exists on\n ``store_path`` in order to avoid double downloads. If you want to\n force downloads, set this to True.\n\nMost of these parameters can also be set on a ``config.yaml`` file found\non the directory where the Python process was launched. See\n`config.yaml.example`_\n\nCommand Line Interface\n----------------------\n\nIt can also be used as a command line utility:\n\n.. code:: bash\n\n $ imgdl --help\n usage: imgdl [-h] [-o STORE_PATH] [--thumbs THUMBS] [--n_workers N_WORKERS]\n [--timeout TIMEOUT] [--min_wait MIN_WAIT] [--max_wait MAX_WAIT]\n [--proxy PROXY] [-u USER_AGENT] [-f] [--notebook] [-d]\n urls\n\n Bulk image downloader from a list of urls\n\n positional arguments:\n urls Text file with the list of urls to be downloaded\n\n optional arguments:\n -h, --help show this help message and exit\n -o STORE_PATH, --store_path STORE_PATH\n Root path where images should be stored (default:\n ~/.datasets/imgdl)\n --thumbs THUMBS Thumbnail size to be created. Can be specified as many\n times as thumbs sizes you want (default: None)\n --n_workers N_WORKERS\n Number of simultaneous threads to use (default: 50)\n --timeout TIMEOUT Timeout to be given to the url request (default: 5.0)\n --min_wait MIN_WAIT Minimum wait time between image downloads (default:\n 0.0)\n --max_wait MAX_WAIT Maximum wait time between image downloads (default:\n 0.0)\n --proxy PROXY Proxy or list of proxies to use for the requests\n (default: None)\n -u USER_AGENT, --user_agent USER_AGENT\n User agent to be used for the requests (default:\n Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0)\n Gecko/20100101 Firefox/55.0)\n -f, --force Force the download even if the files already exists\n (default: False)\n --notebook Use the notebook version of tqdm (default: False)\n -d, --debug Activate debug mode (default: False)\n\n\nDownload images from google\n===========================\n\nThis is an example of how we can use ``imgdl`` to download images from a google image search.\nI currently use this to quickly build up image datasets. I took inspiration from `this`_ blog\npost by `pyimagesearch`_.\n\nRequirements\n------------\n\nInstall imgdl with the ``[google]`` extra requirements:\n\n.. code:: bash\n\n pip install imgdl[google]\n\n\nDownload the webdriver for Chrome `here`_ and make sure it\u2019s in your PATH, e. g., place it in /usr/bin or /usr/local/bin.\n\n.. code:: bash\n\n sudo cp chromedriver /usr/local/bin/\n\nClone this repository, or simply download the ``google.py`` script.\n\nUsage\n-----\n\n\nYou are ready to download images from a google images search. Here is an example of usage:\n\n.. code:: bash\n\n $ python google.py \"paris by night\" -n 600 --interactive\n Querying google images for 'paris by night'\n Scrolling down five times\n 600 images found.\n Downloading to /Users/aguirre/Projets/imagedownloader/examples/images\n 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 600/600 [01:15<00:00, 7.91it/s]\n 2018-03-04 23:21:52,616 - imgdl.downloader - INFO - 0 images failed to download\n\nThe first argument is the query to be sent to google. With ``-n 600`` you are asking for at least 600 images.\nBy default, a google image query page has only 100 images and requires you to scroll down if you want more.\nWhat the script is doing is using `selenium`_ to simulate a browsing session and scroll down on google search.\nWith the ``--interactive`` flag, chrome will open and you will be able to see how it scrolls down in order to\nget more images. Here is the full list of the command line options:\n\n.. code:: bash\n\n $ python google.py --help\n usage: google.py [-h] [-n N_IMAGES] [--interactive] [-o STORE_PATH]\n [--thumbs THUMBS] [--n_workers N_WORKERS] [--timeout TIMEOUT]\n [--min_wait MIN_WAIT] [--max_wait MAX_WAIT] [--proxy PROXY]\n [-u USER_AGENT] [-f] [--notebook] [-d]\n query\n\n Download images from a google images query\n\n positional arguments:\n query Query string to be executed on google images\n\n optional arguments:\n -h, --help show this help message and exit\n -n N_IMAGES, --n_images N_IMAGES\n Number of expected images to download (default: 100)\n --interactive Open up chrome interactively to see the search results\n and scrolling action. (default: False)\n -o STORE_PATH, --store_path STORE_PATH\n Root path where images should be stored (default:\n images)\n --thumbs THUMBS Thumbnail size to be created. Can be specified as many\n times as thumbs sizes you want (default: None)\n --n_workers N_WORKERS\n Number of simultaneous threads to use (default: 40)\n --timeout TIMEOUT Timeout to be given to the url request (default: 5.0)\n --min_wait MIN_WAIT Minimum wait time between image downloads (default:\n 0.0)\n --max_wait MAX_WAIT Maximum wait time between image downloads (default:\n 0.0)\n --proxy PROXY Proxy or list of proxies to use for the requests\n (default: None)\n -u USER_AGENT, --user_agent USER_AGENT\n User agent to be used for the requests (default:\n Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0)\n Gecko/20100101 Firefox/55.0)\n -f, --force Force the download even if the files already exists\n (default: False)\n --notebook Use the notebook version of tqdm (default: False)\n -d, --debug Activate debug mode (default: False)\n\n\nAcknowledgements\n----------------\n\nImages used for tests are from the `wikimedia commons`_\n\n.. _config.yaml.example: config.yaml.example\n.. _wikimedia commons: https://commons.wikimedia.org\n.. _here: https://sites.google.com/a/chromium.org/chromedriver/downloads\n.. _this: https://www.pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/\n.. _pyimagesearch: https://www.pyimagesearch.com/\n.. _selenium: http://selenium-python.readthedocs.io/\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/felipeam86/imagedownloader", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "imgdl", "package_url": "https://pypi.org/project/imgdl/", "platform": "", "project_url": "https://pypi.org/project/imgdl/", "project_urls": { "Homepage": "https://github.com/felipeam86/imagedownloader" }, "release_url": "https://pypi.org/project/imgdl/1.1.0/", "requires_dist": [ "Pillow (>=4.2.1)", "requests (>=2.14.2)", "tqdm (>=4.15.0)", "PyYAML", "attrs", "python-json-logger", "jupyter; extra == 'docs'", "ipython; extra == 'docs'", "pandas; extra == 'docs'", "invoke; extra == 'docs'", "selenium; extra == 'google'", "beautifulsoup4; extra == 'google'", "lxml; extra == 'google'", "pytest; extra == 'tests'", "pytest-pep8; extra == 'tests'", "pep8; extra == 'tests'", "autopep8; extra == 'tests'", "pytest-xdist; extra == 'tests'", "pytest-cov; extra == 'tests'" ], "requires_python": "~=3.6", "summary": "Bulk image downloader from a list of urls", "version": "1.1.0" }, "last_serial": 3777533, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "5094404930ef60e7b34b89ac78c42842", "sha256": "d1a9334c4c841b5bf2d2fd378a56c0f4b96c093e38660971d157bdecb3653ff2" }, "downloads": -1, "filename": "imgdl-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "5094404930ef60e7b34b89ac78c42842", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": "~=3.6", "size": 15203, "upload_time": "2018-04-10T23:06:48", "url": "https://files.pythonhosted.org/packages/ee/27/6964bb0240a8f33cae32918338b718cf356e4d104d63ee54a699fa61a57a/imgdl-1.0.0-py3-none-any.whl" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "0504695be675073d9534b6562a902a68", "sha256": "b263304d89eac5900ff0913a004b8c0a344c275a758a5a6940a7ebc64ffe477e" }, "downloads": -1, "filename": "imgdl-1.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0504695be675073d9534b6562a902a68", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": "~=3.6", "size": 14962, "upload_time": "2018-04-18T15:11:21", "url": "https://files.pythonhosted.org/packages/96/db/f513563d9c9578bfbb325456c57d055f0c327a18ef7804d33b8da9f91deb/imgdl-1.1.0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0504695be675073d9534b6562a902a68", "sha256": "b263304d89eac5900ff0913a004b8c0a344c275a758a5a6940a7ebc64ffe477e" }, "downloads": -1, "filename": "imgdl-1.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0504695be675073d9534b6562a902a68", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": "~=3.6", "size": 14962, "upload_time": "2018-04-18T15:11:21", "url": "https://files.pythonhosted.org/packages/96/db/f513563d9c9578bfbb325456c57d055f0c327a18ef7804d33b8da9f91deb/imgdl-1.1.0-py3-none-any.whl" } ] }