{ "info": { "author": "Time Traveller", "author_email": "time.traveller.san@gmail.com", "bugtrack_url": null, "classifiers": [ "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3" ], "description": "# DatasetScraper\nTool to create image datasets for machine learning problems by scraping search engines like Google, Bing and Baidu.\n\n# Features:\n- **Search engine support**: Google, Bing, Baidu. (in-production): Yahoo, Yandex, Duckduckgo\n- **Image format support**: jpg, png, svg, gif, jpeg\n- Fast multiprocessing enabled scraper\n- Very fast multithreaded downloader\n- Data verification after download for assertion of image files\n\n# Installation\n- COMING SOON on pypi\n\n# Usage:\n- Import\n`from datasetscraper import Scraper`\n\n- Defaults\n```python\nobj = Scraper()\nurls = obj.fetch_urls('kiniro mosaic')\nobj.download(urls, directory='kiniro_mosaic/')\n```\n\n- Specify a search engine\n```python\nobj = Scraper()\nurls = obj.fetch_urls('kiniro mosaic', engine=['google'])\nobj.download(urls, directory='kiniro_mosaic/')\n```\n\n- Specify a list of search engines\n```python\nobj = Scraper()\nurls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'])\nobj.download(urls, directory='kiniro_mosaic/')\n```\n\n- Specify max images (default was 200)\n```python\nobj = Scraper()\nurls = obj.fetch_urls('kiniro mosaic', engine=['google', 'bing'], maxlist=[500, 300])\nobj.download(urls, directory='kiniro_mosaic/')\n```\n\n# FAQs\n- Why aren't yandex, yahoo, duckduckgo and other search engines supported?\nThey are hard to scrape, I am working on them and will update as soon as I can.\n\n- I set maxlist=[500] why are only (x<500) images downloaded?\nThere can be several reasons for this:\n - Search ran out: This happens very often, google/bing might not have enough images for your query\n - Slow internet: Increase the timeout (default is 60 seconds) as follows: ```obj.download(urls, directory='kiniro_mosaic/', timeout=100)```\n\n- How to debug?\nYou can change the logging level while making the scraper object : `obj = Scraper(logger.INFO)`\n\n# TODO:\n- More search engines\n- Better debug\n- Write documentation\n- Text data? Audio data?\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/TimeTraveller-San/datasetscraper", "keywords": "dataset_scraper,machine learning,dataset,images,scrape,yandex,google,bing,baidu", "license": "GPLv2", "maintainer": "", "maintainer_email": "", "name": "datasetscraper", "package_url": "https://pypi.org/project/datasetscraper/", "platform": "", "project_url": "https://pypi.org/project/datasetscraper/", "project_urls": { "Homepage": "https://github.com/TimeTraveller-San/datasetscraper" }, "release_url": "https://pypi.org/project/datasetscraper/0.0.4/", "requires_dist": [ "pyppeteer (>=0.0.25)", "fastprogress (>=0.1.21)", "requests (>=2.19.1)" ], "requires_python": ">=3", "summary": "Tool to create image datasets for machine learning problemsby scraping search engines like Google, Bing and Baidu.", "version": "0.0.4" }, "last_serial": 5166153, "releases": { "0.0.4": [ { "comment_text": "", "digests": { "md5": "6e1af8303f541d2b69e19dd7be9728e6", "sha256": "4819ae12d72c5f358d6bd753b5203cb2689ba4e796799dad535e482d36f42b61" }, "downloads": -1, "filename": "datasetscraper-0.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "6e1af8303f541d2b69e19dd7be9728e6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 13869, "upload_time": "2019-04-19T21:15:49", "url": "https://files.pythonhosted.org/packages/76/32/5b4d10c3e5fdd37fb62e18ff000ec8f857dcff91e0efd4594e0e8e0275ce/datasetscraper-0.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5e2b2caeb2a770518776c0cc9683e686", "sha256": "2487404f8454cdef44d32309c62cc3a035b0d0d7a5fea4744881d66dbf060437" }, "downloads": -1, "filename": "datasetscraper-0.0.4.tar.gz", "has_sig": false, "md5_digest": "5e2b2caeb2a770518776c0cc9683e686", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 5792, "upload_time": "2019-04-19T21:15:51", "url": "https://files.pythonhosted.org/packages/aa/3f/ff3744248ae93b2724e7d210bb95fa0a391c6c81b50db831a968a7a6e009/datasetscraper-0.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "6e1af8303f541d2b69e19dd7be9728e6", "sha256": "4819ae12d72c5f358d6bd753b5203cb2689ba4e796799dad535e482d36f42b61" }, "downloads": -1, "filename": "datasetscraper-0.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "6e1af8303f541d2b69e19dd7be9728e6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 13869, "upload_time": "2019-04-19T21:15:49", "url": "https://files.pythonhosted.org/packages/76/32/5b4d10c3e5fdd37fb62e18ff000ec8f857dcff91e0efd4594e0e8e0275ce/datasetscraper-0.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5e2b2caeb2a770518776c0cc9683e686", "sha256": "2487404f8454cdef44d32309c62cc3a035b0d0d7a5fea4744881d66dbf060437" }, "downloads": -1, "filename": "datasetscraper-0.0.4.tar.gz", "has_sig": false, "md5_digest": "5e2b2caeb2a770518776c0cc9683e686", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 5792, "upload_time": "2019-04-19T21:15:51", "url": "https://files.pythonhosted.org/packages/aa/3f/ff3744248ae93b2724e7d210bb95fa0a391c6c81b50db831a968a7a6e009/datasetscraper-0.0.4.tar.gz" } ] }