{ "info": { "author": "Justin Shenk", "author_email": "shenkjustin@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Mathematics", "Topic :: Software Development :: Libraries", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "# :monkey: simages:monkey:\n[![PyPI version](https://badge.fury.io/py/simages.svg)](https://badge.fury.io/py/simages) [![Build Status](https://travis-ci.com/justinshenk/simages.svg?branch=master)](https://travis-ci.com/justinshenk/simages) [![Documentation Status](https://readthedocs.org/projects/simages/badge/?version=latest)](https://simages.readthedocs.io/en/latest/?badge=latest) [![DOI](https://zenodo.org/badge/188052094.svg)](https://zenodo.org/badge/latestdoi/188052094) [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/justinshenk/simages/master?filepath=demo.ipynb)\n\n\nFind similar images within a dataset. \n\nUseful for removing duplicate images from a dataset after scraping images with [google-images-download](https://github.com/hardikvasa/google-images-download).\n\nThe Python API returns `pairs, duplicates`, where pairs are the (ordered) closest pairs and distances is the \ncorresponding embedding distance.\n\n### Install\n\nSee the [installation docs](https://simages.readthedocs.io/en/latest/install.html) for all details. \n\n```bash\npip install simages\n```\n\nor install from source:\n\n```bash\ngit clone https://github.com/justinshenk/simages\ncd simages\npip install .\n```\n\nTo install the interactive interface, [install mongodb](https://docs.mongodb.com/manual/installation/) and use rather `pip install \"simages[all]\"`.\n\n### Demo\n\n1. Minimal command-line interface with ```simages-show```:\n\n![simages_demo](images/simages_demo.gif)\n\n2. Interactive image deletion with ```simages add/find```:\n![simages_web_demo](images/screenshot_server.png)\n\n### Usage\n\nTwo interfaces exist:\n\n1. matplotlib interface which plots the duplicates for visual inspection\n2. mongodb + flask interface which allows interactive deletion [optional]\n\n#### Minimal interface\n\nIn your console, enter the directory with images and use `simages-show`:\n\n```bash\n$ simages-show --data-dir .\n```\n\n```\nusage: simages-show [-h] [--data-dir DATA_DIR] [--show-train]\n [--epochs EPOCHS] [--num-channels NUM_CHANNELS]\n [--pairs PAIRS] [--zdim ZDIM] [-s]\n\n -h, --help show this help message and exit\n --data-dir DATA_DIR, -d DATA_DIR\n Folder containing image data\n --show-train, -t Show training of embedding extractor every epoch\n --epochs EPOCHS, -e EPOCHS\n Number of passes of dataset through model for\n training. More is better but takes more time.\n --num-channels NUM_CHANNELS, -c NUM_CHANNELS\n Number of channels for data (1 for grayscale, 3 for\n color)\n --pairs PAIRS, -p PAIRS\n Number of pairs of images to show\n --zdim ZDIM, -z ZDIM Compression bits (bigger generally performs better but\n takes more time)\n -s, --show Show closest pairs\n\n```\n\n#### Web Interface [Optional]\n\nNote: To install the web interface API, [install and run mongodb](https://docs.mongodb.com/manual/installation/) and use `pip install \"simages[all]\"` to install optional dependencies.\n\nAdd your pictures to the database (this will take some time depending on the number of pictures)\n\n```\nsimages add \n```\n\nA webpage will come up with all of the similar or duplicate pictures:\n```\nsimages find \n```\n\n```\nUsage:\n simages add ... [--db=] [--parallel=]\n simages remove ... [--db=]\n simages clear [--db=]\n simages show [--db=]\n simages find [--print] [--delete] [--match-time] [--trash=] [--db=] [--epochs=]\n simages -h | --help\nOptions:\n -h, --help Show this screen\n --db= The location of the database or a MongoDB URI. (default: ./db)\n --parallel= The number of parallel processes to run to hash the image\n files (default: number of CPUs).\n find:\n --print Only print duplicate files rather than displaying HTML file\n --delete Move all found duplicate pictures to the trash. This option takes priority over --print.\n --match-time Adds the extra constraint that duplicate images must have the\n same capture times in order to be considered.\n --trash= Where files will be put when they are deleted (default: ./Trash)\n --epochs= Epochs for training [default: 2]\n```\n\n\n### Python APIs\n\n#### Numpy array\n\n```python\nfrom simages import find_duplicates\nimport numpy as np\n\narray_data = np.random.random(100, 3, 48, 48)# N x C x H x W\npairs, distances = find_duplicates(array_data)\n\n```\n\n#### Folder\n\n```python\nfrom simages import find_duplicates\n\ndata_dir = \"my_images_folder\"\npairs, distances = find_duplicates(data_dir)\n\n```\n\nDefault options for `find_duplicates` are:\n\n```python\ndef find_duplicates(\n input: Union[str or np.ndarray],\n n: int = 5,\n num_epochs: int = 2,\n num_channels: int = 3,\n show: bool = False,\n show_train: bool = False,\n **kwargs\n):\n \"\"\"Find duplicates in dataset. Either `array` or `data_dir` must be specified.\n\n Args:\n input (str or np.ndarray): folder directory or N x C x H x W array\n n (int): number of closest pairs to identify\n num_epochs (int): how long to train the autoencoder (more is generally better)\n show (bool): display the closest pairs\n show_train (bool): show output every\n z_dim (int): size of compression (more is generally better, but slower)\n kwargs (dict): etc, passed to `EmbeddingExtractor`\n\n Returns:\n pairs (np.ndarray): indices for closest pairs of images, n x 2 array\n distances (np.ndarray): distances of each pair to each other\n```\n\n#### `Embeddings` API\n\n```python\nfrom simages import Embeddings\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nembeddings = Embeddings(data)\n\n# Access the array\narray = embeddings.array # N x z (compression size)\n\n# Get 10 closest pairs of images\npairs, distances = embeddings.duplicates(n=5)\n\n```\n\n```python\nIn [0]: pairs\nOut[0]: array([[912, 990], [716, 790], [907, 943], [483, 492], [806, 883]])\n\nIn [1]: distances\nOut[1]: array([0.00148035, 0.00150703, 0.00158789, 0.00168699, 0.00168721])\n```\n\n#### `EmbeddingExtractor` API\n\n```python\nfrom simages import EmbeddingExtractor\nimport numpy as np\n\nN = 1000\ndata = np.random.random((N, 28, 28))\nextractor = EmbeddingExtractor(data, num_channels=1) # grayscale\n\n# Show 10 closest pairs of images\npairs, distances = extractor.show_duplicates(n=10)\n\n```\n\nClass attributes and parameters:\n\n```python\nclass EmbeddingExtractor:\n \"\"\"Extract embeddings from data with models and allow visualization.\n\n Attributes:\n trainloader (torch loader)\n evalloader (torch loader)\n model (torch.nn.Module)\n embeddings (np.ndarray)\n\n \"\"\"\n def __init__(\n self,\n input:Union[str, np.ndarray],\n num_channels=None,\n num_epochs=2,\n batch_size=32,\n show_train=True,\n show=False,\n z_dim=8,\n **kwargs,\n ):\n \"\"\"Inits EmbeddingExtractor with input, either `str` or `np.nd.array`, performs training and validation.\n\n Args:\n input (np.ndarray or str): data\n num_channels (int): grayscale = 1, color = 3\n num_epochs (int): more is better (generally)\n batch_size (int): number of images per batch\n show_train (bool): show intermediate training results\n show (bool): show closest pairs\n z_dim (int): compression size\n kwargs (dict)\n\n \"\"\"\n\n```\n\nSpecify tne number of pairs to identify with the parameter `n`.\n\n### How it works\n\n*simages* uses a convolutional autoencoder with PyTorch and compares the latent representations with [closely](https://github.com/justinshenk/closely) :triangular_ruler:.\n\n#### Dependencies\n\n*simages* depends on\nthe following packages:\n\n- [closely](https://github.com/justinshenk/closely)\n- [torch](https://pytorch.org)\n- [torchvision](https://pytorch.org)\n- scikit-learn\n- matplotlib\n\nOptional dependencies, installed with `pip install simages[all]` include:\n\n- pymongodb\n- fastcluster\n- flask\n- jinja\n- dnspython\n- python-magic\n- termcolor\n\n### Cite\n\nIf you use simages, please cite it:\n```\n @misc{justin_shenk_2019_3237830,\n author = {Justin Shenk},\n title = {justinshenk/simages: v19.0.1},\n month = jun,\n year = 2019,\n doi = {10.5281/zenodo.3237830},\n url = {https://doi.org/10.5281/zenodo.3237830}\n }\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/justinshenk/simages", "keywords": "images,photos,duplicates,preprocessing,similar data", "license": "MIT", "maintainer": "Justin Shenk", "maintainer_email": "shenkjustin@gmail.com", "name": "simages", "package_url": "https://pypi.org/project/simages/", "platform": "", "project_url": "https://pypi.org/project/simages/", "project_urls": { "Homepage": "https://github.com/justinshenk/simages" }, "release_url": "https://pypi.org/project/simages/19.0.2.post1/", "requires_dist": [ "numpy", "scipy", "torch (>=1.0)", "torchvision (>=0.3)", "Pillow", "closely", "pymongo", "flask", "jinja2 (>=2.10)", "more-itertools", "Flask-Cors", "dnspython", "Werkzeug", "python-magic", "termcolor", "coverage ; extra == 'dev'", "pytest ; extra == 'dev'", "sphinx ; extra == 'dev'", "wheel ; extra == 'dev'", "pre-commit ; extra == 'dev'", "sphinx ; extra == 'docs'", "coverage ; extra == 'tests'", "pytest ; extra == 'tests'" ], "requires_python": ">= 3.6", "summary": "Find similar images in a dataset", "version": "19.0.2.post1" }, "last_serial": 5415152, "releases": { "19.0.0.dev0": [ { "comment_text": "", "digests": { "md5": "b29c32a4dd58fd1993ca7042007ffc6d", "sha256": "8d373cc3c6c26d88a5656e3cdf38a9ed7ac92e89ec3f0a3e6f987b9cb32230f6" }, "downloads": -1, "filename": "simages-19.0.0.dev0-py3-none-any.whl", "has_sig": false, "md5_digest": "b29c32a4dd58fd1993ca7042007ffc6d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 10379, "upload_time": "2019-05-28T17:15:23", "url": "https://files.pythonhosted.org/packages/79/cc/91d0cb1c9dc2940dcc15806029b78051e6554dcba38ef30e844ccef61a8f/simages-19.0.0.dev0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d509d08c9deb404b0d0548231eea716e", "sha256": "19e8ce464bfb40bc96748603c3db87ff779b5b322dfba418cca2137cbb725670" }, "downloads": -1, "filename": "simages-19.0.0.dev0.tar.gz", "has_sig": false, "md5_digest": "d509d08c9deb404b0d0548231eea716e", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 10296, "upload_time": "2019-05-28T17:15:25", "url": "https://files.pythonhosted.org/packages/a3/ce/cd6aa784a1939cbe384f947c88d2888190d0d9f0fd087e2ab710d79df2ee/simages-19.0.0.dev0.tar.gz" } ], "19.0.0.dev1": [ { "comment_text": "", "digests": { "md5": "f542cf3ae1dfc5ad8ec476096ea5f2fb", "sha256": "2f58ccf61ac276e09ac9499784e5fcac8952d3a9ed6eec32ff1824e6f1c867ae" }, "downloads": -1, "filename": "simages-19.0.0.dev1-py3-none-any.whl", "has_sig": false, "md5_digest": "f542cf3ae1dfc5ad8ec476096ea5f2fb", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 12664, "upload_time": "2019-05-28T19:51:43", "url": "https://files.pythonhosted.org/packages/9d/4a/f1405261dee60ac48a85ca9a07347e35c7b364d10aedcbfd5e998ffc606d/simages-19.0.0.dev1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "13dc5ae6db948cbb5df4a05369975aab", "sha256": "08f8e5e8a7052d47e2d6952595191e8a4bda9548a1c771b7c767626e2a0f626f" }, "downloads": -1, "filename": "simages-19.0.0.dev1.tar.gz", "has_sig": false, "md5_digest": "13dc5ae6db948cbb5df4a05369975aab", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 12894, "upload_time": "2019-05-28T19:51:44", "url": "https://files.pythonhosted.org/packages/45/db/b71b848ada11d69066eb1aa3531648f31c0b3945f23a7e8600425a6e377c/simages-19.0.0.dev1.tar.gz" } ], "19.0.0.dev2": [ { "comment_text": "", "digests": { "md5": "dff8c6614a74b703e41eff8597fc6e97", "sha256": "31b6dd64a7ceba0448dd2f92237e54657e32ce07b6fadf11a1053bcd26748a2a" }, "downloads": -1, "filename": "simages-19.0.0.dev2-py3-none-any.whl", "has_sig": false, "md5_digest": "dff8c6614a74b703e41eff8597fc6e97", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 13305, "upload_time": "2019-05-29T11:06:43", "url": "https://files.pythonhosted.org/packages/80/e5/2b4579511385a6744820624b204e3db9a6200e5dee24715ee732ced8e89e/simages-19.0.0.dev2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d946e3081eb7efbd7f3105faad4747b3", "sha256": "7611ca1770418a0a46d61f4bf9599df6436372f4d4ba76931069e5425b6d5839" }, "downloads": -1, "filename": "simages-19.0.0.dev2.tar.gz", "has_sig": false, "md5_digest": "d946e3081eb7efbd7f3105faad4747b3", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 14103, "upload_time": "2019-05-29T11:06:45", "url": "https://files.pythonhosted.org/packages/fe/91/7dbbbfd65cc60e9ff0508364bdaab023a2c8f5d728e45b83927499e22094/simages-19.0.0.dev2.tar.gz" } ], "19.0.0.dev3": [ { "comment_text": "", "digests": { "md5": "89a54a5d0ba7e3b663fdbbdda1d1e55b", "sha256": "9ea11d477f7567bcc2ca8f47b4377ec947621f7d3449da70d069e55808c1f696" }, "downloads": -1, "filename": "simages-19.0.0.dev3.tar.gz", "has_sig": false, "md5_digest": "89a54a5d0ba7e3b663fdbbdda1d1e55b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 3802592, "upload_time": "2019-05-30T17:22:40", "url": "https://files.pythonhosted.org/packages/ec/2c/0cf42e16829946392fa16e90af848946c3442b261d527062e4d343f2b415/simages-19.0.0.dev3.tar.gz" } ], "19.0.1": [ { "comment_text": "", "digests": { "md5": "d4437a5e2de2b9819209338bf2589d17", "sha256": "a499c80054b16c21d247ede3ff650a1fcb973f9851704e9d698ecff6c7e2cbe2" }, "downloads": -1, "filename": "simages-19.0.1.tar.gz", "has_sig": false, "md5_digest": "d4437a5e2de2b9819209338bf2589d17", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 3813201, "upload_time": "2019-06-03T11:21:55", "url": "https://files.pythonhosted.org/packages/bf/fa/6b39094eaad7ab134ef097148fe94d8fc5b6a9b336fdc00821f28ac3f2c1/simages-19.0.1.tar.gz" } ], "19.0.2": [ { "comment_text": "", "digests": { "md5": "7762e4fec1ab165a061d390e759333d0", "sha256": "526d04b8d6495965f142f77bce80c11ba30c32289e2d3987efc0a1adf74065ed" }, "downloads": -1, "filename": "simages-19.0.2.tar.gz", "has_sig": false, "md5_digest": "7762e4fec1ab165a061d390e759333d0", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 3812673, "upload_time": "2019-06-09T17:53:13", "url": "https://files.pythonhosted.org/packages/f3/92/c48244c64cb63e501d74b3c840b2fe484dd22afbc90b6817790aaf4c8c82/simages-19.0.2.tar.gz" } ], "19.0.2.post0": [ { "comment_text": "", "digests": { "md5": "b6078eeed2b3d67a14ce193215597065", "sha256": "9c54cde28f4544e7bb96e33cc0594938e4ceaf1ade957654169946889ab75846" }, "downloads": -1, "filename": "simages-19.0.2.post0-py3-none-any.whl", "has_sig": false, "md5_digest": "b6078eeed2b3d67a14ce193215597065", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 24048, "upload_time": "2019-06-17T14:55:03", "url": "https://files.pythonhosted.org/packages/8d/d7/335a476c6673efd875ee51827609f43f5f4b66780d7774c6c7aa57e17051/simages-19.0.2.post0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "be2c8ea4f5c40fec4a76b7cc5bd6b875", "sha256": "96af3b0b46726045ed34932b58b5b60ce3df2e4c4ea5422fd1623b5168cb63b0" }, "downloads": -1, "filename": "simages-19.0.2.post0.tar.gz", "has_sig": false, "md5_digest": "be2c8ea4f5c40fec4a76b7cc5bd6b875", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 11689323, "upload_time": "2019-06-17T14:55:15", "url": "https://files.pythonhosted.org/packages/99/c1/859925aa5a3693ca79e9622f1ce7c373eea09c504545c828d5ca6226e5b5/simages-19.0.2.post0.tar.gz" } ], "19.0.2.post1": [ { "comment_text": "", "digests": { "md5": "97df955edc12c9b96001ea4e831afb5d", "sha256": "34da55f1528d02d3539e9084d0367f33e335068a9552cc89c92b49a19f13d73e" }, "downloads": -1, "filename": "simages-19.0.2.post1-py3-none-any.whl", "has_sig": false, "md5_digest": "97df955edc12c9b96001ea4e831afb5d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 24046, "upload_time": "2019-06-17T14:59:13", "url": "https://files.pythonhosted.org/packages/09/f2/f0da30ba689f46e6a6a6d3a7957818717fddd6b9f6a1cd88b7fdfbae92c1/simages-19.0.2.post1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e4463d3f6482291a0fdf8ca335d1f996", "sha256": "00ebca4b65e77b8b62507b166cba99c6dd51a38a5a0b6251d7fc8b7f2025239b" }, "downloads": -1, "filename": "simages-19.0.2.post1.tar.gz", "has_sig": false, "md5_digest": "e4463d3f6482291a0fdf8ca335d1f996", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 11689317, "upload_time": "2019-06-17T14:59:27", "url": "https://files.pythonhosted.org/packages/c8/82/9b61c292ef5ff2f2bb53e64abdcdc456a54a291c0152fa8336f86ec88c9f/simages-19.0.2.post1.tar.gz" } ], "19.0.3.dev0": [ { "comment_text": "", "digests": { "md5": "7be1e95e4880be65217c4eff1b341f39", "sha256": "74bd0741c8131a5248dcb68275b750bdbc2b686dfc0e5ba0db0aefb372d9a49f" }, "downloads": -1, "filename": "simages-19.0.3.dev0-py3-none-any.whl", "has_sig": false, "md5_digest": "7be1e95e4880be65217c4eff1b341f39", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 24011, "upload_time": "2019-06-18T13:13:00", "url": "https://files.pythonhosted.org/packages/bc/91/67c1d79b8789dbe89d43c34797f24d34acf3e8c372663b05bdca7ef761e0/simages-19.0.3.dev0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "17ea0a8dccce56bffc4ba0a07a345e3b", "sha256": "066d44ad38eae844b402ccb83af2926f13ca01ef3ed8f6f905392e9491be6c52" }, "downloads": -1, "filename": "simages-19.0.3.dev0.tar.gz", "has_sig": false, "md5_digest": "17ea0a8dccce56bffc4ba0a07a345e3b", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 11689127, "upload_time": "2019-06-18T13:13:05", "url": "https://files.pythonhosted.org/packages/b1/f9/e71c903d9fc75baba1f59dcdc98dfc502b98c0e3ed40c72a5cd955a3a682/simages-19.0.3.dev0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "97df955edc12c9b96001ea4e831afb5d", "sha256": "34da55f1528d02d3539e9084d0367f33e335068a9552cc89c92b49a19f13d73e" }, "downloads": -1, "filename": "simages-19.0.2.post1-py3-none-any.whl", "has_sig": false, "md5_digest": "97df955edc12c9b96001ea4e831afb5d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">= 3.6", "size": 24046, "upload_time": "2019-06-17T14:59:13", "url": "https://files.pythonhosted.org/packages/09/f2/f0da30ba689f46e6a6a6d3a7957818717fddd6b9f6a1cd88b7fdfbae92c1/simages-19.0.2.post1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e4463d3f6482291a0fdf8ca335d1f996", "sha256": "00ebca4b65e77b8b62507b166cba99c6dd51a38a5a0b6251d7fc8b7f2025239b" }, "downloads": -1, "filename": "simages-19.0.2.post1.tar.gz", "has_sig": false, "md5_digest": "e4463d3f6482291a0fdf8ca335d1f996", "packagetype": "sdist", "python_version": "source", "requires_python": ">= 3.6", "size": 11689317, "upload_time": "2019-06-17T14:59:27", "url": "https://files.pythonhosted.org/packages/c8/82/9b61c292ef5ff2f2bb53e64abdcdc456a54a291c0152fa8336f86ec88c9f/simages-19.0.2.post1.tar.gz" } ] }