{ "info": { "author": "Ole Krause-Sparmann", "author_email": "ole@pixelogik.de", "bugtrack_url": null, "classifiers": [], "description": "# NearPy\n\nNearPy is a Python framework for fast (approximated) nearest neighbour search in high dimensional vector spaces using different locality-sensitive hashing methods.\n\nIt allows to experiment and to evaluate new methods but is also production-ready. It comes with a redis storage adapter.\n\nTo install simply do *pip install NearPy*. It will also install the packages scipy, numpy and redis.\n\nBoth dense and sparse (scipy.sparse) vectors are supported right now.\n\nRead more here: http://nearpy.io\n\n## Principle\n\nTo find approximated nearest neighbours for a query vector, first the vectors to be stored are indexed. For each vector that should be indexed ('stored') a hash is generated,\nthat is a string value. This hash is used as the bucket key\nof the bucket the vector is then stored into. Buckets are in most cases just lists of vectors, it is the terminology used in these applications.\n\nThis would not be of any use for finding nearest neighbours to a query vector. So the 'secret' to this mechanism is to use so called locality sensitive hashes, LSHs.\nThese hashes take spatiality into account so that they tend to generate identical hash values (bucket keys) for close vectors. This makes it then super fast to get close\nvectors given a query vector. Because this is a very rough approach it is called approximated nearest neighbour search, because you might not get the real nearest\nneighbours. In many applications this is fine because you just wanna get 20 vectors that are 'equal enough'.\n\n## Engine\n\nWhen using NearPy you will mostly do this by configuring and using an Engine object. Engines are configurable\npipelines for approximated nearest neighbour search (ANNS) using locality sensitive hashes (LSHs).\n\n![alt text](http://nearpy.io/images/Pipeline.png \"Pipeline diagram\")\n\nEngines are configured using the constructor that accepts the different components along the pipeline:\n\n```python\ndef __init__(self, dim, lshashes=[RandomBinaryProjections('default', 10)],\n distance=CosineDistance(),\n vector_filters=[NearestFilter(10)],\n storage=MemoryStorage()):\n```\n\nThe ANNS pipeline is configured for a fixed dimensionality of the feature space, that is set using the dim parameter of the constructor. This must be an positive integer value.\n\nThe engine can use multiple LSHs and takes them from the lshashes parameter, that must be an array of\nLSHash objects.\n\nDepending on the kind of filters used during querying a distance measure can be specified. This is only\nneeded if you use filters that need a distance (like NearestFilter or DistanceThresholdFilter).\n\nFilters are used in a last step during querying nearest neighbours. Existing implementations are NearestFilter, DistanceThresholdFilter and UniqueFilter.\n\nThe is another parameter to the engine called fetch_vector_filters, which is performed after fetching candidate vectors from the buckets and before distances\nor vector filters are used. This is by default one UniqueFilter and it should always be. I keep this still as a parameter if people need to play around with that. \n\nThe engine supports different kinds of ways how the indexed vectors (and the buckets) are stored. Current\nstorage implementations are MemoryStorage and RedisStorage.\n\nThere are two main methods of the engine:\n\n```python\nstore_vector(self, v, data=None)\nneighbours(self, v)\n```\nstore_vector() hashes vector v with all configured LSHs and stores it in all matching buckets in the storage.\nThe optional data argument must be JSON-serializable. It is stored with the vector and will be returned in search results.\nIt is best practice to use only database ids as attached 'data' and store the actual data in a database.\n\nneighbours() hashes vector v with all configured LSHs, collects all candidate vectors from the matching\nbuckets in storage, applies the (optional) distance function and finally the (optional) filter function\nto construct the returned list of either (vector, data, distance) tuples or (vector, data) tuples.\n\nTo remove indexed vectors and their data from the engine these two methods can be used:\n\n```python\nclean_all_buckets(self)\nclean_buckets(self, hash_name)\n```\n## Hashes\n\nAll LSH implementatiosn in NearPy do subclass nearpy.hashes.LSHash, which has one main method, besides\nconstructor and reset methods.\n\n```python\n hash_vector(self, v)\n```\n\nhash_vector() hashes the specified vector and returns a list of bucket keys with one or more entries.\n\nThe LSH RandomBinaryProjections projects the specified vector on n random\nnormalized vectors in the feature space and returns a string made from zeros and ones. If v lies on\nthe positive side of the n-th normal vector the n-th character in the string is a '1', if v lies\non the negative side of it, the n-th character in the string is a '0'. This way this LSH projects\neach possible vector of the feature space ('input space') into one of many possible buckets.\n\nThe LSH RandomDiscretizedProjections is almost identical to RandomBinaryProjections. The only difference is,\nthat is divides the projection value by a bin width, and using the bin index in each random projection\nas part of the bucket key. Given the same count of random projection vectors as RandomBinaryProjections, this\nresults in more buckets given the same vector set. The density of buckets on the projections can be controlled\nby the bin width, which is part of the constructor.\n\nThe LSH PCABinaryProjections is trained with a training set of vectors specified with the constructor. It\nperforms PCA (principal component analysis) to find the directions of highest variance in the training set.\nIt then uses the first n principal components as projection vectors (or dimensions of the subspace that is\nprojected into). The idea was that this makes it more safe to get a good distribution of the vectors among\nthe buckets. I do not have any tests on this and don't know if this makes sense at all.\n\nThe LSH PCADiscretizedProjections is the pca version of RandomDiscretizedProjections, not using random vectors\nbut the first n principal components of the training set, like PCABinaryProjections does it.\n\n## Hash configurations\n\nTo save your index in Redis and re-use it after the indexing process is done you should persist\nyour hash configurations so that afterwards you can re-create the same engine for more indexing\nor querying.\n\nThis is done like this:\n\n```python\n# Create redis storage adapter\nredis_object = Redis(host='localhost', port=6379, db=0)\nredis_storage = RedisStorage(redis_object)\n\n# Get hash config from redis\nconfig = redis_storage.load_hash_configuration('MyHash')\n\nif config is None:\n # Config is not existing, create hash from scratch, with 10 projections\n lshash = RandomBinaryProjections('MyHash', 10)\nelse:\n # Config is existing, create hash with None parameters\n lshash = RandomBinaryProjections(None, None)\n # Apply configuration loaded from redis\n lshash.apply_config(config)\n\n# Create engine for feature space of 100 dimensions and use our hash.\n# This will set the dimension of the lshash only the first time, not when\n# using the configuration loaded from redis. Use redis storage to store\n# buckets.\nengine = Engine(100, lshashes=[lshash], storage=redis_storage)\n\n# Do some stuff like indexing or querying with the engine...\n\n# Finally store hash configuration in redis for later use\nredis_storage.store_hash_configuration(lshash)\n```\n===========\n\nExample usage:\n\n```python\nfrom nearpy import Engine\nfrom nearpy.hashes import RandomBinaryProjections\n\n# Dimension of our vector space\ndimension = 500\n\n# Create a random binary hash with 10 bits\nrbp = RandomBinaryProjections('rbp', 10)\n\n# Create engine with pipeline configuration\nengine = Engine(dimension, lshashes=[rbp])\n\n# Index 1000000 random vectors (set their data to a unique string)\nfor index in range(100000):\n v = numpy.random.randn(dimension)\n engine.store_vector(v, 'data_%d' % index)\n\n# Create random query vector\nquery = numpy.random.randn(dimension)\n\n# Get nearest neighbours\nN = engine.neighbours(query)\n```\n\n===========\n\n\n\n\n\n\n\n\n\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/pixelogik/NearPy", "keywords": "nearpy approximate nearest neighbour", "license": "LICENSE.txt", "maintainer": "", "maintainer_email": "", "name": "NearPy", "package_url": "https://pypi.org/project/NearPy/", "platform": "", "project_url": "https://pypi.org/project/NearPy/", "project_urls": { "Homepage": "https://github.com/pixelogik/NearPy" }, "release_url": "https://pypi.org/project/NearPy/1.0.0/", "requires_dist": [ "bitarray", "future", "numpy", "scipy" ], "requires_python": "", "summary": "Framework for fast approximated nearest neighbour search.", "version": "1.0.0" }, "last_serial": 2366274, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "02119c8d83844996150e93fd9b13c5ad", "sha256": "5f7b6aea1250648663d28e2b73e6bea10fdace96510c3ee07f324beb7fdd87cc" }, "downloads": -1, "filename": "NearPy-0.1.tar.gz", "has_sig": false, "md5_digest": "02119c8d83844996150e93fd9b13c5ad", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3108, "upload_time": "2013-05-02T13:11:57", "url": "https://files.pythonhosted.org/packages/d1/af/3d0dbd877d720b4a7dbc4b7a738637d100478aaa4106c54d5c46c285df46/NearPy-0.1.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "34430d27a8682d981f850fa1dff88be9", "sha256": "fa7a2202071de27816d9d2d3bc0b77770230d526a4ec8c81a79fe9f7018b05fe" }, "downloads": -1, "filename": "NearPy-0.1.1.tar.gz", "has_sig": false, "md5_digest": "34430d27a8682d981f850fa1dff88be9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3112, "upload_time": "2013-05-02T13:17:42", "url": "https://files.pythonhosted.org/packages/12/15/70bd012ba829a67a2eda899bb51b3eea3de22c9e62a4f9c392ef840a058e/NearPy-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "b4cc5ee01f4db474b2217badbfdecb4f", "sha256": "0182db12904a3463218b6d24fd8fe3aa25eaffb9278755486c94ded4cce4edae" }, "downloads": -1, "filename": "NearPy-0.1.2.tar.gz", "has_sig": false, "md5_digest": "b4cc5ee01f4db474b2217badbfdecb4f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10741, "upload_time": "2013-05-02T13:20:33", "url": "https://files.pythonhosted.org/packages/12/ba/e3b6862bf4792516e4f0d884d50207e81efb941ca42632b748e02a80df2e/NearPy-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "d0f12ba211467861c374295b44232e2e", "sha256": "ff82660513bda9f6c8d29b91966f9564d4cffa4e28722f4c6c08b92346463133" }, "downloads": -1, "filename": "NearPy-0.1.3.tar.gz", "has_sig": false, "md5_digest": "d0f12ba211467861c374295b44232e2e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21514, "upload_time": "2015-06-04T09:33:22", "url": "https://files.pythonhosted.org/packages/4f/96/99a8fdf0c9dc1452536080c5608c70750fbefe610f4a5cb86dabaeae2ba6/NearPy-0.1.3.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "d4726603a61f2ccbf90872931adfde42", "sha256": "24373db80e0341b94ecef3d8d81099043596d35e7ad823d0ffe48c099894120d" }, "downloads": -1, "filename": "NearPy-0.2.tar.gz", "has_sig": false, "md5_digest": "d4726603a61f2ccbf90872931adfde42", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21638, "upload_time": "2015-02-15T14:27:14", "url": "https://files.pythonhosted.org/packages/d4/71/6a25b8f794f021c66ec681ab9ba0193b53739125b5f7f9e968942671612a/NearPy-0.2.tar.gz" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "948023a8970352617d85a4a9e5341eb6", "sha256": "bd9087f327bcc647498649fb600f559946eb381cba9013770ce35e807c705d4c" }, "downloads": -1, "filename": "NearPy-0.2.1.tar.gz", "has_sig": false, "md5_digest": "948023a8970352617d85a4a9e5341eb6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21509, "upload_time": "2015-06-04T09:34:25", "url": "https://files.pythonhosted.org/packages/98/c0/a99fbaac19edd46e23186fede337a785dd78464b6e61dd2d193932053a16/NearPy-0.2.1.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "6ed07af18e92de9368ddc161ff142d7e", "sha256": "9f2e877458c1e13f327b2d3177f863af43fc63332fe3fee3a88192792efc7947" }, "downloads": -1, "filename": "NearPy-0.2.2.tar.gz", "has_sig": false, "md5_digest": "6ed07af18e92de9368ddc161ff142d7e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21506, "upload_time": "2015-06-05T18:28:19", "url": "https://files.pythonhosted.org/packages/a9/c5/9745fea7ddbb051ed642d5d2e29e988abcccafdf04abfc014f90bcdb1628/NearPy-0.2.2.tar.gz" } ], "1.0.0": [ { "comment_text": "", "digests": { "md5": "d434993d961364e058990dc72e5ae361", "sha256": "0134b2e6569b8b141973592ef6e3d09f2d186531e0fe3a3d1a83df893df1b425" }, "downloads": -1, "filename": "NearPy-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d434993d961364e058990dc72e5ae361", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 64411, "upload_time": "2016-09-27T13:03:12", "url": "https://files.pythonhosted.org/packages/cc/74/4680e44c04c23ca52ab458746ab725a6e3e7fa6f6c18c4a578847a725cbd/NearPy-1.0.0-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d434993d961364e058990dc72e5ae361", "sha256": "0134b2e6569b8b141973592ef6e3d09f2d186531e0fe3a3d1a83df893df1b425" }, "downloads": -1, "filename": "NearPy-1.0.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "d434993d961364e058990dc72e5ae361", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 64411, "upload_time": "2016-09-27T13:03:12", "url": "https://files.pythonhosted.org/packages/cc/74/4680e44c04c23ca52ab458746ab725a6e3e7fa6f6c18c4a578847a725cbd/NearPy-1.0.0-py2.py3-none-any.whl" } ] }