{ "info": { "author": "Dom Hudson", "author_email": "dom.hudson@thoughtriver.com", "bugtrack_url": null, "classifiers": [], "description": "![tr_logo_cmyk_tr_logo_cmyk](https://user-images.githubusercontent.com/10864294/29792093-382146cc-8c37-11e7-9e70-6f71b3d0800b.png)\n\n# LMDB Embeddings\nQuery word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).\n\nInspired by [Delft](https://github.com/kermitt2/delft). As explained in their readme, this approach permits us to have the pre-trained embeddings immediately \"warm\" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.\n\nFor instance, in a traditional approach `glove-840B` takes around 2 minutes to load and 4GB in memory. Managed with LMDB, `glove-840B` can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).\n\n## Reading vectors\n\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.exceptions import MissingWordError\n\nembeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')\n\ntry:\n vector = embeddings.get_word_vector('google')\nexcept MissingWordError:\n # 'google' is not in the database.\n pass\n```\n\n## Writing vectors\nAn example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the `iter_embeddings` method below appropriately.\n\nI will be writing a CLI interface to convert standard formats soon.\n\n```python\nfrom gensim.models.keyedvectors import KeyedVectors\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\n\n\nGOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'\nOUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'\n\n\nprint('Loading gensim model...')\ngensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary = True)\n\n\ndef iter_embeddings():\n for word in gensim_model.vocab.keys():\n yield word, gensim_model[word]\n\nprint('Writing vectors to a LMDB database...')\n\nwriter = LmdbEmbeddingsWriter(\n iter_embeddings()\n).write(OUTPUT_DATABASE_FOLDER)\n\n# These vectors can now be loaded with the LmdbEmbeddingsReader.\n```\n\n## Customisation\nBy default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the `LmdbEmbeddingsWriter` and `LmdbEmbeddingsReader`.\n\nA [msgpack](https://msgpack.org/index.html) serializer is included and can be used in the same way.\n\n```python\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nwriter = LmdbEmbeddingsWriter(\n iter_embeddings(),\n serializer = MsgpackSerializer.serialize\n).write(OUTPUT_DATABASE_FOLDER)\n```\n\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nreader = LmdbEmbeddingsReader(\n OUTPUT_DATABASE_FOLDER,\n unserializer = MsgpackSerializer.unserialize\n)\n```\n\n## Running tests\n```\npytest\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://www.thoughtriver.com", "keywords": "", "license": "GNU General Public License v3.0", "maintainer": "", "maintainer_email": "", "name": "lmdb-embeddings", "package_url": "https://pypi.org/project/lmdb-embeddings/", "platform": "", "project_url": "https://pypi.org/project/lmdb-embeddings/", "project_urls": { "Homepage": "https://www.thoughtriver.com" }, "release_url": "https://pypi.org/project/lmdb-embeddings/0.2.1/", "requires_dist": [ "lmdb", "msgpack", "msgpack-numpy", "numpy", "pytest", "pytest-cov" ], "requires_python": "", "summary": "Fast querying of word embeddings using the LMDB \"Lightning\" Database.", "version": "0.2.1" }, "last_serial": 4398096, "releases": { "0.2.1": [ { "comment_text": "", "digests": { "md5": "9c1e2374f19bceccf64126b1513cc03b", "sha256": "cc600ed3a65d392869402739e00380cc2dfd52c79d37da86fd45120f93b03519" }, "downloads": -1, "filename": "lmdb_embeddings-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "9c1e2374f19bceccf64126b1513cc03b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 22289, "upload_time": "2018-10-20T22:00:33", "url": "https://files.pythonhosted.org/packages/6e/77/48747e7f68aa4b1bfb8b4027a2b7e83070090750b72075cd4ba80f28f4d0/lmdb_embeddings-0.2.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eff2df4559d17e56a7bb28bf077626eb", "sha256": "de8283a6e61a9b5f18bd83112dac57edf40e913dd7bd9e011da94d223b8b002e" }, "downloads": -1, "filename": "lmdb_embeddings-0.2.1.tar.gz", "has_sig": false, "md5_digest": "eff2df4559d17e56a7bb28bf077626eb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5150, "upload_time": "2018-10-20T22:00:34", "url": "https://files.pythonhosted.org/packages/fa/ce/2176cca225f7553818807bc67a8c61ee32d4721377ec42de23a9d64ec1cf/lmdb_embeddings-0.2.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9c1e2374f19bceccf64126b1513cc03b", "sha256": "cc600ed3a65d392869402739e00380cc2dfd52c79d37da86fd45120f93b03519" }, "downloads": -1, "filename": "lmdb_embeddings-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "9c1e2374f19bceccf64126b1513cc03b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 22289, "upload_time": "2018-10-20T22:00:33", "url": "https://files.pythonhosted.org/packages/6e/77/48747e7f68aa4b1bfb8b4027a2b7e83070090750b72075cd4ba80f28f4d0/lmdb_embeddings-0.2.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eff2df4559d17e56a7bb28bf077626eb", "sha256": "de8283a6e61a9b5f18bd83112dac57edf40e913dd7bd9e011da94d223b8b002e" }, "downloads": -1, "filename": "lmdb_embeddings-0.2.1.tar.gz", "has_sig": false, "md5_digest": "eff2df4559d17e56a7bb28bf077626eb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5150, "upload_time": "2018-10-20T22:00:34", "url": "https://files.pythonhosted.org/packages/fa/ce/2176cca225f7553818807bc67a8c61ee32d4721377ec42de23a9d64ec1cf/lmdb_embeddings-0.2.1.tar.gz" } ] }