{
    "info": {
        "author": "Dom Hudson",
        "author_email": "dom.hudson@thoughtriver.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "![tr_logo_cmyk_tr_logo_cmyk](https://user-images.githubusercontent.com/10864294/29792093-382146cc-8c37-11e7-9e70-6f71b3d0800b.png)\n\n# LMDB Embeddings\nQuery word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by [Lightning Memory-Mapped Database](https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Database).\n\nInspired by [Delft](https://github.com/kermitt2/delft). As explained in their readme, this approach permits us to have the pre-trained embeddings immediately \"warm\" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.\n\nFor instance, in a traditional approach `glove-840B` takes around 2 minutes to load and 4GB in memory. Managed with LMDB, `glove-840B` can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).\n\n## Reading vectors\n\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.exceptions import MissingWordError\n\nembeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')\n\ntry:\n  vector = embeddings.get_word_vector('google')\nexcept MissingWordError:\n  # 'google' is not in the database.\n  pass\n```\n\n## Writing vectors\nAn example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the `iter_embeddings` method below appropriately.\n\nI will be writing a CLI interface to convert standard formats soon.\n\n```python\nfrom gensim.models.keyedvectors import KeyedVectors\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\n\n\nGOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'\nOUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'\n\n\nprint('Loading gensim model...')\ngensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary = True)\n\n\ndef iter_embeddings():\n    for word in gensim_model.vocab.keys():\n        yield word, gensim_model[word]\n\nprint('Writing vectors to a LMDB database...')\n\nwriter = LmdbEmbeddingsWriter(\n    iter_embeddings()\n).write(OUTPUT_DATABASE_FOLDER)\n\n# These vectors can now be loaded with the LmdbEmbeddingsReader.\n```\n\n## Customisation\nBy default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the `LmdbEmbeddingsWriter` and `LmdbEmbeddingsReader`.\n\nA [msgpack](https://msgpack.org/index.html) serializer is included and can be used in the same way.\n\n```python\nfrom lmdb_embeddings.writer import LmdbEmbeddingsWriter\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nwriter = LmdbEmbeddingsWriter(\n    iter_embeddings(),\n    serializer = MsgpackSerializer.serialize\n).write(OUTPUT_DATABASE_FOLDER)\n```\n\n```python\nfrom lmdb_embeddings.reader import LmdbEmbeddingsReader\nfrom lmdb_embeddings.serializers import MsgpackSerializer\n\nreader = LmdbEmbeddingsReader(\n    OUTPUT_DATABASE_FOLDER,\n    unserializer = MsgpackSerializer.unserialize\n)\n```\n\n## Running tests\n```\npytest\n```\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://www.thoughtriver.com",
        "keywords": "",
        "license": "GNU General Public License v3.0",
        "maintainer": "",
        "maintainer_email": "",
        "name": "lmdb-embeddings",
        "package_url": "https://pypi.org/project/lmdb-embeddings/",
        "platform": "",
        "project_url": "https://pypi.org/project/lmdb-embeddings/",
        "project_urls": {
            "Homepage": "https://www.thoughtriver.com"
        },
        "release_url": "https://pypi.org/project/lmdb-embeddings/0.2.1/",
        "requires_dist": [
            "lmdb",
            "msgpack",
            "msgpack-numpy",
            "numpy",
            "pytest",
            "pytest-cov"
        ],
        "requires_python": "",
        "summary": "Fast querying of word embeddings using the LMDB \"Lightning\" Database.",
        "version": "0.2.1"
    },
    "last_serial": 4398096,
    "releases": {
        "0.2.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "9c1e2374f19bceccf64126b1513cc03b",
                    "sha256": "cc600ed3a65d392869402739e00380cc2dfd52c79d37da86fd45120f93b03519"
                },
                "downloads": -1,
                "filename": "lmdb_embeddings-0.2.1-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "9c1e2374f19bceccf64126b1513cc03b",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 22289,
                "upload_time": "2018-10-20T22:00:33",
                "url": "https://files.pythonhosted.org/packages/6e/77/48747e7f68aa4b1bfb8b4027a2b7e83070090750b72075cd4ba80f28f4d0/lmdb_embeddings-0.2.1-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "eff2df4559d17e56a7bb28bf077626eb",
                    "sha256": "de8283a6e61a9b5f18bd83112dac57edf40e913dd7bd9e011da94d223b8b002e"
                },
                "downloads": -1,
                "filename": "lmdb_embeddings-0.2.1.tar.gz",
                "has_sig": false,
                "md5_digest": "eff2df4559d17e56a7bb28bf077626eb",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5150,
                "upload_time": "2018-10-20T22:00:34",
                "url": "https://files.pythonhosted.org/packages/fa/ce/2176cca225f7553818807bc67a8c61ee32d4721377ec42de23a9d64ec1cf/lmdb_embeddings-0.2.1.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "9c1e2374f19bceccf64126b1513cc03b",
                "sha256": "cc600ed3a65d392869402739e00380cc2dfd52c79d37da86fd45120f93b03519"
            },
            "downloads": -1,
            "filename": "lmdb_embeddings-0.2.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "9c1e2374f19bceccf64126b1513cc03b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 22289,
            "upload_time": "2018-10-20T22:00:33",
            "url": "https://files.pythonhosted.org/packages/6e/77/48747e7f68aa4b1bfb8b4027a2b7e83070090750b72075cd4ba80f28f4d0/lmdb_embeddings-0.2.1-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "eff2df4559d17e56a7bb28bf077626eb",
                "sha256": "de8283a6e61a9b5f18bd83112dac57edf40e913dd7bd9e011da94d223b8b002e"
            },
            "downloads": -1,
            "filename": "lmdb_embeddings-0.2.1.tar.gz",
            "has_sig": false,
            "md5_digest": "eff2df4559d17e56a7bb28bf077626eb",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5150,
            "upload_time": "2018-10-20T22:00:34",
            "url": "https://files.pythonhosted.org/packages/fa/ce/2176cca225f7553818807bc67a8c61ee32d4721377ec42de23a9d64ec1cf/lmdb_embeddings-0.2.1.tar.gz"
        }
    ]
}