{ "info": { "author": "yannvgn", "author_email": "hi@yannvgn.io", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: BSD License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# LASER embeddings\n\n[![Travis (.org) branch](https://img.shields.io/travis/yannvgn/laserembeddings/master?style=flat-square)](https://travis-ci.org/yannvgn/laserembeddings)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/laserembeddings?style=flat-square)\n[![PyPI](https://img.shields.io/pypi/v/laserembeddings.svg?style=flat-square)](https://pypi.org/project/laserembeddings/)\n[![PyPI - License](https://img.shields.io/pypi/l/laserembeddings.svg?style=flat-square)](https://github.com/yannvgn/laserembeddings/blob/master/LICENSE)\n\nlaserembeddings is a pip-packaged, production-ready port of Facebook Research's [LASER](https://github.com/facebookresearch/LASER) (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.\n\n\ud83c\udf81 **Version 0.1.3 is out. What's new?**\n- A lot of languages that were only partially supported are now fully supported (br, bs, ceb, fr, gl, oc, ug, vi) \ud83c\udf0d\n\n## Context\n\n[LASER](https://github.com/facebookresearch/LASER) is a collection of scripts and models created by Facebook Research to compute **multilingual sentence embeddings** for zero-shot cross-lingual transfer. \n\nWhat does it mean? LASER is able to transform sentences into **language-independent vectors**. Similar sentences get mapped to close vectors (in terms of cosine distance), regardless of the input language.\n\nThat is great, especially if you don't have training sets for the language(s) you want to process: you can build a classifier on top of LASER embeddings, train it on whatever language(s) you have in your training data, and let it classify texts in any language.\n\n**The aim of the package is to make LASER as easy-to-use and easy-to-deploy as possible: zero-config, production-ready, etc., just a two-liner to install.**\n\n\ud83d\udc49 \ud83d\udc49 \ud83d\udc49 For detailed information, have a look at the amazing [LASER repository](https://github.com/facebookresearch/LASER), read its [presentation article](https://code.fb.com/ai-research/laser-multilingual-sentence-embeddings/) and its [research paper](https://arxiv.org/abs/1812.10464). \ud83d\udc48 \ud83d\udc48 \ud83d\udc48\n\n## Getting started\n\nYou'll need Python 3.6 or higher.\n\n### Installation\n\n```\npip install laserembeddings\n```\n\n### Downloading the pre-trained models\n\n```\npython -m laserembeddings download-models\n```\n\nThis will download the models to the default `data` directory next to the source code of the package. Use `python -m laserembeddings download-models path/to/model/directory` to download the models to a specific location.\n\n### Usage\n\n```python\nfrom laserembeddings import Laser\n\nlaser = Laser()\n\nembeddings = laser.embed_sentences(\n ['let your neural network be polyglot',\n 'use multilingual embeddings!'],\n lang='en') # lang is used for tokenization\n\n# embeddings is a N*1024 (N = number of sentences) NumPy array\n```\n\nIf you downloaded the models into a specific directory:\n\n```python\nfrom laserembeddings import Laser\n\npath_to_bpe_codes = ...\npath_to_bpe_vocab = ...\npath_to_encoder = ...\n\nlaser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)\n\n# you can also supply file objects instead of file paths\n```\n\nIf you want to pull the models from S3:\n\n```python\nfrom io import BytesIO, StringIO\nfrom laserembeddings import Laser\nimport boto3\n\ns3 = boto3.resource('s3')\nMODELS_BUCKET = ...\n\nf_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8'))\nf_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8'))\nf_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read())\n\nlaser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)\n```\n\n## What are the differences with the original implementation?\n\nSome dependencies of the original project have been replaced with pure-python dependencies, to make this package easy to install and deploy.\n\nHere's a summary of the differences:\n\n| Part of the pipeline | LASER dependency (original project) | laserembeddings dependency (this package) | Reason |\n|----------------------|-------------------------------------|----------------------------------------|--------|\n| Normalization / tokenization | [Moses](https://github.com/moses-smt/mosesdecoder) | [Sacremoses](https://github.com/alvations/sacremoses) | Moses is implemented in Perl |\n| BPE encoding | [fastBPE](https://github.com/glample/fastBPE) | [subword-nmt](https://github.com/rsennrich/subword-nmt) | fastBPE cannot be installed via pip and requires compiling C++ code |\n\nThe following features have not been implemented yet:\n- romanize, needed to process Greek (el)\n- Chinese text segmentation, needed to process Chinese (zh, cmn, wuu and yue)\n- Japanese text segmentation, needed to process Japanese (ja, jpn)\n\n## Will I get the exact same embeddings?\n\n**For most languages, in most of the cases, yes.**\n\nSome slight (and not so slight \ud83d\ude44) differences exist for some languages due to differences in the implementation of the Tokenizer.\n\n**[An exhaustive comparison of the embeddings generated with LASER and laserembeddings](tests/report/comparison-with-LASER.md) is automatically generated and will be updated for each new release.**\n\n## FAQ\n\n**How can I train the encoder?**\n\nYou can't. LASER models are pre-trained and do not need to be fine-tuned. The embeddings are generic and perform well without fine-tuning. See https://github.com/facebookresearch/LASER/issues/3#issuecomment-404175463.\n\n## Credits\n\nThanks a lot to the creators of [LASER](https://github.com/facebookresearch/LASER) for open-sourcing the code of LASER and releasing the pre-trained models. All the kudos should go to them \ud83d\udc4f.\n\nA big thanks to the creators of [Sacremoses](https://github.com/alvations/sacremoses) and [Subword Neural Machine Translation](https://github.com/rsennrich/subword-nmt/) for their great packages.\n\n## Testing\n\nFirst you'll need to checkout this repository and install it (in a virtual environment if you want). Also make sure to have [Poetry](https://github.com/sdispater/poetry) installed.\n\n```\npeotry install\n```\n\nThen, to run the tests:\n\n```\npoetry run pytest\n```\n\n### Testing the similarity between the embeddings computed with LASER and laserembeddings\n\nFirst, download the test data.\n\n```\npython -m laserembeddings download-test-data\n```\n\n\ud83d\udc49 If you want to know more about the contents and the generation of the test data, check out the [laserembeddings-test-data](https://github.com/yannvgn/laserembeddings-test-data) repository.\n\nThen, run the test with `SIMILARITY_TEST` env. variable set to `1`.\n\n```\nSIMILARITY_TEST=1 poetry run pytest tests/test_laser.py\n```\n\nNow, have a coffee \u2615\ufe0f and wait for the test to finish.\n\nThe similarity report will be generated here: [tests/report/comparison-with-LASER.md](tests/report/comparison-with-LASER.md).\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/yannvgn/laserembeddings", "keywords": "", "license": "BSD-3-Clause", "maintainer": "yannvgn", "maintainer_email": "hi@yannvgn.io", "name": "laserembeddings", "package_url": "https://pypi.org/project/laserembeddings/", "platform": "", "project_url": "https://pypi.org/project/laserembeddings/", "project_urls": { "Homepage": "https://github.com/yannvgn/laserembeddings", "Repository": "https://github.com/yannvgn/laserembeddings" }, "release_url": "https://pypi.org/project/laserembeddings/0.1.3/", "requires_dist": [ "torch (>=1.0.1.post2,<2.0.0)", "subword-nmt (>=0.3.6,<0.4.0)", "numpy (>=1.15.4,<2.0.0)", "sacremoses (==0.0.35)" ], "requires_python": ">=3.6,<4.0", "summary": "Production-ready LASER multilingual embeddings", "version": "0.1.3" }, "last_serial": 5922111, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "194301dc16960611e7c537d7eee67632", "sha256": "1db711a56f59db7e992d16347ffe7c73f96dd3b175a6f851a856a0b342ff8a55" }, "downloads": -1, "filename": "laserembeddings-0.1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "194301dc16960611e7c537d7eee67632", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 10772, "upload_time": "2019-07-23T18:25:18", "url": "https://files.pythonhosted.org/packages/c5/f6/f8a6e47c0ade1e082c9bca87d2a66fd3afe3dbf73b1e40c7c6ef950f5b5d/laserembeddings-0.1.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9fe63132442a90119bc1f44376061a98", "sha256": "3ac60f3234c81acfe20e0ee657bb8377a053d0d1033ba1b27e9fd0f295228480" }, "downloads": -1, "filename": "laserembeddings-0.1.0.tar.gz", "has_sig": false, "md5_digest": "9fe63132442a90119bc1f44376061a98", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 8589, "upload_time": "2019-07-23T18:25:20", "url": "https://files.pythonhosted.org/packages/a8/1a/a21af873a34383f1425e3aad967d3864dc9c36516966e6a8ce31d345ebe9/laserembeddings-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "14fa1f155083c5d6194da4df2e6bd8c1", "sha256": "f8da60b17b99281b7799f0731506891b0f94cab3152d3f620da394b5f8638775" }, "downloads": -1, "filename": "laserembeddings-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "14fa1f155083c5d6194da4df2e6bd8c1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 13153, "upload_time": "2019-07-23T20:01:09", "url": "https://files.pythonhosted.org/packages/27/dd/b15b821768fac193c3b319a646ffdfa755213981411901c2d352e5705817/laserembeddings-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "810971f207c52bf72d7829d1b7757f7b", "sha256": "99a8c0cc491d77eca04085aae12d87707594106c80678e137580b64db2514c3c" }, "downloads": -1, "filename": "laserembeddings-0.1.1.tar.gz", "has_sig": false, "md5_digest": "810971f207c52bf72d7829d1b7757f7b", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 13540, "upload_time": "2019-07-23T20:01:11", "url": "https://files.pythonhosted.org/packages/ce/27/621df2bac5567c8db18fdef9ab418d0d0d741b71571c3f78ec0811671672/laserembeddings-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "bee7ebf8029edd9518a125ccaba3c209", "sha256": "135b32fe74a52a885907ec69b2705c86c551e23295ba416b148f9e1ab3f53f40" }, "downloads": -1, "filename": "laserembeddings-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "bee7ebf8029edd9518a125ccaba3c209", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 13875, "upload_time": "2019-08-24T10:56:31", "url": "https://files.pythonhosted.org/packages/71/fa/2038e0c037e0da5f6b785b9a59b0d0bae897acaad0515278b6109ea9ce46/laserembeddings-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "eb2cc770cede4ffdc61c1fe29e36c15a", "sha256": "8adf92b2e06ca8c2a5b42ddb7f1cc2313bba39fc31bce5bf74b0c58cefd1c0ab" }, "downloads": -1, "filename": "laserembeddings-0.1.2.tar.gz", "has_sig": false, "md5_digest": "eb2cc770cede4ffdc61c1fe29e36c15a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 14533, "upload_time": "2019-08-24T10:56:33", "url": "https://files.pythonhosted.org/packages/c1/60/aa9a39ec95d4332ea7b62eadbae2c4e53141e5851c493b50f979ccf74f46/laserembeddings-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "3d08b7361e4195b2a2c1e0b4be249897", "sha256": "fecce40583d2591a0e4fd8afb7437c8facdd1cda68207cb96032f4f581135d08" }, "downloads": -1, "filename": "laserembeddings-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "3d08b7361e4195b2a2c1e0b4be249897", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 13846, "upload_time": "2019-10-03T07:19:13", "url": "https://files.pythonhosted.org/packages/54/d3/af9dbc6a29b4e48d9c53961ac328e96be726d0deeceb642847f235f5f0ba/laserembeddings-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c3bae9e02afaa7036d58f39963b0da1d", "sha256": "3de1868b3be52df8007dc9f67cfee398dc470d4f26446439649b85fe7d706a0d" }, "downloads": -1, "filename": "laserembeddings-0.1.3.tar.gz", "has_sig": false, "md5_digest": "c3bae9e02afaa7036d58f39963b0da1d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 14501, "upload_time": "2019-10-03T07:19:15", "url": "https://files.pythonhosted.org/packages/48/ae/af5c8f4e03329f37a8638f7c65b5d1bdd4b0358d6a2f2baf469aba532ba7/laserembeddings-0.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3d08b7361e4195b2a2c1e0b4be249897", "sha256": "fecce40583d2591a0e4fd8afb7437c8facdd1cda68207cb96032f4f581135d08" }, "downloads": -1, "filename": "laserembeddings-0.1.3-py3-none-any.whl", "has_sig": false, "md5_digest": "3d08b7361e4195b2a2c1e0b4be249897", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 13846, "upload_time": "2019-10-03T07:19:13", "url": "https://files.pythonhosted.org/packages/54/d3/af9dbc6a29b4e48d9c53961ac328e96be726d0deeceb642847f235f5f0ba/laserembeddings-0.1.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c3bae9e02afaa7036d58f39963b0da1d", "sha256": "3de1868b3be52df8007dc9f67cfee398dc470d4f26446439649b85fe7d706a0d" }, "downloads": -1, "filename": "laserembeddings-0.1.3.tar.gz", "has_sig": false, "md5_digest": "c3bae9e02afaa7036d58f39963b0da1d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 14501, "upload_time": "2019-10-03T07:19:15", "url": "https://files.pythonhosted.org/packages/48/ae/af5c8f4e03329f37a8638f7c65b5d1bdd4b0358d6a2f2baf469aba532ba7/laserembeddings-0.1.3.tar.gz" } ] }