{ "info": { "author": "Alexandre Kabbach and Aur\u00e9lie Herbelot", "author_email": "akb@3azouz.net", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Environment :: Web Environment", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Text Processing :: Linguistic" ], "description": "[![GitHub release][release-image]][release-url]\n[![PyPI release][pypi-image]][pypi-url]\n[![Build][travis-image]][travis-url]\n[![MIT License][license-image]][license-url]\n\n# nonce2vec\nWelcome to Nonce2Vec!\n\nThe main branch of this repository now refers to the Kabbach et al. (2019) ACL SRW 2019 paper *Towards incremental learning of word embeddings using context informativeness*.\n\n**If you are looking for the Herbelot and Baroni (2017) repository, check out the [emnlp2017](https://github.com/minimalparts/nonce2vec/tree/release/emnlp2017) branch.**\n\nIf you use this code, please cite:\n```tex\n@inproceedings{kabbach-etal-2019-towards,\n title = \"Towards Incremental Learning of Word Embeddings Using Context Informativeness\",\n author = \"Kabbach, Alexandre and\n Gulordava, Kristina and\n Herbelot, Aur{\\'e}lie\",\n booktitle = \"Proceedings of the 57th Conference of the Association for Computational Linguistics: Student Research Workshop\",\n month = jul,\n year = \"2019\",\n address = \"Florence, Italy\",\n publisher = \"Association for Computational Linguistics\",\n url = \"https://www.aclweb.org/anthology/P19-2022\",\n pages = \"162--168\"\n}\n```\n\n**Abstract**\n\n*In this paper, we investigate the task of learning word embeddings from very sparse data in an incremental, cognitively-plausible way. We focus on the notion of informativeness, that is, the idea that some content is more valuable to the learning process than other. We further highlight the challenges of online learning and argue that previous systems fall short of implementing incrementality. Concretely, we incorporate informativeness in a previously proposed model of nonce learning, using it for context selection and learning rate modulation. We test our system on the task of learning new words from definitions, as well as on the task of learning new words from potentially uninformative contexts. We demonstrate that informativeness is crucial to obtaining state-of-the-art performance in a truly incremental setup.*\n\n## A note on the code\nWe have significantly refactored the original Nonce2Vec code in order to make replication easier and to make it work with gensim v3.x. You can use Nonce2Vec v2.x to replicate the results of the SRW paper. However, to replicate results of the original ENMLP paper, refer to Nonce2Vec v1.x found under the [emnlp2017 branch](https://github.com/minimalparts/nonce2vec/tree/release/emnlp2017) as we **cannot** guarantee fair replication between v1.x and v2.x.\n\n## Install\nYou can install Nonce2Vec via pip:\n```bash\npip3 install nonce2vec\n```\nor, after a git clone, via:\n```bash\npython3 setup.py install\n```\n\n## Pre-requisites\nTo run Nonce2Vec, you need two gensim Word2Vec models (a skipgram model and a cbow model to compute informativeness-metrics). You can download the skipgram model from:\n```bash\nwget http://129.194.21.122/~kabbach/gensim.w2v.skipgram.model.7z\n```\nand the cbow model from:\n```sh\nwget http://129.194.21.122/~kabbach/gensim.w2v.cbow.model.7z\n```\nor generate both yourself following the instructions below.\n\n### Generating a Word2Vec model from a Wikipedia dump\nYou can download our English Wikipedia dump of January 2019 here:\n```bash\nwget http://129.194.21.122/~kabbach/enwiki.20190120.7z\n```\nIf you want to generate a completely new (tokenized-one-sentence-per-line) dump\nof Wikipedia, for English or any other language, check out [WiToKit](https://github.com/akb89/witokit).\n\nOnce you have a Wikipedia txt dump, you can generate a gensim Word2Vec skipgram model via:\n```bash\nn2v train \\\n --data /absolute/path/to/wikipedia/tokenized/text/dump \\\n --outputdir /absolute/path/to/dir/where/to/store/w2v/model \\\n --alpha 0.025 \\\n --neg 5 \\\n --window 5 \\\n --sample 1e-3 \\\n --epochs 5 \\\n --min-count 50 \\\n --size 400 \\\n --num-threads number_of_cpu_threads_to_use \\\n --train-mode skipgram\n```\nand a gensim Word2Vec cbow model via:\n```bash\nn2v train \\\n --data /absolute/path/to/wikipedia/tokenized/text/dump \\\n --outputdir /absolute/path/to/dir/where/to/store/w2v/model \\\n --alpha 0.025 \\\n --neg 5 \\\n --window 5 \\\n --sample 1e-3 \\\n --epochs 5 \\\n --min-count 50 \\\n --size 400 \\\n --num-threads number_of_cpu_threads_to_use \\\n --train-mode cbow\n```\n\nTo check the correlation of your word2vec model(s) with the MEN dataset, run:\n```bash\nn2v check-men \\\n ---model /absolute/path/to/gensim/w2v/model\n```\n\n## Running the code\nRunning Nonce2Vec on the definitional of chimeras datasets is done via the `n2v test` command. You can pass in the `--reload` parameter to run in `one-shot` mode, without it the code runs in incremental model by default. You can further pass in the `--shuffle` parameter to shuffle the test set before running n2v.\n\nYou will find below a list of commands corresponding to the experiments reported in the SRW 2019 paper. For example, to test the SUM CWI model (a basic sum model with context-word-informativeness-based filtering), which provides a rather robust baseline on all datasets in incremental setup, run, for the definitional dataset:\n```bash\nn2v test \\\n --on def \\\n --model /absolute/path/to/gensim/w2v/skipgram/model \\\n --info-model /absolute/path/to/gensim/w2v/cbow/model \\\n --sum-only \\\n --sum-filter cwi \\\n --sum-threshold 0\n```\n\nTo run the N2V CWI alpha model on the chimera L4 test set, with shuffling and in\none-shot evaluation setup (which provides SOTA performance), do:\n```bash\nn2v test \\\n --on l4 \\\n --model /absolute/path/to/gensim/w2v/skipgram/model \\\n --info-model /absolute/path/to/gensim/w2v/cbow/model \\\n --sum-filter cwi \\\n --sum-threshold 0 \\\n --train-with cwi_alpha \\\n --alpha 1.0 \\\n --beta 1000 \\\n --kappa 1 \\\n --neg 3 \\\n --epochs 1 \\\n --reload\n```\n\nTo test N2V as-is (the original N2V code without background freezing), in incremental setup on the definitional dataset, do:\n```bash\nn2v test \\\n --on def \\\n --model /absolute/path/to/gensim/w2v/skipgram/model \\\n --sum-filter random \\\n --sample 10000 \\\n --alpha 1.0 \\\n --neg 3 \\\n --window 15 \\\n --epochs 1 \\\n --lambda 70 \\\n --sample-decay 1.9 \\\n --window-decay 5 \\\n --replication\n```\n\nTo test N2V CWI init (the original N2V with CWI-based sum initialization) on the definitional dataset in one-shot evaluation setup, do:\n```bash\nn2v test \\\n --on def \\\n --model /absolute/path/to/gensim/w2v/skipgram/model \\\n --info-model /absolute/path/to/gensim/w2v/cbow/model \\\n --sum-filter cwi \\\n --sum-threshold 0 \\\n --alpha 1.0 \\\n --neg 3 \\\n --window 15 \\\n --epochs 1 \\\n --lambda 70 \\\n --sample-decay 1.9 \\\n --window-decay 5 \\\n --replication \\\n --reload\n```\n\n\n[release-image]:https://img.shields.io/github/release/minimalparts/nonce2vec.svg?style=flat-square\n[release-url]:https://github.com/minimalparts/nonce2vec/releases/latest\n[pypi-image]:https://img.shields.io/pypi/v/nonce2vec.svg?style=flat-square\n[pypi-url]:https://pypi.org/project/nonce2vec/\n[travis-image]:https://img.shields.io/travis/akb89/nonce2vec.svg?style=flat-square\n[travis-url]:https://travis-ci.org/akb89/nonce2vec\n[license-image]:http://img.shields.io/badge/license-MIT-000000.svg?style=flat-square\n[license-url]:LICENSE.txt", "description_content_type": "text/markdown", "docs_url": null, "download_url": "https://github.com/minimalparts/nonce2vec/#files", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/minimalparts/nonce2vec", "keywords": "word2vec,word-embeddings,incremental-learning", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "nonce2vec", "package_url": "https://pypi.org/project/nonce2vec/", "platform": "any", "project_url": "https://pypi.org/project/nonce2vec/", "project_urls": { "Download": "https://github.com/minimalparts/nonce2vec/#files", "Homepage": "https://github.com/minimalparts/nonce2vec" }, "release_url": "https://pypi.org/project/nonce2vec/2.0.0/", "requires_dist": null, "requires_python": "", "summary": "A python module to generate word embeddings from tiny data", "version": "2.0.0" }, "last_serial": 5599102, "releases": { "2.0.0": [ { "comment_text": "", "digests": { "md5": "aee008b409b54f5e5757e4f387d97f14", "sha256": "d66f2293e519e8c540b99819cb21103e800b54f19ecc462918ce8b4f7e4c8390" }, "downloads": -1, "filename": "nonce2vec-2.0.0.tar.gz", "has_sig": false, "md5_digest": "aee008b409b54f5e5757e4f387d97f14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21721, "upload_time": "2019-07-29T11:47:47", "url": "https://files.pythonhosted.org/packages/da/15/50c3e8fa0f20fdaa3bd468297eea55951a885994233384bedf9ecd3344a9/nonce2vec-2.0.0.tar.gz" } ], "2.0.0rc1": [ { "comment_text": "", "digests": { "md5": "71cd74c86d908e35c552132ba8d4c67a", "sha256": "e5431a75ad29e2403e51640550bf498911c29fbab06cbaff7d1dc3fae38fe08a" }, "downloads": -1, "filename": "nonce2vec-2.0.0rc1.tar.gz", "has_sig": false, "md5_digest": "71cd74c86d908e35c552132ba8d4c67a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16582, "upload_time": "2018-11-17T09:24:32", "url": "https://files.pythonhosted.org/packages/0c/06/b087d9457af55401bad9f2c20fece1c4783dd3cc810c29f34296e4c7a557/nonce2vec-2.0.0rc1.tar.gz" } ], "2.0.0rc2": [ { "comment_text": "", "digests": { "md5": "dde42d0a724b95863fdc210c1c5e53c2", "sha256": "c1426e12da6007b1a5527e085d1879335253d0bf86e1e96d51753e3504deabe9" }, "downloads": -1, "filename": "nonce2vec-2.0.0rc2.tar.gz", "has_sig": false, "md5_digest": "dde42d0a724b95863fdc210c1c5e53c2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16352, "upload_time": "2018-11-17T13:26:22", "url": "https://files.pythonhosted.org/packages/1b/43/76901a5094ca6b9b75f0ed14a098143ba84df90c7ca7fd4bea9da587a103/nonce2vec-2.0.0rc2.tar.gz" } ], "2.0.0rc3": [ { "comment_text": "", "digests": { "md5": "d8ff5a4cef68b55387c16da29af3dd97", "sha256": "7a943cb07a73a84c5ab3451263c91f9e0d1dcc54f44ca29022457568bb01af16" }, "downloads": -1, "filename": "nonce2vec-2.0.0rc3.tar.gz", "has_sig": false, "md5_digest": "d8ff5a4cef68b55387c16da29af3dd97", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16356, "upload_time": "2018-11-17T14:26:16", "url": "https://files.pythonhosted.org/packages/cc/ef/db7636f49acc7077a6165dd4d376dcd538721aa9b450a0a4ba88e89ce9c6/nonce2vec-2.0.0rc3.tar.gz" } ], "2.0.0rc4": [ { "comment_text": "", "digests": { "md5": "18d3b7ced59c534eef61a95dd8b7a9dc", "sha256": "d3b989af558dc0ec997f70a43c082d4b4f92f685157dc250a06f9baa5a550290" }, "downloads": -1, "filename": "nonce2vec-2.0.0rc4.tar.gz", "has_sig": false, "md5_digest": "18d3b7ced59c534eef61a95dd8b7a9dc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16428, "upload_time": "2018-11-19T09:48:19", "url": "https://files.pythonhosted.org/packages/a6/6f/a6f980dd9c67caff2c10d7d86d44520bc489058c41d6500f71315daaff4a/nonce2vec-2.0.0rc4.tar.gz" } ], "2.0.0rc5": [ { "comment_text": "", "digests": { "md5": "e50c0aa92f0793ba031d4258a1247484", "sha256": "6a27c608904bdd4bb17c7ce248a0c5f00582ed3f00a62e30bd6169444e310c58" }, "downloads": -1, "filename": "nonce2vec-2.0.0rc5.tar.gz", "has_sig": false, "md5_digest": "e50c0aa92f0793ba031d4258a1247484", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16440, "upload_time": "2019-01-15T15:28:16", "url": "https://files.pythonhosted.org/packages/11/f2/b3193739ce468028b8f087a4c5f1a596945e5b16aa6ff8de00e9136b514b/nonce2vec-2.0.0rc5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "aee008b409b54f5e5757e4f387d97f14", "sha256": "d66f2293e519e8c540b99819cb21103e800b54f19ecc462918ce8b4f7e4c8390" }, "downloads": -1, "filename": "nonce2vec-2.0.0.tar.gz", "has_sig": false, "md5_digest": "aee008b409b54f5e5757e4f387d97f14", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 21721, "upload_time": "2019-07-29T11:47:47", "url": "https://files.pythonhosted.org/packages/da/15/50c3e8fa0f20fdaa3bd468297eea55951a885994233384bedf9ecd3344a9/nonce2vec-2.0.0.tar.gz" } ] }