{ "info": { "author": "Rico Sennrich", "author_email": "", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 2", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Text Processing" ], "description": "Subword Neural Machine Translation\n==================================\n\nThis repository contains preprocessing scripts to segment text into subword\nunits. The primary purpose is to facilitate the reproduction of our experiments\non Neural Machine Translation with subword units (see below for reference).\n\nINSTALLATION\n------------\n\ninstall via pip (from PyPI):\n\n pip install subword-nmt\n\ninstall via pip (from Github):\n\n pip install https://github.com/rsennrich/subword-nmt/archive/master.zip\n\nalternatively, clone this repository; the scripts are executable stand-alone.\n\n\nUSAGE INSTRUCTIONS\n------------------\n\nCheck the individual files for usage instructions.\n\nTo apply byte pair encoding to word segmentation, invoke these commands:\n\n subword-nmt learn-bpe -s {num_operations} < {train_file} > {codes_file}\n subword-nmt apply-bpe -c {codes_file} < {test_file} > {out_file}\n\nTo segment rare words into character n-grams, do the following:\n\n subword-nmt get-vocab --train_file {train_file} --vocab_file {vocab_file}\n subword-nmt segment-char-ngrams --vocab {vocab_file} -n {order} --shortlist {size} < {test_file} > {out_file}\n\nThe original segmentation can be restored with a simple replacement:\n\n sed -r 's/(@@ )|(@@ ?$)//g'\n\nIf you cloned the repository and did not install a package, you can also run the individual commands as scripts:\n\n ./subword_nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}\n\nBEST PRACTICE ADVICE FOR BYTE PAIR ENCODING IN NMT\n--------------------------------------------------\n\nWe found that for languages that share an alphabet, learning BPE on the\nconcatenation of the (two or more) involved languages increases the consistency\nof segmentation, and reduces the problem of inserting/deleting characters when\ncopying/transliterating names.\n\nHowever, this introduces undesirable edge cases in that a word may be segmented\nin a way that has only been observed in the other language, and is thus unknown\nat test time. To prevent this, `apply_bpe.py` accepts a `--vocabulary` and a\n`--vocabulary-threshold` option so that the script will only produce symbols\nwhich also appear in the vocabulary (with at least some frequency).\n\nTo use this functionality, we recommend the following recipe (assuming L1 and L2\nare the two languages):\n\nLearn byte pair encoding on the concatenation of the training text, and get resulting vocabulary for each:\n\n cat {train_file}.L1 {train_file}.L2 | subword-nmt learn-bpe -s {num_operations} -o {codes_file}\n subword-nmt apply-bpe -c {codes_file} < {train_file}.L1 | subword-nmt get-vocab > {vocab_file}.L1\n subword-nmt apply-bpe -c {codes_file} < {train_file}.L2 | subword-nmt get-vocab > {vocab_file}.L2\n\nmore conventiently, you can do the same with with this command:\n\n subword-nmt learn-joint-bpe-and-vocab --input {train_file}.L1 {train_file}.L2 -s {num_operations} -o {codes_file} --write-vocabulary {vocab_file}.L1 {vocab_file}.L2\n\nre-apply byte pair encoding with vocabulary filter:\n\n subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {train_file}.L1 > {train_file}.BPE.L1\n subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L2 --vocabulary-threshold 50 < {train_file}.L2 > {train_file}.BPE.L2\n\nas a last step, extract the vocabulary to be used by the neural network. Example with Nematus:\n\n nematus/data/build_dictionary.py {train_file}.BPE.L1 {train_file}.BPE.L2\n\n[you may want to take the union of all vocabularies to support multilingual systems]\n\nfor test/dev data, re-use the same options for consistency:\n\n subword-nmt apply-bpe -c {codes_file} --vocabulary {vocab_file}.L1 --vocabulary-threshold 50 < {test_file}.L1 > {test_file}.BPE.L1\n\nPUBLICATIONS\n------------\n\nThe segmentation methods are described in:\n\nRico Sennrich, Barry Haddow and Alexandra Birch (2016):\n Neural Machine Translation of Rare Words with Subword Units\n Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.\n\nACKNOWLEDGMENTS\n---------------\nThis project has received funding from Samsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland, and from the European Union\u2019s Horizon 2020 research and innovation programme under grant agreement 645452 (QT21).\n\n\nCHANGELOG\n---------\n\nv0.3.6:\n - fix to subword-bpe command encoding\n\nv0.3.5:\n - fix to subword-bpe command under Python 2\n - wider support of --total-symbols argument\n\nv0.3.4:\n - segment_tokens method to improve library usability (https://github.com/rsennrich/subword-nmt/pull/52)\n - support regex glossaries (https://github.com/rsennrich/subword-nmt/pull/56)\n - allow unicode separators (https://github.com/rsennrich/subword-nmt/pull/57)\n - new option --total-symbols in learn-bpe (commit 61ad8)\n - fix documentation (best practices) (https://github.com/rsennrich/subword-nmt/pull/60)\n\nv0.3:\n - library is now installable via pip\n - fix occasional problems with UTF-8 whitespace and new lines in learn_bpe and apply_bpe.\n - do not silently convert UTF-8 newline characters into \"\\n\"\n - do not silently convert UTF-8 whitespace characters into \" \"\n - UTF-8 whitespace and newline characters are now considered part of a word, and segmented by BPE\n\nv0.2:\n - different, more consistent handling of end-of-word token (commit a749a7) (https://github.com/rsennrich/subword-nmt/issues/19)\n - allow passing of vocabulary and frequency threshold to apply_bpe.py, preventing the production of OOV (or rare) subword units (commit a00db)\n - made learn_bpe.py deterministic (commit 4c54e)\n - various changes to make handling of UTF more consistent between Python versions\n - new command line arguments for apply_bpe.py:\n - '--glossaries' to prevent given strings from being affected by BPE\n - '--merges' to apply a subset of learned BPE operations\n - new command line arguments for learn_bpe.py:\n - '--dict-input': rather than raw text file, interpret input as a frequency dictionary (as created by get_vocab.py).\n\n\nv0.1:\n - consistent cross-version unicode handling\n - all scripts are now deterministic\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/rsennrich/subword-nmt", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "subword-nmt", "package_url": "https://pypi.org/project/subword-nmt/", "platform": "", "project_url": "https://pypi.org/project/subword-nmt/", "project_urls": { "Homepage": "https://github.com/rsennrich/subword-nmt" }, "release_url": "https://pypi.org/project/subword-nmt/0.3.6/", "requires_dist": null, "requires_python": "", "summary": "Unsupervised Word Segmentation for Neural Machine Translation and Text Generation", "version": "0.3.6" }, "last_serial": 4585721, "releases": { "0.3.1": [ { "comment_text": "", "digests": { "md5": "9d15dc1741d9742c0dcf36eb0e037dff", "sha256": "827a3e6340af6a1aea79b366ef4d0766ee9422c4c14a2e3464172e1153ec295c" }, "downloads": -1, "filename": "subword_nmt-0.3.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9d15dc1741d9742c0dcf36eb0e037dff", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 24848, "upload_time": "2018-05-17T12:09:05", "url": "https://files.pythonhosted.org/packages/46/8b/9de60efd9a5618b78ef4aa56b82a887dcae709f3d7fa9fb2d98829fcbdd1/subword_nmt-0.3.1-py2.py3-none-any.whl" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "18eef52de7ef62677507e75d9dbefc09", "sha256": "3682755a749419a301ff665069f403f56c4e1642dc3aec68b6051c0edf60b5d7" }, "downloads": -1, "filename": "subword_nmt-0.3.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "18eef52de7ef62677507e75d9dbefc09", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 24973, "upload_time": "2018-05-17T12:28:54", "url": "https://files.pythonhosted.org/packages/1b/d2/289b9a179daa06a994fd8a496810b8f4f3fa8206670a9a3dd2404c89af27/subword_nmt-0.3.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "78489cdd068aa95a20df6e09b0ea3791", "sha256": "46af419472b40da33848d4fef4022eb7002734cd8fcf306f2b05edc11c105bde" }, "downloads": -1, "filename": "subword_nmt-0.3.2-py3.5.egg", "has_sig": false, "md5_digest": "78489cdd068aa95a20df6e09b0ea3791", "packagetype": "bdist_egg", "python_version": "3.5", "requires_python": null, "size": 48669, "upload_time": "2018-05-21T09:56:12", "url": "https://files.pythonhosted.org/packages/cd/e7/37aede538b235fa0db845129e0a99649ab246da42a1a800aa11fcc54fefa/subword_nmt-0.3.2-py3.5.egg" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "dc4429b9980da2659391153bf754d00f", "sha256": "908a61581f22c6e695b50f82ca2d28a74cda23f94f7833c2b5505d0fd351c277" }, "downloads": -1, "filename": "subword_nmt-0.3.3-py2.7.egg", "has_sig": false, "md5_digest": "dc4429b9980da2659391153bf754d00f", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 53670, "upload_time": "2018-05-21T09:56:13", "url": "https://files.pythonhosted.org/packages/f8/11/911079d1f5295ad72c74d389cba425f12c9c8e55217b631fae81439b9696/subword_nmt-0.3.3-py2.7.egg" }, { "comment_text": "", "digests": { "md5": "a8bfc6e7c89705e44ef7fbc5da273c7f", "sha256": "7df6b6672114f72fc3b8d87d80fd21cc458871295733c444c72278cfd55c205a" }, "downloads": -1, "filename": "subword_nmt-0.3.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "a8bfc6e7c89705e44ef7fbc5da273c7f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25023, "upload_time": "2018-05-21T09:56:10", "url": "https://files.pythonhosted.org/packages/7f/bd/2bd51c30a05048d6c4e847d48386380c4554cf9c634d426c33c4425469d8/subword_nmt-0.3.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "c2dc776f31643280d02b658deb8abe98", "sha256": "ea295d949c7176579de325d63dfc691143fdf893765a0bf1cd0dbe4628f2ce91" }, "downloads": -1, "filename": "subword_nmt-0.3.3-py3.5.egg", "has_sig": false, "md5_digest": "c2dc776f31643280d02b658deb8abe98", "packagetype": "bdist_egg", "python_version": "3.5", "requires_python": null, "size": 48763, "upload_time": "2018-05-21T09:56:14", "url": "https://files.pythonhosted.org/packages/e1/75/9c860bd42403d9ce6e6a0bbe46e6feeb58834b55725682e849860ecf9cb9/subword_nmt-0.3.3-py3.5.egg" } ], "0.3.4": [ { "comment_text": "", "digests": { "md5": "ccede5e5b8e2531d19a647fa8f90bfb1", "sha256": "e504b1a60d0293807cdb8bef77e1e728c071a7a1ed23144eb5fbce43631ec8e0" }, "downloads": -1, "filename": "subword_nmt-0.3.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ccede5e5b8e2531d19a647fa8f90bfb1", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25780, "upload_time": "2018-08-17T12:52:25", "url": "https://files.pythonhosted.org/packages/57/35/ce910c9f920986db9fa86ba67f5ee35771d1e9d699161bd346958272f408/subword_nmt-0.3.4-py2.py3-none-any.whl" } ], "0.3.5": [ { "comment_text": "", "digests": { "md5": "9fa6a3082df6f5c651d7e809aff49882", "sha256": "c7d29d9b79720023f785c2c3aef1ebff1444a1b74418bfd497034f573a8a7ec7" }, "downloads": -1, "filename": "subword_nmt-0.3.5-py2.7.egg", "has_sig": false, "md5_digest": "9fa6a3082df6f5c651d7e809aff49882", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 55511, "upload_time": "2018-12-11T14:48:20", "url": "https://files.pythonhosted.org/packages/16/d9/465f417a00753fb320e29b52ecaee8cdf17caa8fa6661dfead4efbf4ebc3/subword_nmt-0.3.5-py2.7.egg" }, { "comment_text": "", "digests": { "md5": "9e82fe65cd026ace46ad1204bbb52fd8", "sha256": "3740a7b2ea74f9de20df7a08c80ad384b2f0c7e928cf6fe2eb08167d9151cb35" }, "downloads": -1, "filename": "subword_nmt-0.3.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "9e82fe65cd026ace46ad1204bbb52fd8", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25921, "upload_time": "2018-09-17T12:21:42", "url": "https://files.pythonhosted.org/packages/e1/14/f870780204476815af1aa11a20bfde91fbe588712a1e900b32c079beb7ea/subword_nmt-0.3.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a40bcf802b3bf82e350fe202fb4d3d09", "sha256": "1a003ec19ca0c5ba051302cbac4254a31cc5810e846875bfaea5169a9cdb240d" }, "downloads": -1, "filename": "subword_nmt-0.3.5-py3.6.egg", "has_sig": false, "md5_digest": "a40bcf802b3bf82e350fe202fb4d3d09", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 50003, "upload_time": "2018-12-11T14:48:22", "url": "https://files.pythonhosted.org/packages/9c/74/215a925e38e81810c68eca33571c1e45b8b633fb9e5f8043456ccc636a52/subword_nmt-0.3.5-py3.6.egg" } ], "0.3.6": [ { "comment_text": "", "digests": { "md5": "dac4e56145e2d9337b351ff3cfd15138", "sha256": "5a8a2c8cfa758494eee8b2eb7ede9fe48a5f32d13203c393ee803671fe997603" }, "downloads": -1, "filename": "subword_nmt-0.3.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "dac4e56145e2d9337b351ff3cfd15138", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25912, "upload_time": "2018-12-11T14:48:16", "url": "https://files.pythonhosted.org/packages/26/08/58267cb3ac00f5f895457777ed9e0d106dbb5e6388fa7923d8663b04b849/subword_nmt-0.3.6-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "dac4e56145e2d9337b351ff3cfd15138", "sha256": "5a8a2c8cfa758494eee8b2eb7ede9fe48a5f32d13203c393ee803671fe997603" }, "downloads": -1, "filename": "subword_nmt-0.3.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "dac4e56145e2d9337b351ff3cfd15138", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 25912, "upload_time": "2018-12-11T14:48:16", "url": "https://files.pythonhosted.org/packages/26/08/58267cb3ac00f5f895457777ed9e0d106dbb5e6388fa7923d8663b04b849/subword_nmt-0.3.6-py2.py3-none-any.whl" } ] }