{ "info": { "author": "Junseong Kim", "author_email": "codertimo@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# kor2vec [![CircleCI](https://circleci.com/gh/kor2vec/kor2vec.svg?style=svg)](https://circleci.com/gh/kor2vec/kor2vec)\n\nOOV\uc5c6\uc774 \ube60\ub974\uace0 \uc815\ud655\ud55c Char-CNN \uae30\ubc18 \ud55c\uad6d\uc5b4 Embedding\n\n## Installation\n```shell\npip install kor2vec\n```\n> Requirements : `tqdm`, `numpy` and support `torch >= 0.4.0`\n\n## Introduction\n\n\ud55c\uad6d\uc5b4\ub294 \uad50\ucc29\uc5b4\ub77c\ub294 \ud2b9\uc131\uc744 \uac16\uace0 \uc788\uc2b5\ub2c8\ub2e4. \ub54c\ubb38\uc5d0 \uc5b4\uac04+\uc5b4\ubbf8(\uc6a9\uc5b8), \uba85\uc0ac+\uc870\uc0ac \ub4f1\ub4f1 \ub2e4\uc591\ud55c \ud615\ud0dc\uc758 \uc218\ub9cc\uac00\uc9c0\uc758\n\ub2e8\uc5b4 \uc870\ud569\ub4e4\uc744 \ub9cc\ub4e4\uc5b4 \ub0bc \uc218 \uc788\ub294\ub370\uc694. \ud55c\uad6d\uc5b4\ub97c \uc0ac\uc6a9\ud558\ub294 \uc785\uc7a5\uc5d0\uc11c\ub294 \ub9e4\uc6b0 \ud3b8\ub9ac\ud55c \ud2b9\uc131\uc774\uc9c0\ub9cc\n\ud55c\uad6d\uc5b4\ub97c Embedding \ud574\uc57c\ud558\ub294 NLP \uac1c\ubc1c\uc790\ub4e4\uc5d0\uac8c\ub294 \uc5b8\uc81c\ub098 \uac00\uc7a5 \ud070 \ubb38\uc81c\uc810\uc73c\ub85c \ub2e4\uac00\uc654\uc2b5\ub2c8\ub2e4.\n\n\ub54c\ubb38\uc5d0 `konlpy`\ub098 `sentence piece`\ub97c \uc0ac\uc6a9\ud574\uc11c \ud55c\uad6d\uc5b4\ub97c \uc801\uc808\ud55c token \ub2e8\uc704\ub85c \ub098\ub208\ub4a4\uc5d0\n`Word2vec` \ub610\ub294 \uc790\uc81c\uc801\uc778 Embedding\uc744 \ud559\uc2b5\ud558\uc5ec \uad50\ucc29\uc5b4\uc758 \ubb38\uc81c\ub97c \ud574\uacb0\ud558\uc600\uc2b5\ub2c8\ub2e4.\n\n\ud558\uc9c0\ub9cc \uc774 \ubc29\ubc95\uc5d0\ub294 \uc138\uac00\uc9c0 \ud070 \ubb38\uc81c\uc810\uc774 \uc874\uc7ac\ud569\ub2c8\ub2e4.\n\n1. \ubaa8\ub4e0 inference, training \uacfc\uc815\uc5d0 tokenizer\uac00 \ubd99\uc5b4\uc57c \ud568\uc73c\ub85c \ubcd1\ubaa9\ud604\uc0c1\uc774 \ubc1c\uc0dd\ud55c\ub2e4\n2. tokenization \uacfc\uc815\uc5d0\uc11c \uc758\ubbf8\ub97c \uc783\uc5b4\ubc84\ub9ac\ub294 \uacbd\uc6b0\uac00 \ub9ce\ub2e4 (\uc798\ubabb\ub41c tokenization)\n3. \ubaa8\ub4e0 \ub2e8\uc5b4\uc640 \ubb38\uc7a5\uc744 cover\ud558\ub294 \uac83\uc740 \ubd88\uac00\ub2a5\ud558\ub2e4 (\ud544\uc5f0 OOV\ubb38\uc81c\uac00 \ubc1c\uc0dd\ud568)\n\n\n## Solution\n\n\uc774\ub7ec\ud55c \ubb38\uc81c\uc810\uc744 \ud574\uacb0\ud558\uae30 \uc704\ud574\uc11c CNN\uc744 \uae30\ubc18\uc73c\ub85c \ud55c char-word \uc784\ubca0\ub529\uc744 \ud55c\uad6d\uc5b4\uc5d0 \uc801\uc6a9\ud558\uc5ec\n`kor2vec`\uc744 \ub9cc\ub4e4\uac8c \ub418\uc5c8\uc2b5\ub2c8\ub2e4.\n\n- Embedding \ud559\uc2b5 \ubc29\ubc95 : Skip-gram based embedding training\n- Char-word Encoder \ubaa8\ub378 \uad6c\uc870 : [Yoon Kim's Character-Aware Neural Language Modeling](https://arxiv.org/abs/1508.06615)\n\n## Quick Start\n\n```shell\nkor2vec train -c corpus/path -o output/model.kor2vec\n```\n\n### inference\n```python\n\nfrom kor2vec import Kor2Vec\nkor2vec = Kor2Vec.load(\"../model/path\")\n\nkor2vec.embedding(\"\uc548\ub155 \uc544\uc774\uc624\uc544\uc774\uc57c \ub098\ub294 \ud074\ub85c\ubc14\uc5d0\uc11c \uc654\uc5b4\")\n>>> torch.tensor(5, 128) # embedding vector\n\nkor2vec.embedding(\"\ub098\ub294 \ub3c4\ub77c\uc5d0\ubabd\uc774\ub77c\uace0 \ud574 \ubc18\uac00\uc6cc\", numpy=True)\n>>> numpy.array(4, 128) # numpy embedding vector\n\ninput = kor2vec.to_seqs([\"\uc548\ub155 \ub098\ub294 \ubf40\ub85c\ub85c\ub77c\uace0 \ud574\", \"\ub9cc\ub098\uc11c \ubc18\uac00\uc6cc \ubf40\ub85c\ub85c\"], seq_len=4)\nkor2vec.forward(input)\n>> torch.tensor([2, 4, 128])\n```\n\n### training\n\n```python\nfrom kor2vec import Kor2vec\n\nkor2vec = Kor2Vec(embed_size=128)\n\nkor2vec.train(\"../path/corpus\", batch_size=128) # takes some time\n\nkor2vec.save(\"../mode/path\") # saving embedding\n```\n\n### with pytorch\n\n```python\n\nimport torch.nn as nn\nfrom kor2vec import Kor2vec\n\nkor2vec = Kor2Vec.load(\"../model/path\")\n# or kor2vec = SejongVector()\n\nlstm = nn.LSTM(128, 64, batch_first=True)\ndense = nn.Linear(64, 1)\n\n# Make tensor input\nsentences = [\"\uc774 \uc601\ud654\ub294 \uc815\ub9d0 \ub300\ubc15\uc774\uc5d0\uc694\", \"\uc6b0\uc640 \uc9c4\uc9dc \uc7ac\ubbf8\uc788\uc5c8\uc5b4\uc694\"]\n\nx = kor2vec.to_seqs(sentences, seq_len=10)\n# >>> tensor(batch_size, seq_len, char_seq_len)\n\nx = kor2vec(x) # tensor(batch_size, seq_len, 128)\n_, (x, xc) = lstm(x) # tensor(batch_size, 64)\nx = dense(x) # tensor(batch_size, 1)\n\n```\n\n\nCopyright 2018 Kor2vec Contributors and NAVER Corporation\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/codertimo/kor2vec", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "kor2vec", "package_url": "https://pypi.org/project/kor2vec/", "platform": "", "project_url": "https://pypi.org/project/kor2vec/", "project_urls": { "Homepage": "https://github.com/codertimo/kor2vec" }, "release_url": "https://pypi.org/project/kor2vec/0.0.1a0/", "requires_dist": [ "numpy", "torch (>=0.4.0)", "tqdm" ], "requires_python": "", "summary": "Char-CNN based Korean Word Embedding", "version": "0.0.1a0" }, "last_serial": 4368978, "releases": { "0.0.1a0": [ { "comment_text": "", "digests": { "md5": "dc3a945749c9ca2989fc08241d21812a", "sha256": "e76f92b8d93454dcf70c5d73797ef7ea71c5e42836583a7a8da4f71a1d132a5d" }, "downloads": -1, "filename": "kor2vec-0.0.1a0-py3-none-any.whl", "has_sig": false, "md5_digest": "dc3a945749c9ca2989fc08241d21812a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 22175, "upload_time": "2018-10-12T14:53:42", "url": "https://files.pythonhosted.org/packages/68/77/a42939a91df0f148b9be405462e7c582599ce30deb5e613eca29b0c22d7b/kor2vec-0.0.1a0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "152ced14e456efc76ddcc39499b709ea", "sha256": "0af498d520fa03c36948394973a6bd59042e01dd8ff232b2e42042780a3c74b1" }, "downloads": -1, "filename": "kor2vec-0.0.1a0.tar.gz", "has_sig": false, "md5_digest": "152ced14e456efc76ddcc39499b709ea", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13403, "upload_time": "2018-10-12T14:53:43", "url": "https://files.pythonhosted.org/packages/52/f7/3de1a7bc8d4d29c7456bf22912059ecdf201008e1fa93abb8a49f63dbe41/kor2vec-0.0.1a0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "dc3a945749c9ca2989fc08241d21812a", "sha256": "e76f92b8d93454dcf70c5d73797ef7ea71c5e42836583a7a8da4f71a1d132a5d" }, "downloads": -1, "filename": "kor2vec-0.0.1a0-py3-none-any.whl", "has_sig": false, "md5_digest": "dc3a945749c9ca2989fc08241d21812a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 22175, "upload_time": "2018-10-12T14:53:42", "url": "https://files.pythonhosted.org/packages/68/77/a42939a91df0f148b9be405462e7c582599ce30deb5e613eca29b0c22d7b/kor2vec-0.0.1a0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "152ced14e456efc76ddcc39499b709ea", "sha256": "0af498d520fa03c36948394973a6bd59042e01dd8ff232b2e42042780a3c74b1" }, "downloads": -1, "filename": "kor2vec-0.0.1a0.tar.gz", "has_sig": false, "md5_digest": "152ced14e456efc76ddcc39499b709ea", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13403, "upload_time": "2018-10-12T14:53:43", "url": "https://files.pythonhosted.org/packages/52/f7/3de1a7bc8d4d29c7456bf22912059ecdf201008e1fa93abb8a49f63dbe41/kor2vec-0.0.1a0.tar.gz" } ] }