{ "info": { "author": "Hideki INOUE", "author_email": "hideki@inoue-kobo.com", "bugtrack_url": null, "classifiers": [], "description": "# text-vectorian\n\n## Overview\n\nNLP(Natural Language Processing)\u306b\u304a\u3044\u3066\u3001\u81ea\u7136\u8a00\u8a9e\u3092\u30d9\u30af\u30c8\u30eb\u5316\u3059\u308b\u305f\u3081\u306ePython\u30e2\u30b8\u30e5\u30fc\u30eb\u3067\u3059\u3002\nTokenizer\u3084Vectorizer\u306e\u8a73\u7d30\u3092\u6c17\u306b\u3059\u308b\u3053\u3068\u306a\u304f\u3001\u4efb\u610f\u306e\u30c6\u30ad\u30b9\u30c8\u304b\u3089\u7c21\u5358\u306b\u30d9\u30af\u30c8\u30eb\u8868\u73fe\u3092\u53d6\u5f97\u3059\u308b\u3053\u3068\u304c\u53ef\u80fd\u3067\u3059\u3002\n\n\u73fe\u5728\u63d0\u4f9b\u3057\u3066\u3044\u308bTokenizer\u3001Vectorizer\u306e\u7d44\u307f\u5408\u308f\u305b\u306f\u4ee5\u4e0b\u306e\u901a\u308a\u3067\u3059\u3002\n\n### SentencePiece + Word2Vec\n\n* [SentencePiece](https://github.com/google/sentencepiece)\n* [Word2Vec](https://code.google.com/archive/p/word2vec/)\n\n\u305d\u308c\u305e\u308c[\u65e5\u672c\u8a9eWikipedia](https://dumps.wikimedia.org/jawiki/)\u3092\u5143\u306b\u5b66\u7fd2\u3057\u305f\u5b66\u7fd2\u6e08\u307f\u30e2\u30c7\u30eb\u3092\u540c\u68b1\u3057\u3066\u3044\u307e\u3059\u3002\n\n\u4ee5\u4e0b\u306e\u69d8\u306b\u30d9\u30af\u30c8\u30eb\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\n\n```python\nfrom text_vectorian import SentencePieceVectorian\n\nvectorian = SentencePieceVectorian()\n\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\nvectors = vectorian.fit(text).vectors\n```\n\n### Char2Vec\n\n\u6587\u5b57\u5358\u4f4d\u3067Tokenize\u3057\u3001[Word2Vec](https://code.google.com/archive/p/word2vec/)\u3067Vectorize\u3057\u307e\u3059\u3002\n\n[\u65e5\u672c\u8a9eWikipedia](https://dumps.wikimedia.org/jawiki/)\u3092\u5143\u306b\u5b66\u7fd2\u3057\u305f\u5b66\u7fd2\u6e08\u307f\u30e2\u30c7\u30eb\u3092\u540c\u68b1\u3057\u3066\u3044\u307e\u3059\u3002\n\n\u4ee5\u4e0b\u306e\u69d8\u306b\u30d9\u30af\u30c8\u30eb\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\n\n```python\nfrom text_vectorian import Char2VecVectorian\n\nvectorian = Char2VecVectorian()\n\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\nvectors = vectorian.fit(text).vectors\n```\n\n### SentencePiece + BERT(Keras BERT)\n\n* [SentencePiece](https://github.com/google/sentencepiece)\n* [Keras BERT](https://github.com/CyberZHG/keras-bert)\n\nBERT\u306e\u30e2\u30c7\u30eb\u306f\u5225\u9014\u6e96\u5099\u3059\u308b\u5fc5\u8981\u304c\u3042\u308a\u307e\u3059\u3002\n[\u65e5\u672c\u8a9eWikipedia](https://dumps.wikimedia.org/jawiki/)\u3092\u5143\u306b\u5b66\u7fd2\u3057\u305f\u5b66\u7fd2\u6e08\u307f\u30e2\u30c7\u30eb\u306f\u4ee5\u4e0b\u306e\u65b9\u304c\u63d0\u4f9b\u3055\u308c\u3066\u3044\u307e\u3059\u3002\n\n* [BERT with SentencePiece \u3092\u65e5\u672c\u8a9e Wikipedia \u3067\u5b66\u7fd2\u3057\u3066\u30e2\u30c7\u30eb\u3092\u516c\u958b\u3057\u307e\u3057\u305f](https://yoheikikuta.github.io/bert-japanese/)\n\n[BERT with SentencePiece \u3092\u65e5\u672c\u8a9e Wikipedia \u3067\u5b66\u7fd2\u3057\u3066\u30e2\u30c7\u30eb\u3092\u516c\u958b\u3057\u307e\u3057\u305f](https://yoheikikuta.github.io/bert-japanese/)\u3088\u308a\u4ee5\u4e0b\u306e\u30d5\u30a1\u30a4\u30eb\u3092\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\u3057\u307e\u3059\u3002\n\n* wiki-ja.vocab\n* wiki-ja.model\n* model.ckpt-1400000.data-00000-of-00001\n* model.ckpt-1400000.index\n* model.ckpt-1400000.meta\n\n\u4ee5\u4e0b\u306e\u69d8\u306b\u5b9f\u884c\u3059\u308b\u4e8b\u3067BERT\u306b\u3088\u308b\u30d9\u30af\u30c8\u30eb\u3092\u53d6\u5f97\u3067\u304d\u307e\u3059\u3002\n\n```python\nfrom text_vectorian import SpBertVectorian\n\ntokenizer_filename = '[\u30e2\u30c7\u30eb\u3092\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\u3057\u305f\u30c7\u30a3\u30ec\u30af\u30c8\u30ea]/model/wiki-ja.model'\nvectorizer_filename = '[\u30e2\u30c7\u30eb\u3092\u30c0\u30a6\u30f3\u30ed\u30fc\u30c9\u3057\u305f\u30c7\u30a3\u30ec\u30af\u30c8\u30ea]/model/model.ckpt-1400000'\nvectorian = SpBertVectorian(\n tokenizer_filename=tokenizer_filename,\n vectorizer_filename=vectorizer_filename,\n)\n\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\nvectors = vectorian.fit(text).vectors\n```\n\n## Usage\n\n```bash\npip install text-vectorian\n```\n\n## Examples\n\n### \u30d9\u30af\u30c8\u30eb\u3092\u53d6\u5f97\u3059\u308b\n\n```python\nfrom text_vectorian import SentencePieceVectorian\n\nvectorian = SentencePieceVectorian()\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\nvectors = vectorian.fit(text).vectors\n\nprint(vectors)\n```\n\n```\n[ -4.9867806 13.593797 0.48158574 13.635306 17.737247\n 0.3811171 2.5912592 10.951708 2.45966 6.561281\n 4.335961 -2.328748 0.3230163 7.5206175 12.470385\n -5.782171 6.258509 1.4046584 -5.3632765 11.03699\n\n...\n\n -3.9090352 2.6152203 -2.696024 0.16026124 0.55380476\n -0.09982404 -3.8374352 2.1398337 0.8905425 -0.18653768\n -0.9730848 -0.41389456 0.54263806 -1.1963823 4.827375\n 1.3883296 -0.9925082 2.4345522 -1.2879591 2.6136968 ]]\n```\n\n### Keras\u3067\u5229\u7528\u3059\u308b\n\nVectroizer\u306e\u30e2\u30c7\u30eb\u7528\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u53d6\u5f97\u3057\u3001Keras\u306eEmbedding\u30ec\u30a4\u30e4\u30fc\u306e\u5165\u529b\u3068\u3057\u3066\u5229\u7528\u3057\u307e\u3059\u3002\n\n```python\nfrom text_vectorian import SentencePieceVectorian\n\nvectorian = SentencePieceVectorian()\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\nindices = vectorian.fit(text).indices\n\nprint(indices)\n\nfrom keras import Input, Model\nfrom keras.layers import Dense, LSTM\n\ninput_tensor = Input((vectorian.max_tokens_len,))\ncommon_input = vectorian.get_keras_layer(trainable=True)(input_tensor)\nl1 = LSTM(32)(common_input)\noutput_tensor = Dense(3)(l1)\n\nmodel = Model(input_tensor, output_tensor)\nmodel.summary()\n```\n\n```\n[ 14 138 2645 2389 1]\n\n...\n\n_________________________________________________________________\nLayer (type) Output Shape Param #\n=================================================================\ninput_1 (InputLayer) (None, 5) 0\n_________________________________________________________________\nembedding_1 (Embedding) (None, 5, 50) 8555900\n_________________________________________________________________\nlstm_1 (LSTM) (None, 32) 10624\n_________________________________________________________________\ndense_1 (Dense) (None, 3) 99\n=================================================================\nTotal params: 8,566,623\nTrainable params: 8,566,623\nNon-trainable params: 0\n_________________________________________________________________\n```\n\n### BERT\u3092\u30d5\u30a1\u30a4\u30f3\u30c1\u30e5\u30fc\u30cb\u30f3\u30b0\u3059\u308b\n\nBERT\u306e\u30e2\u30c7\u30eb\u7528\u306e\u30a4\u30f3\u30c7\u30c3\u30af\u30b9\u3092\u53d6\u5f97\u3057\u3001Keras\u3067\u30d5\u30a1\u30a4\u30f3\u30c1\u30e5\u30fc\u30cb\u30f3\u30b0\u3057\u307e\u3059\u3002\n\u73fe\u5728\u5165\u529b\u3067\u304d\u308b\u6587\u306f1\u3064\u306e\u307f\u3067\u3059\u3002\n\n```python\nfrom text_vectorian import SpBertVectorian\n\ntokenizer_filename = '../bert-japanese/model/wiki-ja.model'\nvectorizer_filename = '../bert-japanese/model/model.ckpt-1400000'\nvectorian = SpBertVectorian(\n tokenizer_filename=tokenizer_filename,\n vectorizer_filename=vectorizer_filename\n)\ntext = '\u3053\u308c\u306f\u30c6\u30b9\u30c8\u3067\u3059\u3002'\n\nlabels = [[0, 0, 0, 1]] # \u30e9\u30d9\u30eb\u30c7\u30fc\u30bf\nindices = []\nindices.appennd(vectorian.fit(text, suppress_vectors=True).indices)\n# BERT\u306b\u5165\u529b\u3059\u308b\u6587\u306e\u5206\u5272\u7bc4\u56f2\u3092\u53d6\u5f97\u3059\u308bSegments\u3092\u53d6\u5f97\u3057\u307e\u3059\u3002\nsegments = vectorian.get_segments()\n\nprint(indices)\n\nfrom keras import Model\nfrom keras.layers import Dense\n\nbatch_size = 32\nepochs = 1\nlayers = vectorian.get_keras_layer(trainable=True)\noptimizer = vectorian.get_optimizer(samples_len=len(indices), batch_size=batch_size, epochs=epochs)\n\noutput_tensor = keras.layers.Dense(4)(layers['last'])\nmodel = keras.Model(layers['inputs'], output_tensor)\nmodel.compile(loss='categorical_crossentropy', optimizer=optimizer)\nmodel.summary()\n\nhistory = model.fit([indices, segments],\n labels,\n batch_size=batch_size,\n epochs=epochs)\n```\n\n## Development\n\n### Class\n\n![](docs/class.png)\n\n## License\n\n* [MIT](https://github.com/lhideki/text-vectorian/blob/master/LICENSE)\n\n## Authors\n\n* [Hideki INOUE](https://github.com/lhideki)\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/lhideki/text-vectorian", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "text-vectorian", "package_url": "https://pypi.org/project/text-vectorian/", "platform": "", "project_url": "https://pypi.org/project/text-vectorian/", "project_urls": { "Homepage": "https://github.com/lhideki/text-vectorian" }, "release_url": "https://pypi.org/project/text-vectorian/0.2.0/", "requires_dist": [ "gensim", "sentencepiece", "keras", "keras-bert" ], "requires_python": "", "summary": "For getting token embedded vectors for NLP.", "version": "0.2.0" }, "last_serial": 5928288, "releases": { "0.1.10": [ { "comment_text": "", "digests": { "md5": "a275986ec88901ee18237bac1845e474", "sha256": "6ebaa1018390abc26e98d49a1c3f4098d6e9fd7e36b8ffc7357eb45bc59723b2" }, "downloads": -1, "filename": "text_vectorian-0.1.10.tar.gz", "has_sig": false, "md5_digest": "a275986ec88901ee18237bac1845e474", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6204, "upload_time": "2019-03-23T14:18:53", "url": "https://files.pythonhosted.org/packages/ab/38/3c0916bb20f636c8bca7eaa3c2e713ea81ddd1b273ba53ec9db1ab5b1cee/text_vectorian-0.1.10.tar.gz" } ], "0.1.11": [ { "comment_text": "", "digests": { "md5": "1c578f6cdc493f810962b18ad4dc59a5", "sha256": "c5241de667aa363f3dd4e0a5f7785287d12592b795dcdca2e75449bc4eae95a9" }, "downloads": -1, "filename": "text_vectorian-0.1.11-py3-none-any.whl", "has_sig": false, "md5_digest": "1c578f6cdc493f810962b18ad4dc59a5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9385, "upload_time": "2019-10-04T07:06:57", "url": "https://files.pythonhosted.org/packages/09/c0/13b2c6868c1e1ea16fca1d09f74022923c000577a0ece2c017860b6cc76d/text_vectorian-0.1.11-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "145800bcf1feecb8986ea0be7e19ea47", "sha256": "00576b5c081ef4cc4c792a26cbb3437e2ba35ae4d213fec7813d222d2b0341c1" }, "downloads": -1, "filename": "text_vectorian-0.1.11.tar.gz", "has_sig": false, "md5_digest": "145800bcf1feecb8986ea0be7e19ea47", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 105645, "upload_time": "2019-10-04T07:07:02", "url": "https://files.pythonhosted.org/packages/6b/85/feda837abc29887117feaa602daf891bb27894ee550bc33d56877048b163/text_vectorian-0.1.11.tar.gz" } ], "0.1.12": [ { "comment_text": "", "digests": { "md5": "568e0d7f63d6580de57b955985b7822c", "sha256": "8d7715bb82f85c2e5a84573f421273faf2975801df2c0b6c2820ce8f229501b6" }, "downloads": -1, "filename": "text_vectorian-0.1.12-py3-none-any.whl", "has_sig": false, "md5_digest": "568e0d7f63d6580de57b955985b7822c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9772, "upload_time": "2019-10-04T07:10:16", "url": "https://files.pythonhosted.org/packages/12/39/b2dba057c13f4dd4f4491cf93ba79b466c07d78e81dd0dd216ed642457c2/text_vectorian-0.1.12-py3-none-any.whl" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "f18db3c42a0ee9bfc03728f6fdba6417", "sha256": "c6f6f2a46991a6447dc88604142931b73bcefe3eb8a52ff880be9c7d2d5d1298" }, "downloads": -1, "filename": "text_vectorian-0.1.3.tar.gz", "has_sig": false, "md5_digest": "f18db3c42a0ee9bfc03728f6fdba6417", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4415, "upload_time": "2019-01-18T01:55:45", "url": "https://files.pythonhosted.org/packages/36/a2/fcc12bcccd32429606c74985f92a443665e44635a9f957398103f18a5d79/text_vectorian-0.1.3.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "17770806454d9521cea0cd8269686971", "sha256": "3c26f6d0ec49d1afaead56ae0255bfdae037cc9f8a75619541f5d93158218903" }, "downloads": -1, "filename": "text_vectorian-0.1.5.tar.gz", "has_sig": false, "md5_digest": "17770806454d9521cea0cd8269686971", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4610, "upload_time": "2019-01-22T05:33:25", "url": "https://files.pythonhosted.org/packages/db/de/98feadda5db279f53558b5732794b0f8e7946b3ac1e6abc693a9412759c4/text_vectorian-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "cf42a3cd0d1b429e6f1d7b2ab8f1a3b6", "sha256": "457ad2e7a4b19c6f3e636387a8737102687141b62710380f643c2273787bcf43" }, "downloads": -1, "filename": "text_vectorian-0.1.6.tar.gz", "has_sig": false, "md5_digest": "cf42a3cd0d1b429e6f1d7b2ab8f1a3b6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4628, "upload_time": "2019-01-27T02:29:12", "url": "https://files.pythonhosted.org/packages/29/42/f77853af5915d71a52ff14c05fb44b2839a4b7150a3b2b7529781a609763/text_vectorian-0.1.6.tar.gz" } ], "0.1.7": [ { "comment_text": "", "digests": { "md5": "a92a6aeace02071a1ae685d22cc0e789", "sha256": "66e65402431623a156efeff5219e8e2a85e16bf03720f1874b99d8853ffc51ad" }, "downloads": -1, "filename": "text_vectorian-0.1.7.tar.gz", "has_sig": false, "md5_digest": "a92a6aeace02071a1ae685d22cc0e789", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4796, "upload_time": "2019-02-28T11:41:28", "url": "https://files.pythonhosted.org/packages/03/50/fef0582e2e1e97e41b84c195ec54a9dd27d464f608e9aeef941ea3bfdb6e/text_vectorian-0.1.7.tar.gz" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "60114a3fc2c7e480bc90a229a1fb12d2", "sha256": "981eb535831402c9b5ca84ce964fd54b5101ab03d35b127fc590faacf530b574" }, "downloads": -1, "filename": "text_vectorian-0.1.8.tar.gz", "has_sig": false, "md5_digest": "60114a3fc2c7e480bc90a229a1fb12d2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4970, "upload_time": "2019-03-01T03:34:48", "url": "https://files.pythonhosted.org/packages/23/ec/daed2b17024ee121af52b3286350251672bc121d7d97b8a149eed5ee2d8d/text_vectorian-0.1.8.tar.gz" } ], "0.1.9": [ { "comment_text": "", "digests": { "md5": "1ab57a9cc236f5265ff169c9bc0bba55", "sha256": "067181d21a3946009307e508d338bdfc43b911b9412e27735efa6c87a73a79d1" }, "downloads": -1, "filename": "text_vectorian-0.1.9.tar.gz", "has_sig": false, "md5_digest": "1ab57a9cc236f5265ff169c9bc0bba55", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6177, "upload_time": "2019-03-19T12:10:27", "url": "https://files.pythonhosted.org/packages/7e/41/f0193667fa7e734e148fef30bacc13e12fdb711b121e642e1c0da3042974/text_vectorian-0.1.9.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "0428a045c29c7eb6353c8e6548c7d2bf", "sha256": "4bd25e87a385348afd553488d6e896384dbe4d9d25687ba0af6dddf01c71f40e" }, "downloads": -1, "filename": "text_vectorian-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0428a045c29c7eb6353c8e6548c7d2bf", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9758, "upload_time": "2019-10-04T13:17:23", "url": "https://files.pythonhosted.org/packages/18/e1/b14e71c09630ff78e9238f5eb74f5b4c4d4d8824847f838a4fc69258f8e6/text_vectorian-0.2.0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0428a045c29c7eb6353c8e6548c7d2bf", "sha256": "4bd25e87a385348afd553488d6e896384dbe4d9d25687ba0af6dddf01c71f40e" }, "downloads": -1, "filename": "text_vectorian-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "0428a045c29c7eb6353c8e6548c7d2bf", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 9758, "upload_time": "2019-10-04T13:17:23", "url": "https://files.pythonhosted.org/packages/18/e1/b14e71c09630ff78e9238f5eb74f5b4c4d4d8824847f838a4fc69258f8e6/text_vectorian-0.2.0-py3-none-any.whl" } ] }