{ "info": { "author": "vietnlp", "author_email": "sonvx.coltech@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Table of contents\n1. [Introduction](#introduction)\n2. [More about ETNLP](#moreaboutETNLP)\n3. [Installation and How to Use](#installation_and_howtouse)\n4. [Download Resources](#Download_Resources)\n\n\n# I. ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings \n## A glimpse of ETNLP:\n- Github: https://github.com/vietnlp/etnlp\n- Video: https://vimeo.com/317599106\n\n# II. More about ETNLP :\n## 1. Embedding Evaluator: \nTo compare quality of embedding models on the word analogy task.\n- Input: a pre-trained embedding vector file (word2vec format), and word analogy file.\n- Output: (1) evaluate quality of the embedding model based on the MAP/P@10 score, (2) Paired t-tests to show significant level between different word embeddings.\n\n### 1.1. Note: The word analogy list is created by:\n- Adopt from the English list by selecting suitable categories and translating to the target language (i.e., Vietnamese). \n- Removing inappropriate categories (i.e., category 6, 10, 11, 14) in the target language (i.e., Vietnamese).\n- Adding custom category that is suitable for the target language (e.g., cities and their zones in Vietnam for Vietnamese).\nSince most of this process is automatically done, it can be applied in other languages as well.\n\n### 1.2. Selected categories for Vietnamese: \n> 1. capital-common-countries\n> 2. capital-world\n> 3. currency: E.g., Algeria | dinar | Angola | kwanza\n> 4. city-in-zone (Vietnam's cities and its zone)\n> 5. family (boy|girl | brother | sister)\n> 6. gram1-adjective-to-adverb (NOT USED)\n> 7. gram2-opposite (e.g., acceptable | unacceptable | aware | unaware)\n> 8. gram3-comparative (e.g., bad | worse | big | bigger)\n> 9. gram4-superlative (e.g., bad | worst | big | biggest)\n> 10. gram5-present-participle (NOT USED)\n> 11. gram6-nationality-adjective-nguoi-tieng (e.g., Albania | Albanian | Argentina | Argentinean)\n> 12. gram7-past-tense (NOT USED)\n> 13. gram8-plural-cac-nhung (e.g., banana | bananas | bird | birds) (NOT USED)\n> 14. gram9-plural-verbs (NOT USED)\n\n### 1.3 Evaluation results (in details)\n\n* Analogy: Word Analogy Task\n\n* NER (w): NER task with hyper-parameters selected from the best F1 on validation set.\n\n* NER (w.o): NER task without selecting hyper-parameters from the validation set.\n\n| \ufeff Model | NER.w | NER.w.o \t| Analogy \t|\n|------------------------------\t|------------- | ------------------\t|------------------\t|\n| BiLC3 + w2v \t| 89.01 | 89.41 \t| 0.4796 |\n| BiLC3 + Bert_Base \t| 88.26 | 89.91 | 0.4609 |\n| BiLC3 + w2v_c2v \t| 89.46 | 89.46 \t| 0.4796 |\n| BiLC3 + fastText \t| 89.65 | 89.84 \t| 0.4970 |\n| BiLC3 + Elmo \t| 89.67 | 90.84 \t| **0.4999** |\n| BiLC3 + MULTI_WC_F_E_B | **91.09** | **91.75** \t| 0.4906|\n\n\n## 2. Embedding Extractor: To extract embedding vectors for other tasks.\n- Input: (1) list of input embeddings, (2) a vocabulary file.\n- Output: embedding vectors of the given vocab file in `.txt`, i.e., each line conains the embedding for a word. The file then be compressed in .gz format. This format is widely used in existing NLP Toolkits (e.g., Reimers et al. [1]).\n\n### Extra options:\n- `-input-c2v`: character embedding file\n- `solveoov:1`: to solve OOV words of the 1st embedding. Similarly for more than one embedding: e.g., `solveoov:1:2`.\n\n[1] Nils Reimers and Iryna Gurevych, Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging, 2017, http://arxiv.org/abs/1707.09861, arXiv.\n\n## 3. Visualizer: to explore the embedding space and compare between different embeddings.\n\n### Screenshot of viewing multiple-embeddings side-by-side\n![Alt text](images/etnlp_view_multi_embeddings.png \"Screenshot multiple-embeddings side-by-side\")\n\n### Screenshot of viewing each embedding interactively\n![Alt text](images/etnlp_view_embs.png \"Screenshot example of viewing each embedding interactively\")\n\n\n# III. Installation and How to use ETNLP \n## 1. Installation:\n\nFrom source codes:\n> 1. cd src/codes/\n> 2. python setup.py install\n\nFrom pip\n> 1. pip install etnlp \n\n## 2. Examples\n> 1. cd src/examples\n> 2. python test1_etnlp_preprocessing.py\n> 3. python test2_etnlp_extractor.py\n> 4. python test3_etnlp_evaluator.py\n> 5. python test4_etnlp_visualizer.py\n\n### 3. Visualization\nSide-by-side visualization:\n> 1. sh src/codes/04.run_etnlp_visualizer_sbs.sh\n\nInteractive visualization:\n> 1. sh src/codes/04.run_etnlp_visualizer_inter.sh\n\n\n# IV. Available Lexical Resources \n## 1. Word Analogy List for Vietnamese\n\n| \ufeff Word Analogy List | Download Link (NER Task)| Download Link (General)| \n|------------------------------|---------------|---------------|\n| Vietnamese (This work) | [Link1](https://drive.google.com/open?id=1cBYVGwU59slI6bTe2maOW0_xj_e7XugW)| [Link1]|\n| English (Mirkolov et al. [2]) | [Link2]| [Link2]|\n| Portuguese (Hartmann et al. [3]) | [Link3]| [Link3](https://github.com/nathanshartmann/portuguese_word_embeddings/blob/master/analogies/testset/LX-4WAnalogies.txt)|\n\n\n\n## 2. Multiple pre-trained embedding models for Vietnamese\n\n- Training data: Wiki in Vietnamese:\n\n| \ufeff # of sentences | # of tokenized words| \n|------------------------------|---------------|\n| \ufeff 6,685,621 | 114,997,587 |\n\n\n- Download Pre-trained Embeddings:
\n(Note: The MULTI_WC_F_E_B is the concatenation of four embeddings: W2V_C2V, fastText, ELMO, and Bert_Base.)\n\n| \ufeff Embedding Model | Download Link (NER Task) | Download Link (AIVIVN SentiTask) | Download Link (General) | \n|------------------------------|---------------|---------------|---------------|\n| w2v | [Link1](https://drive.google.com/open?id=1D3RvwFiRvvbP6GIcuff6LTUmm1i3yvdF) (dim=300)| [Link1] | [Link1] |\n| w2v_c2v | [Link2](https://drive.google.com/open?id=1YWDmVq6ku7OzY8-Rsm_MsWQ5lH-uUb4h) (dim=300)| [Link2] | [Link2] |\n| fastText | [Link3](https://drive.google.com/open?id=1EGsNoPX3acNCjYKAVD0T2r04A55lnEmH) (dim=300)| [Link3] | [Link3] |\n| Elmo | [Link4](https://drive.google.com/open?id=1U2eWRWwgbva9OMyRvJKopn8FsmTLqMiq) (dim=1024)| [Link4](https://drive.google.com/open?id=1DK-aPQNuJcvj8GTPJS8A2xRm5A12UuNi) (dim=1024)| [Link4] |\n| Bert_base | [Link5](https://drive.google.com/open?id=1ZEI6jHFYn-eptuf1nEwWQY8U4OXGeLx3) (dim=768)| [Link5] | [Link5] |\n| MULTI_WC_F_E_B | [Link6](https://drive.google.com/file/d/1p2EGSVrVl6TIbq6pZxhEuleH26qXxDts) (dim=2392)| [Link6] | [Link6] |", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/vietnlp/etnlp", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "ETNLP", "package_url": "https://pypi.org/project/ETNLP/", "platform": "", "project_url": "https://pypi.org/project/ETNLP/", "project_urls": { "Homepage": "https://github.com/vietnlp/etnlp" }, "release_url": "https://pypi.org/project/ETNLP/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "ETNLP: Embedding Toolkit for NLP Tasks", "version": "0.1.1" }, "last_serial": 5152119, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "5b4669b6b1a67a3a9ee2691098cbe927", "sha256": "782dc2d5610b2fe26f97a40a1a1110e00dd34c7698ad22cd3f6d26d9a340ca37" }, "downloads": -1, "filename": "ETNLP-0.1.0-py3.6.egg", "has_sig": false, "md5_digest": "5b4669b6b1a67a3a9ee2691098cbe927", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 72106, "upload_time": "2019-04-16T17:47:32", "url": "https://files.pythonhosted.org/packages/50/64/3ac497937a179aa3c0fc1bce71469892258845196bb72bf0adabc83b7bfb/ETNLP-0.1.0-py3.6.egg" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "4c536ac4bdaba2f2ed0dd6dafe5d2901", "sha256": "bd1ec09d025d7a3a4e21a222a8a947fbc34fef4f433d1dccef65b32641973d75" }, "downloads": -1, "filename": "ETNLP-0.1.1-py3.6.egg", "has_sig": false, "md5_digest": "4c536ac4bdaba2f2ed0dd6dafe5d2901", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 72258, "upload_time": "2019-04-16T21:19:40", "url": "https://files.pythonhosted.org/packages/32/c8/f5bef9d60790c0c4ec767150c33fb4e3af57e4a5aa8acffed0881d1ee978/ETNLP-0.1.1-py3.6.egg" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4c536ac4bdaba2f2ed0dd6dafe5d2901", "sha256": "bd1ec09d025d7a3a4e21a222a8a947fbc34fef4f433d1dccef65b32641973d75" }, "downloads": -1, "filename": "ETNLP-0.1.1-py3.6.egg", "has_sig": false, "md5_digest": "4c536ac4bdaba2f2ed0dd6dafe5d2901", "packagetype": "bdist_egg", "python_version": "3.6", "requires_python": null, "size": 72258, "upload_time": "2019-04-16T21:19:40", "url": "https://files.pythonhosted.org/packages/32/c8/f5bef9d60790c0c4ec767150c33fb4e3af57e4a5aa8acffed0881d1ee978/ETNLP-0.1.1-py3.6.egg" } ] }