{ "info": { "author": "Kensuke Muraki", "author_email": "kensk8er1017@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Topic :: Scientific/Engineering :: Artificial Intelligence", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing :: Linguistic" ], "description": "langdist - Character-level Multilingual Language Modeling Toolkit\n=================================================================\n\n`langdist` is a Python project for experimenting *Character-level Multilingual Language Modeling*, which is to see how learning a character-level language model in one language helps learning another character-level language model in a different language. The project is still **under development** and can offer limited functionality.\n\n\nFeatures\n--------\n- Download and preprocess multilingual parallel corpora ([Multilingual Bible Parallel Corpus](http://christos-c.com/bible/))\n- Train a *monolingual language model*\n - This is a language model trained in one language\n- Train a *bilingual language model*\n - This is a language model that is trained on top of another language model (the parameters are initialized using another language model's parameters)\n- Generate texts using a trained language model\n\n\nInstallation\n------------\n- This repository can run on Ubuntu 14.04 LTS & Mac OSX 10.x (not tested on other OSs)\n- Tested only on Python 3.5\n\n`langdist` depends on [NumPy and Scipy](https://www.scipy.org/install.html), Python packages for scientific computing. You might need to have them installed prior to installing `langdist`.\n\nYou can install `langdist` by:\n\n```pip install langdist```\n\nThis installs `langdist` package to your Python, as well as `langdist` command and add it to your `PATH`.\n\n`langdist` also depends on `tensorflow` package. In default, it tries to install the CPU-only version of `tensorflow`. If you want to use GPU, you need to install `tensorflow` with GPU support by yourself. (C.f. [Installing Tensorflow](https://www.tensorflow.org/install/))\n\n\nUsage\n-----\n\nAfter installing, `langdist --help` will print help of how to use `langdist` command.\n\n### 1. Download and preprocess a corpus\n\n`langdist` implements a command to download and preprocess a corpus from [Multilingual Bible Parallel Corpus](http://christos-c.com/bible/). The following command will download an English corpus and save it to `./en_corpus.pkl`.\n\n```langdist download-bible en en_corpus.pkl```\n\nNote that `en` here is the language code of English. Specifying an invalid language code will raise an error message that shows the valid language codes.\n\n### 2. Fit an encoder on the characters used in corpora\n\nYou need to fit an encoder to the character used in corpora before you train a language model on them. Note that the same encoder will be used when you train a new language model on top of another language model (*multilingual language model*). Therefore, you need to fit an encoder to all the corpora you will train multilingual language models on.\n\nThe following command will fit an encoder to English, French, and Japanese corpora and save it to `./en_fr_ja_encoder.pkl`:\n\n```langdist fit-encoder en_fr_ja_encoder.pkl en_corpus.pkl fr_corpus.pkl ja_corpus.pkl```\n\nNote that `xx_corpus.pkl` is a pickle file of a corpus, which can be generated by `langdist download-bible` command. You can also create a list of texts by yourself and save it to a pickle file. (Each element of the list would correspond to a segment such as sentence, paragraph, article, etc. depending on your purpose.)\n\n### 3. Train a language model from the scratch (*monolingual language model*)\n\nThe following command will train a French language model and save it to `./fr_model` directory:\n\n```langdist train fr_corpus.pkl en_fr_ja_encoder.pkl fr_model --patience=819200 --logpath=fr.log```\n\nNote that using an encoder that was not fit to the corpus will throw an exception. `--patience` option specifies how many iterations you want to keep training and `--logpath` option specifies the path to the log file that records the progress of the training (no log file will be created if you don't specify the option).\n\nDuring the training, various stats are dumped to `path_to_model_dir/tensorboard.log` directory. You can visualize them using `tensorboard` by `tensorboard --logdir=path_to_model_dir/tensorboard.log`. The model is saved every time after computing validation perplexity and is available to use before finishing the training.\n\nCheck the output of `langdist --help` to know what other options are available for training a language model.\n\n### 4. Train a new language model on of another language model (*multilingual language model*)\n\nThe following command will train an English language model on top of the French language model we have trained and save it to `fr2en_model` directory:\n\n```langdist retrain fr_model en_corpus.pkl fr2en_model --patience=819200 --logpath=langdist.log```\n\nNote that you don't have to specify the path to an encoder because the model in `fr_model` includes it. If the encoder that was used when training `fr_model` was not fit to characters in `en_corpus.pkl`, it will throw an exception.\n\nDuring the training, various stats are dumped to `path_to_model_dir/tensorboard.log` directory. You can visualize them using `tensorboard` by `tensorboard --logdir=path_to_model_dir/tensorboard.log`. The model is saved every time after computing validation perplexity and is available to use before finishing the training.\n\nCheck the output of `langdist --help` to know what other options are available for training a language model.\n\n### 5. Generate texts using a trained language model\n\nOnce you have trained a language model, the following command will generate texts using the trained language model:\n\n```langdist generate fr2en_model --sample-num=50```\n\n`--sample-num` option decides the number of texts to generate. Note that each text is independently generated (sampled) by the language model.\n\nCheck the output of `langdist --help` to know what other options are available for training a language model.\n\n\n### Use `langdist` from Python\n\n`langdist` can be used as a normal python package by importing `langdist` package, which is installed to your Python environment by `pip install langdist`. Reading `langdist/cli.py` is a good way to figure out how to use the package.\n\nTODO: Add a link to the blog post *Bilingual Character-level Neural Language Modeling*", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kensk8er/langdist", "keywords": "language-model natural-language-processing natural-language-generation machine-learning tensorflow deep-learning recurrent-neural-networks lstm multilingual nlp python neural-network character-embeddings data-science", "license": "", "maintainer": "", "maintainer_email": "", "name": "langdist", "package_url": "https://pypi.org/project/langdist/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/langdist/", "project_urls": { "Homepage": "https://github.com/kensk8er/langdist" }, "release_url": "https://pypi.org/project/langdist/0.4.1/", "requires_dist": [ "docopt (>=0.6.2)", "jieba (>=0.38)", "numpy (>=1.12.0)", "pinyin (>=0.4.0)", "regex (>=2017.2.8)", "scikit-learn (>=0.18.1)", "scipy (>=0.18.1)", "tensorflow (>=1.0.1)" ], "requires_python": "", "summary": "Multilingual Language Modeling Toolkit", "version": "0.4.1" }, "last_serial": 2891981, "releases": { "0.3.0": [ { "comment_text": "", "digests": { "md5": "7ac9e2b677261f34f89add598e5b20f6", "sha256": "c2d9f9f77000ed74877c762d378812ac4d80bf484407902e8583020c39f0480a" }, "downloads": -1, "filename": "langdist-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "7ac9e2b677261f34f89add598e5b20f6", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19184, "upload_time": "2017-05-18T22:39:05", "url": "https://files.pythonhosted.org/packages/7d/0f/7d399af83d4459f27f75e13856f6cf44188f175fd575dfb0b0332fbb4f8e/langdist-0.3.0-py3-none-any.whl" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "f89d3e3f31b4c0e6001a79ea376bba8b", "sha256": "81af8cbeecc422a9eae24da4a0c988542090f05e469d7cd3f3ef265b30c22047" }, "downloads": -1, "filename": "langdist-0.4.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f89d3e3f31b4c0e6001a79ea376bba8b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23651, "upload_time": "2017-05-22T23:41:04", "url": "https://files.pythonhosted.org/packages/c3/07/9407cdf0459d23e2f368017774b3cf3f85545c7622fa8b48bc63295b56c5/langdist-0.4.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f89d3e3f31b4c0e6001a79ea376bba8b", "sha256": "81af8cbeecc422a9eae24da4a0c988542090f05e469d7cd3f3ef265b30c22047" }, "downloads": -1, "filename": "langdist-0.4.1-py3-none-any.whl", "has_sig": false, "md5_digest": "f89d3e3f31b4c0e6001a79ea376bba8b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 23651, "upload_time": "2017-05-22T23:41:04", "url": "https://files.pythonhosted.org/packages/c3/07/9407cdf0459d23e2f368017774b3cf3f85545c7622fa8b48bc63295b56c5/langdist-0.4.1-py3-none-any.whl" } ] }