{ "info": { "author": "icoxfog417", "author_email": "icoxfog417@yahoo.co.jp", "bugtrack_url": null, "classifiers": [], "description": "# chariot\n\n[![PyPI version](https://badge.fury.io/py/chariot.svg)](https://badge.fury.io/py/chariot)\n[![Build Status](https://travis-ci.org/chakki-works/chariot.svg?branch=master)](https://travis-ci.org/chakki-works/chariot)\n[![codecov](https://codecov.io/gh/chakki-works/chariot/branch/master/graph/badge.svg)](https://codecov.io/gh/chakki-works/chariot)\n\n**Deliver the ready-to-train data to your NLP model.**\n\n\n* Prepare Dataset\n * You can prepare typical NLP datasets through the [chazutsu](https://github.com/chakki-works/chazutsu).\n* Build & Run Preprocess\n * You can build the preprocess pipeline like [scikit-learn Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).\n * Preprocesses for each dataset column are executed in parallel by [Joblib](https://pythonhosted.org/joblib/index.html).\n * Multi-language text tokenization is supported by [spaCy](https://spacy.io/).\n* Format Batch\n * Sampling a batch from preprocessed dataset and format it to train the model (padding etc).\n * You can use pre-trained word vectors through the [chakin](https://github.com/chakki-works/chakin).\n\n**chariot** enables you to concentrate on training your model!\n\n![chariot flow](./docs/images/chariot_feature.gif)\n\n## Install\n\n```\npip install chariot\n```\n\n## Prepare dataset\n\nYou can download various dataset by using [chazutsu](https://github.com/chakki-works/chazutsu). \n\n```py\nimport chazutsu\nfrom chariot.storage import Storage\n\n\nstorage = Storage(\"your/data/root\")\nr = chazutsu.datasets.MovieReview.polarity().download(storage.data_path(\"raw\"))\n\ndf = storage.chazutsu(r.root).data()\ndf.head(5)\n```\n\nThen\n\n```\n\tpolarity\treview\n0\t0\tsynopsis : an aging master art thief , his sup...\n1\t0\tplot : a separated , glamorous , hollywood cou...\n2\t0\ta friend invites you to a movie . this film wo...\n```\n\n`Storage` class manage the directory structure that follows [cookie-cutter datascience](https://drivendata.github.io/cookiecutter-data-science/).\n\n```\nProject root\n \u2514\u2500\u2500 data\n \u251c\u2500\u2500 external <- Data from third party sources (ex. word vectors).\n \u251c\u2500\u2500 interim <- Intermediate data that has been transformed.\n \u251c\u2500\u2500 processed <- The final, canonical datasets for modeling.\n \u2514\u2500\u2500 raw <- The original, immutable data dump.\n```\n\n## Build & Run Preprocess\n\n### Build a preprocess pipeline\n\nAll preprocessors are defined at `chariot.transformer`. \nTransformers are implemented by extending [scikit-learn `Transformer`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html). \nBecause of this, the API of Transformer is familiar to you. And you can mix [scikit-learn's preprocessors](https://scikit-learn.org/stable/modules/preprocessing.html).\n\n```py\nimport chariot.transformer as ct\nfrom chariot.preprocessor import Preprocessor\n\n\npreprocessor = Preprocessor()\npreprocessor\\\n .stack(ct.text.UnicodeNormalizer())\\\n .stack(ct.Tokenizer(\"en\"))\\\n .stack(ct.token.StopwordFilter(\"en\"))\\\n .stack(ct.Vocabulary(min_df=5, max_df=0.5))\\\n .fit(train_data)\n\npreprocessor.save(\"my_preprocessor.pkl\")\n\nloaded = Preprocessor.load(\"my_preprocessor.pkl\")\n```\n\nThere is 6 type of transformers are prepared in chariot.\n\n* TextPreprocessor\n * Preprocess the text before tokenization.\n * `TextNormalizer`: Normalize text (replace some character etc).\n * `TextFilter`: Filter the text (delete some span in text stc).\n* Tokenizer\n * Tokenize the texts.\n * It powered by [spaCy](https://spacy.io/) and you can choose [MeCab](https://github.com/taku910/mecab) or [Janome](https://github.com/mocobeta/janome) for Japanese.\n* TokenPreprocessor\n * Normalize/Filter the tokens after tokenization.\n * `TokenNormalizer`: Normalize tokens (to lower, to original form etc).\n * `TokenFilter`: Filter tokens (extract only noun etc).\n* Vocabulary\n * Make vocabulary and convert tokens to indices.\n* Formatter\n * Format (preprocessed) data for training your model.\n* Generator\n * Genrate target data to train your (language) model.\n\n### Build a preprocess for dataset\n\nWhen you want to make preprocess to each of your dataset column, you can use `DatasetPreprocessor`.\n\n```py\nfrom chariot.dataset_preprocessor import DatasetPreprocessor\nfrom chariot.transformer.formatter import Padding\n\n\ndp = DatasetPreprocessor()\ndp.process(\"review\")\\\n .by(ct.text.UnicodeNormalizer())\\\n .by(ct.Tokenizer(\"en\"))\\\n .by(ct.token.StopwordFilter(\"en\"))\\\n .by(ct.Vocabulary(min_df=5, max_df=0.5))\\\n .by(Padding(length=pad_length))\\\n .fit(train_data[\"review\"])\ndp.process(\"polarity\")\\\n .by(ct.formatter.CategoricalLabel(num_class=3))\n\n\npreprocessed = dp.preprocess(data)\n\n# DatasetPreprocessor has multiple preprocessor.\n# Because of this, save file format is `tar.gz`.\ndp.save(\"my_dataset_preprocessor.tar.gz\")\n\nloaded = DatasetPreprocessor.load(\"my_dataset_preprocessor.tar.gz\")\n```\n\n## Train your model with chariot\n\n`chariot` has feature to traing your model.\n\n```py\nformatted = dp(train_data).preprocess().format().processed\n\nmodel.fit(formatted[\"review\"], formatted[\"polarity\"], batch_size=32,\n validation_split=0.2, epochs=15, verbose=2)\n\n```\n\n```py\nfor batch in dp(train_data.preprocess().iterate(batch_size=32, epoch=10):\n model.train_on_batch(batch[\"review\"], batch[\"polarity\"])\n\n```\n\nYou can use pre-trained word vectors by [chakin](https://github.com/chakki-works/chakin). \n\n\n```py\nfrom chariot.storage import Storage\nfrom chariot.transformer.vocabulary import Vocabulary\n\n# Download word vector\nstorage = Storage(\"your/data/root\")\nstorage.chakin(name=\"GloVe.6B.50d\")\n\n# Make embedding matrix\nvocab = Vocabulary()\nvocab.set([\"you\", \"loaded\", \"word\", \"vector\", \"now\"])\nembed = vocab.make_embedding(storage.data_path(\"external/glove.6B.50d.txt\"))\nprint(embed.shape) # (len(vocab.count), 50)\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chakki-works/chariot", "keywords": "machine learning,nlp,natural language processing", "license": "Apache License 2.0", "maintainer": "", "maintainer_email": "", "name": "chariot", "package_url": "https://pypi.org/project/chariot/", "platform": "", "project_url": "https://pypi.org/project/chariot/", "project_urls": { "Homepage": "https://github.com/chakki-works/chariot" }, "release_url": "https://pypi.org/project/chariot/0.5.5/", "requires_dist": null, "requires_python": "", "summary": "Deliver the ready-to-train data to your NLP model.", "version": "0.5.5" }, "last_serial": 5994007, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "5a0d5fcfdb7e39e65a080e979cb9d2c3", "sha256": "ebeaaaf3f53eafa5404bbc7d867391b8bd7be1107be9f562c7d4b025bbbff371" }, "downloads": -1, "filename": "chariot-0.0.1.tar.gz", "has_sig": false, "md5_digest": "5a0d5fcfdb7e39e65a080e979cb9d2c3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7153, "upload_time": "2018-06-11T11:30:13", "url": "https://files.pythonhosted.org/packages/de/9c/8b7e34e88519b8c6702752f5f7030d8892ffd228e2faae05395b66e7ec52/chariot-0.0.1.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "9cf42387c1843101202a79d14bd09893", "sha256": "3c42112a8cea2739ef762c8372a3f9dd7b8d18f07dd9dbadd4ffe492bfa9eb6a" }, "downloads": -1, "filename": "chariot-0.1.0.tar.gz", "has_sig": false, "md5_digest": "9cf42387c1843101202a79d14bd09893", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11914, "upload_time": "2018-06-14T08:04:58", "url": "https://files.pythonhosted.org/packages/b7/3e/fa60e02447606af15e0b057b0c63f980f730243049ee42ca714eede36256/chariot-0.1.0.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "14e913a7e516d002cbaff074b8c0b48d", "sha256": "0c6d1b3c23e4709f5e64a0413282aaa5f6a55ed142307beea824435c0e45f193" }, "downloads": -1, "filename": "chariot-0.2.0.tar.gz", "has_sig": false, "md5_digest": "14e913a7e516d002cbaff074b8c0b48d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12651, "upload_time": "2018-06-18T02:43:19", "url": "https://files.pythonhosted.org/packages/a4/23/34236f38e2c761f8bcbf5525529b5f0fc0cbe49ac9a0a56e906f115e4bdd/chariot-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "ab8d24a261d6e543fa06d2c164525d31", "sha256": "8bd8218aecbf14302c38159ebc86350f3535b605ec39bf605ad98dc49f9d5e49" }, "downloads": -1, "filename": "chariot-0.3.0.tar.gz", "has_sig": false, "md5_digest": "ab8d24a261d6e543fa06d2c164525d31", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13689, "upload_time": "2018-06-18T06:06:43", "url": "https://files.pythonhosted.org/packages/24/a4/a573f2ea5d9723d897de9e1ae084799eedd9d8e3f49b52ced098f8433f71/chariot-0.3.0.tar.gz" } ], "0.4.0": [ { "comment_text": "", "digests": { "md5": "5760fe2511f9953f85d93fcbc90ca8bf", "sha256": "cd7bf5be109e03c4c45b4ad78fc76d9815af708369777daa56910657d00516ac" }, "downloads": -1, "filename": "chariot-0.4.0.tar.gz", "has_sig": false, "md5_digest": "5760fe2511f9953f85d93fcbc90ca8bf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13606, "upload_time": "2018-08-06T06:04:42", "url": "https://files.pythonhosted.org/packages/8d/de/ac013a7a862b361397a0d4ce064da58c3add0e664dbbd1cb153749273536/chariot-0.4.0.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "752ffbfc575307a5c8d2878f676abde7", "sha256": "b7db433a49e4606272e46f887d5f5d5cda5f66cf77721ff152baa534d76a7b7d" }, "downloads": -1, "filename": "chariot-0.4.1.tar.gz", "has_sig": false, "md5_digest": "752ffbfc575307a5c8d2878f676abde7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 13672, "upload_time": "2018-08-08T05:18:49", "url": "https://files.pythonhosted.org/packages/51/37/8565ca3737693f7fcaf4fccc56014639fc4766d4a9be7cb43f0f99148c03/chariot-0.4.1.tar.gz" } ], "0.4.2": [ { "comment_text": "", "digests": { "md5": "3f80f06dc4d905a90fcfef265ebc9082", "sha256": "463c1d5ff303b1637116a5eb4a7f30044b727bd14a0ff7da8fb824309754d144" }, "downloads": -1, "filename": "chariot-0.4.2.tar.gz", "has_sig": false, "md5_digest": "3f80f06dc4d905a90fcfef265ebc9082", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14357, "upload_time": "2018-09-27T07:55:28", "url": "https://files.pythonhosted.org/packages/e7/e1/c0192eaa67ee4dce45e4a8a1a699b906c7376c7379fbab71e6cc92d886d7/chariot-0.4.2.tar.gz" } ], "0.4.3": [ { "comment_text": "", "digests": { "md5": "67e75fb339a380d23446b95a61632d1f", "sha256": "910c8cc0647ca31bbfd085c4fb97b08b882ca8c44b7aa60996318731b151a3f3" }, "downloads": -1, "filename": "chariot-0.4.3.tar.gz", "has_sig": false, "md5_digest": "67e75fb339a380d23446b95a61632d1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14470, "upload_time": "2018-09-27T08:32:29", "url": "https://files.pythonhosted.org/packages/d6/19/b780ae27ee3707a5edaf734884bb201ee1dc9739bb06a4e25843eb5e1442/chariot-0.4.3.tar.gz" } ], "0.4.4": [ { "comment_text": "", "digests": { "md5": "ffcb8da429fa011cd025ef63625d7e90", "sha256": "fef8057c84831543cba3af87ebfea7c03603d6740a5a30e8aa13f472c6ab2235" }, "downloads": -1, "filename": "chariot-0.4.4.tar.gz", "has_sig": false, "md5_digest": "ffcb8da429fa011cd025ef63625d7e90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15124, "upload_time": "2018-10-01T09:45:40", "url": "https://files.pythonhosted.org/packages/9d/b1/e509e98621bd264805e4b1a9b962ed69fcf4ec6d1e49ae192fbb0b21d1af/chariot-0.4.4.tar.gz" } ], "0.4.5": [ { "comment_text": "", "digests": { "md5": "0765835e1484060a80588aae0af08f90", "sha256": "ce42630f4f7f294359dffde342a54a9c191a800d3621d341b5f90368ffbdd5ae" }, "downloads": -1, "filename": "chariot-0.4.5.tar.gz", "has_sig": false, "md5_digest": "0765835e1484060a80588aae0af08f90", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15626, "upload_time": "2018-10-03T08:10:28", "url": "https://files.pythonhosted.org/packages/e8/c9/be032662d3e79e95ca8ede99d0790dedd52a9ebe10899b130980be36eb6e/chariot-0.4.5.tar.gz" } ], "0.4.6": [ { "comment_text": "", "digests": { "md5": "402c1b32b3931700106119e7c10681d3", "sha256": "26e91ec1de2cb794681a5835fce8e3e51154c0905be2765e6587c3ced249e9e5" }, "downloads": -1, "filename": "chariot-0.4.6.tar.gz", "has_sig": false, "md5_digest": "402c1b32b3931700106119e7c10681d3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15782, "upload_time": "2018-10-11T09:38:16", "url": "https://files.pythonhosted.org/packages/51/ea/8a5b292f697dc83fd696930e5dad35e5c4633ee0a827f49c9c65ff0311eb/chariot-0.4.6.tar.gz" } ], "0.4.7": [ { "comment_text": "", "digests": { "md5": "a1bd4c51df1b8f54e800dd1f7e7e726f", "sha256": "d8c633d30211b4b1f87128447ed4f569fd5a45306b1d00eebba7318b53b0245e" }, "downloads": -1, "filename": "chariot-0.4.7.tar.gz", "has_sig": false, "md5_digest": "a1bd4c51df1b8f54e800dd1f7e7e726f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15823, "upload_time": "2018-10-23T05:06:35", "url": "https://files.pythonhosted.org/packages/d4/29/b842e5f75bbba3b82f63d0844fe70576a6607c3cd86a070c975db392bce9/chariot-0.4.7.tar.gz" } ], "0.4.8": [ { "comment_text": "", "digests": { "md5": "af8c73eee7fe97bc4f4bc61f7b37890a", "sha256": "5f68528d6b2029aed63adc1fc1dfaa22082874718616ca0307381a42db1ad7d9" }, "downloads": -1, "filename": "chariot-0.4.8.tar.gz", "has_sig": false, "md5_digest": "af8c73eee7fe97bc4f4bc61f7b37890a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15759, "upload_time": "2018-11-29T06:29:50", "url": "https://files.pythonhosted.org/packages/9e/df/d0f73dcbb38a49c76a518d08ef03e2e4eac55a24de039ced4beb6281f24f/chariot-0.4.8.tar.gz" } ], "0.4.9": [ { "comment_text": "", "digests": { "md5": "8490c509ecfd1f9f0f421ffbcde1636f", "sha256": "8208a69345a3a07bde7a7b9ec466d2e656c94a5e1d8ad3567c264b99f5459218" }, "downloads": -1, "filename": "chariot-0.4.9.tar.gz", "has_sig": false, "md5_digest": "8490c509ecfd1f9f0f421ffbcde1636f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15665, "upload_time": "2018-12-25T07:54:55", "url": "https://files.pythonhosted.org/packages/b3/30/0c06fc90de76f16a66508b73e73e389e4f5b44f6ad5ddcb5f27465870df1/chariot-0.4.9.tar.gz" } ], "0.5.0": [ { "comment_text": "", "digests": { "md5": "a4d94af708656bf9a91715fb1cc0b6c2", "sha256": "15660c5a5600de02bdca55835e12ecf2c3653195807207fcf1dedd9f823921e7" }, "downloads": -1, "filename": "chariot-0.5.0.tar.gz", "has_sig": false, "md5_digest": "a4d94af708656bf9a91715fb1cc0b6c2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16817, "upload_time": "2019-02-21T07:14:28", "url": "https://files.pythonhosted.org/packages/b9/83/59f776ff92a51750b24d43c504daeca557a9fca3b90e71593f230e7068dd/chariot-0.5.0.tar.gz" } ], "0.5.1": [ { "comment_text": "", "digests": { "md5": "5aa29df2b94030735ecbcdbbf054bf84", "sha256": "4c8d6b2308f8ed0c64ae4cb6f39fe7f0eef9e09c22b17d9f153ffa4625d88d17" }, "downloads": -1, "filename": "chariot-0.5.1.tar.gz", "has_sig": false, "md5_digest": "5aa29df2b94030735ecbcdbbf054bf84", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16804, "upload_time": "2019-02-27T06:07:40", "url": "https://files.pythonhosted.org/packages/c9/12/d8aaf56820c97037c108a4f9621df768a130037238a0b03212295edd00c7/chariot-0.5.1.tar.gz" } ], "0.5.2": [ { "comment_text": "", "digests": { "md5": "ec3d38d3ff1b740693da7009bb27f7bd", "sha256": "3faa97ba568a1702083b2808159dc8e86036578686ab5ede38fb1abc3379a305" }, "downloads": -1, "filename": "chariot-0.5.2.tar.gz", "has_sig": false, "md5_digest": "ec3d38d3ff1b740693da7009bb27f7bd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20146, "upload_time": "2019-04-25T07:37:43", "url": "https://files.pythonhosted.org/packages/e4/97/ca814e3862b7859df8b78d2383edafc98e848e61845d27958c95752dfc42/chariot-0.5.2.tar.gz" } ], "0.5.5": [ { "comment_text": "", "digests": { "md5": "7de0cbafc36d834111409199b16cd4ab", "sha256": "a2d8e6e5b6e8c1df5f89cb8d713fffcb36c1c4ee9cda2df0cdd7e68f10fdd96c" }, "downloads": -1, "filename": "chariot-0.5.5.tar.gz", "has_sig": false, "md5_digest": "7de0cbafc36d834111409199b16cd4ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3581056, "upload_time": "2019-10-18T06:25:19", "url": "https://files.pythonhosted.org/packages/30/15/81efc8955beaba63353fd0d3f8e121286b1035650b4637fba58d6b8ed12c/chariot-0.5.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7de0cbafc36d834111409199b16cd4ab", "sha256": "a2d8e6e5b6e8c1df5f89cb8d713fffcb36c1c4ee9cda2df0cdd7e68f10fdd96c" }, "downloads": -1, "filename": "chariot-0.5.5.tar.gz", "has_sig": false, "md5_digest": "7de0cbafc36d834111409199b16cd4ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3581056, "upload_time": "2019-10-18T06:25:19", "url": "https://files.pythonhosted.org/packages/30/15/81efc8955beaba63353fd0d3f8e121286b1035650b4637fba58d6b8ed12c/chariot-0.5.5.tar.gz" } ] }