{ "info": { "author": "Qingyang Wu", "author_email": "wilwu@ucdavis.edu", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Chinese-GPT \u4e2d\u6587GPT\u9884\u8bad\u7ec3\u6a21\u578b\n\nChinese Generative Pre-Training(GPT) Language Model\n\nThis project is unidirectional transformer GPT model (117M) trained on a large corpus dataset following the approach [OpenAI GPT-2](https://openai.com/blog/better-language-models/). Due to limited computational resources, we did not train our model from scratch. Instead, we take the advantage of [BERT](https://arxiv.org/abs/1810.04805) and use its weights as initialization to train our Chinese GPT. This makes the training possible on 4 x 1080Ti.\n\nHowever, please notice that currently the performance still cannot match the original English GPT-2 model for various reasons. This can be that OpenAI has done better text filtering and has a dataset with better quality. Also, they have trained their model for about 300 GPU days at least. But the model here can be a good starting point if you want to apply it for substream tasks. \n\n## Features\n\nThis repository contains a rewritten cached Transformed based on BERT, which is the same technique used in GPT-2 implementation. It can cache the intermediate results, and therefore save the compuation time and memory during the decoding stage. \n\nAlso, a CUDA kernel version of GELU activation function is provided. You have to insatll [Cupy](https://github.com/cupy/cupy) to use it. You can check [cuda_gelu](https://github.com/qywu/Chinese-GPT/blob/master/chinese_gpt/cuda_gelu.py) for the implementation. It is 2x faster than the original implementation!\n\n## Installation \nBefore using it, you might want to install the requirements first.\n\n ```bash\n pip install -r requirements.txt\n ```\n\nYou can also install it via `pip`.\n\n ```bash\n pip install chinese-gpt\n ```\n \n## Usage\n\nCheck [tutorials](https://github.com/qywu/Chinese-GPT/tree/master/tutorials) for details.\n\nI have also included a colab for demo: https://colab.research.google.com/drive/1cvBSt2uF7hYL1feDGt0dkCxIeaVXQs5x\n\nEncoder Weights: https://drive.google.com/open?id=1Mr2-x_qT2hgyo0RalPjc09NmyNi6a_gs\n\nDecoder Weights: https://drive.google.com/open?id=1W6n7Kv6kvHthUX18DhdGSzBYkyzDvxYh\n\n## Data Preparation\nTo train GPT, it requires a dataset from a wide range of sources.\n\nWe collected data from [NLP Chinese Corpus](https://github.com/brightmart/nlp_chinese_corpus)\n\nIn details, we used:\n\n- \u793e\u533a\u95ee\u7b54json\u7248(webtext2019zh) \uff1a\u5927\u89c4\u6a21\u9ad8\u8d28\u91cf\u6570\u636e\u96c6\n- \u767e\u79d1\u7c7b\u95ee\u7b54json\u7248(baike2018qa)\n- \u65b0\u95fb\u8bed\u6599json\u7248(news2016zh)\n\n### Text Filtering\n\nOne thing to take care of is that text filtering. Since Bert Chinese tokenizer doesn't include some punctuations. You might want to use the following code to clean your data first:\n\n```python\nimport regex as re\n\ndef filterPunctuation(x):\n x = re.sub(r'[\u2018\u2019]', \"'\", x)\n x = re.sub(r'[\u201c\u201d]', '\"', x)\n x = re.sub(r'[\u2026]', '...', x)\n x = re.sub(r'[\u2014]', '-', x)\n x = re.sub(r\" \", \"\", x)\n return x\n```\n\nYou may also want to convert traditional Chinese to simplified Chinese and apply some other filtering techniques based on your data. \n\n## Reference\n [OpenAI GPT-2](https://openai.com/blog/better-language-models/)\n \n [BERT](https://arxiv.org/abs/1810.04805)\n \n [AllenNLP](https://github.com/allenai/allennlp/)\n \n [Pytorch BERT](https://github.com/huggingface/pytorch-pretrained-BERT)\n \n [NLP Chinese Corpus](https://github.com/brightmart/nlp_chinese_corpus)", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/qywu/Chinese-GPT", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "chinese-gpt", "package_url": "https://pypi.org/project/chinese-gpt/", "platform": "", "project_url": "https://pypi.org/project/chinese-gpt/", "project_urls": { "Homepage": "https://github.com/qywu/Chinese-GPT" }, "release_url": "https://pypi.org/project/chinese-gpt/0.1.3/", "requires_dist": null, "requires_python": "", "summary": "Chinese Generative Pre-Training Transformer", "version": "0.1.3" }, "last_serial": 5225234, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "02f464fdf57025fcb60bbb2b802fe794", "sha256": "5f4deace1a4cf5bb6065274f0c86f49e0b0a07cbaafd42ef51162a6b85832e74" }, "downloads": -1, "filename": "chinese_gpt-0.1.2.tar.gz", "has_sig": false, "md5_digest": "02f464fdf57025fcb60bbb2b802fe794", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6009, "upload_time": "2019-05-03T09:51:28", "url": "https://files.pythonhosted.org/packages/db/57/4094c22b80b9768d8b324b15e95d0cb3083a86c1f36de180c95206a14413/chinese_gpt-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "c11d70c6b70796121003397c970171ef", "sha256": "4207a4983965184710c7b11bd7a42d185e7f562b5830b2becf2ddb39ab53ebb3" }, "downloads": -1, "filename": "chinese_gpt-0.1.3.tar.gz", "has_sig": false, "md5_digest": "c11d70c6b70796121003397c970171ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6499, "upload_time": "2019-05-04T10:13:06", "url": "https://files.pythonhosted.org/packages/fc/07/545022bcb92f355ae0005c0fb8391ad37679622e18bd7e816ede5727b61e/chinese_gpt-0.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "c11d70c6b70796121003397c970171ef", "sha256": "4207a4983965184710c7b11bd7a42d185e7f562b5830b2becf2ddb39ab53ebb3" }, "downloads": -1, "filename": "chinese_gpt-0.1.3.tar.gz", "has_sig": false, "md5_digest": "c11d70c6b70796121003397c970171ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6499, "upload_time": "2019-05-04T10:13:06", "url": "https://files.pythonhosted.org/packages/fc/07/545022bcb92f355ae0005c0fb8391ad37679622e18bd7e816ede5727b61e/chinese_gpt-0.1.3.tar.gz" } ] }