{ "info": { "author": "Justin Yang", "author_email": "", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# DeepToxic\n\nThis is part of 27th solution for the [toxic comment classification \nchallenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/). \nFor easy understanding, I only uploaded what I used in the final stage, \nand did not attach any experimental or deprecated codes.\n\n## Dataset and External pretrained embeddings\n\nYou can fetch the dataset \n[here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data). \nI used 3 kind of word embeddings:\n\n* [FastText crawl 300d \n2M](https://www.kaggle.com/yekenot/fasttext-crawl-300d-2m)\n* [glove.840B.300d](https://nlp.stanford.edu/projects/glove/) \n* [glove.twitter.27B](https://nlp.stanford.edu/projects/glove/)\n\n## Overview\n\n### Preprocessing\n\nWe trained our models on 3 datasets with different preprocessing:\n\n* original dataset with spellings correction: by comparing the \nLevenshtein distance and a lot of regular expressions.\n* original dataset with pos taggings: We generate the part of speech \n(POS) tagging for every comment by TextBlob and concatenate the word \nembedding and POS embedding as a single one. Since TextBlob drops some \ntokens and punctuations when generating the POS sequences, that gives \nour models another view. \n* Riad's dataset: with very heavily data-cleaning, spelling correction \nand translation\n\n### Models\n\nIn our case, the simpler, the better. I tried some complicated \nstructures (RHN, DPCNN, HAN). Most of them had performed very well \nlocally but got lower AUC on the leaderboard. The models I kept trying \nduring the final stage are the following two:\n\nPooled RNN (public: 0.9862, private: 0.9858)\n![pooledRNN](https://i.imgur.com/AQkbPn7.png)\n\nKmax text CNN (public: 0.9856 , private: 0.9849)\n![kmaxCNN](https://i.imgur.com/WfbXVh3.png)\n\nAs many competitors pointed out, dropout and batch-normalization are the \nkeys to prevent overfitting. By applying the dropout on the word \nembedding directly and behind the pooling does great regularization both \non train set and test set. Although model with many dropouts takes about \n5 more epochs to coverage, it boosts our scores significantly. For \ninstance, my RNN boosts from 0.9853 (private: 0.9850) to 0.9862 \n(private: 0.9858) after adding dropout layers.\n\nFor maximizing the utility of these datasets, besides training on the \noriginal labels, we also add a meta-label \"bad_comment\". If a comment is \nlabeled, then it's considered to be a bad comment. The hypothesizes \nbetween these two labels sets are slightly different but with almost the \nsame LB score, which leaves us room for the ensemble.\n\nIn order to increase the diversity and to deal with some toxic typos, we \ntrained the models both on char-level and word-level. The results of \nchar-level perform a bit worse (for charRNN: 0.983 on LB, 0.982 on PB, \ncharCNN: 0.9808 on LB, 0.9801 on PB) but it does have a pretty low \ncorrelation with word-level models. By simply bagging my char-level and \nword-level result, it is good enough to push me over 0.9869 in the \nprivate test set. By the way, the hyperparameters influence the \nperformance hugely in the char-based models. A large batch size (256), \nvery long sequence length (1000) would ordinarily get a considerable \nresult even though it takes much time for the K-fold validation. (my \nchar-based models usually converge after 60~70 epochs which is about 5 \ntimes more than my word-based models.)\n\n## Performance of Single models\n\nScored by AUC on the private testset.\n\n### Word level\n\n|Model|Fasttext|Glove|Twitter|\n|-----|--------|-----|-------|\n|AVRNN|0.9858|0.9855|0.9843|\n|Meta-AVRNN|0.9850|0.9849|No data|\n|Pos-AVRNN|0.9850|No data|0.9841|\n|AVCNN|0.9846|0.9845|0.9841|\n|Meta-AVCNN|0.9844|0.9844|No data|\n|Pos-AVCNN|0.9850|No data|No data|\n|KmaxTextCNN|0.9849|0.9845|0.9835|\n|TextCNN|0.9837|No data|No data|\n|RCNN|0.9847|0.9842|0.9832|\n|RHN|0.9842|No data|No data|\n\n### Char level\n\n\n|Model|AUC|\n|-----|------|\n|AVRNN|0.9821|\n|KmaxCNN|0.9801|\n|AVCNN|0.9797|\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/zake7749/DeepToxic/tree/master/sotoxic", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "sotoxic", "package_url": "https://pypi.org/project/sotoxic/", "platform": "", "project_url": "https://pypi.org/project/sotoxic/", "project_urls": { "Homepage": "https://github.com/zake7749/DeepToxic/tree/master/sotoxic" }, "release_url": "https://pypi.org/project/sotoxic/1.0/", "requires_dist": null, "requires_python": "", "summary": "top 1% solution to toxic comment classification challenge on Kaggle", "version": "1.0" }, "last_serial": 4605294, "releases": { "1.0": [ { "comment_text": "", "digests": { "md5": "55f8bd2821e9d315e8f2c0dfd4b0d351", "sha256": "0c78926f7c25c6becb2b511cfadc54f157858264c21932ae2fb7b1ecd6675002" }, "downloads": -1, "filename": "sotoxic-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "55f8bd2821e9d315e8f2c0dfd4b0d351", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20636, "upload_time": "2018-12-16T19:05:40", "url": "https://files.pythonhosted.org/packages/c7/81/e250833325c8e05bb31c4a48ccf5a074f27ba487907d9a279563392ad672/sotoxic-1.0-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "55f8bd2821e9d315e8f2c0dfd4b0d351", "sha256": "0c78926f7c25c6becb2b511cfadc54f157858264c21932ae2fb7b1ecd6675002" }, "downloads": -1, "filename": "sotoxic-1.0-py3-none-any.whl", "has_sig": false, "md5_digest": "55f8bd2821e9d315e8f2c0dfd4b0d351", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 20636, "upload_time": "2018-12-16T19:05:40", "url": "https://files.pythonhosted.org/packages/c7/81/e250833325c8e05bb31c4a48ccf5a074f27ba487907d9a279563392ad672/sotoxic-1.0-py3-none-any.whl" } ] }