{
    "info": {
        "author": "Tushin Kirill",
        "author_email": "kirya.tushin1@yandex.ru",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "This package gives you the opportunity to use a Target mean Encoding.\n\nCategorical features can be encoded in several ways. The first method is to encode just numbers from 0 to n-1, where n is the number of unique values. Such an encoding is called LabelEncoding.\n\n![first](img/1.png)\n\nHere we coded\n\"Moscow\": 0,\n\"New York\": 1,\n\"Rome\": 2\n\nAnother encoding method is called OneHotEncoding. Here we create instead of a single feature n features, where n is the number of unique values. Where for each object we put 0 everywhere except for the k-th element, where there is 1.\n\n![second](img/2.png)\n\nAnother method of encoding categorical features is used here - encoding by the average value of the target.\n\n![third](img/3.png)\n\nAverage encoding is better than LabelEncoding, because a histogram of predictions using label & mean encoding show that mean encoding tend to group the classes together whereas the grouping is random in case of LabelEncoding.\n![fourth](img/4.png)\n\n___\n\nConsider next example, here is a table with information about the categories in the data. It can be seen that there are several categories, the number of which is very small, or did not occur in the dataset. Such data can interfere with the model, and this data can be retrained. As you can see Rome was presented only once and its target was 0, then whenever we encode Rome we will replace it with 0. And that's the problem, our algorithm will be retrained. To avoid this, we will use smoothing.\n\n![fifth](img/5.png)\n\n![sixth](img/6.png)\n\nAs you can see, we were able to solve the problem with small classes, their encodings have become more smoothed and shifted to the mean values.\n\n___\n\nNext we will be able to encode Train dataset and Test dataset.\n\nIn order to avoid overfitting, we have to use the Folds split when encoding on the Train, and if we use validation that would on validation also not to retrain we inside each Fold have to do another split on the Folds.\nAnd for Test dataset, we use all the data from Train dataset for encoding.\n\n![seventh](img/7.png)\n\n___\n\nOnce we have coded average, there are 3 uses for these features. \n1. Train the model on our new data.\n2. Train the model on our new and old data.\n3. Take the average of the new data and use it as a prediction.\n\nIn the folder \"experiments\" have the results from the comparison of these methods.\n\n___\n\nExample of usage\n```python\nfrom target_encoding import TargetEncoderClassifier\nfrom target_encoding import TargetEncoder\n\nfrom sklearn.datasets import load_breast_cancer\nfrom sklearn.model_selection import train_test_split\nfrom sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import roc_auc_score\n\n\nX, y = load_breast_cancer(return_X_y=True)\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)\n\nenc = TargetEncoder()\nnew_X_train = enc.transform_train(X=X_train, y=y_train)\nnew_X_test = enc.transform_test(X_test)\n\nrf = RandomForestClassifier(n_estimators=100, random_state=42)\nrf.fit(X_train, y_train)\npred = rf.predict_proba(X_test)[:,1]\nprint('without target encoding', roc_auc_score(y_test, pred))\n\nrf.fit(new_X_train, y_train)\npred = rf.predict_proba(new_X_test)[:,1]\nprint('with target encoding', roc_auc_score(y_test, pred))\n\nenc = TargetEncoderClassifier()\nenc.fit(X_train, y_train)\npred = enc.predict_proba(X_test)[:,1]\nprint('target encoding classifier', roc_auc_score(y_test, pred))\n```\n```\nwithout target encoding 0.9952505732066819\nwith target encoding 0.996560759908287\ntarget encoding classifier 0.9973796265967901\n\n```\n\n___\nYou can install it by using pip\n```\npip install target_encoding\n```\n\n___\n```\nRequirements:\n    numpy==1.16.2\n    scikit-learn==0.20.3\n```",
        "description_content_type": "",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/KirillTushin/target_encoding",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "target_encoding",
        "package_url": "https://pypi.org/project/target_encoding/",
        "platform": "",
        "project_url": "https://pypi.org/project/target_encoding/",
        "project_urls": {
            "Homepage": "https://github.com/KirillTushin/target_encoding"
        },
        "release_url": "https://pypi.org/project/target_encoding/0.5.0/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "",
        "version": "0.5.0"
    },
    "last_serial": 5022565,
    "releases": {
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "97fadb3200eb6e011f2d33952745240d",
                    "sha256": "cf99cea2b5ef54d0569d3a209b1f57147d4d10bc12b1b968781bf0d0e6527004"
                },
                "downloads": -1,
                "filename": "target_encoding-0.0.1.tar.gz",
                "has_sig": false,
                "md5_digest": "97fadb3200eb6e011f2d33952745240d",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5839,
                "upload_time": "2019-04-01T14:21:35",
                "url": "https://files.pythonhosted.org/packages/b3/87/13f96e75b035569d994e040c9d7e7ae1139c17be460ba81cefec8f13eb39/target_encoding-0.0.1.tar.gz"
            }
        ],
        "0.5.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "3e5e11f40862cf4eed3f0934ac6f7c1f",
                    "sha256": "53fef6f46a253e04fa1c0ff4ef26360c56e331c135145eba92955ba2aad20b0c"
                },
                "downloads": -1,
                "filename": "target_encoding-0.5.0.tar.gz",
                "has_sig": false,
                "md5_digest": "3e5e11f40862cf4eed3f0934ac6f7c1f",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 5851,
                "upload_time": "2019-04-01T14:49:13",
                "url": "https://files.pythonhosted.org/packages/09/a2/6fbd86b463b1aff46316efb9ac3d86ce7efdc39d65608d808fada5dae07f/target_encoding-0.5.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "3e5e11f40862cf4eed3f0934ac6f7c1f",
                "sha256": "53fef6f46a253e04fa1c0ff4ef26360c56e331c135145eba92955ba2aad20b0c"
            },
            "downloads": -1,
            "filename": "target_encoding-0.5.0.tar.gz",
            "has_sig": false,
            "md5_digest": "3e5e11f40862cf4eed3f0934ac6f7c1f",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 5851,
            "upload_time": "2019-04-01T14:49:13",
            "url": "https://files.pythonhosted.org/packages/09/a2/6fbd86b463b1aff46316efb9ac3d86ce7efdc39d65608d808fada5dae07f/target_encoding-0.5.0.tar.gz"
        }
    ]
}