{
"info": {
"author": "Yves Greatti",
"author_email": "yvgrotti@gmail.com",
"bugtrack_url": null,
"classifiers": [],
"description": "\n# RIVALGAN \n\n\n[Background](#background) \n[The Dataset](#the-dataset) \n[Implementation Overview](#implementation-overview)
\n[Usage](#usage)
\n[Visualizing the Data Augmentation Process](#visualizing-the-data-augmentation-process)
\n[GitHub Folder Structure](#github-folder-structure)
\n[Setup script](#setup-script)
\n[Requirements](#requirements)\n\n\n------------\n\n## Background\nImbalanced data sets for Machine Learning classification problems could be difficult to solve.\nFor example suppose we have two classes in our data set where the majority class is more than 90% of the dataset \nand the minority is less than 1% but we are more interested in identifying instances of the minority class.\nMost of the ML classifiers will reach an accuracy of 90% or more but this is not useful for our intended case.\nA more properly calibrated method may achieve a lower accuracy, but would have a higher true positive rate (or recall).\n\n\nA lot of critical real-data sets are imbalanced by nature like in credit card fraud detection or in the \nHealth Care Industry due to the scarcity of data. Traditional ways to fix imbalanced datasets is either \nby oversampling instances of the minority class or undersampling instances of the majority class (SMOTE).\n\nInstead to tackle these issues, GANS are used to learn from the real dara and generate samples to augment \nthe training dataset. A GAN is composed of a generator and a discriminator (both are deep neural networks). \nThe generator will try its best to make a sample by learning from a dataset and the discriminator will learn \nto predict if a given sample is generated by the generator or if it is from the original training dataset. \nThis indication is then used by Generator in turn to generate data as close to the real data so as to fool\nthe discriminator.\n\n------------\n\n## The Dataset\nThe dataset contains 284,807 transactions that occurred in two days, where we have 492 frauds. \nThe dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.\nEach of these observations corresponds to an individual transaction.\nA binary response variable 'Class' indicates whether or not this transaction is a fraud.\nThe dataset contains 30 features: \n* V1, V2, ... V28 which are anonymized features and uncorrelated after PCA transformations.\n* 'Time' and 'Amount'which have not been transformed.\n\n------------\n\n## Process Overview and Tech Stack\n\n\n\n------------\n\n## Implementation Overview\n\nThis code trains a generative adversarial network to generate new synthetic data related to credit card fraud csv file. \nIt can also read data from any other csv files but the file will need to be transformed so the class variable to predict is \nclearly identified. The code provides an api to visualize the synthetic data, compare the data distributions \nbetween the real and the augmented data. It also allows to train different classifiers (LogisticRegression, SVM, \nRandomForrest, XGBoost) and compare their performances with the real and augmented datasets. The synthetic data could\nbe generated using either SMOTE or GANs. Different GANs architectures are proposed (Vanilla GAN, \nWasserstein GAN, Improved Wasserstein GAN, Least Squares GAN). Finally a random n-classes data set for classification\n problems is provided and the decision boundaries are plotted on the real and augmented datasets.\n\n\n ------------\n\n\n ## Usage\n\n ``` bash\n\n\n $ python pipeline -h\n\n usage: pipeline.py [-h]\n [--CLASSIFIER {Logit,LinearSVC,RandomForest,SGDClassifier,SVC}]\n [--SAMPLER {SMOTE,SMOTETomek}]\n [--AUGMENTED_DATA_SIZE AUGMENTED_DATA_SIZE]\n [--TOTAL_TRAINING_STEPS TOTAL_TRAINING_STEPS]\n [--GEN_FILENAME GEN_FILENAME]\n [--train_classifier TRAIN_CLASSIFIER]\n [--classifier_scores CLASSIFIER_SCORES]\n [--generate_data GENERATE_DATA]\n [--compute_learning_curves COMPUTE_LEARNING_CURVES]\n [--aug_model_scores AUG_MODEL_SCORES]\n [--plot_augmented_learning_curves PLOT_AUGMENTED_LEARNING_CURVES]\n [--generate_distribution_plots GENERATE_DISTRIBUTION_PLOTS]\n [--compare_scores COMPARE_SCORES]\n [--random_dataset RANDOM_DATASET]\n [--retrieve_real_data_generated_data RETRIEVE_REAL_DATA_GENERATED_DATA]\n\n```\n## API\n\nExamples\n\n ``` python\n\nfrom pipeline import *\n\npipeline = Pipeline()\n\ndata = pipeline.read_process_data()\n\npipeline.run_train_classifier()\n\npipeline.run_classifier_scores_report()\n\ndargs = {\n 'AUGMENTED_DATA_SIZE':5000, \n 'TOTAL_TRAINING_STEPS': 1000,\n 'GAN_NAME':'VGAN'}\npipeline.set_configuration(dargs)\npipeline.run_train_gan()\n\npipeline.compare_classifier_gan_scores()\n\npipeline.generate_distribution_plots()\n\npipeline.plot_augmented_learning_curves()\n\n```\n\n------------\n\n#### Output\n\n```text\n\n------------- Reading data --------------\n\nLoading data from /home/ubuntu/insight/data/creditcard.engineered.pkl\nShape of the data=(284807, 31)\nHead: \n Time V1 V2 V3 V4 V5 V6 \\\n0 -2.495776 -0.760474 -0.059825 1.778510 0.998741 -0.282036 0.366454 \n1 -2.495776 0.645665 0.177226 0.108889 0.326641 0.047566 -0.064642 \n2 -2.495729 -0.759673 -0.946238 1.240864 0.277228 -0.418463 1.425391 \n\n V7 V8 V9 ... V21 V22 V23 \\\n0 0.234118 0.091669 0.343867 ... -0.027953 0.392914 -0.259567 \n1 -0.078505 0.077453 -0.237661 ... -0.405091 -0.908272 0.228784 \n2 0.775964 0.247431 -1.420257 ... 0.456138 1.094031 2.092428 \n\n V24 V25 V26 V27 V28 Amount Class \n0 0.111992 0.253257 -0.396610 0.399584 -0.090140 1.130025 0 \n1 -0.569582 0.329670 0.267951 -0.031113 0.069997 -1.138642 0 \n2 -1.155079 -0.649083 -0.291089 -0.171222 -0.263354 1.695499 0 \n\n[3 rows x 31 columns]\nNumber of frauds in training data: 379 out of 213605 cases (0.1774303036% fraud)\nNumber of frauds in test data: 113 out of 71202 cases (0.1587034072% fraud)\nNumber of features=30\n\n------------- Training classifier --------------\n\n\nTraining 30 features with classifier SGDClassifier\nTime elapsed to train: 0:00:00.34\nSaving SGDClassifier in /home/ubuntu/insight/cache/SGDClassifier_Fraud.pkl\nNo sampler to train\n\n\n------------- Baseline scores --------------\n\nBaseline classifier SGDClassifier\nLoading classifier SGDClassifier from file /home/ubuntu/insight/cache/SGDClassifier_Fraud.pkl\nPredicting 30 features\nClassification Report: \n pre rec spe f1 geo iba sup\n\n 0 1.00 0.96 0.91 0.98 0.93 0.88 71089\n 1 0.03 0.91 0.96 0.06 0.93 0.87 113\n\navg / total 1.00 0.96 0.91 0.98 0.93 0.88 71202\n\nAccuracy score: 0.9578523075194517\nPrecision score: 0.911504424778761\nRecall score: 0.03329023917259211\nF1 score: 0.06423448705955721\nConfusion Matrix: \n [[68098 2991]\n [ 10 103]] \n\n------------- Training GAN and generating synthetic data -------------- \nTraining WGAN total_steps=1000, #generatedData=5000\nStep: 0\nGenerator loss: 0.6404243111610413 | discriminator loss: 1.3558526039123535 \n\nStep: 100\nGenerator loss: 0.3018853962421417 | discriminator loss: 1.5490034818649292 \n\n```\n\n------------\n\n## Visualizing the Data Augmentation Process\n\n



