{ "info": { "author": "Alexander Whedon", "author_email": "alexander.whedon@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# GapLearn\n\nGapLearn bridges gaps between other machine learning and deep learning tools. All models can be passed to the functions below regardless of the framework they were built upon (scikit-learn, tensorflow, xgboost, or even good ole numpy). \n\nMy first package objective is to bring transparency to the often black-boxy model training process by making by making robust post-mortem analysis of the hyperparameter and feature selection processes possible. The functions below also further automate some processes while still giving the user full control of the results.\n\nMany features are on their way. Unit tests and full documentation will be added shortly.\n\nThe source code is available on [GitHub](https://www.github.com/awhedon/gaplearn)\n\n## Installation\n\n```bash\npip install gaplearn\n```\n\nSee the latest version on [PyPI](https://pypi.org/project/gaplearn/)\n\n## Submodules\n\n### cv\n\nThe `cv` submodule will have these classes (sfs has been released):\n\n#### SFS\n**Description:**\n- This is a sequential feature selector that enables you to perform backwards elimination with any model (not just a linear regression).\n- At each step, the feature that has the lowest permutation importance is selected for removal. The permutation importance is measured as the decrease in accuracy by default, but the user can pass any custom scoring function.\n\n**Improvements on their way:**\n- Add forward selection and all subsets testing\n- Add more built-in scoring functions to for assessing feature permutation importance\n- Create custom permutation scoring method to remove `eli5` as a dependency\n\n##### Methods\n**backwards_elimination(X, y, model, params = {}, fit_function = None, predict_function = None, score_function = None, score_name = 'score', cols = [], verbose = 0)**\n- Run the backwards elimination\n- **Params:**\n\t- X: (DataFrame or matrix) with independent variables\n\t- y: (iterable) dependent corresponding variable values\n\t- params: (dict) parameter set for model\n\t- model: a model architecture to be trained and evaluated\n\t- fit_function: (function) the function that will be used to train the model; the function must accept the parameters `model`, `X`, and `y`; if this value is not set, `backwards_elimination` will attempt to use your model's `fit` method\n\t- predict_function: (function) the function that will be used to make predictions with the model; the function must accept the parameters `model`, and `X`; if this value is not set, `backwards_elimination` will attempt to use your model's `predict` method\n\t- score_function: (function) the function that will be used to score the model and determine the feature permutation importance; the function must accept the parameters `y` and `preds`; if this value is not set, the accuracy will be used\n\t- score_name: (str) name of the score calculated by the `score_function`; 'score' by default\n\t- cols: (list) the names of the columns in the matrix or DataFrame\n\t- verbose: (0, 1, or 2) determines the amount of printing\n\n**get_summary_be()**\n- Get a summary of the results by step; this can be used to more robustly and holistically determine which feature set is ideal for your problem\n\n**get_results_be()**\n- Get the results for each observation within each step; this can be used to more robustly and holistically determine which feature set is ideal for your problem\n\n**get_features_be()**\n- Get all the features used in the sequential feature selection\n\n**get_set_by_score_be(min_score = None, num_steps = 'all')**\n- Get the n feature sets (n is determined by `num_steps`) that achieve a score greather than or equal to `min_score`\n- **Params:**\n \t- min_score: (float or int) the minumum score a feature set must achieve to be returned\n \t- num_steps: (str or int) the number of feature sets to return; 'all' is the only valid str option\n\n**get_set_by_features_be(num_features, max_features = None)**\n- Get the all feature sets that have `num_features` features; if `max_features` is set, feature sets with between `num_features` and `max_features` features will be returned\n- **Params:**\n \t- num_features: (int) the number of features the returned feature set should have; functions as a minimum if `max_features` is set\n \t- max_features: (int) the maximum number of features the returned feature sets should have\n\n\n**Example 1:**\n```python\n#### Perform a backwards elimination with sci-kit learn's random forest model ####\n\nimport pandas as pd\nfrom gaplearn.cv import SFS\n\nX = pd.read_csv('X_classification.csv')\ny = pd.read_csv('y_classification.csv')\n\nfs = SFS()\n\nprint('The backwards elimination has been run: {}'.format(fs.be_complete)) # prints False\n\nfrom sklearn.ensemble import RandomForestClassifier\nrfc = RandomForestClassifier()\n\n# Run the backwards elimination\nfs.backwards_elimination(X, y, model = rfc, params = {'n_jobs': -1})\n\n# Get the step-by-step summary\nsummary = fs.get_summary_be() # Alternatively, `summary = fs.summary_be`\n\n# Get the predictions and true values for each observation\nresults = fs.get_results_be() # Alternatively, `results = fs.results_be\n\n# Get the features used in the analysis\nfeatures = fs.features_be # Alternatives, `sorted(list(results['feature to remove']))\n\n# Identify which feature set can achieve at least 85% accuracy with the smallest number of features\nat_least_85 = fs.get_set_by_score_be(min_score = .85, num_steps = 1)\n\n# Identify the best model with only 4 features\nfeatures_4 = fs.get_set_by_features_be(num_features = 4)\n```\n\n**Example 2:**\n```python\n#### Perform a more complex backwards elimination with sci-kit learn's naive bayes model ####\n\nimport pandas as pd\nfrom gaplearn.cv import SFS\n\nfrom sklearn.linear_model import SGDRegressor\n\nmodel_sgd = SGDRegressor(loss = 'modified_huber', penalty = 'elasticnet')\n\nX = pd.read_csv('X_regression.csv')\ny = pd.read_csv('y_regression.csv')\n\nfs = SFS()\n\n# Define a score_function\ndef mse(y, preds):\n\tscore = sum([(preds[i] - y[i]) ** 2 for i in range(y.shape[0])]) / y.shape[0]\n\treturn score\n\n# Define a predict_function\ndef arbitrary_prediction(model, X):\n\tpreds = model.predict(X) + 1 # arbitrarily deciding to add 1 to the prediction (realistically, this would be a wrapper for model that don't have a `fit` method)\n\treturn preds\n\n# Define a predict_function\ndef predict_w_proba(model, X):\n\tpreds = [1 if x[1] > 0.6 else 0 for x in model.predict_proba(X)]\n\treturn preds\n\n# Run the backwards elimination\nfs.backwards_elimination(X, y, model = model_sgd, predict_function = predict_w_proba, score_function = mse)\n\n# Get the step-by-step summary\nsummary = fs.get_summary_be() # Alternatively, `summary = fs.summary_be`\n\n# Get the predictions and true values for each observation\nresults = fs.get_results_be() # Alternatively, `results = fs.results_be`\n\n# Get the features used in the analysis\nfeatures = fs.features_be # Alternatively, `sorted(list(results['feature to remove']))`\n\n# Identify which two feature sets can achieve at least 85% accuracy with the smallest number of features\nat_least_85 = fs.get_set_by_score_be(min_score = .85, num_steps = 2)\n\n# Identify the best models with 3-5 features\nfeatures_3_5 = fs.get_set_by_features_be(num_features = 3, max_features = 5)\n```\n\n#### SearchCluster\n**Description:**\n- This is a hyperparameter grid/random search for clustering algorithms\n- Unlike other grid/random search algorithms, this one enables you to get the observation-by-observation results from each parameter set so that you can do deep post-mortem analysis of the grid/random search.\n\n**Documentation:**\nComing soon. See the documentation for details, in the meantime: https://github.com/awhedon/gaplearn\n##### Methods\n**search_cluster(model, params, X, cols = None, fit_function = None, label_function = None, centroid_function = None, score_function = None, metric_name = 'score', random = None, centroid = False, verbose = 0)**\n- Run the grid/random hyperparameter search\n- **Params:**\n\t- model: a model architecture to be trained and evaluated\n\t- params: (dict) the hypterparameter grid (ex.: {'C': [0.1, 0.5, 0.9], ...})\n\t- X: (DataFrame or matrix) features to be used for clustering\n\t- cols: (list) the names of the columns in the matrix or DataFrame\n\t- fit_function: (function) the function that will be used to train the model; the function must accept the parameters `model`, `params`, and `X`; if this value is not set, `search_cluster` will attempt to use your model's `fit` method\n\t- label_function: (function) the function that will use the trained model to attach labels to the unseen data; the function must accept the parameter `model`; if this value is not set, `search_cluster` will attempt to extract your model's `labels_` attribute value\n\t- centroid_function: (function) the function that will extract the cluster centroids from the trained model; the function must accept the parameter `model`; if this value is not set, `search_cluster` will attempt to extract your model's `cluster_centers_` attribute value\n\t- score_function: (function) the function that will score the results from the trained model; the function must accept the parameters `X` and `labels`; if this value is not set, `search_cluster` will calculate the silhouette score\n\t- metric_name: (str) the name of the evaluation metric; 'score' is the default\n\t- random: (int or float) if int, the maximum number of parameter sets to use; if float, the percentage of parameter sets to use (.7 means that 70% of the available parameter sets will be used for testing); default is None, which means that all parameter sets will be used\n\t- centroid: (bool) if the model is centroid-based, you can set this to True and define a `centroid_function` (or leave it as None if the model has a `cluster_centers_` attribute) so that the cluster centers will be extracted\n\t- verbose: (0 or 1) determines the amount of printing\n\n**get_best_model(return_scores = False)**\n\n**get_best_models(n = 1, return_scores= False)**\n\n**get_best_params()**\n\n**get_labels()**\n\n**get_param_results()**\n\n\n**Example 1:**\n```python\nfrom gaplearn.cv import SearchCluster\nimport pandas as pd\n\nfrom sklearn import KMeans\n\nsc = SearchCluster()\n\nX = pd.read_csv('X_clustering.csv')\n\nparams = {'': , '': [], '': []}\n\n# Run the hyperparameter grid search\nsc.search_cluster(model = KMeans, params = params, X = X, centroid = True)\n\n# Get the best model\nbest_model = sc.get_best_model() # Alternatively, `best_model = sc.best_model`\n\n# Get the five best models and their performance\nbest_models_5, best_models_5_scores = sc.get_best_models(n = 5, return_scores = True)\n\n# Get the model params that performed best\nbest_params = sc.get_best_params() # Alternatively, `best_params = sc.best_params`\n\n# Get a summary of the results for each parameter set\nparam_results = sc.get_param_results() # Alternatively, `param_results = sc.param_results`\n\n# Get the labels for each observation for each parameter set's k-fold validation to robustly analyze the differences in performance\nlabels = sc.get_labels() # Alternatively, `labels = sc.labels_df`\n```\n\n**Example 2:**\n```python\nfrom gaplearn.cv import SearchCluster\nimport pandas as pd\nimport random\n\nfrom my_module import ModelClass\n\nsc = SearchCluster()\n\nX = pd.read_csv('X_clustering.csv')\n\nparams = {'param1': ['value1', 'value2', 'value3'], 'param2': ['value1', 'value2', 'value3'], 'param3': ['value1', 'value2', 'value3']}\n\ndef fit_function(model, params, X):\n\tmodel(**params).fit_my_model(X)\n\treturn model\n\ndef score_function(X, labels):\n\tscore = random.random()\n\treturn score\n\n# Run the hyperparameter random search, testing 50% of the parameter sets\nsc.search_cluster(model = KMeans, params = params, X = X, fit_function = fit_function, score_function = score_function, random = 0.5)\n\n# Get the best model and results\nbest_model, best_model_scores = sc.get_best_model(return_scores = True)\n\n# Get the five best models and their performance\nbest_models_5, best_models_5_scores = sc.get_best_models(n = 5, return_scores = True)\n\n# Get the model params that performed best\nbest_params = sc.get_best_params() # Alternatively, `best_params = sc.best_params`\n\n# Get a summary of the results for each parameter set\nparam_results = sc.get_param_results() # Alternatively, `param_results = sc.param_results`\n\n# Get the labels for each observation for each parameter set's k-fold validation to robustly analyze the differences in performance\nlabels = sc.get_labels() # Alternatively, `labels = sc.labels_df`\n```\n\n#### Search (in development)\n**Description:**\n- This is a hyperparameter grid/random search for both regression algorithms and classification algorithms\n- Unlike other grid/random search algorithms, this one enables you to get the observation-by-observation results from each parameter set so that you can do deep post-mortem analysis of the grid/random search.\n\n### data_eng\n\nThe `data_eng` submodule will have these classes:\n\n#### DistributedSQL (in development)\n**Description:**\n- This enables users to chunk multi-parameter sql queries and process them on multiple threads.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/awhedon/gaplearn", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "gaplearn", "package_url": "https://pypi.org/project/gaplearn/", "platform": "", "project_url": "https://pypi.org/project/gaplearn/", "project_urls": { "Homepage": "http://github.com/awhedon/gaplearn" }, "release_url": "https://pypi.org/project/gaplearn/0.11/", "requires_dist": null, "requires_python": "", "summary": "Bridging gaps between other machine learning and deep learning tools and making robust, post-mortem analysis possible.", "version": "0.11" }, "last_serial": 5759277, "releases": { "0.10": [ { "comment_text": "", "digests": { "md5": "5eb6b5ce330a8f382463e73d4d8bca4e", "sha256": "c6f1fcc6b174153b5a533dc72a5f6552fcdf6f7c6d271677ada651331a204952" }, "downloads": -1, "filename": "gaplearn-0.10.tar.gz", "has_sig": false, "md5_digest": "5eb6b5ce330a8f382463e73d4d8bca4e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11586, "upload_time": "2019-08-30T08:43:23", "url": "https://files.pythonhosted.org/packages/9a/9f/ab094212aaf99f48e8ddd45fe9c33041316f82e250cc3be765588805b150/gaplearn-0.10.tar.gz" } ], "0.11": [ { "comment_text": "", "digests": { "md5": "29effc7798b3aa77856fbd8a1c21cada", "sha256": "cf5f8af6d0685b21ef60aab1245f5dd2a8eab5525b9a416d2e6359cde0ae40b5" }, "downloads": -1, "filename": "gaplearn-0.11.tar.gz", "has_sig": false, "md5_digest": "29effc7798b3aa77856fbd8a1c21cada", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11649, "upload_time": "2019-08-30T09:15:12", "url": "https://files.pythonhosted.org/packages/65/2a/b1526424e6da60ebe8ea06af8180d391041af7ce48b28527016888b2bb72/gaplearn-0.11.tar.gz" } ], "0.7": [ { "comment_text": "", "digests": { "md5": "d8505e4a18b266c7351e9961459746b7", "sha256": "9e7708e1658ef9210d0429cd5c605a8362d65bee8671824ef12e8dc7254ceb9a" }, "downloads": -1, "filename": "gaplearn-0.7.tar.gz", "has_sig": false, "md5_digest": "d8505e4a18b266c7351e9961459746b7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 4672, "upload_time": "2019-08-04T22:49:16", "url": "https://files.pythonhosted.org/packages/da/b8/21ce17a5f7f68f92d6212f8454104ac795920973dfff7f50242e21afb630/gaplearn-0.7.tar.gz" } ], "0.8": [ { "comment_text": "", "digests": { "md5": "8c7a9a0df82811427ee8ae1cd1e62801", "sha256": "7011986d11f14e54f586b17b6d510b6f5bf1600fcfa164a14352ac86be701d37" }, "downloads": -1, "filename": "gaplearn-0.8.tar.gz", "has_sig": false, "md5_digest": "8c7a9a0df82811427ee8ae1cd1e62801", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6101, "upload_time": "2019-08-08T15:40:27", "url": "https://files.pythonhosted.org/packages/c3/4f/32af44b3f30a5f5a36a35800375ad0fd0fe906a74ed0efcd3dbc635fc6fa/gaplearn-0.8.tar.gz" } ], "0.9": [ { "comment_text": "", "digests": { "md5": "3daa4ae5317a4d22795d446ccc6b9508", "sha256": "33a2edc445eb892fa45203b21098f856cc2983b79c62a0e7a28de05532f9bfb1" }, "downloads": -1, "filename": "gaplearn-0.9.tar.gz", "has_sig": false, "md5_digest": "3daa4ae5317a4d22795d446ccc6b9508", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6427, "upload_time": "2019-08-08T16:03:53", "url": "https://files.pythonhosted.org/packages/56/87/54838db0d845be4a93109af1ea53115a360ea968e7870296ed492237cb90/gaplearn-0.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "29effc7798b3aa77856fbd8a1c21cada", "sha256": "cf5f8af6d0685b21ef60aab1245f5dd2a8eab5525b9a416d2e6359cde0ae40b5" }, "downloads": -1, "filename": "gaplearn-0.11.tar.gz", "has_sig": false, "md5_digest": "29effc7798b3aa77856fbd8a1c21cada", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11649, "upload_time": "2019-08-30T09:15:12", "url": "https://files.pythonhosted.org/packages/65/2a/b1526424e6da60ebe8ea06af8180d391041af7ce48b28527016888b2bb72/gaplearn-0.11.tar.gz" } ] }