{ "info": { "author": "ahmed-mohamed-sn", "author_email": "hanoush87@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Customer Service", "Intended Audience :: Developers", "Intended Audience :: Financial and Insurance Industry", "Intended Audience :: Healthcare Industry", "Intended Audience :: Legal Industry", "Intended Audience :: Other Audience", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Artificial Intelligence" ], "description": "# ATgfe (Automated Transparent Genetic Feature Engineering)\n\n
\n\"ATgfe-logo\"/\n
\n\n# What is ATgfe?\nATgfe stands for Automated Transparent Genetic Feature Engineering. ATgfe is powered by genetic algorithm to engineer new features. The idea is to compose new interpretable features based on interactions between the existing features. The predictive power of the newly constructed features are measured using a pre-defined evaluation metric, which can be custom designed.\n\nATgfe applies the following techniques to generate candidate features:\n- Simple feature interactions by using the basic operators (+, -, *, /).\n``` \n (petalwidth * petallength) \n```\n- Scientific feature interactions by applying transformation operators (e.g. log, cosine, cube, etc. as well as custom operators which can be easily implemented using user defined functions).\n```\n squared(sepalwidth)*(log_10(sepalwidth)/squared(petalwidth))-cube(sepalwidth)\n```\n- Weighted feature interactions by adding weights to the simple and/or scientific feature interactions.\n```\n (0.09*exp(petallength)+0.7*sepallength/0.12*exp(petalwidth))+0.9*squared(sepalwidth)\n```\n- Complex feature interactions by applying groupBy on the categorical features.\n```\n (0.56*groupByYear0TakeMeanOfFeelslike*0.51*feelslike)+(0.45*temp)\n```\n\n# Why ATgfe?\nATgfe allows you to deal with **non-linear** problems by generating new **interpretable** features from existing features. The generated features can then be used with a linear model, which is inherently explainable. The idea is to explore potential predictive information that can be represented using interactions between existing features.\n\nWhen compared with non-linear models (e.g. gradient boosting machines, random forests, etc.), ATgfe can achieve comparable results and in some cases over-perform them.\nThis is demonstrated in the following examples: [BMI](https://github.com/ahmed-mohamed-sn/ATgfe/blob/master/examples/generated/generated_1.ipynb), [Rational difference](https://github.com/ahmed-mohamed-sn/ATgfe/blob/master/examples/generated/generated_2.ipynb) and [IRIS](https://github.com/ahmed-mohamed-sn/ATgfe/blob/master/examples/toy-examples/iris_multi_classification.ipynb).\n\n# Results\n## Generated\n| Expression | Linear Regression | LightGBM Regressor | Linear Regression + ATgfe |\n|----------------------------------|-----------------------------------------------------------------|----------------------------------------------------------------|---------------------------------------------------------|\n| BMI = weight/height^2 | | | |\n| Y = (X1 - X2) / (X3 - X4) | | | |\n| Y = (Log10(X1) + Log10(X2)) / X5 | | | |\n| Y = 0.4*X2^2 + 2*X4 + 2 | | | |\n\n## Classification\n| Dataset | Logistic Regression | LightGBM Classifier | Logistic Regression + ATgfe |\n|----------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------|--------------------------------------------------------------------|\n| IRIS (4 features) | | | |\n\n## Regression\n| Dataset | Linear Regression | LightGBM Regressor | Linear Regression + ATgfe |\n|----------------------------------|-------------------------------------------------------------------------|------------------------------------------------------------------------|--------------------------------------------------------------------|\n| Concrete (8 features) | | | |\n| Boston (13 features) | | | |\n\n# Get started\n\n## Requirements\n- Python ^3.6\n- DEAP ^1.3\n- Pandas ^0.25.2\n- Scipy ^1.3\n- Numpy ^1.17\n- Sympy ^1.4\n\n## Install ATgfe\n```bash\npip install atgfe\n```\n## Upgrade ATgfe\n```bash\npip install -U atgfe\n```\n# Usage\n\n## Examples\nThe [Examples](https://github.com/ahmed-mohamed-sn/ATgfe/tree/master/examples/) are grouped under the following two sections:\n- [Generated](https://github.com/ahmed-mohamed-sn/ATgfe/tree/master/examples/generated) examples test ATgfe against hand-crafted non-linear problems where we know there is information that can be captured using feature interactions. \n\n- [Toy Examples](https://github.com/ahmed-mohamed-sn/ATgfe/tree/master/examples/toy-examples) show how to use ATgfe in solving a mix of regression and classification problems from publicly available benchmark datasets.\n\n## Pre-processing for column names\n### ATgfe requires column names that are free from special characters and spaces (e.g. @, $, %, #, etc.)\n```python\n# example\ndef prepare_column_names(columns):\n return [col.replace(' ', '').replace('(cm)', '_cm') for col in columns]\n\ncolumns = prepare_column_names(df.columns.tolist())\ndf.columns = columns\n```\n\n## Configuring the parameters of GeneticFeatureEngineer\n```python\nGeneticFeatureEngineer(\n model,\n x_train: pandas.core.frame.DataFrame,\n y_train: pandas.core.frame.DataFrame,\n numerical_features: List[str],\n number_of_candidate_features: int,\n number_of_interacting_features: int,\n evaluation_metric: Callable[..., Any],\n minimize_metric: bool = True,\n categorical_features: List[str] = None,\n enable_grouping: bool = False,\n sampling_size: int = None,\n cv: int = 10,\n fit_wo_original_columns: bool = False,\n enable_feature_transformation_operations: bool = False,\n enable_weights: bool = False,\n enable_bias: bool = False,\n max_bias: float = 100.0,\n weights_number_of_decimal_places: int = 2,\n shuffle_training_data_every_generation: bool = False,\n cross_validation_in_objective_func: bool = False,\n objective_func_cv: int = 3,\n n_jobs: int = 1,\n verbose: bool = True\n)\n```\n\n### model\nATgfe works with any model or pipeline that follows scikit-learn API (i.e. the model should implement the ```fit()``` and ```predict()``` methods).\n\n### x_train\nTraining features in a pandas Dataframe.\n\n### y_train\nTraining labels in a pandas Dataframe to also handle multiple target problems.\n\n### numerical_features\nThe list of column names that represent the numerical features.\n\n### number_of_candidate_features\nThe maximum number of features to be generated.\n\n### number_of_interacting_features\nThe maximum number of existing features that can be used in constructing new features. \nThese features are selected from those passed in the ```numerical_features``` argument.\n\n### evaluation_metric\nAny of the [scitkit-learn metrics](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) or a custom-made evaluation metric to be used by the genetic algorithm to evaluate the predictive power of the newly generated features. \n```python\nimport numpy as np\nfrom sklearn.metrics import mean_squared_error\n\ndef rmse(y_true, y_pred):\n return np.sqrt(mean_squared_error(y_true, y_pred))\n```\n### minimize_metric\nA boolean flag, which should be set to ```True``` if the evaluation metric is to be minimized; otherwise set to ```False``` if the evaluation metric is to be maximized.\n \n### categorical_features\nThe list of column names that represent the categorical features. The parameter ```enable_grouping``` should be set to ```True``` in order for the ```categorical_features``` to be utilized in grouping.\n\n### enable_grouping\nA boolean flag, which should be set to ```True``` to construct complex feature interactions that use ```pandas.groupBy```.\n\n### sampling_size\nThe exact size of the sampled training dataset. Use this parameter to run the optimization using the specified number of observations in the training data. If the ```sampling_size``` is greater than the number of observations, then ATgfe will create a sample with replacement.\n \n### cv\nThe number of folds for cross validation. Every generation of the genetic algorithm, ATgfe evaluates the current best solution using k-fold cross validation. The default number of folds is 10.\n\n### fit_wo_original_columns\nA boolean flag, which should be set to ```True``` to fit the model without the original features specified in ```numerical_features```. In this case, ATgfe will only use the newly generated features together with any remaining original features in ```x_train```.\n\n### enable_feature_transformation_operations\nA boolean flag, which should be set to ```True``` to enable scientific feature interactions on the ```numerical_features```.\nThe pre-defined transformation operators are listed as follows:\n```\nnp_log(), np_log_10(), np_exp(), squared(), cube()\n```\nYou can easily remove from or add to the existing list of transformation operators. Check out the next section for examples.\n\n### enable_weights\nA boolean flag, which should be set to ```True``` to enable weighted feature interactions.\n\n### weights_number_of_decimal_places\nThe number of decimal places (i.e. precision) to be applied to the weight values.\n\n### enable_bias\nA boolean flag, which enables the genetic algorithm to add a bias to the expressions generated. For example:\n```\n0.43*log(cement) + 806.8557595548646\n```\n\n### max_bias\nThe value of the bias will be between ```-max_bias``` and ```max_bias```.\nIf the ```max_bias``` is 100 then the bias value will be between -100 and 100.\n\n### shuffle_training_data_every_generation\nA boolean flag, if enabled the ```train_test_split``` method in the objective function uses the generation number as its random seed. This can prevent over-fitting.
\nThis option is only available if ```cross_validation_in_objective_func``` is set to ```False```. \n\n### cross_validation_in_objective_func\nA boolean flag, if enabled the ```train_test_split``` method will not be used in the objective function. Instead of using ```train_test_split```, the genetic algorithm will use cross validation to evaluate the generated features.\n
The default number of folds is **3**. The number of folds can modified using the ```objective_func_cv``` parameter.\n\n### objective_func_cv\nThe number of folds to be used when ```cross_validation_in_objective_func``` is enabled.\n\n### verbose\nA boolean flag, which should be set to ```True``` to enable the logging functionality.\n\n### n_jobs\nTo enable parallel processing, set ```n_jobs``` to the number of CPUs that you would like to utilise. If ```n_jobs``` is set to **-1**, all the machine's CPUs will be utilised.\n\n## Configuring the parameters of fit()\n```python\ngfe.fit(\n number_of_generations: int = 100,\n mu: int = 10,\n lambda_: int = 100,\n crossover_probability: float = 0.5,\n mutation_probability: float = 0.2,\n early_stopping_patience: int = 5,\n random_state: int = 77\n)\n```\n\n### number_of_generations\nThe maximum number of generations to be explored by the genetic algorithm.\n\n### mu\nThe number of solutions to select for the next generation.\n\n### lambda_\nThe number of children to produce at each generation.\n\n### crossover_probability\nThe crossover probability.\n\n### mutation_probability\nThe mutation probability.\n\n### early_stopping_patience\nThe maximum number of generations to be explored before early the stopping criteria is satisfied when the validation score is not improving.\n\n\n## Configuring the parameters of transform()\n```python\nX = gfe.transform(X)\n```\n\nWhere X is the pandas dataframe that you would like to append the generated features to.\n\n## Transformation operations\n\n### Get current transformation operations\n```python\ngfe.get_enabled_transformation_operations()\n```\n\nThe enabled transformation operations will be returned.\n\n```\n['None', 'np_log', 'np_log_10', 'np_exp', 'squared', 'cube']\n```\n### Remove existing transformation operations\n```gfe.remove_transformation_operation``` accepts string or a list of strings\n```python\ngfe.remove_transformation_operation('squared')\n```\n\n```python\ngfe.remove_transformation_operation(['np_log_10', 'np_exp'])\n```\n### Add new transformation operations \n```python\nnp_sqrt = np.sqrt\n\ndef some_func(x):\n return (x * 2)/3\n\ngfe.add_transformation_operation('sqrt', np_sqrt)\ngfe.add_transformation_operation('some_func', some_func)\n```", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ahmed-mohamed-sn/ATgfe", "keywords": "Python,Machine Learning,Feature Engineering,Genetic Algorithms,Explainable", "license": "MIT", "maintainer": "ahmed-mohamed-sn", "maintainer_email": "hanoush87@gmail.com", "name": "atgfe", "package_url": "https://pypi.org/project/atgfe/", "platform": "", "project_url": "https://pypi.org/project/atgfe/", "project_urls": { "Homepage": "https://github.com/ahmed-mohamed-sn/ATgfe", "Repository": "https://github.com/ahmed-mohamed-sn/ATgfe" }, "release_url": "https://pypi.org/project/atgfe/0.2.60/", "requires_dist": [ "deap (>=1.3,<2.0)", "pandas (>=0.25.2,<0.26.0)", "scipy (>=1.3,<2.0)", "numpy (>=1.17,<2.0)", "sympy (>=1.4,<2.0)" ], "requires_python": ">=3.6,<4.0", "summary": "Automated Transparent Genetic Feature Engineering or ATgfe", "version": "0.2.60", "yanked": false, "yanked_reason": null }, "last_serial": 7370272, "releases": { "0.2.58": [ { "comment_text": "", "digests": { "md5": "85ec061b01f9fd13a1f0e27ca763e312", "sha256": "26cc87e73f9cd519d7dee8159db80a20fa4985bb8c0442c3d809c982cbbbe045" }, "downloads": -1, "filename": "ATgfe-0.2.58-py3-none-any.whl", "has_sig": false, "md5_digest": "85ec061b01f9fd13a1f0e27ca763e312", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 14076, "upload_time": "2019-11-28T16:41:39", "upload_time_iso_8601": "2019-11-28T16:41:39.794667Z", "url": "https://files.pythonhosted.org/packages/61/46/dde57bb28c80368aa699c03ddd43794dbcf0151ff327f63482fa207607ad/ATgfe-0.2.58-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "75478fc37278998225d5de0ce485fa29", "sha256": "5d1a89f0196a04f9a21659fd8c1dcb47e8d713990b2dca2b27d526d1ea936e3a" }, "downloads": -1, "filename": "ATgfe-0.2.58.tar.gz", "has_sig": false, "md5_digest": "75478fc37278998225d5de0ce485fa29", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 18401, "upload_time": "2019-11-28T16:41:41", "upload_time_iso_8601": "2019-11-28T16:41:41.200216Z", "url": "https://files.pythonhosted.org/packages/7f/a6/0453ecf083f3f524df2d88bab3c23a77112ca386b6afd5931668bfb27b80/ATgfe-0.2.58.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2.59": [ { "comment_text": "", "digests": { "md5": "015861e5d8d76f9f4ec5470707a05e78", "sha256": "eb0aab8d43cb8dcc09bb9f65e110ee8d6c74a175e073fe0a19ce005bb715109b" }, "downloads": -1, "filename": "ATgfe-0.2.59-py3-none-any.whl", "has_sig": false, "md5_digest": "015861e5d8d76f9f4ec5470707a05e78", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 20813, "upload_time": "2020-06-01T13:11:49", "upload_time_iso_8601": "2020-06-01T13:11:49.511033Z", "url": "https://files.pythonhosted.org/packages/9f/8f/bc48c5f8f5832c4453731bd4c0bbb1c81d1ae3d84a349753e16432699abc/ATgfe-0.2.59-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "b2d989d41a9e05710f32a2fd8ba1174e", "sha256": "af101b101c322ba8f0c1bb3d53055f90357af280b9f474a0527185f107255c90" }, "downloads": -1, "filename": "ATgfe-0.2.59.tar.gz", "has_sig": false, "md5_digest": "b2d989d41a9e05710f32a2fd8ba1174e", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 24541, "upload_time": "2020-06-01T13:11:51", "upload_time_iso_8601": "2020-06-01T13:11:51.066039Z", "url": "https://files.pythonhosted.org/packages/52/98/bcfba6948f66496a73bd90f1727ada5e66e4816b049eb0120f599814323f/ATgfe-0.2.59.tar.gz", "yanked": false, "yanked_reason": null } ], "0.2.60": [ { "comment_text": "", "digests": { "md5": "df49a4c8720d0af0a3b0fa20e4e0b59c", "sha256": "5a7cfbe0950be45e5cdd1ca4f0938fa557b8d99b2171432f1f24b0fbfc6dc444" }, "downloads": -1, "filename": "ATgfe-0.2.60-py3-none-any.whl", "has_sig": false, "md5_digest": "df49a4c8720d0af0a3b0fa20e4e0b59c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 20824, "upload_time": "2020-06-01T13:37:48", "upload_time_iso_8601": "2020-06-01T13:37:48.768443Z", "url": "https://files.pythonhosted.org/packages/ed/9d/f14d1d7e84c7966fcd3ab5d73e845c4a19e488b83661a79df3ab3a0e7c90/ATgfe-0.2.60-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "257285ec101d7ac18ab23a6d4ca51fd8", "sha256": "5563f1d1f2920013d8e3b6e7826b818931c5c148d6a6c0f9db1a37d323a54e46" }, "downloads": -1, "filename": "ATgfe-0.2.60.tar.gz", "has_sig": false, "md5_digest": "257285ec101d7ac18ab23a6d4ca51fd8", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 24568, "upload_time": "2020-06-01T13:37:49", "upload_time_iso_8601": "2020-06-01T13:37:49.987900Z", "url": "https://files.pythonhosted.org/packages/d9/69/06dc7d51414c7a0342abc23b586c7557deaf7fbd5612532bd68885dded99/ATgfe-0.2.60.tar.gz", "yanked": false, "yanked_reason": null } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "df49a4c8720d0af0a3b0fa20e4e0b59c", "sha256": "5a7cfbe0950be45e5cdd1ca4f0938fa557b8d99b2171432f1f24b0fbfc6dc444" }, "downloads": -1, "filename": "ATgfe-0.2.60-py3-none-any.whl", "has_sig": false, "md5_digest": "df49a4c8720d0af0a3b0fa20e4e0b59c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 20824, "upload_time": "2020-06-01T13:37:48", "upload_time_iso_8601": "2020-06-01T13:37:48.768443Z", "url": "https://files.pythonhosted.org/packages/ed/9d/f14d1d7e84c7966fcd3ab5d73e845c4a19e488b83661a79df3ab3a0e7c90/ATgfe-0.2.60-py3-none-any.whl", "yanked": false, "yanked_reason": null }, { "comment_text": "", "digests": { "md5": "257285ec101d7ac18ab23a6d4ca51fd8", "sha256": "5563f1d1f2920013d8e3b6e7826b818931c5c148d6a6c0f9db1a37d323a54e46" }, "downloads": -1, "filename": "ATgfe-0.2.60.tar.gz", "has_sig": false, "md5_digest": "257285ec101d7ac18ab23a6d4ca51fd8", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 24568, "upload_time": "2020-06-01T13:37:49", "upload_time_iso_8601": "2020-06-01T13:37:49.987900Z", "url": "https://files.pythonhosted.org/packages/d9/69/06dc7d51414c7a0342abc23b586c7557deaf7fbd5612532bd68885dded99/ATgfe-0.2.60.tar.gz", "yanked": false, "yanked_reason": null } ], "vulnerabilities": [] }