{ "info": { "author": "Christopher H. Todd", "author_email": "Christopher.Hayden.Todd@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Natural Language :: English", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3.8" ], "description": "# Christopher H. Todd's PROJECT_STRING_NAME\n\nThe PROJECT_GIT_NAME project is responsible for ...\n\nThe library ...\n\n## Table of Contents\n\n- [Dependencies](#dependencies)\n- [Libraries](#libraries)\n- [Example Scripts](#example-scripts)\n- [Notes](#notes)\n- [TODO](#todo)\n\n## Dependencies\n\n### Python Packages\n\n- great-expectations>=0.4.5\n- pandas>=0.24.2\n- tensorflow>=1.13.1\n\n## Libraries\n\n### [data_engineering_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_engineering_helpers.py)\n\nLibrary for Dealing with redundant Data Engineering Tasks. This will include functions for tranforming dictionaries and PANDAS Dataframes\n\nFunctions:\n\n```\ndef remove_overly_null_columns(df, percentage_null=.25):\n \"\"\"\n Purpose:\n Remove columns with the count of null values\n exceeds the passed in percentage. This defaults\n to 25%.\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n percentage_null (float): Percentage of null values\n that will be the threshold for removing or\n keeping columns. Defaults to .25 (25%)\n Return\n df (Pandas DataFrame): DataFrame with columns removed\n based on thresholds\n \"\"\"\n```\n\n```\ndef remove_high_cardinality_numerical_columns(df, percentage_unique=1):\n \"\"\"\n Purpose:\n Remove columns with the count of unique values\n matches the count of rows. These are usually\n unique identifiers (primary keys in a database)\n that are not useful for modeling and can result\n in poor model performance. percentage_unique\n defaults to 100%, but this can be passed in\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n percentage_unique (float): Percentage of null values\n that will be the threshold for removing or\n keeping columns. Defaults to 1 (100%)\n Return\n df (Pandas DataFrame): DataFrame with columns removed\n based on thresholds\n \"\"\"\n```\n\n```\ndef remove_high_cardinality_categorical_columns(df, max_unique_values=20):\n \"\"\"\n Purpose:\n Remove columns with the count of unique values\n for categorical columns are over a specified threshold.\n These values are difficult to transform into dummies,\n and would not work for logistic/linear regression.\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n max_unique_values (int): Integer of unique values\n that is the threshold to remove column\n Return\n df (Pandas DataFrame): DataFrame with columns removed\n based on thresholds\n \"\"\"\n```\n\n```\ndef remove_single_value_columns(df):\n \"\"\"\n Purpose:\n Remove columns with a single value\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n Return\n df (Pandas DataFrame): DataFrame with columns removed\n \"\"\"\n```\n\n```\ndef remove_quantile_equality_columns(df, low_quantile=.05, high_quantile=.95):\n \"\"\"\n Purpose:\n Remove columns where the low quantile matches the\n high quantile (data is heavily influenced by outliers)\n and data is not well spread out\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n low_quantile (float): Percentage quantile to compare\n high_quantile (float): Percentage quantile to compare\n Return\n df (Pandas DataFrame): DataFrame with columns removed\n \"\"\"\n```\n\n```\ndef mask_outliers_numerical_columns(df, low_quantile=.05, high_quantile=.95):\n \"\"\"\n Purpose:\n Update outliers to be equal to the low_quantile and\n high_quantile values specified.\n Args:\n df (Pandas DataFrame): DataFrame to update data\n low_quantile (float): Percentage quantile to set values\n high_quantile (float): Percentage quantile to set values\n Return\n df (Pandas DataFrame): DataFrame with columns updated\n \"\"\"\n```\n\n```\ndef convert_categorical_columns_to_dummies(df, drop_first=True):\n \"\"\"\n Purpose:\n Convert Categorical Values into Dummies. Will also\n remove the initial column being converted. If\n remove first is true, will remove one of the\n dummy variables to remove prevent multicollinearity\n Args:\n df (Pandas DataFrame): DataFrame to convert columns\n drop_first (bool): to remove or not remove a column\n from dummies generated\n Return\n df (Pandas DataFrame): DataFrame with columns converted\n \"\"\"\n```\n\n```\ndef ensure_categorical_columns_all_string(df):\n \"\"\"\n Purpose:\n Ensure all values for Categorical Values are strings\n and converts any non-string value into strings\n Args:\n df (Pandas DataFrame): DataFrame to convert columns\n Return\n df (Pandas DataFrame): DataFrame with columns converted\n \"\"\"\n```\n\n```\ndef encode_categorical_columns_as_integer(df):\n \"\"\"\n Purpose:\n Convert Categorical Values into single value\n using sklearn LabelEncoder\n Args:\n df (Pandas DataFrame): DataFrame to convert columns\n Return\n df (Pandas DataFrame): DataFrame with columns converted\n \"\"\"\n```\n\n```\ndef replace_null_values_numeric_columns(df, replace_operation='median'):\n \"\"\"\n Purpose:\n Replace all null values in a dataframe with other\n values. Options include 0, mean, and median; the\n default operation converts numeric columns to\n median\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n replace_operation (string/enum): operation to perform\n in replacing null values in the dataframe\n Return\n df (Pandas DataFrame): DataFrame with nulls replaced\n \"\"\"\n```\n\n```\ndef replace_null_values_categorical_columns(df):\n \"\"\"\n Purpose:\n Replace all null values in a dataframe with \"Unknown\"\n Args:\n df (Pandas DataFrame): DataFrame to remove columns\n from\n replace_operation (string/enum): operation to perform\n in replacing null values in the dataframe\n Return\n df (Pandas DataFrame): DataFrame with nulls replaced\n \"\"\"\n```\n\n```\ndef get_categorical_columns(df):\n \"\"\"\n Purpose:\n Returns the categorical columns in a\n DataFrame\n Args:\n df (Pandas DataFrame): DataFrame to describe\n Return\n categorical_columns (list): List of string\n names of categorical columns\n \"\"\"\n```\n\n\n```\ndef get_numeric_columns(df):\n \"\"\"\n Purpose:\n Returns the numeric columns in a\n DataFrame\n Args:\n df (Pandas DataFrame): DataFrame to describe\n Return\n numeric_columns (list): List of string\n names of numeric columns\n \"\"\"\n```\n\n\n```\ndef get_columns_with_null_values(df):\n \"\"\"\n Purpose:\n Get Columns with Null Values\n Args:\n df (Pandas DataFrame): DataFrame to describe\n Return\n columns_with_nulls (dict): Dictionary where\n keys are columns with nulls and the value\n is the number of nulls in the column\n \"\"\"\n```\n\n### [data_exploration_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_exploration_helpers.py)\n\nLibrary for aiding the understanding and investigation into the data provided for modeling. These helpers will help explain, graph, and explore the data\n\nFunctions:\n\n```\ndef get_numerical_column_statistics(df):\n \"\"\"\n Purpose:\n Describe the numerical columns in a dataframe.\n This will include, total_count, count_null, count_0,\n mean, median, mode, sum, 5% quantile, and 95% quantile.\n Args:\n df (Pandas DataFrame): DataFrame to describe\n Return\n num_statistics (dictionary): Dictionary with key being\n the column and the data being statistics for the\n column\n \"\"\"\n```\n\n\n```\ndef get_column_correlation(df):\n \"\"\"\n Purpose:\n Determine the true correlation between\n all column pairs in a passed in DataFrame.\n This is the pure correlation; this is useful\n if you are looking for the detailed correlation\n and the direction of the correlation\n Args:\n df (Pandas DataFrame): DataFrame to determine correlation\n Return\n unique_value_correlation (Pandas DataFrame): DataFrame\n of correlations for each column set in the DataFrame\n \"\"\"\n```\n\n\n```\ndef get_column_absolute_correlation(df):\n \"\"\"\n Purpose:\n Determine the absolute correlation between\n all column pairs in a passed in DataFrame.\n Absolute converts all correlations to a\n positive value; this is useful if you are\n only looking for the existance of a coorelation\n and not the direction.\n Args:\n df (Pandas DataFrame): DataFrame to determine correlation\n Return\n unique_value_abs_correlation (Pandas DataFrame): DataFrame\n of correlations for each column set in the DataFrame\n \"\"\"\n```\n\n\n```\ndef get_column_pairs_significant_correlation(df, pos_corr=.20, neg_corr=.20):\n \"\"\"\n Purpose:\n Determine Columns with highly positive or highly\n negative correlation. Defaults for positive and\n negative correlations are 20% and can be passed\n in as parameters\n Args:\n df (Pandas DataFrame): DataFrame to determine correlation\n pos_corr (float): Float percentage to consider a positive\n correlation as significant. Default 20%\n neg_corr (float): Float percentage to consider a negative\n correlation as significant. Default 20%\n Return\n high_positive_correlation_pairs (List of Sets): List of column\n pairs with a high positive correlation\n high_negative_correlation_pairs (List of Sets): List of column\n pairs with a high negative correlation\n \"\"\"\n```\n\n\n```\ndef get_unique_column_paris(df):\n \"\"\"\n Purpose:\n Get unique pairs of columns from a DataFrame. This\n assumes there is no direction (A, B) and returns\n a Set of column pairs that can be used for identifying\n correlation, mapping columns, and other functions\n Args:\n df (Pandas DataFrame): DataFrame to determine column pairs\n Return\n unique_pairs (Set): Set of unique column pairs\n \"\"\"\n```\n\n### [model_persistence_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_persistence_helpers.py)\n\nLibrary for helping store/load/persist data science models using Python libraries\n\nFunctions:\n\n```\ndef store_model_as_pickle(filename, config={}, metadata={}):\n \"\"\"\n Purpose:\n Store a model in memory to a .pkl file for later\n usage. ALso store a .config file and .metadata\n file with information about the model\n Args:\n filename (String): Filename of a pickled model (.pkl)\n config (Dict): Configuration data for the model\n metadata (Dict): Metadata related to the model/training/etc\n Return:\n N/A\n \"\"\"\n```\n\n\n```\ndef load_pickled_model(filename):\n \"\"\"\n Purpose:\n Load a model that has been pickled and stored to\n persistance storage into memory\n Args:\n filename (String): Filename of a pickled model (.pkl)\n Return:\n model (Pickeled Object): Pickled model loaded from .pkl\n \"\"\"\n```\n\n### [model_training_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_training_helpers.py)\n\nLibrary for helping train data science models using Python libraries\n\nFunctions:\n\n```\ndef split_dataframe_for_model_training(\n df, dependent_variable, independent_variables=None, train_size=.70):\n \"\"\"\n Purpose:\n Takes in DataFrame and creates 4 DataFrames.\n 2 DataFrames holding X varib DataFrames and 2 Model Y DataFrames.\n Train size is defaulted at 70% and the split defaults to using\n all passed in columns.\n Args:\n df (Pandas DataFrame): DataFrame to split\n dependent_variable (string): dependent variable being\n that the model is being created to predict\n independent_variables (List of strings): independent variables that\n will be used to predict the dependent varilable. If no columns\n are passed, use all columns in the dataframe except the\n dependent variable.\n train_size (float): Percentage of rows in DataFrame\n to use testing model. Inverse precentage will/can\n be used to test the model's effectiveness\n Return\n train_x (Pandas DataFrame): DataFrame with all independent variables\n for training the model. Size is equal to a percentage of the\n base dataset multiplied by the train size\n test_x (Pandas DataFrame): DataFrame with all independent variables\n for testing the trained model. Size is equal to a percentage\n of the base dataset subtracted by the train size\n train_y_observed (Pandas DataFrame): DataFrame with all dependant\n variables for training the model. Size is equal to a percentage\n of the base dataset multiplied by the train size\n test_y_observed (Pandas DataFrame): DataFrame with all dependant\n variables testing the trained model. Size is equal to a\n percentage of the base dataset multiplied by the train size\n \"\"\"\n```\n\n```\ndef split_dataframe_by_column(df, column):\n \"\"\"\n Purpose:\n Split dataframe into multipel dataframes based on uniqueness\n of columns passed in. The dataframe is then split into smaller\n dataframes, one for each value of the variable.\n Args:\n df (Pandas DataFrame): DataFrame to split\n column (string): string of the column name to split on\n Return\n split_df (Dict of Pandas DataFrames): Dictionary with the\n split dataframes and the value that the column maps to\n e.g false/true/0/1\n \"\"\"\n```\n\n## Example Scripts\n\nExample executable Python scripts/modules for testing and interacting with the library. These show example use-cases for the libraries and can be used as templates for developing with the libraries or to use as one-off development efforts.\n\n### N/A\n\n## Notes\n\n - Relies on f-string notation, which is limited to Python3.6. A refactor to remove these could allow for development with Python3.0.x through 3.5.x\n\n## TODO\n\n - Unittest framework in place, but lacking tests\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science", "keywords": "python,libraries,numpy,pandas,data science", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "ctodd-python-lib-data-science", "package_url": "https://pypi.org/project/ctodd-python-lib-data-science/", "platform": "", "project_url": "https://pypi.org/project/ctodd-python-lib-data-science/", "project_urls": { "Homepage": "https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science" }, "release_url": "https://pypi.org/project/ctodd-python-lib-data-science/1.0.0/", "requires_dist": [ "pandas (>=0.24.2)", "great-expectations (>=0.4.5)", "tensorflow (>=1.13.1)" ], "requires_python": ">3.6", "summary": "Python utilities used for practicing data science and engineering", "version": "1.0.0" }, "last_serial": 5166161, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "2fe87aaa345c10593bf12edd1e64d828", "sha256": "6cb39b6b91121e460dd6b92bc6abdeb8559fc210d28e096f2d63b38d1a4c7e92" }, "downloads": -1, "filename": "ctodd_python_lib_data_science-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "2fe87aaa345c10593bf12edd1e64d828", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">3.6", "size": 15178, "upload_time": "2019-04-19T21:18:50", "url": "https://files.pythonhosted.org/packages/c0/72/1c4c6b78e4fc86e1c5fef639963f43d13b7c1c09ee28a81bb3052f07e829/ctodd_python_lib_data_science-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4b5d9afaf898f7275513fcea97b572b8", "sha256": "343259561f9ad7603f206be6c954a32e245d3e872577c4b4c151445ad15f6aab" }, "downloads": -1, "filename": "ctodd-python-lib-data-science-1.0.0.tar.gz", "has_sig": false, "md5_digest": "4b5d9afaf898f7275513fcea97b572b8", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.6", "size": 13935, "upload_time": "2019-04-19T21:18:52", "url": "https://files.pythonhosted.org/packages/0d/01/fde627137bb5911c552b19bfd2bcc8fa88cb675f9fca1b626d881c2e165f/ctodd-python-lib-data-science-1.0.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2fe87aaa345c10593bf12edd1e64d828", "sha256": "6cb39b6b91121e460dd6b92bc6abdeb8559fc210d28e096f2d63b38d1a4c7e92" }, "downloads": -1, "filename": "ctodd_python_lib_data_science-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "2fe87aaa345c10593bf12edd1e64d828", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">3.6", "size": 15178, "upload_time": "2019-04-19T21:18:50", "url": "https://files.pythonhosted.org/packages/c0/72/1c4c6b78e4fc86e1c5fef639963f43d13b7c1c09ee28a81bb3052f07e829/ctodd_python_lib_data_science-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "4b5d9afaf898f7275513fcea97b572b8", "sha256": "343259561f9ad7603f206be6c954a32e245d3e872577c4b4c151445ad15f6aab" }, "downloads": -1, "filename": "ctodd-python-lib-data-science-1.0.0.tar.gz", "has_sig": false, "md5_digest": "4b5d9afaf898f7275513fcea97b572b8", "packagetype": "sdist", "python_version": "source", "requires_python": ">3.6", "size": 13935, "upload_time": "2019-04-19T21:18:52", "url": "https://files.pythonhosted.org/packages/0d/01/fde627137bb5911c552b19bfd2bcc8fa88cb675f9fca1b626d881c2e165f/ctodd-python-lib-data-science-1.0.0.tar.gz" } ] }