{
    "info": {
        "author": "Christopher H. Todd",
        "author_email": "Christopher.Hayden.Todd@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 5 - Production/Stable",
            "Intended Audience :: Developers",
            "Natural Language :: English",
            "Programming Language :: Python",
            "Programming Language :: Python :: 3",
            "Programming Language :: Python :: 3.6",
            "Programming Language :: Python :: 3.7",
            "Programming Language :: Python :: 3.8"
        ],
        "description": "# Christopher H. Todd's PROJECT_STRING_NAME\n\nThe PROJECT_GIT_NAME project is responsible for ...\n\nThe library ...\n\n## Table of Contents\n\n- [Dependencies](#dependencies)\n- [Libraries](#libraries)\n- [Example Scripts](#example-scripts)\n- [Notes](#notes)\n- [TODO](#todo)\n\n## Dependencies\n\n### Python Packages\n\n- great-expectations>=0.4.5\n- pandas>=0.24.2\n- tensorflow>=1.13.1\n\n## Libraries\n\n### [data_engineering_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_engineering_helpers.py)\n\nLibrary for Dealing with redundant Data Engineering Tasks. This will include functions for tranforming dictionaries and PANDAS Dataframes\n\nFunctions:\n\n```\ndef remove_overly_null_columns(df, percentage_null=.25):\n    \"\"\"\n        Purpose:\n            Remove columns with the count of null values\n            exceeds the passed in percentage. This defaults\n            to 25%.\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            percentage_null (float): Percentage of null values\n                that will be the threshold for removing or\n                keeping columns. Defaults to .25 (25%)\n        Return\n            df (Pandas DataFrame): DataFrame with columns removed\n                based on thresholds\n    \"\"\"\n```\n\n```\ndef remove_high_cardinality_numerical_columns(df, percentage_unique=1):\n    \"\"\"\n        Purpose:\n            Remove columns with the count of unique values\n            matches the count of rows. These are usually\n            unique identifiers (primary keys in a database)\n            that are not useful for modeling and can result\n            in poor model performance. percentage_unique\n            defaults to 100%, but this can be passed in\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            percentage_unique (float): Percentage of null values\n                that will be the threshold for removing or\n                keeping columns. Defaults to 1 (100%)\n        Return\n            df (Pandas DataFrame): DataFrame with columns removed\n                based on thresholds\n    \"\"\"\n```\n\n```\ndef remove_high_cardinality_categorical_columns(df, max_unique_values=20):\n    \"\"\"\n        Purpose:\n            Remove columns with the count of unique values\n            for categorical columns are over a specified threshold.\n            These values are difficult to transform into dummies,\n            and would not work for logistic/linear regression.\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            max_unique_values (int): Integer of unique values\n                that is the threshold to remove column\n        Return\n            df (Pandas DataFrame): DataFrame with columns removed\n                based on thresholds\n    \"\"\"\n```\n\n```\ndef remove_single_value_columns(df):\n    \"\"\"\n        Purpose:\n            Remove columns with a single value\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n        Return\n            df (Pandas DataFrame): DataFrame with columns removed\n    \"\"\"\n```\n\n```\ndef remove_quantile_equality_columns(df, low_quantile=.05, high_quantile=.95):\n    \"\"\"\n        Purpose:\n            Remove columns where the low quantile matches the\n            high quantile (data is heavily influenced by outliers)\n            and data is not well spread out\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            low_quantile (float): Percentage quantile to compare\n            high_quantile (float): Percentage quantile to compare\n        Return\n            df (Pandas DataFrame): DataFrame with columns removed\n    \"\"\"\n```\n\n```\ndef mask_outliers_numerical_columns(df, low_quantile=.05, high_quantile=.95):\n    \"\"\"\n        Purpose:\n            Update outliers to be equal to the low_quantile and\n            high_quantile values specified.\n        Args:\n            df (Pandas DataFrame): DataFrame to update data\n            low_quantile (float): Percentage quantile to set values\n            high_quantile (float): Percentage quantile to set values\n        Return\n            df (Pandas DataFrame): DataFrame with columns updated\n    \"\"\"\n```\n\n```\ndef convert_categorical_columns_to_dummies(df, drop_first=True):\n    \"\"\"\n        Purpose:\n            Convert Categorical Values into Dummies. Will also\n            remove the initial column being converted. If\n            remove first is true, will remove one of the\n            dummy variables to remove prevent multicollinearity\n        Args:\n            df (Pandas DataFrame): DataFrame to convert columns\n            drop_first (bool): to remove or not remove a column\n                from dummies generated\n        Return\n            df (Pandas DataFrame): DataFrame with columns converted\n    \"\"\"\n```\n\n```\ndef ensure_categorical_columns_all_string(df):\n    \"\"\"\n        Purpose:\n            Ensure all values for Categorical Values are strings\n            and converts any non-string value into strings\n        Args:\n            df (Pandas DataFrame): DataFrame to convert columns\n        Return\n            df (Pandas DataFrame): DataFrame with columns converted\n    \"\"\"\n```\n\n```\ndef encode_categorical_columns_as_integer(df):\n    \"\"\"\n        Purpose:\n            Convert Categorical Values into single value\n            using sklearn LabelEncoder\n        Args:\n            df (Pandas DataFrame): DataFrame to convert columns\n        Return\n            df (Pandas DataFrame): DataFrame with columns converted\n    \"\"\"\n```\n\n```\ndef replace_null_values_numeric_columns(df, replace_operation='median'):\n    \"\"\"\n        Purpose:\n            Replace all null values in a dataframe with other\n            values. Options include 0, mean, and median; the\n            default operation converts numeric columns to\n            median\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            replace_operation (string/enum): operation to perform\n                in replacing null values in the dataframe\n        Return\n            df (Pandas DataFrame): DataFrame with nulls replaced\n    \"\"\"\n```\n\n```\ndef replace_null_values_categorical_columns(df):\n    \"\"\"\n        Purpose:\n            Replace all null values in a dataframe with \"Unknown\"\n        Args:\n            df (Pandas DataFrame): DataFrame to remove columns\n                from\n            replace_operation (string/enum): operation to perform\n                in replacing null values in the dataframe\n        Return\n            df (Pandas DataFrame): DataFrame with nulls replaced\n    \"\"\"\n```\n\n```\ndef get_categorical_columns(df):\n    \"\"\"\n        Purpose:\n            Returns the categorical columns in a\n            DataFrame\n        Args:\n            df (Pandas DataFrame): DataFrame to describe\n        Return\n            categorical_columns (list): List of string\n                names of categorical columns\n    \"\"\"\n```\n\n\n```\ndef get_numeric_columns(df):\n    \"\"\"\n        Purpose:\n            Returns the numeric columns in a\n            DataFrame\n        Args:\n            df (Pandas DataFrame): DataFrame to describe\n        Return\n            numeric_columns (list): List of string\n                names of numeric columns\n    \"\"\"\n```\n\n\n```\ndef get_columns_with_null_values(df):\n    \"\"\"\n        Purpose:\n            Get Columns with Null Values\n        Args:\n            df (Pandas DataFrame): DataFrame to describe\n        Return\n            columns_with_nulls (dict): Dictionary where\n                keys are columns with nulls and the value\n                is the number of nulls in the column\n    \"\"\"\n```\n\n### [data_exploration_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/data_exploration_helpers.py)\n\nLibrary for aiding the understanding and investigation into the data provided for modeling. These helpers will help explain, graph, and explore the data\n\nFunctions:\n\n```\ndef get_numerical_column_statistics(df):\n    \"\"\"\n        Purpose:\n            Describe the numerical columns in a dataframe.\n            This will include, total_count, count_null, count_0,\n            mean, median, mode, sum, 5% quantile, and 95% quantile.\n        Args:\n            df (Pandas DataFrame): DataFrame to describe\n        Return\n            num_statistics (dictionary): Dictionary with key being\n            the column and the data being statistics for the\n            column\n    \"\"\"\n```\n\n\n```\ndef get_column_correlation(df):\n    \"\"\"\n        Purpose:\n            Determine the true correlation between\n            all column pairs in a passed in DataFrame.\n            This is the pure correlation; this is useful\n            if you are looking for the detailed correlation\n            and the direction of the correlation\n        Args:\n            df (Pandas DataFrame): DataFrame to determine correlation\n        Return\n            unique_value_correlation (Pandas DataFrame): DataFrame\n            of correlations for each column set in the DataFrame\n    \"\"\"\n```\n\n\n```\ndef get_column_absolute_correlation(df):\n    \"\"\"\n        Purpose:\n            Determine the absolute correlation between\n            all column pairs in a passed in DataFrame.\n            Absolute converts all correlations to a\n            positive value; this is useful if you are\n            only looking for the existance of a coorelation\n            and not the direction.\n        Args:\n            df (Pandas DataFrame): DataFrame to determine correlation\n        Return\n            unique_value_abs_correlation (Pandas DataFrame): DataFrame\n            of correlations for each column set in the DataFrame\n    \"\"\"\n```\n\n\n```\ndef get_column_pairs_significant_correlation(df, pos_corr=.20, neg_corr=.20):\n    \"\"\"\n        Purpose:\n            Determine Columns with highly positive or highly\n            negative correlation. Defaults for positive and\n            negative correlations are 20% and can be passed\n            in as parameters\n        Args:\n            df (Pandas DataFrame): DataFrame to determine correlation\n            pos_corr (float): Float percentage to consider a positive\n            correlation as significant. Default 20%\n            neg_corr (float): Float percentage to consider a negative\n            correlation as significant. Default 20%\n        Return\n            high_positive_correlation_pairs (List of Sets): List of column\n            pairs with a high positive correlation\n            high_negative_correlation_pairs (List of Sets): List of column\n            pairs with a high negative correlation\n    \"\"\"\n```\n\n\n```\ndef get_unique_column_paris(df):\n    \"\"\"\n        Purpose:\n            Get unique pairs of columns from a DataFrame. This\n            assumes there is no direction (A, B) and returns\n            a Set of column pairs that can be used for identifying\n            correlation, mapping columns, and other functions\n        Args:\n            df (Pandas DataFrame): DataFrame to determine column pairs\n        Return\n            unique_pairs (Set): Set of unique column pairs\n    \"\"\"\n```\n\n### [model_persistence_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_persistence_helpers.py)\n\nLibrary for helping store/load/persist data science models using Python libraries\n\nFunctions:\n\n```\ndef store_model_as_pickle(filename, config={}, metadata={}):\n    \"\"\"\n    Purpose:\n        Store a model in memory to a .pkl file for later\n        usage. ALso store a .config file and .metadata\n        file with information about the model\n    Args:\n        filename (String): Filename of a pickled model (.pkl)\n        config (Dict): Configuration data for the model\n        metadata (Dict): Metadata related to the model/training/etc\n    Return:\n        N/A\n    \"\"\"\n```\n\n\n```\ndef load_pickled_model(filename):\n    \"\"\"\n    Purpose:\n        Load a model that has been pickled and stored to\n        persistance storage into memory\n    Args:\n        filename (String): Filename of a pickled model (.pkl)\n    Return:\n        model (Pickeled Object): Pickled model loaded from .pkl\n    \"\"\"\n```\n\n### [model_training_helpers.py](https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science/blob/master/data_science_helpers/model_training_helpers.py)\n\nLibrary for helping train data science models using Python libraries\n\nFunctions:\n\n```\ndef split_dataframe_for_model_training(\n    df, dependent_variable, independent_variables=None, train_size=.70):\n    \"\"\"\n        Purpose:\n            Takes in DataFrame and creates 4 DataFrames.\n            2 DataFrames holding X varib DataFrames and 2 Model Y DataFrames.\n            Train size is defaulted at 70% and the split defaults to using\n            all passed in columns.\n        Args:\n            df (Pandas DataFrame): DataFrame to split\n            dependent_variable (string): dependent variable being\n                that the model is being created to predict\n            independent_variables (List of strings): independent variables that\n                will be used to predict the dependent varilable. If no columns\n                are passed, use all columns in the dataframe except the\n                dependent variable.\n            train_size (float): Percentage of rows in DataFrame\n                to use testing model. Inverse precentage will/can\n                be used to test the model's effectiveness\n        Return\n            train_x (Pandas DataFrame): DataFrame with all independent variables\n                for training the model. Size is equal to a percentage of the\n                base dataset multiplied by the train size\n            test_x (Pandas DataFrame): DataFrame with all independent variables\n                for testing the trained model. Size is equal to a percentage\n                of the base dataset subtracted by the train size\n            train_y_observed (Pandas DataFrame): DataFrame with all dependant\n                variables for training the model. Size is equal to a percentage\n                of the base dataset multiplied by the train size\n            test_y_observed (Pandas DataFrame): DataFrame with all dependant\n                variables testing the trained model. Size is equal to a\n                percentage of the base dataset multiplied by the train size\n    \"\"\"\n```\n\n```\ndef split_dataframe_by_column(df, column):\n    \"\"\"\n        Purpose:\n            Split dataframe into multipel dataframes based on uniqueness\n            of columns passed in. The dataframe is then split into smaller\n            dataframes, one for each value of the variable.\n        Args:\n            df (Pandas DataFrame): DataFrame to split\n            column (string): string of the column name to split on\n        Return\n            split_df (Dict of Pandas DataFrames): Dictionary with the\n                split dataframes and the value that the column maps to\n                e.g false/true/0/1\n    \"\"\"\n```\n\n## Example Scripts\n\nExample executable Python scripts/modules for testing and interacting with the library. These show example use-cases for the libraries and can be used as templates for developing with the libraries or to use as one-off development efforts.\n\n### N/A\n\n## Notes\n\n - Relies on f-string notation, which is limited to Python3.6.  A refactor to remove these could allow for development with Python3.0.x through 3.5.x\n\n## TODO\n\n - Unittest framework in place, but lacking tests\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science",
        "keywords": "python,libraries,numpy,pandas,data science",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "ctodd-python-lib-data-science",
        "package_url": "https://pypi.org/project/ctodd-python-lib-data-science/",
        "platform": "",
        "project_url": "https://pypi.org/project/ctodd-python-lib-data-science/",
        "project_urls": {
            "Homepage": "https://github.com/ChristopherHaydenTodd/ctodd-python-lib-data-science"
        },
        "release_url": "https://pypi.org/project/ctodd-python-lib-data-science/1.0.0/",
        "requires_dist": [
            "pandas (>=0.24.2)",
            "great-expectations (>=0.4.5)",
            "tensorflow (>=1.13.1)"
        ],
        "requires_python": ">3.6",
        "summary": "Python utilities used for practicing data science and engineering",
        "version": "1.0.0"
    },
    "last_serial": 5166161,
    "releases": {
        "1.0.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "2fe87aaa345c10593bf12edd1e64d828",
                    "sha256": "6cb39b6b91121e460dd6b92bc6abdeb8559fc210d28e096f2d63b38d1a4c7e92"
                },
                "downloads": -1,
                "filename": "ctodd_python_lib_data_science-1.0.0-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "2fe87aaa345c10593bf12edd1e64d828",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": ">3.6",
                "size": 15178,
                "upload_time": "2019-04-19T21:18:50",
                "url": "https://files.pythonhosted.org/packages/c0/72/1c4c6b78e4fc86e1c5fef639963f43d13b7c1c09ee28a81bb3052f07e829/ctodd_python_lib_data_science-1.0.0-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "4b5d9afaf898f7275513fcea97b572b8",
                    "sha256": "343259561f9ad7603f206be6c954a32e245d3e872577c4b4c151445ad15f6aab"
                },
                "downloads": -1,
                "filename": "ctodd-python-lib-data-science-1.0.0.tar.gz",
                "has_sig": false,
                "md5_digest": "4b5d9afaf898f7275513fcea97b572b8",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": ">3.6",
                "size": 13935,
                "upload_time": "2019-04-19T21:18:52",
                "url": "https://files.pythonhosted.org/packages/0d/01/fde627137bb5911c552b19bfd2bcc8fa88cb675f9fca1b626d881c2e165f/ctodd-python-lib-data-science-1.0.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "2fe87aaa345c10593bf12edd1e64d828",
                "sha256": "6cb39b6b91121e460dd6b92bc6abdeb8559fc210d28e096f2d63b38d1a4c7e92"
            },
            "downloads": -1,
            "filename": "ctodd_python_lib_data_science-1.0.0-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "2fe87aaa345c10593bf12edd1e64d828",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": ">3.6",
            "size": 15178,
            "upload_time": "2019-04-19T21:18:50",
            "url": "https://files.pythonhosted.org/packages/c0/72/1c4c6b78e4fc86e1c5fef639963f43d13b7c1c09ee28a81bb3052f07e829/ctodd_python_lib_data_science-1.0.0-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "4b5d9afaf898f7275513fcea97b572b8",
                "sha256": "343259561f9ad7603f206be6c954a32e245d3e872577c4b4c151445ad15f6aab"
            },
            "downloads": -1,
            "filename": "ctodd-python-lib-data-science-1.0.0.tar.gz",
            "has_sig": false,
            "md5_digest": "4b5d9afaf898f7275513fcea97b572b8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": ">3.6",
            "size": 13935,
            "upload_time": "2019-04-19T21:18:52",
            "url": "https://files.pythonhosted.org/packages/0d/01/fde627137bb5911c552b19bfd2bcc8fa88cb675f9fca1b626d881c2e165f/ctodd-python-lib-data-science-1.0.0.tar.gz"
        }
    ]
}