{ "info": { "author": "MIT Data To AI Lab", "author_email": "dailabmit@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "
\n
\n
\nA project from Data to AI Lab at MIT.\n
\n\n\n ------------------------------------\n
\n\n\n
\n
\nGreenGuard is a machine learning library built for data generated by wind turbines and solar panel installations.\n
\n\n[](https://pypi.python.org/pypi/greenguard)\n[](https://travis-ci.org/D3-AI/GreenGuard)\n\n# GreenGuard\n\n- Documentation: https://D3-AI.github.io/GreenGuard\n- Homepage: https://github.com/D3-AI/GreenGuard\n\n# Overview\n\nThe GreenGuard project is a collection of end-to-end solutions for machine learning tasks commonly\nfound in monitoring wind energy production systems. Most tasks utilize sensor data\nemanating from monitoring systems. We utilize the foundational innovations developed for\nautomation of machine Learning at Data to AI Lab at MIT.\n\nThe salient aspects of this customized project are:\n* A set of ready to use, well tested pipelines for different machine learning tasks. These are\n vetted through testing across multiple publicly available datasets for the same task.\n* An easy interface to specify the task, pipeline, and generate results and summarize them.\n* A production ready, deployable pipeline.\n* An easy interface to ``tune`` pipelines using Bayesian Tuning and Bandits library.\n* A community oriented infrastructure to incorporate new pipelines.\n* A robust continuous integration and testing infrastructure.\n* A ``learning database`` recording all past outcomes --> tasks, pipelines, outcomes.\n\n## Data Format\n\n**GreenGuard Pipelines** work on time Series formatted as follows:\n\n* A **Turbines** table that contains:\n * `turbine_id`: column with the unique id of each turbine.\n * A number of additional columns with information about each turbine.\n* A **Signals** table that contains:\n * `signal_id`: column with the unique id of each signal.\n * A number of additional columns with information about each signal.\n* A **Readings** table that contains:\n * `reading_id`: Unique identifier of this reading.\n * `turbine_id`: Unique identifier of the turbine which this reading comes from.\n * `signal_id`: Unique identifier of the signal which this reading comes from.\n * `timestamp`: Time where the reading took place, as an ISO formatted datetime.\n * `value`: Numeric value of this reading.\n* A **Targets** table that contains:\n * `target_id`: Unique identifier of the turbine which this label corresponds to.\n * `turbine_id`: Unique identifier of the turbine which this label corresponds to.\n * `timestamp`: Time associated with this target\n * `target - optional`: The value that we want to predict. This can either be a numerical\n value or a categorical label. This column can also be skipped when preparing data that\n will be used only to make predictions and not to fit any pipeline.\n\n### Demo Dataset\n\nFor development and demonstration purposes, we include a dataset with data from several telemetry\nsignals associated with one wind energy production turbine.\n\nThis data, which has been already formatted as expected by the GreenGuard Pipelines, can be\nbrowsed and downloaded directly from the\n[d3-ai-greenguard AWS S3 Bucket](https://d3-ai-greenguard.s3.amazonaws.com/index.html).\n\nThis dataset is adapted from the one used in the project by Cohen, Elliot J.,\n\"Wind Analysis.\" Joint Initiative of the ECOWAS Centre for Renewable Energy and Energy Efficiency (ECREEE), The United Nations Industrial Development Organization (UNIDO) and the Sustainable Engineering Lab (SEL). Columbia University, 22 Aug. 2014.\n[Available online here](https://github.com/Ecohen4/ECREEE)\n\nThe complete list of manipulations performed on the original dataset to convert it into the\ndemo one that we are using here is exhaustively shown and explained in the\n[Green Guard Demo Data notebook](notebooks/Green%20Guard%20Demo%20Data.ipynb).\n\n## Concepts\n\nBefore diving into the software usage, we briefly explain some concepts and terminology.\n\n### Primitive\n\nWe call the smallest computational blocks used in a Machine Learning process\n**primitives**, which:\n\n* Can be either classes or functions.\n* Have some initialization arguments, which MLBlocks calls `init_params`.\n* Have some tunable hyperparameters, which have types and a list or range of valid values.\n\n### Template\n\nPrimitives can be combined to form what we call **Templates**, which:\n\n* Have a list of primitives.\n* Have some initialization arguments, which correspond to the initialization arguments\n of their primitives.\n* Have some tunable hyperparameters, which correspond to the tunable hyperparameters\n of their primitives.\n\n### Pipeline\n\nTemplates can be used to build **Pipelines** by taking and fixing a set of valid\nhyperparameters for a Template. Hence, Pipelines:\n\n* Have a list of primitives, which corresponds to the list of primitives of their template.\n* Have some initialization arguments, which correspond to the initialization arguments\n of their template.\n* Have some hyperparameter values, which fall within the ranges of valid tunable\n hyperparameters of their template.\n\nA pipeline can be fitted and evaluated using the MLPipeline API in MLBlocks.\n\n\n## Current tasks and pipelines\n\nIn our current phase, we are addressing two tasks - time series classification and time series\nregression. To provide solutions for these two tasks we have two components.\n\n### GreenGuardPipeline\n\nThis class is the one in charge of learning from the data and making predictions by building\n[MLBlocks](https://hdi-project.github.io/MLBlocks) pipelines and later on tuning them using\n[BTB](https://hdi-project.github.io/BTB/)\n\n### GreenGuardLoader\n\nA class responsible for loading the time series data from CSV files, and return it in the\nformat ready to be used by the **GreenGuardPipeline**.\n\n### Tuning\n\nWe call tuning the process of, given a dataset and a template, find the pipeline derived from the\ngiven template that gets the best possible score on the given dataset.\n\nThis process usually involves fitting and evaluating multiple pipelines with different hyperparameter\nvalues on the same data while using optimization algorithms to deduce which hyperparameters are more\nlikely to get the best results in the next iterations.\n\nWe call each one of these tries a **tuning iteration**.\n\n\n# Getting Started\n\n## Requirements\n\n### Python\n\n**GreenGuard** has been developed and runs on Python 3.5, 3.6 and 3.7.\n\nAlso, although it is not strictly required, the usage of a [virtualenv](https://virtualenv.pypa.io/en/latest/)\nis highly recommended in order to avoid interfering with other software installed in the system\nwhere you are trying to run **GreenGuard**.\n\n## Installation\n\nThe simplest and recommended way to install **GreenGuard** is using pip:\n\n```bash\npip install greenguard\n```\n\nFor development, you can also clone the repository and install it from sources\n\n```bash\ngit clone git@github.com:D3-AI/GreenGuard.git\ncd GreenGuard\nmake install-develop\n```\n\n## Quickstart\n\nIn this example we will load some demo data using the **GreenGuardLoader** and fetch it to the\n**GreenGuardPipeline** for it to find the best possible pipeline, fit it using the given data\nand then make predictions from it.\n\n### 1. Load and explore the data\n\nThe first step is to load the demo data.\n\nFor this, we will import and call the `greenguard.loader.load_demo` function without any arguments:\n\n```python\nfrom greenguard.loader import load_demo\n\nX, y, tables = load_demo()\n```\n\nThe returned objects are:\n\n`X`: A `pandas.DataFrame` with the `targets` table data without the `target` column.\n\n```\n target_id turbine_id timestamp\n0 1 1 2013-01-01\n1 2 1 2013-01-02\n2 3 1 2013-01-03\n3 4 1 2013-01-04\n4 5 1 2013-01-05\n```\n\n`y`: A `pandas.Series` with the `target` column from the `targets` table.\n\n```\n0 0.0\n1 0.0\n2 0.0\n3 0.0\n4 0.0\nName: target, dtype: float64\n```\n\n`tables`: A dictionary containing three tables in the format explained above:\n\nthe `turbines` table:\n\n```\n turbine_id name\n0 1 Turbine 1\n```\n\nthe `signals` table:\n\n```\n signal_id name\n0 1 WTG01_Grid Production PossiblePower Avg. (1)\n1 2 WTG02_Grid Production PossiblePower Avg. (2)\n2 3 WTG03_Grid Production PossiblePower Avg. (3)\n3 4 WTG04_Grid Production PossiblePower Avg. (4)\n4 5 WTG05_Grid Production PossiblePower Avg. (5)\n```\n\nand the `readings` table:\n\n```\n reading_id turbine_id signal_id timestamp value\n0 1 1 1 2013-01-01 817.0\n1 2 1 2 2013-01-01 805.0\n2 3 1 3 2013-01-01 786.0\n3 4 1 4 2013-01-01 809.0\n4 5 1 5 2013-01-01 755.0\n```\n\n### 2. Split the data\n\nIf we want to split the data in train and test subsets, we can do so by splitting the\n`X` and `y` variables with any suitable tool.\n\nIn this case, we will do it using the [train_test_split function from scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).\n\n```python\nfrom sklearn.model_selection import train_test_split\n\nX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)\n```\n\n### 3. Finding the best Pipeline\n\nOnce we have loaded the data, we create a **GreenGuardPipeline** instance by passing:\n\n* `template (string)`: the name of a template or the path to a template json file.\n* `metric (string or function)`: The name of the metric to use or a metric function to use.\n* `cost (bool)`: Whether the metric is a cost function to be minimized or a score to be maximized.\n\nOptionally, we can also pass defails about the cross validation configuration:\n\n* `stratify`\n* `cv_splits`\n* `shuffle`\n* `random_state`\n\nIn this case, we will be loading the `greenguard_classification` pipeline, using\nthe `accuracy` metric, and using only 2 cross validation splits:\n\n```python\nfrom greenguard.pipeline import GreenGuardPipeline\n\npipeline = GreenGuardPipeline(template='greenguard_classification', metric='accuracy', cv_splits=2)\n```\n\nOnce we have created the pipeline, we can call its `tune` method to find the best possible\nhyperparameters for our data, passing the `X`, `y`, and `tables` variables returned by the loader,\nas well as an indication of the number of tuning iterations that we want to perform.\n\n```python\npipeline.tune(X_train, y_train, tables, iterations=10)\n```\n\nAfter the tuning process has finished, the hyperparameters have been already set in the classifier.\n\nWe can see the found hyperparameters by calling the `get_hyperparameters` method,\n\n```python\npipeline.get_hyperparameters()\n```\n\nwhich will return a dictionary with the best hyperparameters found so far:\n\n```\n{\n \"pandas.DataFrame.resample#1\": {\n \"rule\": \"1D\",\n \"time_index\": \"timestamp\",\n \"groupby\": [\n \"turbine_id\",\n \"signal_id\"\n ],\n \"aggregation\": \"mean\"\n },\n \"pandas.DataFrame.unstack#1\": {\n \"level\": \"signal_id\",\n \"reset_index\": true\n },\n ...\n```\n\nas well as the obtained cross validation score by looking at the `score` attribute of the\n`pipeline` object:\n\n```python\npipeline.score # -> 0.6447509660798626\n```\n\n**NOTE**: If the score is not good enough, we can call the `tune` method again as many times\nas needed and the pipeline will continue its tuning process every time based on the previous\nresults!\n\n### 4. Fitting the pipeline\n\nOnce we are satisfied with the obtained cross validation score, we can proceed to call\nthe `fit` method passing again the same data elements.\n\nThis will fit the pipeline with all the training data available using the best hyperparameters\nfound during the tuning process:\n\n```python\npipeline.fit(X_train, y_train, tables)\n```\n\n### 5. Use the fitted pipeline\n\nAfter fitting the pipeline, we are ready to make predictions on new data:\n\n```python\npredictions = pipeline.predict(X_test, tables)\n```\n\nAnd evaluate its prediction performance:\n\n```python\nfrom sklearn.metrics import accuracy_score\n\nf1_score(y_test, predictions) # -> 0.6413043478260869\n```\n\n### 6. Save and load the pipeline\n\nSince the tuning and fitting process takes time to execute and requires a lot of data, you\nwill probably want to save a fitted instance and load it later to analyze new signals\ninstead of fitting pipelines over and over again.\n\nThis can be done by using the `save` and `load` methods from the `GreenGuardPipeline`.\n\nIn order to save an instance, call its `save` method passing it the path and filename\nwhere the model should be saved.\n\n```python\npath = 'my_pipeline.pkl'\n\npipeline.save(path)\n```\n\nOnce the pipeline is saved, it can be loaded back as a new `GreenGuardPipeline` by using the\n`GreenGuardPipeline.load` method:\n\n```python\nnew_pipeline = GreenGuardPipeline.load(path)\n```\n\nOnce loaded, it can be directly used to make predictions on new data.\n\n```python\nnew_pipeline.predict(X_test, tables)\n```\n\n\n## Use your own Dataset\n\nOnce you are familiar with the **GreenGuardPipeline** usage, you will probably want to run it\non your own dataset.\n\nHere are the necessary steps:\n\n### 1. Prepare the data\n\nFirt of all, you will need to prepare your data as 4 CSV files like the ones described in the\n[data format](#data-format) section above.\n\n### 2. Create a GreenGuardLoader\n\nOnce you have the CSV files ready, you will need to import the `greenguard.loader.GreenGuardLoader`\nclass and create an instance passing:\n\n* `path - str`: The path to the folder where the 4 CSV files are\n* `target - str, optional`: The name of the target table. Defaults to `targets`.\n* `target_column - str, optional`: The name of the target column. Defaults to `target`.\n* `readings - str, optional`: The name of the readings table. Defaults to `readings`.\n* `turbines - str, optional`: The name of the turbines table. Defaults to `turbines`.\n* `signals - str, optional`: The name of the signals table. Defaults to `signals`.\n* `readings - str, optional`: The name of the readings table. Defaults to `readings`.\n* `gzip - bool, optional`: Set to True if the CSV files are gzipped. Defaults to False.\n\nFor example, here we will be loading a custom dataset which has been sorted in gzip format\ninside the `my_dataset` folder, and for which the target table has a different name:\n\n```python\nfrom greenguard.loader import GreenGuardLoader\n\nloader = GreenGuardLoader(path='my_dataset', target='labels', gzip=True)\n```\n\n### 3. Call the loader.load method.\n\nOnce the `loader` instance has been created, we can call its `load` method:\n\n```python\nX, y, tables = loader.load()\n```\n\nOptionally, if the dataset contains only data to make predictions and the `target` column\ndoes not exist, we can pass it the argument `False` to skip it:\n\n```python\nX, tables = loader.load(target=False)\n```\n\n## Docker Usage\n\n**GreenGuard** comes configured and ready to be distributed and run as a docker image which starts\na jupyter notebook already configured to use greenguard, with all the required dependencies already\ninstalled.\n\n### Requirements\n\nThe only requirement in order to run the GreenGuard Docker image is to have Docker installed and\nthat the user has enough permissions to run it.\n\nInstallation instructions for any possible system compatible can be found [here](https://docs.docker.com/install/)\n\nAdditionally, the system that builds the GreenGuard Docker image will also need to have a working\ninternet connection that allows downloading the base image and the additional python depenedencies.\n\n### Building the GreenGuard Docker Image\n\nAfter having cloned the **GreenGuard** repository, all you have to do in order to build the GreenGuard Docker\nImage is running this command:\n\n```bash\nmake docker-jupyter-build\n```\n\nAfter a few minutes, the new image, called `greenguard-jupyter`, will have been built into the system\nand will be ready to be used or distributed.\n\n### Distributing the GreenGuard Docker Image\n\nOnce the `greenguard-jupyter` image is built, it can be distributed in several ways.\n\n#### Distributing using a Docker registry\n\nThe simplest way to distribute the recently created image is [using a registry](https://docs.docker.com/registry/).\n\nIn order to do so, we will need to have write access to a public or private registry (remember to\n[login](https://docs.docker.com/engine/reference/commandline/login/)!) and execute these commands:\n\n```bash\ndocker tag greenguard-jupyter:latest your-registry-name:some-tag\ndocker push your-registry-name:some-tag\n```\n\nAfterwards, in the receiving machine:\n\n```bash\ndocker pull your-registry-name:some-tag\ndocker tag your-registry-name:some-tag greenguard-jupyter:latest\n```\n\n#### Distributing as a file\n\nIf the distribution of the image has to be done offline for any reason, it can be achieved\nusing the following command.\n\nIn the system that already has the image:\n\n```bash\ndocker save --output greenguard-jupyter.tar greenguard-jupyter\n```\n\nThen copy over the file `greenguard-jupyter.tar` to the new system and there, run:\n\n```bash\ndocker load --input greenguard-jupyter.tar\n```\n\nAfter these commands, the `greenguard-jupyter` image should be available and ready to be used in the\nnew system.\n\n\n### Running the greenguard-jupyter image\n\nOnce the `greenguard-jupyter` image has been built, pulled or loaded, it is ready to be run.\n\nThis can be done in two ways:\n\n#### Running greenguard-jupyter with the code\n\nIf the GreenGuard source code is available in the system, running the image is as simple as running\nthis command from within the root of the project:\n\n```bash\nmake docker-jupyter-run\n```\n\nThis will start a jupyter notebook using the docker image, which you can access by pointing your\nbrowser at http://127.0.0.1:8888\n\nIn this case, the local version of the project will also mounted within the Docker container,\nwhich means that any changes that you do in your local code will immediately be available\nwithin your notebooks, and that any notebook that you create within jupyter will also show\nup in your `notebooks` folder!\n\n#### Running greenguard-jupyter without the greenguard code\n\nIf the GreenGuard source code is not available in the system and only the Docker Image is, you can\nstill run the image by using this command:\n\n```bash\ndocker run -ti -p8888:8888 greenguard-jupyter\n```\n\nIn this case, the code changes and the notebooks that you create within jupyter will stay\ninside the container and you will only be able to access and download them through the\njupyter interface.\n\n\n# History\n\n## 0.1.0\n\n* First release on PyPI\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/D3-AI/GreenGuard", "keywords": "wind machine learning green guard", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "greenguard", "package_url": "https://pypi.org/project/greenguard/", "platform": "", "project_url": "https://pypi.org/project/greenguard/", "project_urls": { "Homepage": "https://github.com/D3-AI/GreenGuard" }, "release_url": "https://pypi.org/project/greenguard/0.1.0/", "requires_dist": [ "baytune (<0.3,>=0.2.3)", "mlblocks (<0.4,>=0.3.0)", "mlprimitives (<0.2,>=0.1.8)", "numpy (<1.17,>=1.15.4)", "pymongo (<4,>=3.7.2)", "scikit-learn (<0.21,>=0.20.1)", "bumpversion (>=0.5.3) ; extra == 'dev'", "pip (>=10.0.0) ; extra == 'dev'", "watchdog (>=0.8.3) ; extra == 'dev'", "m2r (>=0.2.0) ; extra == 'dev'", "Sphinx (>=1.7.1) ; extra == 'dev'", "sphinx-rtd-theme (>=0.2.4) ; extra == 'dev'", "recommonmark (>=0.4.0) ; extra == 'dev'", "flake8 (>=3.5.0) ; extra == 'dev'", "isort (>=4.3.4) ; extra == 'dev'", "autoflake (>=1.1) ; extra == 'dev'", "autopep8 (>=1.3.5) ; extra == 'dev'", "twine (>=1.10.0) ; extra == 'dev'", "wheel (>=0.30.0) ; extra == 'dev'", "jupyter (>=1.0.0) ; extra == 'dev'", "coverage (>=4.5.1) ; extra == 'dev'", "pytest (>=3.4.2) ; extra == 'dev'", "tox (>=2.9.1) ; extra == 'dev'", "coverage (>=4.5.1) ; extra == 'test'", "pytest (>=3.4.2) ; extra == 'test'", "tox (>=2.9.1) ; extra == 'test'" ], "requires_python": ">=3.5", "summary": "The GreenGuard project is a collection of end-to-end solutions for machine learning tasks commonly found in monitoring wind energy production systems.", "version": "0.1.0" }, "last_serial": 5133037, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "eec5c940cc930b9337d5cc0dabc71d14", "sha256": "b4d7e3a29b2dcfafe1683f1b76b5f57d5e8da1b635202765f2c6aeec641b4aa1" }, "downloads": -1, "filename": "greenguard-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "eec5c940cc930b9337d5cc0dabc71d14", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.5", "size": 15207, "upload_time": "2019-04-12T08:57:24", "url": "https://files.pythonhosted.org/packages/f8/cf/4c2949e4c264cf00e7cd2818f2db11dd450ccaa7ac843fac69fcaf2cb622/greenguard-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9821d5a7b0700488710f463e967fa236", "sha256": "2adf28804a2ee36b0aa07e887e8ed69d12698bf8c074414eea49f0f4074a39a3" }, "downloads": -1, "filename": "greenguard-0.1.0.tar.gz", "has_sig": false, "md5_digest": "9821d5a7b0700488710f463e967fa236", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 33734, "upload_time": "2019-04-12T08:57:27", "url": "https://files.pythonhosted.org/packages/e1/5f/56cd7e3ac0c898c8e7fe36dcc1eec0167e00e9177eaaf9eea5f48feda1a7/greenguard-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "eec5c940cc930b9337d5cc0dabc71d14", "sha256": "b4d7e3a29b2dcfafe1683f1b76b5f57d5e8da1b635202765f2c6aeec641b4aa1" }, "downloads": -1, "filename": "greenguard-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "eec5c940cc930b9337d5cc0dabc71d14", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.5", "size": 15207, "upload_time": "2019-04-12T08:57:24", "url": "https://files.pythonhosted.org/packages/f8/cf/4c2949e4c264cf00e7cd2818f2db11dd450ccaa7ac843fac69fcaf2cb622/greenguard-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "9821d5a7b0700488710f463e967fa236", "sha256": "2adf28804a2ee36b0aa07e887e8ed69d12698bf8c074414eea49f0f4074a39a3" }, "downloads": -1, "filename": "greenguard-0.1.0.tar.gz", "has_sig": false, "md5_digest": "9821d5a7b0700488710f463e967fa236", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 33734, "upload_time": "2019-04-12T08:57:27", "url": "https://files.pythonhosted.org/packages/e1/5f/56cd7e3ac0c898c8e7fe36dcc1eec0167e00e9177eaaf9eea5f48feda1a7/greenguard-0.1.0.tar.gz" } ] }