{ "info": { "author": "MIT Data To AI Lab", "author_email": "dailabmit@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Artificial Intelligence" ], "description": "[![CircleCI][circleci-img]][circleci-url]\n[![Coverage status][codecov-img]][codecov-url]\n[![Documentation][rtd-img]][rtd-url]\n\n[circleci-img]: https://circleci.com/gh/HDI-Project/ATM.svg?style=shield\n[circleci-url]: https://circleci.com/gh/HDI-Project/ATM\n[travis-img]: https://travis-ci.org/HDI-Project/ATM.svg?branch=master\n[travis-url]: https://travis-ci.org/HDI-Project/ATM\n[pypi-img]: https://img.shields.io/pypi/v/atm.svg\n[pypi-url]: https://pypi.python.org/pypi/atm\n[codecov-img]: https://codecov.io/gh/HDI-project/ATM/branch/master/graph/badge.svg\n[codecov-url]: https://codecov.io/gh/HDI-project/ATM\n[rtd-img]: https://readthedocs.org/projects/atm/badge/?version=latest\n[rtd-url]: http://atm.readthedocs.io/en/latest/\n\n\n# ATM - Auto Tune Models\n\n- Free software: MIT license\n- Documentation: http://atm.readthedocs.io/en/latest/\n\n\nATM is an open source software library under the\n[*Human Data Interaction* project](https://hdi-dai.lids.mit.edu/)\n(HDI) at MIT. It is a distributed, scalable AutoML system designed with ease of use in mind.\n\n## Summary\n\nFor a given classification problem, ATM's goal is to find\n\n1. a classification *method*, such as *decision tree*, *support vector machine*,\nor *random forest*, and\n2. a set of *hyperparameters* for that method\n\nwhich generate the best classifier possible.\n\nATM takes in a dataset with pre-extracted feature vectors and labels as a CSV file.\nIt then begins training and testing classifiers (machine learning models) in parallel.\nAs time goes on, ATM will use the results of previous classifiers to intelligently select\nwhich methods and hyperparameters to try next.\nAlong the way, ATM saves data about each classifier it trains, including the hyperparameters\nused to train it, extensive performance metrics, and a serialized version of the model itself.\n\nATM has the following features:\n\n* It allows users to run the system for multiple datasets and multiple\nproblem configurations in parallel.\n* It can be run locally, on AWS\\*, or on a custom compute cluster\\*\n* It can be configured to use a variety of AutoML approaches for hyperparameter tuning and\nselection, available in the accompanying library [btb](https://github.com/hdi-project/btb)\n* It stores models, metrics and cross validated accuracy information about each\nclassifier it has trained.\n\n\\**work in progress! See issue [#40](https://github.com/HDI-Project/ATM/issues/40)*\n\n## Current status\n\nATM and the accompanying library BTB are under active development.\nWe have made the transition and improvements.\n\n## Setup/Installation\n\nThis section describes the quickest way to get started with ATM on a machine running Ubuntu Linux.\nWe hope to have more in-depth guides in the future, but for now, you should be able to substitute\ncommands for the package manager of your choice to get ATM up and running on most modern\nUnix-based systems.\n\nATM is compatible with and has been tested on Python 2.7, 3.5, and 3.6.\n\n1. **Clone the project**\n\n ```\n git clone https://github.com/hdi-project/ATM.git /path/to/atm\n cd /path/to/atm\n ```\n\n2. **Install a database**\n\n You will need to install the libmysqlclient-dev package (for sqlalchemy)\n\n ```\n sudo apt install libmysqlclient-dev\n ```\n\n and at least one of the following databases.\n\n - for SQLite (simpler):\n\n ```\n sudo apt install sqlite3\n ```\n\n - for MySQL:\n\n ```\n sudo apt install mysql-server mysql-client\n ```\n\n3. **Install python dependencies**.\n\n This will also install [btb](https://github.com/hdi-project/btb), the core AutoML library in\n development under the HDI project, as an egg which will track changes to the git repository.\n\n Here, usage of `virtualenv` is shown, but you can substitute `conda` or your preferred\n environment manager as well.\n\n ```\n virtualenv venv\n . venv/bin/activate\n python setup.py install\n ```\n\n## Quick Usage\n\nBelow we will give a quick tutorial of how to run ATM on your desktop.\nWe will use a featurized dataset, already saved in `data/test/pollution_1.csv`.\nThis is one of the datasets available on [openml.org](https://www.openml.org).\nMore details can be found [here](https://www.openml.org/d/542).\nIn this problem the goal is predict `mortality` using the metrics associated with the air\npollution. Below we show a snapshot of the `csv` file.\nThe data has 15 features and the last column is the `class` label.\n\n\n|PREC |JANT |JULT |OVR65 |POPN |EDUC |HOUS |DENS |NONW |WWDRK |POOR |HC |NOX |SO@ |HUMID |class |\n|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|\n|35 |23 |72 |11.1 |3.14 |11 |78.8 |4281 |3.5 |50.7 |14.4 |8 |10 |39 |57 |1 |\n|44 |29 |74 |10.4 |3.21 |9.8 |81.6 |4260 |0.8 |39.4 |12.4 |6 |6 |33 |54 |1 |\n|47 |45 |79 |6.5 |3.41 |11.1 |77.5 |3125 |27.1 |50.2 |20.6 |18 |8 |24 |56 |1 |\n|43 |35 |77 |7.6 |3.44 |9.6 |84.6 |6441 |24.4 |43.7 |14.3 |43 |38 |206 |55 |1 |\n|53 |45 |80 |7.7 |3.45 |10.2 |66.8 |3325 |38.5 |43.1 |25.5 |30 |32 |72 |54 |1 |\n|43 |30 |74 |10.9 |3.23 |12.1 |83.9 |4679 |3.5 |49.2 |11.3 |21 |32 |62 |56 |0 |\n|45 |30 |73 |9.3 |3.29 |10.6 |86 |2140 |5.3 |40.4 |10.5 |6 |4 |4 |56 |0 |\n|.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |\n|.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |\n|.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |.. |\n|37 |31 |75 |8 |3.26 |11.9 |78.4 |4259 |13.1 |49.6 |13.9 |23 |9 |15 |58 |1 |\n|35 |46 |85 |7.1 |3.22 |11.8 |79.9 |1441 |14.8 |51.2 |16.1 |1 |1 |1 |54 |0 |\n\n\n1. **Create a datarun**\n\n ```\n python scripts/enter_data.py\n ```\n\n This command will create a `datarun`. In ATM, a \"datarun\" is a single logical machine learning\n task. If you run the above command without any arguments, it will use the default settings\n found in `atm/config.py` to create a new SQLite3 database at `./atm.db`, create a new\n `dataset` instance which refers to the data above, and create a `datarun` instance which\n points to that dataset. More about what is stored in this database and what is it used for\n can be found [here](https://cyphe.rs/static/atm.pdf).\n\n The command should produce a lot of output, the end of which looks something like this:\n\n ```\n ========== Summary ==========\n Training data: data/test/pollution_1.csv\n Test data: \n Dataset ID: 1\n Frozen set selection strategy: uniform\n Parameter tuning strategy: gp_ei\n Budget: 100 (classifier)\n Datarun ID: 1\n ```\n\n The most important piece of information is the datarun ID.\n\n\n2. **Start a worker**\n\n ```\n python scripts/worker.py\n ```\n\n This will start a process that builds classifiers, tests them, and saves them to the\n `./models/` directory. The output should show which hyperparameters are being tested\n and the performance of each classifier (the \"judgment metric\"), plus the best overall\n performance so far.\n\n ```\n Classifier type: classify_logreg\n Params chosen:\n C = 8904.06127554\n _scale = True\n fit_intercept = False\n penalty = l2\n tol = 4.60893080631\n dual = True\n class_weight = auto\n\n Judgment metric (f1): 0.536 +- 0.067\n Best so far (classifier 21): 0.716 +- 0.035\n ```\n\n Occasionally, a worker will encounter an error in the process of building and testing a\n classifier. When this happens, the worker will print error data to the terminal, log the\n error in the database, and move on to the next classifier.\n\nAnd that's it! You can break out of the worker with Ctrl+c and restart\nit with the same command; it will pick up right where it left off. You can also run the\ncommand simultaneously in different terminals to parallelize the work -- all workers will\nrefer to the same ModelHub database. When all 100 classifiers in your budget have been built,\nall workers will exit gracefully.\n\n## Customizing ATM's configuration and using your own data\n\nATM's default configuration is fully controlled by `atm/config.py`. Our documentation will\ncover the configuration in more detail, but this section provides a brief overview of how\nto specify the most important values.\n\n### Running ATM on your own data\n\nIf you want to use the system for your own dataset, convert your data to a CSV file similar\nto the example shown above. The format is:\n\n * Each column is a feature (or the label)\n * Each row is a training example\n * The first row is the header row, which contains names for each column of data\n * A single column (the *target* or *label*) is named `class`\n\nNext, you'll need to use `enter_data.py` to create a `dataset` and `datarun` for your task.\n\nThe script will look for values for each configuration variable in the following places, in order:\n\n1. Command line arguments\n2. Configuration files\n3. Defaults specified in `atm/config.py`\n\nThat means there are two ways to pass configuration to the command.\n\n1. **Using YAML configuration files**\n\n Saving configuration as YAML files is an easy way to save complicated setups or share\n them with team members.\n\n You should start with the templates provided in `atm/config/templates` and modify them\n to suit your own needs.\n\n ```\n mkdir config\n cp atm/config/templates/*.yaml config/\n vim config/*.yaml\n ```\n\n `run.yaml` contains all the settings for a single dataset and datarun. Specify the `train_path`\n to point to your own dataset.\n\n `sql.yaml` contains the settings for the ModelHub SQL database. The default configuration will\n connect to (and create if necessary) a SQLite database at `./atm.db` relative to the directory\n from which `enter_data.py` is run. If you are using a MySQL database, you will need to change\n the file to something like this:\n\n ```\n dialect: mysql\n database: atm\n username: username\n password: password\n host: localhost\n port: 3306\n query:\n ```\n\n `aws.yaml` should contain the settings for running ATM in the cloud. This is not necessary\n for local operation.\n\n Once your YAML files have been updated, run the datarun creation script and pass it the paths\n to your new config files:\n\n ```\n python scripts/enter_data.py --sql-config config/sql.yaml \\\n --aws-config config/aws.yaml \\\n --run-config config/run.yaml\n ```\n\n2. **Using command line arguments**\n\n You can also specify each argument individually on the command line. The names of the\n variables are the same as those in the YAML files. SQL configuration variables must be\n prepended by `sql-`, and AWS config variables must be prepended by `aws-`.\n\n Using command line arguments is convenient for quick experiments, or for cases where you\n need to change just a couple of values from the default configuration. For example:\n\n ```\n python scripts/enter_data.py --train-path ./data/my-custom-data.csv --selector bestkvel\n ```\n\n You can also use a mixture of config files and command line arguments; any command line\n arguments you specify will override the values found in config files.\n\nOnce you've created your custom datarun, start a worker, specifying your config files and the\ndatarun(s) you'd like to compute on.\n\n```\npython scripts/worker.py --sql-config config/sql.yaml \\\n --aws-config config/aws.yaml \\\n --dataruns 1\n```\n\nIt's important that the SQL configuration used by the worker matches the configuration you\npassed to `enter_data.py` -- otherwise, the worker will be looking in the wrong ModelHub\ndatabase for its datarun!\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n## Citing ATM\n\nIf you use ATM, please consider citing the following paper:\n\nThomas Swearingen, Will Drevo, Bennett Cyphers, Alfredo Cuesta-Infante, Arun Ross, Kalyan Veeramachaneni. [ATM: A distributed, collaborative, scalable system for automated machine learning.](https://cyphe.rs/static/atm.pdf) *IEEE BigData 2017*, 151-162\n\nBibTeX entry:\n\n```bibtex\n@inproceedings{DBLP:conf/bigdataconf/SwearingenDCCRV17,\n author = {Thomas Swearingen and\n Will Drevo and\n Bennett Cyphers and\n Alfredo Cuesta{-}Infante and\n Arun Ross and\n Kalyan Veeramachaneni},\n title = {{ATM:} {A} distributed, collaborative, scalable system for automated\n machine learning},\n booktitle = {2017 {IEEE} International Conference on Big Data, BigData 2017, Boston,\n MA, USA, December 11-14, 2017},\n pages = {151--162},\n year = {2017},\n crossref = {DBLP:conf/bigdataconf/2017},\n url = {https://doi.org/10.1109/BigData.2017.8257923},\n doi = {10.1109/BigData.2017.8257923},\n timestamp = {Tue, 23 Jan 2018 12:40:42 +0100},\n biburl = {https://dblp.org/rec/bib/conf/bigdataconf/SwearingenDCCRV17},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n}\n```\n\n## Related Projects\n\n### BTB\n\n[BTB](https://github.com/hdi-project/btb), for Bayesian Tuning and Bandits, is the core AutoML\nlibrary in development under the HDI project. BTB exposes several methods for hyperparameter\nselection and tuning through a common API. It allows domain experts to extend existing methods\nand add new ones easily. BTB is a central part of ATM, and the two projects were developed in\ntandem, but it is designed to be implementation-agnostic and should be useful for a wide range\nof hyperparameter selection tasks.\n\n### Featuretools\n\n[Featuretools](https://github.com/featuretools/featuretools) is a python library for automated\nfeature engineering. It can be used to prepare raw transactional and relational datasets for ATM.\nIt is created and maintained by [Feature Labs](https://www.featurelabs.com) and is also a part\nof the [Human Data Interaction Project](https://hdi-dai.lids.mit.edu/).\n\n\n# History\n\n## 0.1.0 (2018-05-04)\n\n* First release on PyPI.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/HDI-project/ATM", "keywords": "machine learning hyperparameters tuning classification", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "atm-ml", "package_url": "https://pypi.org/project/atm-ml/", "platform": "", "project_url": "https://pypi.org/project/atm-ml/", "project_urls": { "Homepage": "https://github.com/HDI-project/ATM" }, "release_url": "https://pypi.org/project/atm-ml/0.1.0/", "requires_dist": [ "baytune (==0.1.2)", "boto (>=2.48.0)", "future (>=0.16.0)", "joblib (>=0.11)", "mysqlclient (>=1.2)", "numpy (>=1.13.1)", "pandas (>=0.22.0)", "pyyaml (>=3.12)", "scikit-learn (>=0.18.2)", "scipy (>=0.19.1)", "sklearn-pandas (>=1.5.0)", "sqlalchemy (>=1.1.14)" ], "requires_python": ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*", "summary": "Auto Tune Models", "version": "0.1.0" }, "last_serial": 4297132, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "4c0ab648f4d0bab676ead5c83ae81a1b", "sha256": "62ee3e372fe4092014f908d154c93f2e97185521b0591ff0e22b37f6579e04af" }, "downloads": -1, "filename": "atm_ml-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c0ab648f4d0bab676ead5c83ae81a1b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*", "size": 97594, "upload_time": "2018-09-21T17:19:48", "url": "https://files.pythonhosted.org/packages/cb/f9/297114b138767785bc6fd5451690ed9b11795dcbe015b025c662cd087e20/atm_ml-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f530b2c8e0f421519aeba47d02c10ad9", "sha256": "7c1cb018b67d9bdbb5857dcb57e382a6460630c9654b3def8acbc38b6bba09b9" }, "downloads": -1, "filename": "atm-ml-0.1.0.tar.gz", "has_sig": false, "md5_digest": "f530b2c8e0f421519aeba47d02c10ad9", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*", "size": 108767, "upload_time": "2018-09-21T17:19:50", "url": "https://files.pythonhosted.org/packages/e8/a8/0713f620ae22ec71cb72a4099e5c324509db72db5b53784f76bbcb6e1ce9/atm-ml-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4c0ab648f4d0bab676ead5c83ae81a1b", "sha256": "62ee3e372fe4092014f908d154c93f2e97185521b0591ff0e22b37f6579e04af" }, "downloads": -1, "filename": "atm_ml-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "4c0ab648f4d0bab676ead5c83ae81a1b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*", "size": 97594, "upload_time": "2018-09-21T17:19:48", "url": "https://files.pythonhosted.org/packages/cb/f9/297114b138767785bc6fd5451690ed9b11795dcbe015b025c662cd087e20/atm_ml-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f530b2c8e0f421519aeba47d02c10ad9", "sha256": "7c1cb018b67d9bdbb5857dcb57e382a6460630c9654b3def8acbc38b6bba09b9" }, "downloads": -1, "filename": "atm-ml-0.1.0.tar.gz", "has_sig": false, "md5_digest": "f530b2c8e0f421519aeba47d02c10ad9", "packagetype": "sdist", "python_version": "source", "requires_python": ">=2.7,!=3.0.*,!=3.1.*,!=3.2.*,!=3.3.*,!=3.4.*", "size": 108767, "upload_time": "2018-09-21T17:19:50", "url": "https://files.pythonhosted.org/packages/e8/a8/0713f620ae22ec71cb72a4099e5c324509db72db5b53784f76bbcb6e1ce9/atm-ml-0.1.0.tar.gz" } ] }