{ "info": { "author": "Reiji Hatsugai", "author_email": "reiji.hatsugai@deepx.co.jp", "bugtrack_url": null, "classifiers": [], "description": "
\n\n
\n
\n\n[![Build Status](https://travis-ci.com/DeepX-inc/machina.svg?token=xZEqXwSaqc7xZ2saWZa2&branch=master)](https://travis-ci.com/DeepX-inc/machina)\n[![Python Version](https://img.shields.io/pypi/pyversions/Django.svg)](https://github.com/DeepX-inc/machina)\n[![License](https://img.shields.io/badge/license-MIT-blue.svg)](https://github.com/DeepX-inc/machina/blob/master/LICENSE)\n\n# machina\n\nmachina is a library for real-world Deep Reinforcement Learning which is built on top of PyTorch. \nmachina is officially pronounced \"m\u03ack\u026an\u0259\".\n\n## Features\n**High Composability** \nComposability is an important property in computer programming, allowing to dynamically switch between program components during execution. machina was built and designed with this principle in mind, allowing for high flexibility on system and program development. \nSpecifically, the RL-policy interacts with the environment via generated trajectories, making the exchange of either components simple. For example, using machina, it is possible to switch between a simulated and a real-world environment during the training phase.\n\n### Base Merits\nThere are merits for all users including beginners of Deep Reinforcement Learning.\n1. Readability\n2. Intuitive understanding of algorithmic differences\n3. Customizability\n\n### Advanced Merits\nUsing the principle of composability, we can easily implement following configurations which are otherwise difficult in other RL libraries.\n1. Easy implementation of mixed environment (e.g. simulated environment and real world environment, some meta learning settings).\n2. Convenient for combining multiple algorithms (e.g. Q-Prop is combination of TRPO and DDPG).\n3. Possibility of changing hyperparameters dynamically (e.g. Meta Learning for hyperparameters).\n\n#### 1 Meta Reinforcement Learning example\nWe usually define meta learning as a fast adaptation method for tasks which are sampled from a task-space.\nIn meta RL, a task is defined as a MDP.\nRL agents have to adapt to a new MDP as fast as possible.\nWe have to sample episodes from different environments to train our meta agent.\nWe can easily implement this like below with machina.\n\n```python:run_mixed_env.py\nenv1 = GymEnv('HumanoidBulletEnv-v0')\n\nenv2 = GymEnv('HumanoidFlagrunBulletEnv-v0')\n\nepis1 = sampler1.sample(pol, max_epis=args.max_epis_per_iter)\nepis2 = sampler2.sample(pol, max_epis=args.max_epis_per_iter)\ntraj1 = Traj()\ntraj2 = Traj()\n\ntraj1.add_epis(epis1)\ntraj1 = ef.compute_vs(traj1, vf)\ntraj1 = ef.compute_rets(traj1, args.gamma)\ntraj1 = ef.compute_advs(traj1, args.gamma, args.lam)\ntraj1 = ef.centerize_advs(traj1)\ntraj1 = ef.compute_h_masks(traj1)\ntraj1.register_epis()\n\ntraj2.add_epis(epis2)\ntraj2 = ef.compute_vs(traj2, vf)\ntraj2 = ef.compute_rets(traj2, args.gamma)\ntraj2 = ef.compute_advs(traj2, args.gamma, args.lam)\ntraj2 = ef.centerize_advs(traj2)\ntraj2 = ef.compute_h_masks(traj2)\ntraj2.register_epis()\n\ntraj1.add_traj(traj2)\n\nresult_dict = ppo_clip.train(traj=traj1, pol=pol, vf=vf, clip_param=args.clip_param,\n optim_pol=optim_pol, optim_vf=optim_vf, epoch=args.epoch_per_iter, batch_size=args.batch_size, max_grad_norm=args.max_grad_norm)\n```\n\nYou can see the full example code [here](https://github.com/DeepX-inc/machina/blob/master/example/run_mixed_env.py).\n\n#### 2 Combination of Off-policy and On-policy algorithms\nDeepRL algorithms can be roughly divided into 2 types.\nOn-policy and Off-policy algorithms.\nOn-policy algorithms use only current episodes for updating policy or some value functions.\nOn the other hand, Off-policy algorithms use whole episodes for updating policy or some value functions.\nOn-policy algorithms are more stable but need many episodeos.\nOff-policy algorithms are sample efficient but unstable.\nSome algorithms like [Q-Prop](https://arxiv.org/abs/1611.02247) are a combination of On-policy and Off-policy algorithms.\nThis is an example of the combination using [ppo](https://arxiv.org/abs/1707.06347) and [sac](https://arxiv.org/abs/1801.01290).\n\n```\nepis = sampler.sample(pol, max_steps=args.max_steps_per_iter)\n\non_traj = Traj()\non_traj.add_epis(epis)\n\non_traj = ef.add_next_obs(on_traj)\non_traj = ef.compute_vs(on_traj, vf)\non_traj = ef.compute_rets(on_traj, args.gamma)\non_traj = ef.compute_advs(on_traj, args.gamma, args.lam)\non_traj = ef.centerize_advs(on_traj)\non_traj = ef.compute_h_masks(on_traj)\non_traj.register_epis()\n\nresult_dict1 = ppo_clip.train(traj=on_traj, pol=pol, vf=vf, clip_param=args.clip_param,\n optim_pol=optim_pol, optim_vf=optim_vf, epoch=args.epoch_per_iter, batch_size=args.batch_size, max_grad_norm=args.max_grad_norm)\n\ntotal_epi += on_traj.num_epi\nstep = on_traj.num_step\ntotal_step += step\n\noff_traj.add_traj(on_traj)\n\nresult_dict2 = sac.train(\n off_traj,\n pol, qf, targ_qf, log_alpha,\n optim_pol, optim_qf, optim_alpha,\n 100, args.batch_size,\n args.tau, args.gamma, args.sampling,\n)\n```\n\nYou can see the full example code [here](https://github.com/DeepX-inc/machina/blob/master/example/run_ppo_sac.py).\n\nTo obtain this composability, machina's sampling method is deliberatly restricted to be episode-based because episode-based sampling is suitable for real-world environments. Moreover, some algorithms which update networks step by step (e.g. DQN, DDPG) are not reproduced in machina.\n\n## Implemented Algorithms\nThe algorithms classes described below are useful for real-world Deep Reinforcement Learning.\n\n\n \n \n \n \n\n\n \n \n \n \n\n\n \n \n\n\n \n \n \n \n\n\n \n \n\n\n \n \n\n\n \n \n\n\n \n \n \n \n\n\n \n \n \n \n\n\n \n \n\n\n \n \n\n\n \n \n \n \n\n
CLASS MERIT ALGORITHM SUPPORT
Model-Free On-Policy RL stable policy learningProximal Policy OptimizationRNN
Trust Region Policy OptimizationRNN
Model-Free Off-Policy RL high generalizationSoft Actor CriticR2D2
QT-Opt
Deep Deterministic Policy Gradient
Stochastic Value Gradient
Model-Based RL high sample efficiencyModel Predictive ControlRNN
Imitation Learningremoval of the need for reward designingBehavior Cloning
Generative Adversarial Imitation LearningRNN
Adversatial Inverse Reinforcement Learning
Policy Distillationreduction of necessary computation resources during deployment of policyTeacher Distillation
\n* R2D2 like burn in and saving hidden states methods\n\n## Installation\n\nmachina supports Ubuntu, Python3.5, 3.6, 3.7 and PyTorch1.0.0+.\n\nmachina can be directly installed using PyPI.\n```\npip install machina-rl\n```\n\nOr you can install machina from source code.\n```\ngit clone https://github.com/DeepX-inc/machina\ncd machina\npython setup.py install\n```\n\n## Quick Start\nYou can start machina by checking out this [quickstart](https://github.com/DeepX-inc/machina/tree/master/example/quickstart).\n\nMoreover, you can also check already implemented algorithms in [examples](https://github.com/DeepX-inc/machina/tree/master/example).\n\n## Documentation\nYou can check the [documentation](https://docs.machina-rl.org).\n\n## Web Page\nYou can check machina's [web page](https://machina-rl.org/).\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/DeepX-inc/machina", "keywords": "", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "machina-rl", "package_url": "https://pypi.org/project/machina-rl/", "platform": "", "project_url": "https://pypi.org/project/machina-rl/", "project_urls": { "Homepage": "https://github.com/DeepX-inc/machina" }, "release_url": "https://pypi.org/project/machina-rl/0.2.1/", "requires_dist": [ "cached-property", "torch (>=1.0.1)", "joblib (>=0.11)", "cloudpickle", "redis", "gym (==0.10.5)", "numpy (>=1.13.3)", "terminaltables", "pandas" ], "requires_python": "", "summary": "machina is a library for a deep reinforcement learning.", "version": "0.2.1" }, "last_serial": 5157916, "releases": { "0.2.0": [ { "comment_text": "", "digests": { "md5": "b08c4c230cfd07a26b3785e41171d14c", "sha256": "9c6d9e856d8b2e1e77de55ccc03035b915bbc2161230d3a3b240c32fcc03eb61" }, "downloads": -1, "filename": "machina_rl-0.2.0-py3-none-any.whl", "has_sig": false, "md5_digest": "b08c4c230cfd07a26b3785e41171d14c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 83915, "upload_time": "2019-03-17T12:54:00", "url": "https://files.pythonhosted.org/packages/24/ac/607fa242a914c72074a92318651be26d03aa16047934707bb538e4933b8a/machina_rl-0.2.0-py3-none-any.whl" } ], "0.2.1": [ { "comment_text": "", "digests": { "md5": "254e511b6e2f2f4f4e4df7ff997f6914", "sha256": "29090ff37c8a3950a949feb44aff1518872bb891db21578940942260b014496c" }, "downloads": -1, "filename": "machina_rl-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "254e511b6e2f2f4f4e4df7ff997f6914", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 91336, "upload_time": "2019-04-18T02:07:15", "url": "https://files.pythonhosted.org/packages/f3/43/1ec89b42a5983475cd3d74fa5882847764c586972f62ce7b396b117e3125/machina_rl-0.2.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "254e511b6e2f2f4f4e4df7ff997f6914", "sha256": "29090ff37c8a3950a949feb44aff1518872bb891db21578940942260b014496c" }, "downloads": -1, "filename": "machina_rl-0.2.1-py3-none-any.whl", "has_sig": false, "md5_digest": "254e511b6e2f2f4f4e4df7ff997f6914", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 91336, "upload_time": "2019-04-18T02:07:15", "url": "https://files.pythonhosted.org/packages/f3/43/1ec89b42a5983475cd3d74fa5882847764c586972f62ce7b396b117e3125/machina_rl-0.2.1-py3-none-any.whl" } ] }