{ "info": { "author": "Jonathan Landy", "author_email": "jslandy@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# `linselect`\n\nA fast, flexible, and performant feature selection package for python.\n\n\n## Package in a nutshell\n### It's built on stepwise linear regression\nWhen passed data, the underlying algorithm seeks minimal variable subsets that\nproduce good linear fits to the targets. This approach to feature selection\nstrikes a competitive balance between performance, speed, and memory\nefficiency.\n\n### It has a simple API\nA simple API makes it easy to quickly rank a data set's features in terms of\ntheir added value to a given fit. This is demoed below, where we learn that we\ncan drop column `1` of `X` and still obtain a fit to `y` that captures 97.37%\nof its variance.\n\n\n```python\nfrom linselect import FwdSelect\nimport numpy as np\n\nX = np.array([[1,2,4], [1,1,2], [3,2,1], [10,2,2]])\ny = np.array([[1], [-1], [-1], [1]])\n\nselector = FwdSelect()\nselector.fit(X, y)\n\nprint selector.ordered_features\nprint selector.ordered_cods\n# [2, 0, 1]\n# [0.47368422, 0.97368419, 1.0]\n\nX_compressed = X[:, selector.ordered_features[:2]]\n```\n\n### It's fast\nA full sweep on a 1000 feature count data set runs in 10s on my laptop -- about\n**one million times faster** (seriously) than standard stepwise algorithms,\nwhich are effectively too slow to run at this scale. A 100 count feature set\nruns in 0.07s.\n\n```python\nfrom linselect import FwdSelect\nimport numpy as np\nimport time\n\nX = np.random.randn(5000, 1000)\ny = np.random.randn(5000, 1)\n\nselector = FwdSelect()\n\nt1 = time.time()\nselector.fit(X, y)\nt2 = time.time()\nprint t2 - t1\n# 9.87492\n```\n\n### Its scores reveal your effective feature count\nBy plotting fitted CODs against ranked feature count, one often learns that\nseemingly high-dimensional problems can actually be understood using only a\nminority of the available features. The plot below demonstrates this: A fit to one year of AAPL's stock fluctuations -- using just 3\nselected stocks as predictors -- nearly matches the performance of a 49-feature fit.\nThe 3-feature fit arguably provides more insight and is certainly easier to\nreason about (cf. tutorials for details).\n\n\n\n![apple stock plot](./docs/apple.png)\n\n\n### It's flexible\n`linselect` exposes multiple applications of the underlying algorithm. These\nallow for:\n* Forward, reverse, and general forward-reverse stepwise regression strategies.\n* Supervised applications aimed at a single target variable or simultaneous\n prediction of multiple target variables.\n* Unsupervised applications. The algorithm can be applied to identify minimal,\n representative subsets of an available column set. This provides a feature\n selection analog of PCA -- importantly, one that retains interpretability.\n\n\n\n## Under the hood\nFeature selection algorithms are used to seek minimal column / feature subsets\nthat capture the majority of the useful information contained within a data\nset. Removal of a selected subset's complement -- the relatively uninformative\nor redundant features -- can often result in a significant data compression and\nimproved interpretability.\n\nStepwise selection algorithms work by iteratively updating a model feature set,\none at a time [1]. For example, in a given step of a forward process, one\nconsiders all of the features that have not yet been added to the model, and\nthen identifies that which would improve the model the most. This is added,\nand the process is then repeated until all features have been selected. The\nfeatures that are added first in this way tend to be those that are predictive\nand also not redundant with those already included in the predictor set.\nRetaining only these first selected features therefore provides a convenient\nmethod for identifying minimal, informative feature subsets.\n\nIn general, identifying the optimal feature to add to a model in a given step\nrequires building and scoring each possible updated model variant. This\nresults in a slow process: If there are `n` features, `O(n^2)` models must be\nbuilt to carry out a full ranking. However, the process can be dramatically\nsped up in the case of linear regression -- thanks to\nsome linear algebra identities that allow one to efficiently update these\nmodels as features are either added or removed from their predictor sets [2,3].\nUsing these update rules, a full feature ranking can be carried out in roughly\nthe same amount of time that is needed to fit only a single model. For\n`n=1000`, this means we get an `O(n^2) = O(10^6)` speed up! `linselect` makes\nuse of these update rules -- first identified in [2] -- allowing for fast\nfeature selection sweeps.\n\n[1] Introduction to Statistical Learning by G. James, et al -- cf. chapter 6.\n\n[2] M. Efroymson. Multiple regression analysis. *Mathematical methods for\ndigital computers*, 1:191\u2013203, 1960.\n\n[3] J. Landy. Stepwise regression for unsupervised learning, 2017.\n[arxiv.1706.03265](https://arxiv.org/abs/1706.03265).\n\n\n## Classes, documentation, tests, license\n`linselect` contains three classes: `FwdSelect`, `RevSelect`, and `GenSelect`.\nAs the names imply, these support efficient forward, reverse, and general\nforward-reverse search protocols, respectively. Each can be used for both\nsupervised and unsupervised analyses.\n\nDocstrings and basic call examples are illustrated for each class in the\n[./docs](docs/) folder.\n\nAn FAQ and a running list of tutorials are available at\n[efavdb.com/linselect](http://www.efavdb.com/linselect).\n\nTests: From the root directory,\n\n```\npython setup.py test\n```\n\nThis project is licensed under the terms of the MIT license.\n\n## Installation\nThe package can be installed using pip, from pypi\n\n```\npip install linselect\n```\n\nor from github\n\n```\npip install git+git://github.com/efavdb/linselect.git\n```\n\n\n## Author\n\n**Jonathan Landy** - [EFavDB](http://www.efavdb.com)\n\nAcknowledgments: Special thanks to P. Callier, P. Spanoudes, and R. Zhou for\nproviding helpful feedback.\n", "description_content_type": "", "docs_url": null, "download_url": "https://github.com/efavdb/linselect/archive/0.1.1.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/EFavDB/linselect", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "linselect", "package_url": "https://pypi.org/project/linselect/", "platform": "", "project_url": "https://pypi.org/project/linselect/", "project_urls": { "Download": "https://github.com/efavdb/linselect/archive/0.1.1.tar.gz", "Homepage": "https://github.com/EFavDB/linselect" }, "release_url": "https://pypi.org/project/linselect/0.1.1/", "requires_dist": null, "requires_python": "", "summary": "A fast, flexible, and performant feature selection package for python.", "version": "0.1.1" }, "last_serial": 3925604, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "da1ad563234657265b93059aaf8735cf", "sha256": "79a49c68a590b7a9334555da32b22cb1c387546ce73d42ad11b670e617638ee0" }, "downloads": -1, "filename": "linselect-0.1.1.tar.gz", "has_sig": false, "md5_digest": "da1ad563234657265b93059aaf8735cf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10734, "upload_time": "2018-06-03T15:32:37", "url": "https://files.pythonhosted.org/packages/aa/ae/54c8bb66d73ecefa4f947936235ea692ee201fb5319820e38d64e3a976c3/linselect-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "da1ad563234657265b93059aaf8735cf", "sha256": "79a49c68a590b7a9334555da32b22cb1c387546ce73d42ad11b670e617638ee0" }, "downloads": -1, "filename": "linselect-0.1.1.tar.gz", "has_sig": false, "md5_digest": "da1ad563234657265b93059aaf8735cf", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10734, "upload_time": "2018-06-03T15:32:37", "url": "https://files.pythonhosted.org/packages/aa/ae/54c8bb66d73ecefa4f947936235ea692ee201fb5319820e38d64e3a976c3/linselect-0.1.1.tar.gz" } ] }