{ "info": { "author": "Sven Serneels", "author_email": "svenserneels@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "Projection Pursuit Dimension Reduction\n======================================\n\nA scikit-learn compatible Python 3 package for Projection Pursuit Dimension Reduction. \nThis class implements a very general framweork for projection pursuit, giving access to \nmethods ranging from PP-PCA to CAPI generalized betas. \nThe package uses the grid algorithm\\[1\\], the most numerically stable and accurate PP algorithm. \n\nDescription\n-----------\n\nProjection pursuit (PP) provides a very general framework for dimension reduction and regression. The\n`ppdire` package provides a framework to calculate PP estimates based on a wide variety of projection \nindices. \n\nWhile the package will also work with user-defined projection indices, a set of projection indices are \nincluded into the package as two ancillary classes: \n- `dicomo` for (co-)moment statistics \n- `capi` specifically for analyzing financial market returns based on a linear combination of co-moments \\[2\\] \n\nWhen using the `dicomo` class as a plugin, several well-known multivariate dimension reduction techniques \nare accessible, as well as robust alternatives thereto. For more details, see the Example below. \n\nRemarks: \n- all the methods contained in this package have been designed for continuous data. They do not work correctly for categorical or textual data.\n- this package focuses on projection pursuit dimension reduction. Regression methods that involve a dimension reduction step can be accessed through it \n (e.g. PCR, PLS, RCR, ...), yet the package does not provide an implementation for projection pursuit regression (PPR). To access PPR, we refer to \n the `projection-pursuit` package, also distributed through PIP. \n\nThe code is aligned to ScikitLearn, such that modules such as GridSearchCV can flawlessly be applied to it. \n\nThe repository contains\n- The estimator (`ppdire.py`) \n- A class to estimate co-moments (`dicomo.py`)\n- A class for the co-moment analysis projection index (`capi.py`)\n- Ancillary functions for co-moment estimation (`_dicomo_utils.py`)\n\nNote: \n\nHow to install\n--------------\nThe package is distributed through PyPI, so install through: \n\n pip install ppdire\n\n\nThe ppdire class\n================\n\nDependencies\n------------\n- From : BaseEstimator,TransformerMixin,RegressorMixin\n- From : _BaseComposition\n- copy\n- :\n- From : pinv2\n- numpy \n- From : QuantReg\n- From : svd_flip\n- From : rm, robcent\n- From : MyException\n- warnings\n\n\nParameters\n----------\n- `projection_index`, function or class. `dicomo` and `capi` supplied in this\n package can both be used, but user defined projection indices can \n be processed \n- `pi_arguments`, dict. Dict of arguments to be passed on to `projection index` \n- `n_components`, int. number of components to be estimated \n- `trimming`, float. trimming percentage for projection index, to be entered as pct/100 \n- `alpha`, float. Continuum coefficient. Only relevant if `ppdire` is used to \n estimate (classical or robust) continuum regression. \n- `ndir`, int. Number of directions to calculate per iteration.\n- `maxiter`, int. Maximal number of iterations.\n- `regopt`, str. Regression option for regression step y~T. Can be set\n to `'OLS'` (default), `'robust'` (will run `sprm.rm`) or `'quantile'` \n (`statsmodels.regression.quantreg`). \n- `center`, str. How to center the data. options accepted are options from\n `sprm.robcent` \n- `center_data`, bool. \n- `scale_data`, bool. Note: if set to `False`, convergence to correct optimum \n is not a given. Will throw a warning. \n- `whiten_data`, bool. Typically used for ICA (kurtosis as PI)\n- `square_pi`, bool. Whether to square the projection index upon evaluation. \n- `compression`, bool. If `True`, an internal SVD compression step is used for \n flat data tables (p > n). Speds up the calculations. \n- `copy`, bool. Whether to make a deep copy of the input data or not. \n- `verbose`, bool. Set to `True` prints the iteration number. \n- `return_scaling_object`, bool.\nNote: several interesting parameters can also be passed to the `fit` method. \n\nAttributes\n----------\nAttributes always provided \n- `x_weights_`: X block PPDIRE weighting vectors (usually denoted W)\n- `x_loadings_`: X block PPDIRE loading vectors (usually denoted P)\n- `x_scores_`: X block PPDIRE score vectors (usually denoted T)\n- `x_ev_`: X block explained variance per component\n- `x_Rweights_`: X block SIMPLS style weighting vectors (usually denoted R)\n- `x_loc_`: X block location estimate \n- `x_sca_`: X block scale estimate\n- `crit_values_`: vector of evaluated values for the optimization objective. \n- `Maxobjf_`: vector containing the optimized objective per component. \n\nAttributes created when more than one block of data is provided: \n- `C_`: vector of inner relationship between response and latent variables block\n- `coef_`: vector of regression coefficients, if second data block provided \n- `intercept_`: intercept\n- `coef_scaled_`: vector of scaled regression coeeficients (when scaling option used)\n- `intercept_scaled_`: scaled intercept\n- `residuals_`: vector of regression residuals\n- `y_ev_`: y block explained variance \n- `fitted_`: fitted response\n- `y_loc_`: y location estimate\n- `y_sca_`: y scale estimate\n\nAttributes created only when corresponding input flags are `True`:\n- `whitening_`: whitened data matrix (usually denoted K)\n- `mixing_`: mixing matrix estimate\n- `scaling_object_`: scaling object from `sprm.robcent`\n\n\nMethods\n--------\n- `fit(X, *args, **kwargs)`: fit model \n- `predict(X)`: make predictions based on fit \n- `transform(X)`: project X onto latent space \n- `getattr()`: get list of attributes\n- `setattr(*kwargs)`: set individual attribute of sprm object \n\nThe `fit` function takes several optional input arguments. These are flags that \ntypically would not need to be cross-validated. They are: \n- `y`, numpy vector or 1D matrix, either as `arg` directly or as `kwarg`\n- `h`, int. Overrides `n_components` for an individual call to `fit`. Use with caution. \n- `dmetric`, str. Distance metric used internally. Defaults to `'euclidean'`\n- `mixing`, bool. Return mixing matrix? \n- Further parameters to the regression methods can be passed on here \n as additional `kwargs`. \n\n\nAncillary functions \n-------------------\n- `dicomo` (class): (co-)moments \n- `capi` (class): co-moment analysis projection index \n\nExamples\n========\n\nLoad and Prepare Data\n---------------------\nTo run a toy example: \n- Source packages and data: \n\n # Load data\n import pandas as ps\n import numpy as np\n data = ps.read_csv(\"./data/Returns_shares.csv\")\n columns = data.columns[2:8]\n (n,p) = data.shape\n datav = np.matrix(data.values[:,2:8].astype('float64'))\n y = datav[:,0]\n X = datav[:,1:5]\n\n # Scale data\n from sprm import robcent\n centring = robcent()\n Xs = centring.fit(X)\n\nComparison of PP estimates to Scikit-Learn \n------------------------------------------\nLet us at first run `ppdire` to produce slow, approximate PP estimates of \nPCA and PLS. This makes it easy to verify that the algorithm is correct. \n\n- Projection Pursuit as a slow, approximate way to compute PCA. Compare:\n\n # PCA ex Scikit-Learn \n import sklearn.decomposition as skd\n skpca = skd.PCA(n_components=4)\n skpca.fit(Xs)\n skpca.components_.T # sklearn outputs loadings as rows ! \n\n # PP-PCA \n from ppdire import dicomo, ppdire\n pppca = ppdire(projection_index = dicomo, pi_arguments = {'mode' : 'var'}, n_components=4)\n pppca.fit(X)\n pppca.x_loadings_\n\n- Likewise, projection pursuit as a slow, approximate way to compute PLS. Compare: \n\n # PLS ex Scikit-Learn \n skpls = skc.PLSRegression(n_components=4)\n skpls.fit(Xs,(y-np.mean(y))/np.std(y))\n skpls.x_scores_\n skpls.coef_ \n Xs*skpls.coef_*np.std(y) + np.mean(y) \n\n # PP-PLS \n pppls = ppdire(projection_index = dicomo, pi_arguments = {'mode' : 'cov'}, n_components=4, square_pi=True)\n pppls.fit(X,y)\n pppls.x_scores_\n pppls.coef_scaled_ # Column 4 should agree with skpls.coef_\n pppls.fitted_ \n\nRemark: Dimension Reduction techniques based on projection onto latent variables, \nsuch as PCA, PLS and ICA, are sign indeterminate with respect to the components. \nTherefore, signs of estimates by different algorithms can be opposed, yet the \nabsolute values should be identical up to algorithm precision. Here, this implies\nthat `sklearn` and `ppdire`'s `x_scores_` and `x_loadings` can have opposed signs,\nyet the coefficients and fitted responses should be identical. \n\n\nRobust projection pursuit estimators\n------------------------------------\n\n- Robust PCA based on the Median Absolute Deviation (MAD) \\[3\\]. \n\n lcpca = ppdire(projection_index = dicomo, pi_arguments = {'mode' : 'var', 'center': 'median'}, n_components=4)\n lcpca.fit(X)\n lcpca.x_loadings_\n # To extend to Robust PCR, just add y \n lcpca.fit(X,y,ndir=1000,regopt='robust')\n\n- Robust Continuum Regression \\[4\\] based on trimmed (co)variance: \n\n rcr = ppdire(projection_index = dicomo, pi_arguments = {'mode' : 'continuum'}, n_components=4, trimming=.1, alpha=.5)\n rcr.fit(X,y=y,ndir=1000,regopt='robust')\n rcr.x_loadings_\n rcr.x_scores_\n rcr.coef_scaled_\n rcr.predict(X)\n\nRemark: for RCR, the continuum parameter `alpha` tunes the result from multiple \nregression (`alpha` -> 0) via PLS (`alpha` = 1) to PCR (`alpha` -> Inf). Of course, \nthe robust PLS option can also be accessed through `pi_arguments = {'mode' : 'cov'}, trimming=.1`. \n\n\nProjection pursuit generalized betas\n------------------------------------\n\nGeneralized betas are obtained as the projection pursuit weights using the \nco-moment analysis projection index (CAPI) \\[2\\]. \n\n from ppdire import capi \n est = ppdire(projection_index = capi, pi_arguments = {'max_degree' : 3,'projection_index': dicomo, 'scaling': False}, n_components=1, trimming=0,center_data=True,scale_data=True)\n est.fit(X,y=y,ndir=200)\n est.x_weights_\n # These data aren't the greatest illustration. Evaluating CAPI \n # projections, makes more sense if y is a market index, e.g. SPX \n\n\nCross-validating through `scikit-learn` \n---------------------------------------\n\n from sklearn.model_selection import GridSearchCV\n rcr_cv = GridSearchCV(ppdire(projection_index=dicomo, \n pi_arguments = {'mode' : 'continuum'}), \n cv=10, \n param_grid={\"n_components\": [1, 2, 3], \n \"alpha\": np.arange(.1,3,.3).tolist(),\n \"trimming\": [0, .15]\n }\n )\n rcr_cv.fit(X[:2666],y[:2666]) \n rcr_cv.best_params_\n rcr_cv.predict(X[2666:])\n\n\nData compression\n----------------\nWhile `ppdire` is very flexible and can project according to a very wide variety \nof projection indices, it can be computationally demanding. For flat data tables,\na workaround has been built in. \n\n # Load flat data \n datan = ps.read_csv(\"./ppdire/data/Glass_df.csv\")\n X = datan.values[:,100:300]\n y = datan.values[:,2]\n\n # Now compare\n rcr = ppdire(projection_index = dicomo, \n pi_arguments = {'mode' : 'continuum'}, \n n_components=4, \n trimming=.1, \n alpha=.5, \n compression = False)\n rcr.coef_\n\n rcr = ppdire(projection_index = dicomo, \n pi_arguments = {'mode' : 'continuum'}, \n n_components=4, \n trimming=.1, \n alpha=.5, \n compression = True)\n rcr.coef_\n\nHowever, compression will not work properly if the data contain several low scale \nvarables. In this example, it will not work for `X = datan.values[:,8:751]`. This \nwill throw a warning, and `ppdire` will continue without compression. \n\n\n\n\nCalling the projection indices independently \n--------------------------------------------\nBoth `dicomo` and `capi` can be useful as a consistent framework to call moments\nthemselves, or linear combinations of them. Let's extract univariate columns from\nthe data: \n\n # Prepare univariate data\n x = datav[:,1]\n y = datav[:,2]\n\nNow calculate some moments and compare them to `numpy`: \n\n # Variance \n covest = dicomo() \n # division by n\n covest.fit(x,biascorr=False)\n np.var(x)\n # division by n-1 \n covest.fit(x,biascorr=True)\n np.var(x)*n/(n-1)\n # But we can also trim variance: \n covest.fit(x,biascorr=False,trimming=.1)\n\n # MAD \n import statsmodels.robust as srs\n covest.set_params(center='median')\n srs.mad(x)\n\n # 4th Moment \n import scipy.stats as sps\n # if center is still median, reset it\n covest.set_params(center='mean')\n covest.fit(x,order=4)\n sps.moment(x,4)\n # Again, we can trim: \n covest.fit(x,order=4,trimming=.2)\n\n #Kurtosis \n covest.set_params(mode='kurt')\n sps.kurtosis(x,fisher=False,bias=False) \n #Note that in scipy: bias = False corrects for bias\n covest.fit(x,biascorr=True,Fisher=False)\n\n\nLikewise: co-moments\n\n # Covariance \n covest.set_params(mode='com')\n data.iloc[:,2:8].cov() #Pandas Calculates n-1 division\n covest.fit(x,y=y,biascorr=True)\n\n # M4 (4th co-moment)\n covest.set_params(mode='com')\n covest.fit(x,y=y,biascorr=True,order=4,option=1)\n\n # Co-kurtosis\n covest.set_params(mode='cok')\n covest.fit(x,y=y,biascorr=True,option=1)\n\n\nThese are just some options of the set that can be explored in `dicomo`. \n\n\nReferences\n----------\n1. [Robust Multivariate Methods: The Projection Pursuit Approach](https://link.springer.com/chapter/10.1007/3-540-31314-1_32), Peter Filzmoser, Sven Serneels, Christophe Croux and Pierre J. Van Espen, in: From Data and Information Analysis to Knowledge Engineering,\n Spiliopoulou, M., Kruse, R., Borgelt, C., Nuernberger, A. and Gaul, W., eds., \n Springer Verlag, Berlin, Germany,\n 2006, pages 270--277.\n2. [Projection pursuit based generalized betas accounting for higher order co-moment effects in financial market analysis](https://arxiv.org/pdf/1908.00141.pdf), Sven Serneels, arXiv preprint 1908.00141, 2019. \n3. Robust principal components and dispersion matrices via projection pursuit, Chen, Z. and Li, G., Research Report, Department of Statistics, Harvard University, 1981.\n4. [Robust Continuum Regression](https://www.sciencedirect.com/science/article/abs/pii/S0169743904002667), Sven Serneels, Peter Filzmoser, Christophe Croux, Pierre J. Van Espen, Chemometrics and Intelligent Laboratory Systems, 76 (2005), 197-204.\n\n\n\nWork to do\n----------\n- optimize alignment to `sklearn`\n- align to some of `sprm` plotting functions\n- optimize for speed \n- extend to multivariate responses (open research topic !)\n- suggestions always welcome \n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/SvenSerneels/ppdire", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "ppdire", "package_url": "https://pypi.org/project/ppdire/", "platform": "", "project_url": "https://pypi.org/project/ppdire/", "project_urls": { "Homepage": "https://github.com/SvenSerneels/ppdire" }, "release_url": "https://pypi.org/project/ppdire/0.0.2/", "requires_dist": [ "numpy (>=1.5.0)", "scipy (>=0.8.0)", "matplotlib (>=2.2.0)", "scikit-learn (>=0.18.0)", "pandas (>=0.19.0)", "statsmodels (>=0.8.0)", "sprm (>=0.3.0)" ], "requires_python": "", "summary": "Projection Pursuit Dimension Reduction", "version": "0.0.2" }, "last_serial": 5636585, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "2452da6f23654d8cf813ea8725cff79e", "sha256": "eb327b6429b11d93533252531a92fe7ce2d1056d41ce9cfc6b9fcf353f708f8c" }, "downloads": -1, "filename": "ppdire-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "2452da6f23654d8cf813ea8725cff79e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 34302, "upload_time": "2019-08-05T21:33:10", "url": "https://files.pythonhosted.org/packages/c0/1f/d0b5e3c8244ee0b31f27ee0ccedeea977671dc1607f390a9f5d38d32671d/ppdire-0.0.1-py3-none-any.whl" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "b6aec05367a35edb87cfd4dd21b8cb34", "sha256": "e64e450f66f3103d7a1adb7c3f8368cabe56f82696647a46f1b4c4f4204b52bb" }, "downloads": -1, "filename": "ppdire-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b6aec05367a35edb87cfd4dd21b8cb34", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 34303, "upload_time": "2019-08-05T21:47:10", "url": "https://files.pythonhosted.org/packages/a8/94/173944e6a66067907ecdd9cd473a19b975e13768a198184fd0c4b8f2f69b/ppdire-0.0.2-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b6aec05367a35edb87cfd4dd21b8cb34", "sha256": "e64e450f66f3103d7a1adb7c3f8368cabe56f82696647a46f1b4c4f4204b52bb" }, "downloads": -1, "filename": "ppdire-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "b6aec05367a35edb87cfd4dd21b8cb34", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 34303, "upload_time": "2019-08-05T21:47:10", "url": "https://files.pythonhosted.org/packages/a8/94/173944e6a66067907ecdd9cd473a19b975e13768a198184fd0c4b8f2f69b/ppdire-0.0.2-py3-none-any.whl" } ] }