{ "info": { "author": "telescopes", "author_email": "luyaoli88@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python :: 3", "Topic :: Scientific/Engineering :: Mathematics", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "\n# Truncated_FAMD\n\n`Truncated_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and truncated implementation for each algorithm along with a scikit-learn API.\n\n## Table of contents\n\n- [Usage](##Usage)\n - [Guidelines](###Guidelines)\n - [Principal component analysis (PCA)](#principal-component-analysis-pca)\n - [Correspondence analysis (CA)](#correspondence-analysis-ca)\n - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca)\n - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa)\n - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd)\n- [Going faster](#going-faster)\n\n\n\n\n`Truncated_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda.\n\n## Usage\n\n```python\nimport numpy as np; np.random.set_state(42) # This is for doctests reproducibility\n```\n\n### Guidelines\n`Truncated_FAMD` integrates the power of automatic selection of `svd_solver` according to structure of data and to `n_components` parameter the [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) possesses and capacity of processing sparse input that [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD) has and the support for processing data in a minibatch form,making it possible to processing big data,from [Incremental PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA).Therefore,`Truncated_FAMD` is appropriate for processing big sparse input.\n\nEach base estimator(CA,PCA) provided by `Truncated_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods.\n\nUnder the hood `Truncated_FAMD` uses `partial_fit` method on chunks of data fetched sequentially from the original data,only stores estimates of component and noise variances, in order update `explained_variance_ratio_` incrementally. This is why memory usage depends on the number of samples per batch, rather than the number of samples to be processed in the dataset.\n\n\nIn this package,inheritance relationship as shown below(A->B:A is superclass of B):\n\n- _BasePCA ->PCA ->CA -> MCA \n- _BasePCA ->PCA ->MFA -> FAMD\n\nYou are supposed to use each method depending on your situation:\n\n- All your variables are numeric: use principal component analysis (`PCA`)\n- You have a contingency table: use correspondence analysis (`CA`)\n- You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`)\n- You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`)\n- You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`)\n\nThe next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:\n\n- [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf)\n- [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf)\n- [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf)\n- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf)\n- [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf)\n- [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf)\n\n\n\n\n###\tPrincipal-Component-Analysis: PCA\n\n**PCA**(standard_scaler=True,n_components=2,svd_solver='auto',whiten=False,copy=True,\n tol=None,iterated_power=2,batch_size =None,random_state=None):\n\n**Args:**\n- `standard_scaler` (bool): Whether to standard_scaler each column's or not.\n- `n_components` (int, float, None or string): The number of principal components to compute.See [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)\n- `svd_solver`(string {`auto`, `full`, `arpack`, `randomized`}):Noter that if input data is sparse,`svd_solver`:{`arpack`,`randomized`}.See [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)\n- `iterated_power` (int): The number of iterations used for computing the SVD when `svd_solver` == `randomized`.\n- `tol`(float >= 0, optional (default .0)):Tolerance for singular values computed by `svd_solver` == `arpack`.\n- `copy` (bool): Whether to perform the computations inplace or not.\n- `batch_size`( int or None):The number of samples to use for each batch. Only used when calling fit. If `batch_size` is None, then `batch_size` is inferred from the data and set to 5 * n_features, to provide a balance between approximation accuracy and memory consumption.\n- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.\nReturn ndarray (M,k),M:Number of samples,K:Number of components.\n\n**Examples:**\n```\n>>>import numpy as np\n>>>from Light_Famd import PCA\n>>>X = pd.DataFrame(np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]),columns=list('ABC'))\n>>>pca = PCA(n_components=2)\n>>>pca.fit(X)\nPCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3,\n random_state=None, rescale_with_mean=True, rescale_with_std=True)\n\n>>>print(pca.explained_variance_)\n[2.37837661 0.02162339]\n\n>>>print(pca.explained_variance_ratio_)\n[0.99099025 0.00900975]\n>>>print(pca.column_correlation(X)) #You could call this method once estimator is >fitted.correlation_ratio is pearson correlation between 2 columns values,\nwhere p-value >=0.05 this similarity is `Nan`.\n 0 1\nA -0.995485 NaN\nB -0.995485 NaN\n\n>>>print(pca.transform(X))\n[[ 0.82732684 -0.17267316]\n [ 1.15465367 0.15465367]\n [ 1.98198051 -0.01801949]\n [-0.82732684 0.17267316]\n [-1.15465367 -0.15465367]\n [-1.98198051 0.01801949]]\n>>>print(pca.fit_transform(X))\n>[[ 0.82732684 -0.17267316]\n [ 1.15465367 0.15465367]\n [ 1.98198051 -0.01801949]\n [-0.82732684 0.17267316]\n [-1.15465367 -0.15465367]\n [-1.98198051 0.01801949]]\n\n```\n###\tCorrespondence-Analysis: CA\nCA class inherits from PCA class.\n\n\n**Examples:**\n```\n>>>import numpy as np\n>>>from Light_Famd import CA\n>>>X = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD'))\n>>>ca=CA(n_components=2,iterated_power=2)\n>>>ca.fit(X)\nCA(batch_size=None, copy=True, iterated_power=2, n_components=2,\n random_state=None, svd_solver='auto', tol=None))\n\n>>> print(ca.explained_variance_)\n[0.02811427 0.01346975]\n\n>>>print(ca.explained_variance_ratio_)\n[0.54703122 0.26208655]\n\n>>>print(ca.transform(X))\n[[-0.74276079 -0.24252589]\n [-0.02821543 0.27099114]\n [ 0.47655683 -0.53616059]\n [ 0.11871109 -0.10247506]\n [ 0.06085895 -0.15267951]\n [ 0.89766224 -0.19222481]\n [ 0.683192 0.67379238]\n [-0.66493196 -0.08886992]\n [-0.81955305 -0.08935231]\n [-0.21371233 0.48649714]]\n\n\n```\n\n###\tMultiple-Correspondence-Analysis: MCA\nMCA class inherits from CA class.\n\n```\n>>>import pandas as pd\n>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(50,4),replace=True),columns =list('ABCD'))\n>>>print(X)\n A B C D\n0 e a a b\n1 b e c a\n2 e b a c\n3 e e b c\n4 b c d d\n5 c d a c\n6 a c e a\n7 d b d b\n8 e a e e\n9 c a e b\n...\n>>>mca=MCA(n_components=2,iterated_power=2)\n>>>mca.fit(X)\nMCA(batch_size=None, copy=True, iterated_power=2, n_components=2,\n random_state=None, svd_solver='auto', tol=None)\n\n>>>print(mca.column_correlation(X))\n 0 1\nA_a NaN NaN\nA_b -0.343282 -0.450314\nA_c NaN -0.525714\nA_d 0.606039 NaN\nA_e -0.482576 0.561833\nB_a NaN -0.303963\nB_b 0.622119 0.333704\nB_c NaN NaN\nB_d -0.396896 NaN\nB_e NaN 0.359454\nC_a NaN 0.586749\nC_b -0.478460 NaN\nC_c NaN -0.389922\nC_d NaN NaN\nC_e 0.624175 NaN\nD_a -0.579454 0.453070\nD_b NaN NaN\nD_c NaN -0.592007\nD_d 0.487821 NaN\nD_e NaN NaN\n\n>>>print(mca.explained_variance_)\n[0.0104482 0.00964206]\n\n>>>print(mca.explained_variance_ratio_)\n[0.00261205 0.00241051]\n\n>>>print(mca.transform(X)) \n[[ 4.75897353e-01 1.18328365e+00]\n [-7.44935557e-01 8.80467767e-01]\n [ 8.75427551e-01 5.25160608e-01]\n [ 4.59454326e-01 -4.06521487e-01]\n [ 9.37769179e-01 -7.65735918e-01]\n [-8.34480014e-01 9.82195557e-01]\n [-4.01418791e-03 -9.82014024e-01]\n [-9.98029713e-02 5.25646968e-01]\n [-4.70148309e-01 1.71969029e-03]\n [-8.88880685e-01 -3.95681877e-01]\n [ 1.73157292e+00 3.59962430e-01]\n [ 5.56384642e-01 -4.90593710e-01]\n [-8.34480014e-01 9.82195557e-01]\n [-4.66163214e-01 -1.04999945e+00]\n [-3.65088651e-01 5.85105538e-02]\n [ 1.02856977e+00 -5.33364595e-01]\n [-4.94864281e-01 -7.14484346e-01]\n [-5.47243985e-01 2.59249764e-01]\n [-1.20025145e-02 4.19830209e-01]\n [ 8.96709363e-01 1.29732542e-01]\n [-2.44747616e-01 -5.78512715e-01]\n...\n\n```\n###\tMultiple-Factor-Analysis: MFA\nMFA class inherits from PCA class.\nSince FAMD class inherits from MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`.\n\n\n###\tFactor-Analysis-of-Mixed-Data: FAMD\nThe `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class.\n```\n>>>import pandas as pd\n>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(100,2)),columns=list('AB'))\n>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(100,4),replace=True),columns =list('CDEF'))\n>>>X=pd.concat([X_n,X_c],axis=1)\n>>>print(X)\nA B C D E F\n0 32 26 e b b c\n1 41 90 e c c e\n2 16 2 a b b c\n3 22 74 a d d a\n4 97 41 d b b a\n5 35 18 c d a a\n6 95 16 c d a a\n7 47 1 c e e c\n8 2 24 c b c e\n9 82 95 b c a a\n10 93 60 a d b e\n11 36 56 e c a a\n12 30 75 b c b e\n13 20 68 b e e a\n14 94 98 b c e c\n15 8 87 c b c c\n16 34 35 c a c b\n17 56 6 c a d d\n18 33 94 e a c d\n19 76 42 a c b c\n20 83 62 a e d c\n21 65 63 d e d d\n22 4 12 a d a a\n23 73 38 a e e b\n...\n\n\n>>>famd = FAMD(n_components=2)\n>>>famd.fit(X)\nFAMD(batch_size=None, copy=True, iterated_power=2, n_components=2,\n random_state=None, svd_solver='auto', tol=None)\n\n>>>print(famd.explained_variance_)\n[2.09216762 1.82830854]\n\n>>>print(famd.explained_variance_ratio_)\n[0.19719544 0.17232563]\n\n>>> print(famd.column_correlation(X))\n 0 1\nC_a NaN -0.545526\nC_b NaN 0.329485\nC_c NaN NaN\nC_d NaN NaN\nC_e 0.233212 0.430677\nD_a 0.308279 NaN\nD_b NaN NaN\nD_c NaN 0.549633\nD_d 0.331463 -0.364919\nD_e -0.538894 -0.215123\nE_a 0.403095 -0.468114\nE_b NaN 0.203875\nE_c 0.199005 0.402092\nE_d -0.241785 NaN\nE_e -0.278375 NaN\nF_a 0.432976 -0.237807\nF_b 0.210518 NaN\nF_c -0.820450 NaN\nF_d NaN NaN\nF_e 0.240326 0.480436\n\n\n>>>print(famd.transform(X)) \n[[-10.63120157 9.03256199]\n [ 7.2594962 20.41834095]\n [-15.25371982 -2.37627147]\n [ 3.11348646 -14.51376937]\n [ 1.16397889 2.77352044]\n [ 14.62345611 -14.83857274]\n [ 14.62345611 -14.83857274]\n [-17.26519048 -6.58745196]\n [ 4.62106121 9.22232575]\n [ 9.60947121 2.49750339]\n [ 2.14508201 -4.2707566 ]\n [ 12.26245577 3.06721231]\n [ 1.2800622 17.53643902]\n [ -4.38821966 -2.6828045 ]\n [-12.4878278 8.57298192]\n...\n\n>>>print(famd.fit_transform(X))\n[[-10.63120157 9.03256199]\n [ 7.2594962 20.41834095]\n [-15.25371982 -2.37627147]\n [ 3.11348646 -14.51376937]\n [ 1.16397889 2.77352044]\n [ 14.62345611 -14.83857274]\n [ 14.62345611 -14.83857274]\n [-17.26519048 -6.58745196]\n [ 4.62106121 9.22232575]\n [ 9.60947121 2.49750339]\n [ 2.14508201 -4.2707566 ]\n [ 12.26245577 3.06721231]\n [ 1.2800622 17.53643902]\n [ -4.38821966 -2.6828045 ]\n [-12.4878278 8.57298192]\n...\n\n```\n\n\n\n\n\n\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Cauchemare/Truncated_FAMD", "keywords": "famd,factor analysis", "license": "", "maintainer": "", "maintainer_email": "", "name": "truncated-famd", "package_url": "https://pypi.org/project/truncated-famd/", "platform": "", "project_url": "https://pypi.org/project/truncated-famd/", "project_urls": { "Homepage": "https://github.com/Cauchemare/Truncated_FAMD" }, "release_url": "https://pypi.org/project/truncated-famd/0.0.1/", "requires_dist": [ "scikit-learn", "scipy", "pandas", "numpy" ], "requires_python": "", "summary": "Scalable Factor Analysis of Mixed and Sparse Data", "version": "0.0.1" }, "last_serial": 4818727, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "366b59d69ab3903f8fe0a38d2bd36c6b", "sha256": "21ced67c2f7145fa8ed2579e08a4d639ca26ca320b3fce4a5ac40aa0e9ab09e5" }, "downloads": -1, "filename": "truncated_famd-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "366b59d69ab3903f8fe0a38d2bd36c6b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17238, "upload_time": "2019-02-14T04:44:20", "url": "https://files.pythonhosted.org/packages/24/1d/8141f7892274cbb7b8dfdf2298d09eef6912f5677e8c0efc2c499f266139/truncated_famd-0.0.1-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "366b59d69ab3903f8fe0a38d2bd36c6b", "sha256": "21ced67c2f7145fa8ed2579e08a4d639ca26ca320b3fce4a5ac40aa0e9ab09e5" }, "downloads": -1, "filename": "truncated_famd-0.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "366b59d69ab3903f8fe0a38d2bd36c6b", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17238, "upload_time": "2019-02-14T04:44:20", "url": "https://files.pythonhosted.org/packages/24/1d/8141f7892274cbb7b8dfdf2298d09eef6912f5677e8c0efc2c499f266139/truncated_famd-0.0.1-py3-none-any.whl" } ] }