{
    "info": {
        "author": "telescopes",
        "author_email": "luyaoli88@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 3 - Alpha",
            "Intended Audience :: Science/Research",
            "License :: OSI Approved :: BSD License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3",
            "Topic :: Scientific/Engineering :: Mathematics",
            "Topic :: Software Development :: Libraries :: Python Modules"
        ],
        "description": "\n# Truncated_FAMD\n\n`Truncated_FAMD` is a library for prcessing [factor analysis of mixed data](https://www.wikiwand.com/en/Factor_analysis). This includes a variety of methods including [principal component analysis (PCA)](https://www.wikiwand.com/en/Principal_component_analysis) and [multiply correspondence analysis (MCA)](https://www.researchgate.net/publication/239542271_Multiple_Correspondence_Analysis). The goal is to provide an efficient and truncated implementation for each algorithm along with a scikit-learn API.\n\n## Table of contents\n\n- [Usage](##Usage)\n  - [Guidelines](###Guidelines)\n  - [Principal component analysis (PCA)](#principal-component-analysis-pca)\n  - [Correspondence analysis (CA)](#correspondence-analysis-ca)\n  - [Multiple correspondence analysis (MCA)](#multiple-correspondence-analysis-mca)\n  - [Multiple factor analysis (MFA)](#multiple-factor-analysis-mfa)\n  - [Factor analysis of mixed data (FAMD)](#factor-analysis-of-mixed-data-famd)\n- [Going faster](#going-faster)\n\n\n\n\n`Truncated_FAMD` doesn't have any extra dependencies apart from the usual suspects (`sklearn`, `pandas`, `numpy`) which are included with Anaconda.\n\n## Usage\n\n```python\nimport numpy as np; np.random.set_state(42)  # This is for doctests reproducibility\n```\n\n### Guidelines\n`Truncated_FAMD` integrates the power of automatic selection of `svd_solver` according to structure of data and to `n_components` parameter the [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) possesses and capacity of processing sparse input that [sklearn.decomposition.TruncatedSVD](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html#sklearn.decomposition.TruncatedSVD) has and the support for processing data in a minibatch form,making it possible to processing big data,from [Incremental PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.IncrementalPCA.html#sklearn.decomposition.IncrementalPCA).Therefore,`Truncated_FAMD` is appropriate for processing big sparse input.\n\nEach base estimator(CA,PCA) provided by `Truncated_FAMD` extends scikit-learn's `(TransformerMixin,BaseEstimator)`.which means we could use directly `fit_transform`,and `(set_params,get_params)` methods.\n\nUnder the hood `Truncated_FAMD` uses `partial_fit` method on chunks of data fetched sequentially from the original data,only stores estimates of component and noise variances, in order update `explained_variance_ratio_` incrementally. This is why memory usage depends on the number of samples per batch, rather than the number of samples to be processed in the dataset.\n\n\nIn this package,inheritance relationship as shown  below(A->B:A is superclass of B):\n\n- _BasePCA ->PCA ->CA -> MCA \n- _BasePCA ->PCA ->MFA -> FAMD\n\nYou are supposed to use each method depending on your situation:\n\n- All your variables are numeric: use principal component analysis (`PCA`)\n- You have a contingency table: use correspondence analysis (`CA`)\n- You have more than 2 variables and they are all categorical: use multiple correspondence analysis (`MCA`)\n- You have groups of categorical **or** numerical variables: use multiple factor analysis (`MFA`)\n- You have both categorical and numerical variables: use factor analysis of mixed data (`FAMD`)\n\nThe next subsections give an overview of each method along with usage information. The following papers give a good overview of the field of factor analysis if you want to go deeper:\n\n- [A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf)\n- [Theory of Correspondence Analysis](http://statmath.wu.ac.at/courses/CAandRelMeth/caipA.pdf)\n- [Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions](https://arxiv.org/pdf/0909.4061.pdf)\n- [Computation of Multiple Correspondence Analysis, with code in R](https://core.ac.uk/download/pdf/6591520.pdf)\n- [Singular Value Decomposition Tutorial](https://davetang.org/file/Singular_Value_Decomposition_Tutorial.pdf)\n- [Multiple Factor Analysis](https://www.utdallas.edu/~herve/Abdi-MFA2007-pretty.pdf)\n\n\n\n\n###\tPrincipal-Component-Analysis: PCA\n\n**PCA**(standard_scaler=True,n_components=2,svd_solver='auto',whiten=False,copy=True,\n                 tol=None,iterated_power=2,batch_size =None,random_state=None):\n\n**Args:**\n- `standard_scaler` (bool): Whether to standard_scaler each column's  or not.\n- `n_components` (int, float, None or string): The number of principal components to compute.See [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)\n- `svd_solver`(string {`auto`, `full`, `arpack`, `randomized`}):Noter that if input data is sparse,`svd_solver`:{`arpack`,`randomized`}.See [sklearn.decomposition.PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA)\n- `iterated_power` (int): The number of iterations used for computing the SVD when `svd_solver` == `randomized`.\n- `tol`(float >= 0, optional (default .0)):Tolerance for singular values computed by `svd_solver` == `arpack`.\n- `copy` (bool): Whether to perform the computations inplace or not.\n- `batch_size`( int or None):The number of samples to use for each batch. Only used when calling fit. If `batch_size` is None, then `batch_size` is inferred from the data and set to 5 * n_features, to provide a balance between approximation accuracy and memory consumption.\n- `random_state`(int, RandomState instance or None, optional (default=None):The seed of the -pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.\nReturn ndarray (M,k),M:Number of samples,K:Number of components.\n\n**Examples:**\n```\n>>>import numpy as np\n>>>from Light_Famd import PCA\n>>>X = pd.DataFrame(np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]),columns=list('ABC'))\n>>>pca = PCA(n_components=2)\n>>>pca.fit(X)\nPCA(check_input=True, copy=True, engine='auto', n_components=2, n_iter=3,\n  random_state=None, rescale_with_mean=True, rescale_with_std=True)\n\n>>>print(pca.explained_variance_)\n[2.37837661 0.02162339]\n\n>>>print(pca.explained_variance_ratio_)\n[0.99099025 0.00900975]\n>>>print(pca.column_correlation(X))  #You could call this method once estimator is >fitted.correlation_ratio is pearson correlation between 2 columns values,\nwhere p-value >=0.05 this similarity is `Nan`.\n          0   1\nA -0.995485 NaN\nB -0.995485 NaN\n\n>>>print(pca.transform(X))\n[[ 0.82732684 -0.17267316]\n [ 1.15465367  0.15465367]\n [ 1.98198051 -0.01801949]\n [-0.82732684  0.17267316]\n [-1.15465367 -0.15465367]\n [-1.98198051  0.01801949]]\n>>>print(pca.fit_transform(X))\n>[[ 0.82732684 -0.17267316]\n [ 1.15465367  0.15465367]\n [ 1.98198051 -0.01801949]\n [-0.82732684  0.17267316]\n [-1.15465367 -0.15465367]\n [-1.98198051  0.01801949]]\n\n```\n###\tCorrespondence-Analysis: CA\nCA class inherits from  PCA  class.\n\n\n**Examples:**\n```\n>>>import numpy as np\n>>>from Light_Famd import CA\n>>>X  = pd.DataFrame(data=np.random.randint(0,100,size=(10,4)),columns=list('ABCD'))\n>>>ca=CA(n_components=2,iterated_power=2)\n>>>ca.fit(X)\nCA(batch_size=None, copy=True, iterated_power=2, n_components=2,\n  random_state=None, svd_solver='auto', tol=None))\n\n>>> print(ca.explained_variance_)\n[0.02811427 0.01346975]\n\n>>>print(ca.explained_variance_ratio_)\n[0.54703122 0.26208655]\n\n>>>print(ca.transform(X))\n[[-0.74276079 -0.24252589]\n [-0.02821543  0.27099114]\n [ 0.47655683 -0.53616059]\n [ 0.11871109 -0.10247506]\n [ 0.06085895 -0.15267951]\n [ 0.89766224 -0.19222481]\n [ 0.683192    0.67379238]\n [-0.66493196 -0.08886992]\n [-0.81955305 -0.08935231]\n [-0.21371233  0.48649714]]\n\n\n```\n\n###\tMultiple-Correspondence-Analysis: MCA\nMCA class inherits from  CA  class.\n\n```\n>>>import pandas as pd\n>>>X=pd.DataFrame(np.random.choice(list('abcde'),size=(50,4),replace=True),columns =list('ABCD'))\n>>>print(X)\n   A  B  C  D\n0  e  a  a  b\n1  b  e  c  a\n2  e  b  a  c\n3  e  e  b  c\n4  b  c  d  d\n5  c  d  a  c\n6  a  c  e  a\n7  d  b  d  b\n8  e  a  e  e\n9  c  a  e  b\n...\n>>>mca=MCA(n_components=2,iterated_power=2)\n>>>mca.fit(X)\nMCA(batch_size=None, copy=True, iterated_power=2, n_components=2,\n  random_state=None, svd_solver='auto', tol=None)\n\n>>>print(mca.column_correlation(X))\n            0         1\nA_a       NaN       NaN\nA_b -0.343282 -0.450314\nA_c       NaN -0.525714\nA_d  0.606039       NaN\nA_e -0.482576  0.561833\nB_a       NaN -0.303963\nB_b  0.622119  0.333704\nB_c       NaN       NaN\nB_d -0.396896       NaN\nB_e       NaN  0.359454\nC_a       NaN  0.586749\nC_b -0.478460       NaN\nC_c       NaN -0.389922\nC_d       NaN       NaN\nC_e  0.624175       NaN\nD_a -0.579454  0.453070\nD_b       NaN       NaN\nD_c       NaN -0.592007\nD_d  0.487821       NaN\nD_e       NaN       NaN\n\n>>>print(mca.explained_variance_)\n[0.0104482  0.00964206]\n\n>>>print(mca.explained_variance_ratio_)\n[0.00261205 0.00241051]\n\n>>>print(mca.transform(X)) \n[[ 4.75897353e-01  1.18328365e+00]\n [-7.44935557e-01  8.80467767e-01]\n [ 8.75427551e-01  5.25160608e-01]\n [ 4.59454326e-01 -4.06521487e-01]\n [ 9.37769179e-01 -7.65735918e-01]\n [-8.34480014e-01  9.82195557e-01]\n [-4.01418791e-03 -9.82014024e-01]\n [-9.98029713e-02  5.25646968e-01]\n [-4.70148309e-01  1.71969029e-03]\n [-8.88880685e-01 -3.95681877e-01]\n [ 1.73157292e+00  3.59962430e-01]\n [ 5.56384642e-01 -4.90593710e-01]\n [-8.34480014e-01  9.82195557e-01]\n [-4.66163214e-01 -1.04999945e+00]\n [-3.65088651e-01  5.85105538e-02]\n [ 1.02856977e+00 -5.33364595e-01]\n [-4.94864281e-01 -7.14484346e-01]\n [-5.47243985e-01  2.59249764e-01]\n [-1.20025145e-02  4.19830209e-01]\n [ 8.96709363e-01  1.29732542e-01]\n [-2.44747616e-01 -5.78512715e-01]\n...\n\n```\n###\tMultiple-Factor-Analysis: MFA\nMFA class inherits from  PCA  class.\nSince FAMD class inherits from  MFA and the only thing to do for FAMD is to determine `groups` parameter compare to its  superclass `MFA`.therefore we skip this chapiter and go directly to `FAMD`.\n\n\n###\tFactor-Analysis-of-Mixed-Data: FAMD\nThe `FAMD` inherits from the `MFA` class, which entails that you have access to all it's methods and properties of `MFA` class.\n```\n>>>import pandas as pd\n>>>X_n = pd.DataFrame(data=np.random.randint(0,100,size=(100,2)),columns=list('AB'))\n>>>X_c =pd.DataFrame(np.random.choice(list('abcde'),size=(100,4),replace=True),columns =list('CDEF'))\n>>>X=pd.concat([X_n,X_c],axis=1)\n>>>print(X)\nA   B  C  D  E  F\n0   32  26  e  b  b  c\n1   41  90  e  c  c  e\n2   16   2  a  b  b  c\n3   22  74  a  d  d  a\n4   97  41  d  b  b  a\n5   35  18  c  d  a  a\n6   95  16  c  d  a  a\n7   47   1  c  e  e  c\n8    2  24  c  b  c  e\n9   82  95  b  c  a  a\n10  93  60  a  d  b  e\n11  36  56  e  c  a  a\n12  30  75  b  c  b  e\n13  20  68  b  e  e  a\n14  94  98  b  c  e  c\n15   8  87  c  b  c  c\n16  34  35  c  a  c  b\n17  56   6  c  a  d  d\n18  33  94  e  a  c  d\n19  76  42  a  c  b  c\n20  83  62  a  e  d  c\n21  65  63  d  e  d  d\n22   4  12  a  d  a  a\n23  73  38  a  e  e  b\n...\n\n\n>>>famd = FAMD(n_components=2)\n>>>famd.fit(X)\nFAMD(batch_size=None, copy=True, iterated_power=2, n_components=2,\n   random_state=None, svd_solver='auto', tol=None)\n\n>>>print(famd.explained_variance_)\n[2.09216762 1.82830854]\n\n>>>print(famd.explained_variance_ratio_)\n[0.19719544 0.17232563]\n\n>>> print(famd.column_correlation(X))\n            0         1\nC_a       NaN -0.545526\nC_b       NaN  0.329485\nC_c       NaN       NaN\nC_d       NaN       NaN\nC_e  0.233212  0.430677\nD_a  0.308279       NaN\nD_b       NaN       NaN\nD_c       NaN  0.549633\nD_d  0.331463 -0.364919\nD_e -0.538894 -0.215123\nE_a  0.403095 -0.468114\nE_b       NaN  0.203875\nE_c  0.199005  0.402092\nE_d -0.241785       NaN\nE_e -0.278375       NaN\nF_a  0.432976 -0.237807\nF_b  0.210518       NaN\nF_c -0.820450       NaN\nF_d       NaN       NaN\nF_e  0.240326  0.480436\n\n\n>>>print(famd.transform(X)) \n[[-10.63120157   9.03256199]\n [  7.2594962   20.41834095]\n [-15.25371982  -2.37627147]\n [  3.11348646 -14.51376937]\n [  1.16397889   2.77352044]\n [ 14.62345611 -14.83857274]\n [ 14.62345611 -14.83857274]\n [-17.26519048  -6.58745196]\n [  4.62106121   9.22232575]\n [  9.60947121   2.49750339]\n [  2.14508201  -4.2707566 ]\n [ 12.26245577   3.06721231]\n [  1.2800622   17.53643902]\n [ -4.38821966  -2.6828045 ]\n [-12.4878278    8.57298192]\n...\n\n>>>print(famd.fit_transform(X))\n[[-10.63120157   9.03256199]\n [  7.2594962   20.41834095]\n [-15.25371982  -2.37627147]\n [  3.11348646 -14.51376937]\n [  1.16397889   2.77352044]\n [ 14.62345611 -14.83857274]\n [ 14.62345611 -14.83857274]\n [-17.26519048  -6.58745196]\n [  4.62106121   9.22232575]\n [  9.60947121   2.49750339]\n [  2.14508201  -4.2707566 ]\n [ 12.26245577   3.06721231]\n [  1.2800622   17.53643902]\n [ -4.38821966  -2.6828045 ]\n [-12.4878278    8.57298192]\n...\n\n```\n\n\n\n\n\n\n\n\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/Cauchemare/Truncated_FAMD",
        "keywords": "famd,factor analysis",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "truncated-famd",
        "package_url": "https://pypi.org/project/truncated-famd/",
        "platform": "",
        "project_url": "https://pypi.org/project/truncated-famd/",
        "project_urls": {
            "Homepage": "https://github.com/Cauchemare/Truncated_FAMD"
        },
        "release_url": "https://pypi.org/project/truncated-famd/0.0.1/",
        "requires_dist": [
            "scikit-learn",
            "scipy",
            "pandas",
            "numpy"
        ],
        "requires_python": "",
        "summary": "Scalable Factor Analysis of Mixed and Sparse Data",
        "version": "0.0.1"
    },
    "last_serial": 4818727,
    "releases": {
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "366b59d69ab3903f8fe0a38d2bd36c6b",
                    "sha256": "21ced67c2f7145fa8ed2579e08a4d639ca26ca320b3fce4a5ac40aa0e9ab09e5"
                },
                "downloads": -1,
                "filename": "truncated_famd-0.0.1-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "366b59d69ab3903f8fe0a38d2bd36c6b",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 17238,
                "upload_time": "2019-02-14T04:44:20",
                "url": "https://files.pythonhosted.org/packages/24/1d/8141f7892274cbb7b8dfdf2298d09eef6912f5677e8c0efc2c499f266139/truncated_famd-0.0.1-py3-none-any.whl"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "366b59d69ab3903f8fe0a38d2bd36c6b",
                "sha256": "21ced67c2f7145fa8ed2579e08a4d639ca26ca320b3fce4a5ac40aa0e9ab09e5"
            },
            "downloads": -1,
            "filename": "truncated_famd-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "366b59d69ab3903f8fe0a38d2bd36c6b",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 17238,
            "upload_time": "2019-02-14T04:44:20",
            "url": "https://files.pythonhosted.org/packages/24/1d/8141f7892274cbb7b8dfdf2298d09eef6912f5677e8c0efc2c499f266139/truncated_famd-0.0.1-py3-none-any.whl"
        }
    ]
}