{ "info": { "author": "Abubakar Abid", "author_email": "a12d@stanford.edu", "bugtrack_url": null, "classifiers": [], "description": "shuffled-stats\r\n===================\r\nA python library for performing inference on datasets with shuffled / unordered labels. \r\n\r\nThis library includes functions for generating datasets and performing linear regression on datasets whose labels (the \"y\") are shuffled with respect to the input features (the \"x\"). In other words, you should use this library to perform linear regression when you don't know which measurement comes from which data point.\r\n\r\nApplications include: experiments done on an entire population of particles at once (`flow cytometry `_), datasets shuffled to protect privacy (`medical records `_), measurements where the ordering is unclear (`signaling with identical tokens `_)\r\n\r\nInstallation\r\n--------------------\r\n\r\n.. code-block:: \r\n\r\n\t$ pip install shuffled_stats\r\n\r\n\r\nExamples (without noise)\r\n-------------------------------\r\nLet's start with some simple examples. We construct some random 2-dimensional input data, and the corresponding labels. We then apply shuffled regression using the :code:`shuffled_stats.linregress` function.\r\n\r\n.. code-block:: python\r\n\r\n\timport numpy as np, shuffled_stats\r\n\r\n\tnp.random.seed(1)\r\n\r\n\tx = np.random.normal(1, 1, (100,2)) #input features\r\n\ty = 3*x[:,0] - 7*x[:,1] #labels\r\n\r\n\tnp.random.shuffle(y) #in-place shuffling of the labels\r\n\r\n\tshuffled_stats.linregress(x,y) #performs shuffled linear regression\r\n\t>>> array([3., -7.])\r\n\r\n\r\nThe original weights, [3, -7], are recovered exactly. \r\n\r\nWe can do another example with defined data points:\r\n\r\n===== ===== =======\r\nx1 x2 y\r\n===== ===== =======\r\n1 2 3\r\n2 5 7\r\n-1 -2 -3\r\n5 5 10\r\n2 10 12\r\n===== ===== =======\r\n\r\nStandard linear regression would clearly reveal that 1*x1 + 1*x2 = y. Let's see what shuffled linear regression reveals with a permuted version of the labels:\r\n\r\n.. code-block:: python\r\n\r\n\tx = np.array([[1,2],[2,5],[-1, -2],[5,5],[2,10]])\r\n\ty = np.array([-3, 10, 7, 3, 12])\r\n\r\n\tshuffled_stats.linregress(x,y) #performs shuffled linear regression\r\n\t>>> array([1., 1.])\r\n\r\n\r\nAgain, the weights are recovered exactly.\r\n\r\nExamples (with noise)\r\n------------------------\r\n\r\n.. code-block:: python\r\n\t\r\n\tnp.random.seed(1) #for reproducibility\r\n\tx = np.random.normal(1, 1, (100,3)) #input features\r\n\tx[:,0] = 1 #making a bias/intercept column \r\n\ty = 4 + 2*x[:,1] - 3*x[:,2] #labels\r\n\r\n\ty = y + np.random.normal(0, .3, (100)) #adding Gaussian noise\r\n\r\n\tw = shuffled_stats.linregress(x,y)\r\n\tnp.round(w,2)\r\n\t>>> array([3.80, 2.09, -2.91])\r\n\r\nWe see that the recovered weights approximate the original weights (4, 2, -3), including the bias term.\r\n\r\nThe library includes a function, :code:`shuffled_stats.generate_dataset` to quickly generating datasets for testing. Here's an example:\r\n\r\n.. code-block:: python\r\n\t\r\n\tnp.random.seed(1) #for reproducibility\r\n\r\n\tx, y, w0 = shuffled_stats.generate_dataset(n=100, dim=3, bias=True, noise=0.3, mean=2)\r\n\tw = shuffled_stats.linregress(x,y)\r\n\r\n\tprint(np.round(w0,2))\r\n\t>>> array([2.07, -1.47, -0.83])\t\r\n\tprint(np.round(w,2))\r\n\t>>> array([1.79, 1.55, -0.63])\r\n\r\nThe weights are approximately recovered. We can quantify the relative error by using :code:`shuffled_stats.error_in_weights`.\r\n\r\n.. code-block:: python\r\n\t\r\n\tshuffled_stats.error_in_weights(w0,w)\r\n\t>>> 0.13010948373615697\t#13% error\r\n\r\nCan we improve performance by running three separate \"trials\" or \"replications\" of this experiment, each consisting of 100 unordered labels (within each trial, the ordering of the labels is unknown, but labels within a trial must correspond to data points from that trial)? We can test this easily with our library:\r\n\r\n.. code-block:: python\r\n\t\r\n\tnp.random.seed(1) #for reproducibility\r\n\tx, y, w0, groups = shuffled_stats.generate_dataset(n=300, dim=3, weights=[2.07, -1.47, -0.83], bias=True, noise=0.3, mean=2, n_groups=3) #fix weights to the same values as before\r\n\tw = shuffled_stats.linregress(x,y, groups=groups)\r\n\r\n\tprint(np.round(w,2))\r\n\t>>> array([2.09, -1.48, -0.83])\r\n\tshuffled_stats.error_in_weights(w0,w)\r\n\t>>> 0.0099665304764283077 #<1% error\r\n\r\nThe weights are a lot closer this time!\r\n\r\nThe library includes several different estimators (see paper for details). We can choose different estimators to compare results:\r\n\r\n.. code-block:: python\r\n\t\r\n\tnp.random.seed(1) #for reproducibility\r\n\tx, y, w0 = shuffled_stats.generate_dataset(n=100, dim=3, weights=[1,1,1], noise=0.3, mean=1) #the true weights are [1,1]\r\n\tw = shuffled_stats.linregress(x,y, estimator='SM')\r\n\tprint(np.round(w,2))\r\n\t>>> [0.98 0.98 1.03]\r\n\tw = shuffled_stats.linregress(x,y, estimator='LS')\r\n\tprint(np.round(w,2))\r\n\t>>> [0.99 0.92 1.09]\r\n\tw = shuffled_stats.linregress(x,y, estimator='EMD')\r\n\tprint(np.round(w,2))\r\n\t>>> [0.99 0.93 1.09]\r\n\r\n\r\nExamples (on datasets)\r\n---------------------------------\r\n\r\nFinally, we include methods to load datasets from .csv files (:code:`shuffled_stats.load_dataset_in_clusters`) so that the performance of shuffled regression can be compared to that of, for example, ordinary least-squares, on real-world data. Here's an example that uses the :code:`accidents.csv` dataset, from the UCI repository.\r\n\r\n.. code-block:: python\r\n\t\r\n\tfrom sklearn.linear_model import LinearRegression\r\n\t\r\n\tnp.random.seed(1) #for reproducibility\r\n\r\n\tx, y, groups = shuffled_stats.load_dataset_in_clusters('accidents.csv', normalize=True, n_clusters = 2)\r\n\r\n\tlr = LinearRegression(fit_intercept=False) #fit_intercept is false because x already includes a bias column\r\n\t\r\n\tprint(lr.fit(x,y).coef_)\r\n\t>>> [1.02859104, 0.03967381]\r\n\r\n\tprint(shuffled_stats.linregress(x,y))\r\n\t>>> [ 1.12348216 0.02539006]\r\n\r\nNot bad, if I do say so myself! Feel free to explore shuffled regression and reach out to me if you have any questions!", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/abidlabs/shuffled-stats/archive/1.0.5.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/abidlabs/shuffled-stats", "keywords": "statistics,shuffled,permutation,regression", "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "shuffled-stats", "package_url": "https://pypi.org/project/shuffled-stats/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/shuffled-stats/", "project_urls": { "Download": "https://github.com/abidlabs/shuffled-stats/archive/1.0.5.tar.gz", "Homepage": "https://github.com/abidlabs/shuffled-stats" }, "release_url": "https://pypi.org/project/shuffled-stats/1.0.6/", "requires_dist": null, "requires_python": null, "summary": "Python library for performing inference on datasets with shuffled labels", "version": "1.0.6" }, "last_serial": 2807851, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "688b326da0f2dff2bd930e2f59a7a719", "sha256": "da658e3531f7d2e904f3c96b79ece3696331595959ebfa03817bc2f7d31b06b0" }, "downloads": -1, "filename": "shuffled_stats-1.0.0.zip", "has_sig": false, "md5_digest": "688b326da0f2dff2bd930e2f59a7a719", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 879, "upload_time": "2017-04-16T04:46:14", "url": "https://files.pythonhosted.org/packages/c3/44/3998504654b67b2a2a58ac409597d2bfd639d32b1e44cf117809076ba2d9/shuffled_stats-1.0.0.zip" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "73a0f51c1d024c2992548e7483c88443", "sha256": "491f13d2886112096b56abd29fb91eb783c747adce6f74baf68f7c57ef6af9cf" }, "downloads": -1, "filename": "shuffled_stats-1.0.1.zip", "has_sig": false, "md5_digest": "73a0f51c1d024c2992548e7483c88443", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 879, "upload_time": "2017-04-16T05:16:48", "url": "https://files.pythonhosted.org/packages/0e/20/c0acd0d9664afcef6892cffdf3dc37731891ab60a7c92333e4eb6dc674ca/shuffled_stats-1.0.1.zip" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "212e016d280fd2a6df2121e4e7df9ee4", "sha256": "b02feab6f57ff9acc53aa18d45916a0261199de39c6f62305d3fbff09e33b6bb" }, "downloads": -1, "filename": "shuffled_stats-1.0.3.zip", "has_sig": false, "md5_digest": "212e016d280fd2a6df2121e4e7df9ee4", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9168, "upload_time": "2017-04-17T03:57:55", "url": "https://files.pythonhosted.org/packages/a1/2a/caf378e47a4ea41485e62435e2feac230a649496192ef341c865e728dafc/shuffled_stats-1.0.3.zip" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "1af4bcb24d4961adea0dd43f8439f97f", "sha256": "64020cbdb2fec3624faf236444de0e28cae9b11305bdd23cb50fe374a5f55c9b" }, "downloads": -1, "filename": "shuffled_stats-1.0.4.zip", "has_sig": false, "md5_digest": "1af4bcb24d4961adea0dd43f8439f97f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9191, "upload_time": "2017-04-17T04:20:59", "url": "https://files.pythonhosted.org/packages/9b/00/4c5d51055afac3417b4c48082f6e49b1e8bbd7e618c380097b9e4f602747/shuffled_stats-1.0.4.zip" } ], "1.0.5": [ { "comment_text": "", "digests": { "md5": "f0f87d050582f1f01b255ef6d38fab34", "sha256": "3dc7fb7415655acdebe2e4366c330e67c83748825c95eeab8f80f4e2f9d2d8a7" }, "downloads": -1, "filename": "shuffled_stats-1.0.5.zip", "has_sig": false, "md5_digest": "f0f87d050582f1f01b255ef6d38fab34", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9165, "upload_time": "2017-04-17T04:22:35", "url": "https://files.pythonhosted.org/packages/37/47/095c507ca4f531a8e8254f94dc2077d767201141a82cc4774deb423c49c2/shuffled_stats-1.0.5.zip" } ], "1.0.6": [ { "comment_text": "", "digests": { "md5": "b590ff597b72e8e6ef48473e9c3949ab", "sha256": "269d98e63dd41004ec994fd92045c588a56bcbd0e7a97f7bd491017ff8bf3601" }, "downloads": -1, "filename": "shuffled_stats-1.0.6.zip", "has_sig": false, "md5_digest": "b590ff597b72e8e6ef48473e9c3949ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9166, "upload_time": "2017-04-17T04:26:00", "url": "https://files.pythonhosted.org/packages/aa/7e/c929b871c45ef045bc96053315087812b15471c5827fcf2dc9db058fdb31/shuffled_stats-1.0.6.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b590ff597b72e8e6ef48473e9c3949ab", "sha256": "269d98e63dd41004ec994fd92045c588a56bcbd0e7a97f7bd491017ff8bf3601" }, "downloads": -1, "filename": "shuffled_stats-1.0.6.zip", "has_sig": false, "md5_digest": "b590ff597b72e8e6ef48473e9c3949ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9166, "upload_time": "2017-04-17T04:26:00", "url": "https://files.pythonhosted.org/packages/aa/7e/c929b871c45ef045bc96053315087812b15471c5827fcf2dc9db058fdb31/shuffled_stats-1.0.6.zip" } ] }