{ "info": { "author": "Johannes Haug", "author_email": "johannes-christian.haug@uni-tuebingen.de", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "\n\n[![PyPI version](https://badge.fury.io/py/pystreamfs.svg)](https://badge.fury.io/py/pystreamfs)\n\n*pystreamfs* is an Open-Source Python package that allows for quick and simple comparison of feature selection algorithms on a simulated data stream.\n\nThe user can simulate data streams with varying batch size on any dataset provided as a numpy.ndarray. \n*pystreamfs* applies a specified feature selection algorithm to every batch and computes performance metrics for the\nselected feature set at every time *t*. *pystreamfs* can also be used to plot the performance metrics.\n\n*pystreamfs* comes with 5 built-in feature selection algorithms for data streams. Additionally, you can find 3 datasets ready for download on [Github](https://github.com/haugjo/pystreamfs). \n*pystreamfs* has a modular structure and is thus easily expandable (see Section 2.5 for more information).\n\n**License:** MIT License
\n**Upcoming changes:**\n* ability to simulate feature streams\n* ability to generate artificial data streams\n* ability to test multiple feature selection algorithms at once\n\n## 1 Getting started\n### 1.1 Prerequesites\nThe following Python modules need to be installed (older versions than indicated might also work):\n* python >= 3.7.1\n* numpy >= 1.15.4\n* psutil >= 5.4.7\n* matplotlib >= 2.2.3\n* scikit-learn >= 0.20.1\n* ... any modules required by the feature selection algorithm \n\n### 1.2 How to get *pystreamfs*\nUsing pip: ``pip install pystreamfs``
\n**OR** Download and unpack the .tar.gz file in ``/dist``. Navigate to the unpacked folder and execute\n``python setup.py install``.\n\n## 2 The Package \n### 2.1 Files\nThe main module is ``/pystreamfs/pystreamfs.py``. Feature selection algorithms are stored in ``/algorithms``.\n\n### 2.2 Main module: ``pystreamfs.py``\n``pystreamfs.py`` provides the following functions:\n* ``X, Y = prepare_data(data, target, shuffle)``\n * **Description**: Prepare the data set for the simulation of a data stream: randomly sort the rows of a data matrix and extract the target variable ``Y`` and the features ``X``\n * **Input**:\n * ``data``: numpy.ndarray, data set\n * ``target``: int, index of the target variable\n * ``shuffle``: bool, if ``True`` sort samples randomly\n * **Output**:\n * ``X``: numpy.ndarray, features\n * ``Y``: numpy.ndarray, target variable\n* ``stats = simulate_stream(X, Y, fs_algorithm, model, param)``\n * **Description**: Iterate over all datapoints in the dataset to simulate a data stream. \n Perform given feature selection algorithm and return performance statistics.\n * **Input**:\n * ``X``: numpy.ndarray, this is the ``X`` returned by ``prepare_data()``\n * ``Y``: numpy.ndarray, this is the ``Y`` returned by ``prepare_data()``\n * ``fs_algorithm``: function, feature selection algorithm\n * ``ml_model``: object, the machine learning model to use for the computation of the accuracy score \n (remark on KNN: number of neighbours has to be greater or equal to batch size)\n * ``param``: dict, includes:\n * ``num_features``: integer, the number of features you want returned\n * ``batch_size``: integer, the number of instances processed in one iteration\n * ... additional algorithm specific parameters\n * **Output**:\n * ``stats``: dict\n * ``features``: list of lists, set of selected features for every batch\n * ``time_avg``: float, average computation time for one execution of the feature selection\n * ``time_measures``: list, time measures for every batch\n * ``memory_avg``: float, average memory usage after one execution of the feature selection, uses ``psutil.Process(os.getpid()).memory_full_info().uss``\n * ``memory_measures``: list, memory measures for every batch\n * ``acc_avg``: float, average accuracy for classification with the selected feature set\n * ``acc_measures``: list, accuracy measures for every batch\n * ``fscr_avg``: float, average feature selection change rate (fscr) per time window. \n The fscr is the percentage of selected features that changes in *t* with respect to *t-1* (fscr=0 if all selected features remain the same, fscr=1 if all selected features change)\n * ``fscr_measures`` list, fscr measures for every batch\n* ``plt = plot_stats(stats, ftr_names, param, fs_name, model_name):``\n * **Description**: Plot the statistics for time, memory, fscr and selected features over all time windows.\n * **Input**:\n * ``stats``: dict (see ``stats`` of ``simulate_stream()``)\n * ``ftr_names``: numpy.ndarray, contains all feature names\n * ``param``: dict, parameters\n * ``fs_name``: string, name of feature selection algorithm\n * ``model_name``: string, name of machine learning model\n * **Output**:\n * ``plt``: pyplot object: statistic plots\n\n### 2.3 Built-in feature selection algorithms\n* Online Feature Selection (OFS) based on the Perceptron algorithm by Wang et al. (2013) - [link to paper](https://ieeexplore.ieee.org/abstract/document/6522405)\n* Unsupervised Feature Selection on Data Streams (FSDS) using matrix sketching by Huang et al. (2015) - [link to paper](https://dl.acm.org/citation.cfm?id=2806521)\n* Feature Selection based on Micro Cluster Nearest Neighbors by Hamoodi et al. (2018) - [link to paper](https://www.sciencedirect.com/science/article/abs/pii/S0950705118304039)\n* Extremal Feature Selection based on a Modified Balanced Winnow classifier by Carvalho et al. (2006) - [link to paper](https://dl.acm.org/citation.cfm?id=1150466)\n* CancelOut Feature Selection based on a Neural Network by Vadim Borisov ([Github](https://github.com/unnir/CancelOut))\n\n### 2.4 Downloadable datasets\nAll datasets are cleaned and normalized. The target variable of all datasets is moved to the first column.\n* German Credit Score ([link](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)))\n* Binary version of Human Activity Recognition ([link](https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones)).\n * The original HAR dataset has a multivariate target. For its binary version we defined the class \"WALKING\" as our positive class (label=1) and all other classes as the negative (non-walking) class. \n We combined the 1722 samples of the original \"WALKING\" class with a random sample of 3000 instances from all other classes.\n* Usenet ([link](http://www.liaad.up.pt/kdus/products/datasets-for-concept-drift))\n\n### 2.5 How to add a feature selection algorithm\nIf you want to use *pystreamfs* to test your own feature selection algorithm, you have to encapsulate your algorithm in a function\nwith the following format:\n```python\ndef your_fs_algorithm(X, Y, w, param):\n \"\"\"Your feature selection algorithm\n\n :param numpy.nparray X: current data batch\n :param numpy.nparray Y: labels of current batch\n :param numpy.nparray w: feature weights\n :param dict param: any parameters the algorithm requires\n :return: w (updated feature weights), param\n :rtype numpy.ndarray, dict\n \"\"\"\n\n ...do feature selection...\n\n return w, param\n```\nAfterwards you can import and test your feature selection algorithm in the same way as for any built-in algorithm (see the example).\n\n## 3. Example\n```python\nfrom pystreamfs import pystreamfs\nimport numpy as np\nimport pandas as pd\nfrom pystreamfs.algorithms import ofs\nfrom sklearn.neighbors import KNeighborsClassifier\n\n# Load a dataset\ndata = pd.read_csv('../datasets/har.csv')\nfeature_names = np.array(data.drop('target', 1).columns)\ndata = np.array(data)\n\n# Extract features and target variable\nX, Y = pystreamfs.prepare_data(data, 0, False)\n\n# Load a FS algorithm\nfs_algorithm = ofs.run_ofs\n\n# Define parameters\nparam = dict()\nparam['num_features'] = 5 # number of features to return\nparam['batch_size'] = 50 # batch size\n\n# Define ML model\nmodel = KNeighborsClassifier(n_jobs=-1, n_neighbors=5)\n\n# Data stream simulation\nstats = pystreamfs.simulate_stream(X, Y, fs_algorithm, model, param)\n\n# Plot statistics\npystreamfs.plot_stats(stats, feature_names, param, 'Online feature selection (OFS)', 'K Nearest Neighbor').show()\n```\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/haugjo/pystreamfs", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "pystreamfs", "package_url": "https://pypi.org/project/pystreamfs/", "platform": "", "project_url": "https://pypi.org/project/pystreamfs/", "project_urls": { "Homepage": "https://github.com/haugjo/pystreamfs" }, "release_url": "https://pypi.org/project/pystreamfs/0.0.6/", "requires_dist": null, "requires_python": "", "summary": "A Python package for feature selection on a simulated data stream", "version": "0.0.6" }, "last_serial": 4874726, "releases": { "0.0.2": [ { "comment_text": "", "digests": { "md5": "09e31dfdfd1d3dcf020d39592b36605a", "sha256": "46ad57fdfced15cdcb9200023d5dbac213194d26a6a769e08485331322d35f7e" }, "downloads": -1, "filename": "pystreamfs-0.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "09e31dfdfd1d3dcf020d39592b36605a", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18586, "upload_time": "2019-02-19T14:35:00", "url": "https://files.pythonhosted.org/packages/46/07/839e11e7b825e3216614afd0b1de7f10d729d644ef1e1ca68ea9806e7831/pystreamfs-0.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1071d81b1143826e108427cf822a06c1", "sha256": "fc7aa0ded97052716cd2fa377b799e556be8d7816cfc70318014c3cf1ff60753" }, "downloads": -1, "filename": "pystreamfs-0.0.2.tar.gz", "has_sig": false, "md5_digest": "1071d81b1143826e108427cf822a06c1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23124771, "upload_time": "2019-02-19T14:35:03", "url": "https://files.pythonhosted.org/packages/66/8f/6d258499361b14c70c273c9c6f6fc300f0e2abb1d5d238b9033ebe0f1284/pystreamfs-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "b1d65066c4e5420e0c23433e09fc4efa", "sha256": "37ecf98e8ee7a6d2537ec911c85f9460f93758784bd8cdc50f1c4908dbdc6907" }, "downloads": -1, "filename": "pystreamfs-0.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "b1d65066c4e5420e0c23433e09fc4efa", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 18607, "upload_time": "2019-02-20T10:44:47", "url": "https://files.pythonhosted.org/packages/53/a8/448ae3388c891e482b921b25c02af6aa6ae575f09088d3e1c0dc465157d8/pystreamfs-0.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "399da10f0721d11dbb90ba3ad5106b68", "sha256": "90435870290a90cc844fc3f8d3066c03c70606f04903e14e3e781ac2d8280712" }, "downloads": -1, "filename": "pystreamfs-0.0.3.tar.gz", "has_sig": false, "md5_digest": "399da10f0721d11dbb90ba3ad5106b68", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23135961, "upload_time": "2019-02-20T10:44:53", "url": "https://files.pythonhosted.org/packages/90/3e/30be7ae965eca2771430345802c051c926cc7c1010db4d887ebc08afb96e/pystreamfs-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "4d0d960d54b9c1d191259cbcb1854cc7", "sha256": "20931fe94a7cb6d596e7eaf59ef7cc17c1ff7a6918ff3deb1fb3509f695d654d" }, "downloads": -1, "filename": "pystreamfs-0.0.4-py3-none-any.whl", "has_sig": false, "md5_digest": "4d0d960d54b9c1d191259cbcb1854cc7", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19844, "upload_time": "2019-02-27T14:45:56", "url": "https://files.pythonhosted.org/packages/fd/3a/a91bf539659db1d5905e5e65b89dbcc25d721c9c3451046ff57debb5d07a/pystreamfs-0.0.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "604924c5c7f26fa0dfa54dab91cb4f47", "sha256": "a8b2e74add89a782b75e511a53f6cc9167e8844412cfab95879ba7ff0623fec3" }, "downloads": -1, "filename": "pystreamfs-0.0.4.tar.gz", "has_sig": false, "md5_digest": "604924c5c7f26fa0dfa54dab91cb4f47", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23136898, "upload_time": "2019-02-27T14:46:05", "url": "https://files.pythonhosted.org/packages/af/c2/07bc9855528e088502a0366f645b526754276d1d43f2f7c5f90e927296a1/pystreamfs-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "05d1c6bc8824d3c273460a0a517c3d75", "sha256": "e2ae75e5d44924b701ea02713010ec0aab398aa0bf1a4210bd6d90f401773bdf" }, "downloads": -1, "filename": "pystreamfs-0.0.5-py3-none-any.whl", "has_sig": false, "md5_digest": "05d1c6bc8824d3c273460a0a517c3d75", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19817, "upload_time": "2019-02-27T15:00:50", "url": "https://files.pythonhosted.org/packages/c9/f1/192427793ca94a5fc345817eb9b2994485d1e9dc20083453554aa5b1ac5a/pystreamfs-0.0.5-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d957c8951885467943f5cfb0155dffe1", "sha256": "6b4da5a5224b1e3feea63d7b6c51cdb474f4ee740370571666c13830e41c17b4" }, "downloads": -1, "filename": "pystreamfs-0.0.5.tar.gz", "has_sig": false, "md5_digest": "d957c8951885467943f5cfb0155dffe1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23136854, "upload_time": "2019-02-27T15:01:01", "url": "https://files.pythonhosted.org/packages/05/9d/5515eb198aceb806de72a1ba8c7b8544f921203dbb659323f7971223a572/pystreamfs-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "4726b3ffa03b62b6e60a7e0742c5dee7", "sha256": "e1babb1c4b39941986e2a22178173cb14ddc4dd37d06222ec8af7cfc77b49dea" }, "downloads": -1, "filename": "pystreamfs-0.0.6-py3-none-any.whl", "has_sig": false, "md5_digest": "4726b3ffa03b62b6e60a7e0742c5dee7", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19753, "upload_time": "2019-02-27T15:20:07", "url": "https://files.pythonhosted.org/packages/bd/03/58a07cd3f627696b4cb3c6d4e422f55b51a04844cce1f48bd4c3228c000c/pystreamfs-0.0.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a092f277310956116fe60b582a1b371a", "sha256": "ffc20878b884bbbd837bf6c52a3192729a42f45ea9199554b81704dfa8d7fca5" }, "downloads": -1, "filename": "pystreamfs-0.0.6.tar.gz", "has_sig": false, "md5_digest": "a092f277310956116fe60b582a1b371a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23136842, "upload_time": "2019-02-27T15:20:23", "url": "https://files.pythonhosted.org/packages/87/db/1fd8fa912632858b0319b7a4a16f84aaf7a6323065bd89a593350eb2781c/pystreamfs-0.0.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4726b3ffa03b62b6e60a7e0742c5dee7", "sha256": "e1babb1c4b39941986e2a22178173cb14ddc4dd37d06222ec8af7cfc77b49dea" }, "downloads": -1, "filename": "pystreamfs-0.0.6-py3-none-any.whl", "has_sig": false, "md5_digest": "4726b3ffa03b62b6e60a7e0742c5dee7", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19753, "upload_time": "2019-02-27T15:20:07", "url": "https://files.pythonhosted.org/packages/bd/03/58a07cd3f627696b4cb3c6d4e422f55b51a04844cce1f48bd4c3228c000c/pystreamfs-0.0.6-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a092f277310956116fe60b582a1b371a", "sha256": "ffc20878b884bbbd837bf6c52a3192729a42f45ea9199554b81704dfa8d7fca5" }, "downloads": -1, "filename": "pystreamfs-0.0.6.tar.gz", "has_sig": false, "md5_digest": "a092f277310956116fe60b582a1b371a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23136842, "upload_time": "2019-02-27T15:20:23", "url": "https://files.pythonhosted.org/packages/87/db/1fd8fa912632858b0319b7a4a16f84aaf7a6323065bd89a593350eb2781c/pystreamfs-0.0.6.tar.gz" } ] }