{ "info": { "author": "Technology Center, de Volksbank (NL)", "author_email": "tc@devolksbank.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Software Development :: Build Tools" ], "description": "# dvb.datascience\n\na python [data science](https://en.wikipedia.org/wiki/Data_science) pipeline package\n\n![Travis (.org)](https://img.shields.io/travis/devolksbank/dvb.datascience.svg)\n\nAt [de Volksbank](https://www.devolksbank.nl/), our data scientists used to write a lot of overhead code for every experiment from scratch. To help them focus on the more exciting and value added parts of their jobs, we created this package.\nUsing this package you can easily create and reuse your pipeline code (consisting of often used data transformations and modeling steps) in experiments. \n\n![Sample Project Gif](docs/GIF_Sample_Project.gif)\n\nThis package has (among others) the following features:\n\n- Make easy-to-follow model pipelines of fits and transforms ([what exactly is a pipeline?](https://stackoverflow.com/questions/33091376/python-what-is-exactly-sklearn-pipeline-pipeline))\n- Make a graph of the pipeline\n- Output graphics, data, metadata, etc from the pipeline steps\n- Data preprocessing such as filtering feature and observation outliers \n- Adding and merging intermediate dataframes\n- Every pipe stores all intermediate output, so the output can be inspected later on\n- Transforms can store the outputs of previous runs, so the data from different transforms can be compared into one graph\n- Data is in [Pandas](https://pandas.pydata.org/) DataFrame format\n- Parameters for every pipe can be given with the pipeline fit_transform() and transform() methods\n\n![logo](https://www.devolksbank.nl/upload/d201c68e-5401-4722-be68-6b201dbe8082_de_volksbank.png \"De Volksbank - The Netherlands\")\n\n\n## Scope\n\nThis package was developed specifically for fast prototyping with relatively small datasets on a single machine. By allowing the intermediate output of each pipeline step to be stored, this package might underperform for bigger datasets (100,000 rows or more). \n\n## Getting Started\n\nThese instructions will get you a copy of the project up and running on your local machine for development and testing purposes.\nFor a more extensive overview of all the features, see the docs directory.\n\n### Prerequisites\n\nThis package requires [Python3](https://www.python.org/) and has been tested/developed using python 3.6\n\n### Installing\n\nThe easiest way to install the library (for using it), is using:\n\n```bash\npip install dvb.datascience\n```\n\n#### Development\n\n(in the checkout directory): For installing the checkouts repo for developing of dvb.datascience:\n\n```bash\npipenv install --dev\n```\n\nFor using dvb.datascience in your project:\n\n```bash\npipenv install dvb.datascience\n```\n\n#### Development - Anaconda\n\n(in the checkout directory): Create and activate an environment + install the package:\n\n```bash\nconda create --name dvb.datascience\nconda activate dvb.datascience\npip install -e .\n```\n\nor use it via:\n\n```bash\npip install dvb.datascience\n```\n\n#### Jupyter table-of-contents\n\nWhen working with longer pipelines, the output when using a jupyter notebook can become quite long. It is advisable to install the\n[nbextensions](https://github.com/ipython-contrib/jupyter_contrib_nbextensions) for the [toc2](https://github.com/ipython-contrib/jupyter_contrib_nbextensions/tree/master/src/jupyter_contrib_nbextensions/nbextensions/toc2) extension:\n\n```bash\npip install jupyter_contrib_nbextensions\njupyter contrib nbextension install\n```\n\nNext, start a jupyter notebook and navigate to [edit > nbextensions config](http://localhost:8888/nbextensions/) and enable the toc2 extension. And optionally set other properties.\nAfter that, navigate back to your notebook (refresh) and click the icon in the menu for loading the toc in the side panel.\n\n## Examples\n\nThis example loads the data and makes some plots of the Iris dataset\n\n```python\nimport dvb.datascience as ds\n\n\np = ds.Pipeline()\np.addPipe('read', ds.data.SampleData('iris'))\np.addPipe('split', ds.transform.TrainTestSplit(test_size=0.3), [(\"read\", \"df\", \"df\")])\np.addPipe('boxplot', ds.eda.BoxPlot(), [(\"split\", \"df\", \"df\")])\np.fit_transform(transform_params={'split': {'train': True}})\n```\n\nThis example shows a number of features of the package and its usage:\n\n- Adding 3 steps to the pipeline using `addPipe()`.\n- Linking the 3 steps using `[(\"read\", \"df\", \"df\")]`: the `'df'` output (2nd parameter) of the `\"read\"` method (1st method) to the `\"df\"` input (3rd parameter) of the split method.\n- The usage of 3 subpackages: `ds.data`, `ds.transform` and `ds.eda`. The other 2 packages are: `ds.predictor` and `ds.score`.\n- The last method `p.fit_transform()` has as a parameter additional input for running the defined pipeline, which can be different for each call to the `p.fit_transform()` or `p.transform()` method.\n\nThis example applies the KNeighborsClassifier from sklearn to the Iris dataset\n\n```python\nimport dvb.datascience as ds\n\nfrom sklearn.neighbors import KNeighborsClassifier\np = ds.Pipeline()\np.addPipe('read', ds.data.SampleData('iris'))\np.addPipe('clf', ds.predictor.SklearnClassifier(KNeighborsClassifier, n_neighbors=3), [(\"read\", \"df\", \"df\"), (\"read\", \"df_metadata\", \"df_metadata\")])\np.addPipe('score', ds.score.ClassificationScore(), [(\"clf\", \"predict\", \"predict\"), (\"clf\", \"predict_metadata\", \"predict_metadata\")])\np.fit_transform()\n```\n\nThis example shows:\n\n- The use of the `KNeighborsClassifier` from `sklearn`\n- The usage of coupling of multiple parameters as input: `[(\"read\", \"df\", \"df\"), (\"read\", \"df_metadata\", \"df_metadata\")]`\n\nFor a more extensive overview of all the features, see the docs directory.\n\n## Unittesting\n\nThe unittests for the project can be run using [pytest](https://pytest.org/):\n\n```bash\npytest\n```\n\n### Code coverage\n\nPytest will also output the coverage tot the console.\n\nTo generate an html report, you can use:\n\n```bash\npy.test --cov-report html\n```\n\n## Code styling\n\nCode styling is done using [Black](https://pypi.org/project/black/)\n\n## Built With\n\nFor an extensive list, see [setup.py](setup.py)\n\n- [scipy / numpy / pandas / matplotlib](https://www.scipy.org/) - For calculations and visualizations\n- [sklearn](http://scikit-learn.org/stable/) - Machine learning algorithms\n- [statsmodels](https://www.statsmodels.org/stable/index.html) - Statistics\n- [mlxtend](https://rasbt.github.io/mlxtend/) - Feature selection\n- [tabulate](https://pypi.org/project/tabulate/) - Printing tabular data\n- [imblearn](https://pypi.org/project/imblearn/) - SMOTE\n\n## Contributing\n\nPlease read [CONTRIBUTING.md](CONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.\n\n## Versioning\n\nWe use [SemVer](http://semver.org/) for versioning. For the versions available, see the [tags on this repository](https://github.com/devolksbank/dvb.datascience/tags).\n\n## Authors\n\n- **Marc Rijken** - _Initial work_ - [mrijken](https://github.com/mrijken)\n- **Wouter Poncin** - _Maintenance_ - [wpbs](https://github.com/wpbs)\n- **Daan Knoope** - _Contributor_ - [daanknoope](https://github.com/daanknoope)\n\nSee also the list of [contributors](https://github.com/devolksbank/dvb.datascience/CONTRIBUTORS) who participated in this project.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details\n\n## Contact\n\nFor any questions please don't hesitate to contact us at [tc@devolksbank.nl](mailto:tc@devolksbank.nl)\n\n## Work in progress\n\n- Adding support for multiclass classification problems\n- Adding support for regression problems\n- Adding support for Apache Spark ML", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/devolksbank/dvb.datascience", "keywords": "datascience sklearn pipeline pandas eda train", "license": "MIT License", "maintainer": "", "maintainer_email": "", "name": "dvb.datascience", "package_url": "https://pypi.org/project/dvb.datascience/", "platform": "", "project_url": "https://pypi.org/project/dvb.datascience/", "project_urls": { "Homepage": "https://github.com/devolksbank/dvb.datascience" }, "release_url": "https://pypi.org/project/dvb.datascience/0.12/", "requires_dist": null, "requires_python": "", "summary": "Some helpers for our data scientist", "version": "0.12" }, "last_serial": 4287754, "releases": { "0.12": [ { "comment_text": "", "digests": { "md5": "edeaf370270a01858879e71150f4db7c", "sha256": "0e77eda73b2342dde4a059211b3a04a93a57cfb12ff1eba2f839a3b3f36269b8" }, "downloads": -1, "filename": "dvb.datascience-0.12.tar.gz", "has_sig": false, "md5_digest": "edeaf370270a01858879e71150f4db7c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7401419, "upload_time": "2018-09-19T09:40:59", "url": "https://files.pythonhosted.org/packages/82/42/cf03e76af94dad28534a8032dc95adf8c83ea6a8b94201b6df959a4cdf69/dvb.datascience-0.12.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "edeaf370270a01858879e71150f4db7c", "sha256": "0e77eda73b2342dde4a059211b3a04a93a57cfb12ff1eba2f839a3b3f36269b8" }, "downloads": -1, "filename": "dvb.datascience-0.12.tar.gz", "has_sig": false, "md5_digest": "edeaf370270a01858879e71150f4db7c", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7401419, "upload_time": "2018-09-19T09:40:59", "url": "https://files.pythonhosted.org/packages/82/42/cf03e76af94dad28534a8032dc95adf8c83ea6a8b94201b6df959a4cdf69/dvb.datascience-0.12.tar.gz" } ] }