{ "info": { "author": "Ivan Ogasawara", "author_email": "ivan.ogasawara@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "===============================\nSciKit Data\n===============================\n\n\n.. image:: https://img.shields.io/pypi/v/scikit-data.svg\n :target: https://pypi.python.org/pypi/scikit-data\n\n.. image:: https://img.shields.io/travis/OpenDataScienceLab/skdata.svg\n :target: https://travis-ci.org/OpenDataScienceLab/skdata\n\n.. image:: https://readthedocs.org/projects/skdata/badge/?version=latest\n :target: https://skdata.readthedocs.io/en/latest/?badge=latest\n :alt: Documentation Status\n\n\nConda package current release info\n==================================\n\n.. image:: https://anaconda.org/conda-forge/scikit-data/badges/version.svg\n :target: https://anaconda.org/conda-forge/scikit-data\n :alt: Anaconda-Server Badge\n\n.. image:: https://anaconda.org/conda-forge/scikit-data/badges/downloads.svg\n :target: https://anaconda.org/conda-forge/scikit-data\n :alt: Anaconda-Server Badge\n\n\nAbout SciKit Data\n=================\n\nThe propose of this library is to allow the data analysis process more easy and automatic.\n\n The data analysis process is composed of following steps:\n\n * The statement of problem\n * Collecting your data\n * Cleaning the data\n * Normalizing the data\n * Transforming the data\n * Exploratory statistics\n * Exploratory visualization\n * Predictive modeling\n * Validating your model\n * Visualizing and interpreting your results\n * Deploying your solution\n\n (Cuesta, Hector and Kumar, Sampath; 2016)\n\nThis project contemplates the follow features:\n\n* Data Preparation\n* Data Exploration\n* Prepare data to Predictive modeling\n* Visualizing results\n* Reproducible data analysis\n\n\nData Preparation\n----------------\n\n Data preparation is about how to obtain, clean, normalize, and transform the data into an\n optimal dataset, trying to avoid any possible data quality issues such as invalid, ambiguous,\n out-of-range, or missing values.\n\n (...)\n\n Scrubbing data, also called data cleansing, is the process of correcting or\n removing data in a dataset that is incorrect, inaccurate, incomplete,\n improperly formatted, or duplicated.\n\n (...)\n\n In order to avoid dirty data, our dataset should possess the following characteristics:\n\n * Correct\n * Completeness\n * Accuracy\n * Consistency\n * Uniformity\n\n (...)\n\n **Data transformation**\n\n Data transformation is usually related to databases and data warehouses where values from\n a source format are extract, transform, and load in a destination format.\n\n Extract, Transform, and Load (ETL) obtains data from various data sources, performs some\n transformation functions depending on our data model, and loads the resulting data into\n the destination.\n\n (...)\n\n Some important transformations:\n\n * Text facet and Clustering\n * Numeric fact\n * Replace\n\n **Data reduction methods**\n\n Data reduction is the transformation of numerical or alphabetical digital information\n derived empirically or experimentally into a corrected, ordered, and simplified form.\n Reduced data size is very small in volume and comparatively original, hence, the storage\n efficiency will increase and at the same time we can minimize the data handling costs and\n will minimize the analysis time also.\n\n We can use several types of data reduction methods, which are listed as follows:\n\n * Filtering and sampling\n * Binned algorithm\n * Dimensionality reduction\n\n (Cuesta, Hector and Kumar, Sampath; 2016)\n\n\nData exploration\n----------------\n\n Data exploration is essentially looking at the processed data in a graphical or statistical form\n and trying to find patterns, connections, and relations in the data. Visualization is used to\n provide overviews in which meaningful patterns may be found.\n\n (...)\n\n The goals of exploratory data analysis (EDA) are as follows:\n\n * Detection of data errors\n * Checking of assumptions\n * Finding hidden patters (like tendency)\n * Preliminary selection of appropriate models\n * Determining relationships between the variables\n\n (...)\n\n The four types of EDA are univariate nongraphical, multivariate nongraphical, univariate\n graphical, and multivariate graphical. The nongraphical methods refer to the calculation of\n summary statistics or the outlier detection. In this book, we will focus on the univariate and\n\n (Cuesta, Hector and Kumar, Sampath; 2016)\n\n**Outlier Detection**\n\nTwo outlier detection method should be used, initially, for SkData are:\n\n* IQR;\n* Chauvenet.\n\nAnother methods should be implemented soon [1].\n\n\nPrepare data to Predictive modeling\n-----------------------------------\n\n From the galaxy of information we have to extract usable hidden patterns and trends using\n relevant algorithms. To extract the future behavior of these hidden patterns, we can use\n predictive modeling. Predictive modeling is a statistical technique to predict future\n behavior by analyzing existing information, that is, historical data. We have to use proper\n statistical models that best forecast the hidden patterns of the data or\n information (Cuesta, Hector and Kumar, Sampath; 2016).\n\nSkData, should allow you to format your data to send it to some predictive library\nas scikit-learn.\n\n\nVisualizing results\n-------------------\n\n In an explanatory data analysis process, simple visualization techniques are very useful for\n discovering patterns, since the human eye plays an important role. Sometimes, we have to\n generate a three-dimensional plot for finding the visual pattern. But, for getting better\n visual patterns, we can also use a scatter plot matrix, instead of a three-dimensional plot. In\n practice, the hypothesis of the study, dimensionality of the feature space, and data all play\n important roles in ensuring a good visualization technique (Cuesta, Hector and Kumar, Sampath; 2016).\n\n\nQuantitative and Qualitative data analysis\n------------------------------------------\n\n Quantitative data are numerical measurements expressed in terms of numbers.\n\n Qualitative data are categorical measurements expressed in terms of natural language\n descriptions.\n\n Quantitative analytics involves analysis of numerical data. The type of the analysis will\n depend on the level of measurement. There are four kinds of measurements:\n\n * Nominal data has no logical order and is used as classification data.\n * Ordinal data has a logical order and differences between values are not constant.\n * Interval data is continuous and depends on logical order. The data has standardized differences between values, but do not include zero.\n * Ratio data is continuous with logical order as well as regular intervals differences between values and may include zero.\n\n Qualitative analysis can explore the complexity and meaning of social phenomena. Data for\n qualitative study may include written texts (for example, documents or e-mail) and/or\n audible and visual data (digital images or sounds).\n\n (Cuesta, Hector and Kumar, Sampath; 2016)\n\n\nReproducibility for Data Analysis\n---------------------------------\n\nA good way to promote reproducibility for data analysis is store the\noperation history. This history can be used to prepare another dataset\nwith the same steps (operations).\n\n\nBooks used as reference to guide this project:\n----------------------------------------------\n\n- https://www.packtpub.com/big-data-and-business-intelligence/clean-data\n- https://www.packtpub.com/big-data-and-business-intelligence/python-data-analysis\n- https://www.packtpub.com/big-data-and-business-intelligence/mastering-machine-learning-scikit-learn\n- https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-second-edition\n\nSome other materials used as reference:\n---------------------------------------\n\n- https://github.com/rsouza/MMD/blob/master/notebooks/3.1_Kaggle_Titanic.ipynb\n- https://github.com/agconti/kaggle-titanic/blob/master/Titanic.ipynb\n- https://github.com/donnemartin/data-science-ipython-notebooks/blob/master/kaggle/titanic.ipynb\n\n\nInstalling scikit-data\n======================\n\nUsing conda\n-----------\n\nInstalling `scikit-data` from the `conda-forge` channel can be achieved by adding `conda-forge` to your channels with:\n\n.. code-block:: console\n\n $ conda config --add channels conda-forge\n\n\nOnce the `conda-forge` channel has been enabled, `scikit-data` can be installed with:\n\n.. code-block:: console\n\n $ conda install scikit-data\n\n\nIt is possible to list all of the versions of `scikit-data` available on your platform with:\n\n.. code-block:: console\n\n $ conda search scikit-data --channel conda-forge\n\n\nUsing pip\n---------\n\nTo install scikit-data, run this command in your terminal:\n\n.. code-block:: console\n\n $ pip install skdata\n\nIf you don't have `pip`_ installed, this `Python installation guide`_ can guide\nyou through the process.\n\n.. _pip: https://pip.pypa.io\n.. _Python installation guide: http://docs.python-guide.org/en/latest/starting/installation/\n\n\nMore Information\n----------------\n\n* License: MIT\n* Documentation: https://skdata.readthedocs.io\n\n\nReferences\n----------\n\n* CUESTA, Hector; KUMAR, Sampath. Practical Data Analysis. Packt Publishing Ltd, 2016.\n\n\n**Electronic materials**\n\n* [1] http://www.datasciencecentral.com/profiles/blogs/introduction-to-outlier-detection-methods\n\n\n=======\nHistory\n=======\n\n0.1.0 (2016-08-14)\n------------------\n\n* First release on PyPI.", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/OpenDataScienceLab/skdata/archive/master.tar.gz", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/OpenDataScienceLab/skdata", "keywords": "scikit data analysis", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "scikit-data", "package_url": "https://pypi.org/project/scikit-data/", "platform": "", "project_url": "https://pypi.org/project/scikit-data/", "project_urls": { "Download": "https://github.com/OpenDataScienceLab/skdata/archive/master.tar.gz", "Homepage": "https://github.com/OpenDataScienceLab/skdata" }, "release_url": "https://pypi.org/project/scikit-data/0.1.3/", "requires_dist": null, "requires_python": "", "summary": "The propose of this library is to allow the data analysis process more easy and automatic.", "version": "0.1.3" }, "last_serial": 3355471, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "5daf832cc2652165bfc5adadb2994d1f", "sha256": "bf1a05bb590a202b48bfaf154d875e4e1ab6261ab42e20a2c389b8e1c626c575" }, "downloads": -1, "filename": "scikit-data-0.1.0.tar.gz", "has_sig": false, "md5_digest": "5daf832cc2652165bfc5adadb2994d1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 14480, "upload_time": "2017-07-06T16:16:03", "url": "https://files.pythonhosted.org/packages/80/94/db0cfc657ac47402177a7e7d54bae3e3a768e6dcf0d5a71ffffcc142498e/scikit-data-0.1.0.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "4a4b38dc8ec00f09c1f368b6ff33d402", "sha256": "d0f17db22e718272f9914cfe11069bea890ef6912b7409beebc72beb52a69253" }, "downloads": -1, "filename": "scikit-data-0.1.2.tar.gz", "has_sig": false, "md5_digest": "4a4b38dc8ec00f09c1f368b6ff33d402", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18863, "upload_time": "2017-07-25T02:36:50", "url": "https://files.pythonhosted.org/packages/24/fa/339451f8113e6387f0a8c900e394be950646f54b61a0eafbe79dc11d94a0/scikit-data-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "558b1d24d59e5a8df5da995a10592449", "sha256": "d933e9822f7f8d1bb120ba313b64c45ea73e0d97c9b322797f3cfe42e3113428" }, "downloads": -1, "filename": "scikit-data-0.1.3.tar.gz", "has_sig": false, "md5_digest": "558b1d24d59e5a8df5da995a10592449", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27179, "upload_time": "2017-11-22T14:57:09", "url": "https://files.pythonhosted.org/packages/23/b6/409c946069d3f9a3f1c4153cfc9e0c6a6762f6bbccbec9813eed4d6baf67/scikit-data-0.1.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "558b1d24d59e5a8df5da995a10592449", "sha256": "d933e9822f7f8d1bb120ba313b64c45ea73e0d97c9b322797f3cfe42e3113428" }, "downloads": -1, "filename": "scikit-data-0.1.3.tar.gz", "has_sig": false, "md5_digest": "558b1d24d59e5a8df5da995a10592449", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27179, "upload_time": "2017-11-22T14:57:09", "url": "https://files.pythonhosted.org/packages/23/b6/409c946069d3f9a3f1c4153cfc9e0c6a6762f6bbccbec9813eed4d6baf67/scikit-data-0.1.3.tar.gz" } ] }