{ "info": { "author": "Lucas Miguel Ponce", "author_email": "lucasmsp@dcc.ufmg.br", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: POSIX :: Linux", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Topic :: Software Development :: Libraries", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: System :: Distributed Computing" ], "description": "

\n \"Distributed \n

\n\n

A PyCOMPSs library for Big Data scenarios.

\n\n

\n Documentation \u2022\n Releases\n\n\n

\n\n\n\n## Introduction\n\nThe Distributed DataFrame Library provides distributed algorithms and operations ready to use as a library \nimplemented over [PyCOMPSs](https://pypi.org/project/pycompss/) programming model. Currently, is highly focused on \nETL (extract-transform-load) and Machine Learning algorithms to Data Science tasks. DDF is greatly inspired by Spark's \nDataFrame and its operators.\n\nCurrently, an operation can be of two types, transformations or actions. Action operations are those that produce \na final result (whether to save to a file or to display on screen). Transformation operations are those that will \ntransform an input DDF into another output DDF. Besides this classification, there are operations with one processing \nstage and those with two or more stages of processing (those that need to exchange information between the partitions).\n\nWhen running DDF operation/algorithms, a context variable (COMPSs Context) will check the possibility of \noptimizations during the scheduling of COMPS tasks. These optimizations can be of the type: grouping one stage \noperations to a single task COMPSs and stacking operations until an action operation is found.\n\n\n## Contents\n\n- [Summary of algorithms and operations](#summary-of-algorithms-and-operations)\n- [Example of use](#example-of-use)\n- [Installation](#Installation)\n- [Publications](#publications)\n- [License](#license)\n\n \n## Summary of algorithms and operations:\n\n - **ETL**: Add Columns, Aggregation, Cast types, Clean Missing Values, \n Difference (distinct or all), Distinct (Remove Duplicated Rows), Drop Columns, Filter rows, Set-Intersect (distinct or all), \n Join (Inner, Left or Right), Load Data, Rename columns, Replace Values, Sample Rows, Save data, Select Columns, Sort, \n Split, Transform (Map operations), Union (by column position or column name)\n - **Statistics**: Covariance, CrossTab, Describe, Frequent Items, Kolmogorov-Smirnov test, Pearson Correlation,\n - **ML**:\n - **Evaluator Models**: Binary Classification Metrics, Multi-label Metrics and Regression Metrics\n - **Feature Operations**: Vector Assembler, Vector Slicer, Binarizer, Tokenizer (Simple and Regex), \n Remove Stop-Words, N-Gram, Count Vectorizer, Tf-idf Vectorizer, String Indexer,\n Index To String, Scalers (Max-Abs, Min-Max and Standard), PCA and Polynomial Expansion\n \n - **Frequent Pattern Mining**: FP-Growth and Association Rules\n - **Classification**: K-Nearest Neighbors, Gaussian Naive Bayes, Logistic Regression, SVM\n - **Clustering**: K-means (using random or k-means|| initialization method), DBSCAN\n - **Regression**: Linear Regression using method of least squares (works only for 2-D data) or using Gradient Descent\n - **Geographic Operations**: Load data from shapefile, Geo Within (Select rows that exists within a specified shape)\n - **Graph Operations**: Initially, only Page Rank are present\n\n \n### Example of use:\n\nThe following code is an example of how to use this library for Data Science purposes. In this example, we want\nto know the number of men, women and children who survived or died in the Titanic crash.\n\nIn the first part, we will perform some pre-processing (remove some columns, clean some rows that\nhave missing values, replace some value and filter rows) and after that, aggregate the information for adult women.\n\nFor explanatory aspects, the input data (Pandas DataFrame) is distributed by COMPSs in 4 fragments using `parallelize()`. \nAt this point, the programmer no longer has to worry about partitioning the data. All operations will be able to \nwork transparently to the user. The COMPS tasks will be executed in parallel, one for each fragment. \n\n```python\nfrom ddf_library.ddf import DDF\nimport pandas as pd\n\nurl = 'https://raw.githubusercontent.com/eubr-bigsea/' \\\n 'Compss-Python/tests/titanic.csv'\ndf = pd.read_csv(url, sep='\\t')\n\nddf1 = DDF().parallelize(df, num_of_parts=4)\\\n .select(['Sex', 'Age', 'Survived'])\\\n .dropna(['Sex', 'Age'], mode='REMOVE_ROW')\\\n .replace({0: 'No', 1: 'Yes'}, subset=['Survived']).cache()\n\nddf_women = ddf1.filter('(Sex == \"female\") and (Age >= 18)')\\\n .group_by(['Survived']).count(['Survived'], alias=['Women'])\n\nddf_women.show()\n```\n\nThe image shows the DAG created by COMPSs during the execution. The operations `select(), dropna(), replace() and filter()` \nare some of them that are 'one processing stage' and then, the library was capable of group into a single COMPSs task \n(which was named `task_bundle()`). In this DAG, the other tasks are referring to the operation of `group_by.count()`. This operations \nneeds certain exchanges of information, so it performs a synchronization of some indices (light data) for submit the\n minimum amount of tasks from master node. Finally, the last synchronization is performed by `show()` function \n (which is an action) to receives the data produced.\n\n![usecase1](./docs/source/use_case_1.png)\n\n\nNext, we extend the previous code to computate the result also for men and kids. \n\n\n```python\nfrom ddf_library.ddf import DDF\nimport pandas as pd\n\nurl = 'https://raw.githubusercontent.com/eubr-bigsea/' \\\n 'Compss-Python/tests/titanic.csv'\ndf = pd.read_csv(url, sep='\\t')\n\nddf1 = DDF().parallelize(df, num_of_parts=4)\\\n .select(['Sex', 'Age', 'Survived'])\\\n .dropna(['Sex', 'Age'], mode='REMOVE_ROW')\\\n .replace({0: 'No', 1: 'Yes'}, subset=['Survived']).cache()\n\nddf_women = ddf1.filter('(Sex == \"female\") and (Age >= 18)')\\\n .group_by(['Survived']).count(['Survived'], alias=['Women'])\n\nddf_kids = ddf1.filter('Age < 18')\\\n .group_by(['Survived']).count(['Survived'], alias=['Kids'])\n\nddf_men = ddf1.filter('(Sex == \"male\") and (Age >= 18)')\\\n .group_by(['Survived']).count(['Survived'], alias=['Men'])\n\nddf_final = ddf_women\\\n .join(ddf_men, key1=['Survived'], key2=['Survived'], mode='inner')\\\n .join(ddf_kids, key1=['Survived'], key2=['Survived'], mode='inner')\n\nddf_final.show()\n```\n\nThis code will produce following result:\n\n\n| Survived | Women | Men | Kids |\n| ----------|------ | ----|----- |\n| No | 8 | 63 | 14 |\n| Yes | 24 | 7 | 10 |\n\n\n## Installation\n\nDDF Library can be installed via pip from PyPI:\n\n pip install ddf-pycompss\n\n\n### Requirements\n\nBesides the PyCOMPSs installation, DDF uses others third-party libraries. If you want to read and save data from HDFS, \nyou need to install [hdfspycompss](https://github.com/eubr-bigsea/compss-hdfs/tree/master/Python) library. The others \ndependencies will be installed automatically from PyPI or can be manually installed by using the command `$ pip install -r requirements.txt` \n\n```\npandas>=0.23.4\nscikit-learn>=0.20.3\nnumpy>=1.16.0\nscipy>=1.2.1\nPyqtree>=0.24\nmatplotlib>=1.5.1\nnetworkx>=1.11\npyshp>=1.2.11\nnltk>=3.4\n```\n\n\n\n## Publications\n\n```\nPONCE, Lucas M.; SANTOS, Walter dos; MEIRA JR., Wagner; GUEDES, Dorgival. \nExtens\u00e3o de um ambiente de computa\u00e7\u00e3o de alto desempenho para o processamento de dados massivos. \nIn: BRAZILIAN SYMPOSIUM ON COMPUTER NETWORKS AND DISTRIBUTED SYSTEMS (SBRC), 2018 \nProceedings of the 36th Brazilian Symposium on Computer Networks and Distributed Systems. \nPorto Alegre: Sociedade Brasileira de Computa\u00e7\u00e3o, may 2018. ISSN 2177-9384.\n```\n## License\n\nApache License Version 2.0, see [LICENSE](LICENSE)", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/eubr-bigsea/Compss-Python", "keywords": "", "license": "Apache License, Version 2.0", "maintainer": "", "maintainer_email": "", "name": "ddf-pycompss", "package_url": "https://pypi.org/project/ddf-pycompss/", "platform": "Linux", "project_url": "https://pypi.org/project/ddf-pycompss/", "project_urls": { "Homepage": "https://github.com/eubr-bigsea/Compss-Python" }, "release_url": "https://pypi.org/project/ddf-pycompss/0.4/", "requires_dist": null, "requires_python": "", "summary": "A PyCOMPSs library for Big Data scenarios.", "version": "0.4" }, "last_serial": 5886436, "releases": { "0.2": [ { "comment_text": "", "digests": { "md5": "aae54261912237ed264795f0e3ba70ca", "sha256": "a6e237cf8f471675b42ad47611b873bf77a94c3cd14b631822bca2ac42d6cd0b" }, "downloads": -1, "filename": "ddf-pycompss-0.2.tar.gz", "has_sig": false, "md5_digest": "aae54261912237ed264795f0e3ba70ca", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 70381, "upload_time": "2019-02-20T14:45:40", "url": "https://files.pythonhosted.org/packages/aa/8a/8e1d685854db46b28dde29fb49cf7ed69591463acdf2f8798d54cba89205/ddf-pycompss-0.2.tar.gz" } ], "0.3": [ { "comment_text": "", "digests": { "md5": "8936d294a89f9c3ce230406fe98a70f7", "sha256": "c686b0908656332cf2e1ebca27e3f23f5886e4ad14f4a911c217d0195dbfb488" }, "downloads": -1, "filename": "ddf-pycompss-0.3.tar.gz", "has_sig": false, "md5_digest": "8936d294a89f9c3ce230406fe98a70f7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 369016, "upload_time": "2019-03-22T02:31:02", "url": "https://files.pythonhosted.org/packages/f8/55/43fbe377fc7f8684c3d53492be0af40eafd94684c0168cb23485635a9ddf/ddf-pycompss-0.3.tar.gz" } ], "0.4": [ { "comment_text": "", "digests": { "md5": "3e511f5e53f7f1113224f5c322db9dbb", "sha256": "5c34be1644b051d2ee116d9848e434b383d468ba842722bf4ea01ea178b61507" }, "downloads": -1, "filename": "ddf-pycompss-0.4.tar.gz", "has_sig": false, "md5_digest": "3e511f5e53f7f1113224f5c322db9dbb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 101897, "upload_time": "2019-09-25T17:25:08", "url": "https://files.pythonhosted.org/packages/a9/5f/8ca60808c1557a14db01fa3678eaae108ea63f76fab4ad874ae863feeb42/ddf-pycompss-0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3e511f5e53f7f1113224f5c322db9dbb", "sha256": "5c34be1644b051d2ee116d9848e434b383d468ba842722bf4ea01ea178b61507" }, "downloads": -1, "filename": "ddf-pycompss-0.4.tar.gz", "has_sig": false, "md5_digest": "3e511f5e53f7f1113224f5c322db9dbb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 101897, "upload_time": "2019-09-25T17:25:08", "url": "https://files.pythonhosted.org/packages/a9/5f/8ca60808c1557a14db01fa3678eaae108ea63f76fab4ad874ae863feeb42/ddf-pycompss-0.4.tar.gz" } ] }