{ "info": { "author": "Tiago Alves", "author_email": "tiagohcalves@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# Data Pipe ML\nPipeline API to manipulate dataframes for machine learning.\n\nData Pipe is a framework that wraps Pandas Data Frames to provide a more fluid method to manipulate data. \n\nBasic concepts:\n- Every operation is performed in place. The Data Pipe object keeps one and only one reference to a pandas Data Frame that is constantly updated. \n- \u200eEvery operation returns a reference to self, which allows chaining methods fluidly. \n- Every method called is recorded internally to provide improved reproducibility and understanding of the preparation pipeline. The exception is the \"load\" method.\n- \u200eData Pipe calls of unimplemented methods default to the internal Data Frame object. This allows quickly accessing some methods, such as shape and head, but please be aware that those calls are not recorded and do not return Data Pipe objects. If it's necessary to use an unimplemented function, please use the Update method to keep manipulating the Data Pipe. \n\n## Example\n\n### Full pipeline with time split\n```\n>>> from datapipeml import DataPipe\n\n>>> X, y = DataPipe.load(\"data/kiva_loans_sample.csv.gz\")\\\n>>> .anonymize(\"id\")\\\n>>> .set_index(\"id\")\\\n>>> .drop(\"tags\")\\\n>>> .drop_sparse()\\\n>>> .drop_duplicates()\\\n>>> .fill_null()\\\n>>> .remove_outliers()\\\n>>> .normalize()\\\n>>> .set_one_hot()\\\n>>> .split_train_test(by=\"date\")\n\nAnonymizing id\nNo sparse columns to drop\nFound 0 duplicated rows\nFillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nRemoving outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nNormalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nEncoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'repayment_interval']\n \n>>> X.keep_numerics()\n>>> y.keep_numerics()\n\nDropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}\nDropping columns {'region', 'posted_time', 'date', 'funded_time', 'borrower_genders', 'disbursed_time', 'use'}\n\n>>> print(X.summary())\n___________________________________________________________|\nMethod Name |Args |Kwargs |\n___________________________________________________________|\nanonymize |('id',) |{} |\nset_index |('id',) |{} |\ndrop |('tags',) |{} |\ndrop_sparse |() |{} |\ndrop_duplicates |() |{} |\nfill_null |() |{} |\nremove_outliers |() |{} |\nnormalize |() |{} |\nset_one_hot |() |{} |\nsplit_train_test |() |{'by': 'date'} |\nkeep_numerics |() |{} |\n___________________________________________________________|\n```\n\n### Create target column and stratified folds\n```\n>>> folds = DataPipe.load(\"data/kiva_loans_sample.csv.gz\")\\\n>>> .set_index(\"id\")\\\n>>> .drop_duplicates()\\\n>>> .fill_null()\\\n>>> .remove_outliers()\\\n>>> .normalize()\\\n>>> .set_one_hot()\\\n>>> .create_column(\"high_loan\", lambda x: 1 if x[\"loan_amount\"] > 2000 else 0)\\\n>>> .keep_numerics()\\\n>>> .create_folds(stratify_by=\"high_loan\")\n \nFound 0 duplicated rows\nFillings columns ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nRemoving outliers from ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nNormalizing ['funded_amount', 'loan_amount', 'partner_id', 'term_in_months', 'lender_count']\nOne-hot encoding columns ['activity', 'sector', 'country_code', 'country', 'currency', 'borrower_genders', 'repayment_interval']\nCreating column high_loan\nDropping columns {'tags', 'funded_time', 'disbursed_time', 'region', 'use', 'posted_time', 'date'}\n```\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tiagohcalves/data-pipeline", "keywords": "pandas pipeline machine learning dataset etl", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "datapipeml", "package_url": "https://pypi.org/project/datapipeml/", "platform": "", "project_url": "https://pypi.org/project/datapipeml/", "project_urls": { "Homepage": "https://github.com/tiagohcalves/data-pipeline" }, "release_url": "https://pypi.org/project/datapipeml/0.8/", "requires_dist": null, "requires_python": "", "summary": "Framework to manipulate dataframes fluidly in a pipeline.", "version": "0.8" }, "last_serial": 3746928, "releases": { "0.8": [ { "comment_text": "", "digests": { "md5": "a33b4f32f85fc004a0ab02f76d182f4b", "sha256": "64358e5cc8e75c5694ba55867a81b604dcf1a4115c398316df56bf2fc50fb131" }, "downloads": -1, "filename": "datapipeml-0.8.tar.gz", "has_sig": false, "md5_digest": "a33b4f32f85fc004a0ab02f76d182f4b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10714, "upload_time": "2018-04-08T22:28:41", "url": "https://files.pythonhosted.org/packages/1e/96/ab3ab7e2fd329e0af03a16b62cd44f6e8e51d168723f4b7c64df04706ccd/datapipeml-0.8.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "a33b4f32f85fc004a0ab02f76d182f4b", "sha256": "64358e5cc8e75c5694ba55867a81b604dcf1a4115c398316df56bf2fc50fb131" }, "downloads": -1, "filename": "datapipeml-0.8.tar.gz", "has_sig": false, "md5_digest": "a33b4f32f85fc004a0ab02f76d182f4b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10714, "upload_time": "2018-04-08T22:28:41", "url": "https://files.pythonhosted.org/packages/1e/96/ab3ab7e2fd329e0af03a16b62cd44f6e8e51d168723f4b7c64df04706ccd/datapipeml-0.8.tar.gz" } ] }