{ "info": { "author": "Vutsal Singhal", "author_email": "vutsalsinghal@nyu.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Programming Language :: Python", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "
\n

\n
\n\nCleanFlow is a framework for cleaning, pre-processing and exploring data in a scalable and distributed manner. Being built on top of Apache Spark, it is highly scalable.\n\n## Features\n* Explore data\n* Clean Data\n* Get output in different formats\n\n## Installation\n`pip install CleanFlow`\n\n## Sample usage\n\nStart Pyspark session\n```\n$ pyspark\n\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\\ \\/ _ \\/ _ `/ __/ '_/\n /__ / .__/\\_,_/_/ /_/\\_\\ version 2.2.1\n /_/\n\nUsing Python version 3.5.2 (default, Nov 23 2017 16:37:01)\nSparkSession available as 'spark'.\n```\n## Load data\n```python\n# DataFrame (df)\n>>> df = sqlContext.read.format('csv').options(header='true',inferschema='true').load('sample.csv')\n>>> type(df)\n\npyspark.sql.dataframe.DataFrame\n\n>>> df.printSchema()\n'''\nroot\n |-- summons_number: long (nullable = true)\n |-- issue_date: timestamp (nullable = true)\n |-- violation_code: integer (nullable = true)\n |-- violation_county: string (nullable = true)\n |-- violation_description: string (nullable = true)\n |-- violation_location: integer (nullable = true)\n |-- violation_precinct: integer (nullable = true)\n |-- violation_time: string (nullable = true)\n |-- time_first_observed: string (nullable = true)\n |-- meter_number: string (nullable = true)\n |-- issuer_code: integer (nullable = true)\n |-- issuer_command: string (nullable = true)\n |-- issuer_precinct: integer (nullable = true)\n |-- issuing_agency: string (nullable = true)\n |-- plate_id: string (nullable = true)\n |-- plate_type: string (nullable = true)\n |-- registration_state: string (nullable = true)\n |-- street_name: string (nullable = true)\n |-- vehicle_body_type: string (nullable = true)\n |-- vehicle_color: string (nullable = true)\n |-- vehicle_make: string (nullable = true)\n |-- vehicle_year: string (nullable = true)\n '''\n ```\n ## Explore Data\n ```python\n >>> from cleanflow.exploratory import describe\n >>> describe(df)\n```\n| | summons_number | violation_code | violation_location | violation_precinct | issuer_code |\n|-------|----------------|----------------|--------------------|--------------------|---------------|\n| count | 1.000000e+05 | 100000.000000 | 100000.000000 | 100000.000000 | 100000.00000 |\n| mean | 6.046625e+09 | 36.468890 | 66.483920 | 66.483980 | 445461.251460 |\n| std | 2.384666e+09 | 19.455201 | 34.810481 | 34.810365 | 199702.620407 |\n| min | 1.119098e+09 | 1.000000 | -1.000000 | 0.000000 | 0.000000 |\n| 25% | 7.014648e+09 | 21.000000 | 43.000000 | 43.000000 | 355455.000000 |\n| 50% | 7.100271e+09 | 37.000000 | 66.000000 | 66.000000 | 361282.000000 |\n| 75% | 7.451945e+09 | 41.000000 | 103.000000 | 103.000000 | 363040.000000 |\n| max | 7.698377e+09 | 99.000000 | 803.000000 | 803.000000 | 999843.000000 |\n```python\n\n>>> # Choose a subset of data\n>>> df = df.select('summons_number', 'violation_code', 'violation_county', 'plate_type', 'vehicle_year')\n>>> df.show(10)\n'''\n+--------------+--------------+----------------+----------+------------+\n|summons_number|violation_code|violation_county|plate_type|vehicle_year|\n+--------------+--------------+----------------+----------+------------+\n| 1307964308| 14| NY! | PAS| 2008|\n| 1362655727| 98| BX$ | PAS| 1999|\n| 1363178234| 21| NY$ | COM| 0|\n| 1365797030| 74| K$! | PAS| 1999|\n| 1366529595| 38| NY | COM| 2005|\n| 1366571757| 20| NY| COM| 2013|\n| 1363178192| 21| NY| PAS| 0|\n| 1362906062| 21| BX| PAS| 2008|\n| 1367591351| 40| K| PAS| 2005|\n| 1354042244| 20| NY| COM| 0|\n+--------------+--------------+----------------+----------+------------+\n'''\nonly showing top 10 rows\n```\n\n```python\n>>> from cleanflow.exploratory import check_duplicates, find_unique\n>>> check_duplicates(df, column='violation_county').show()\n'''\n+--------------+-----+\n|DuplicateValue|Count|\n+--------------+-----+\n| K|29123|\n| Q|25961|\n| BX|21203|\n| NY|20513|\n| R| 2728|\n| null| 467|\n+--------------+-----+\n'''\n>>> find_unique(df, column='violation_county').show()\n'''\n+------------+-----+ \n|UniqueValues|Count|\n+------------+-----+\n| K|29123|\n| Q|25961|\n| BX|21203|\n| NY|20513|\n| R| 2728|\n| null| 467|\n| NY | 1|\n| NY$ | 1|\n| K$! | 1|\n| BX$ | 1|\n| NY! | 1|\n+------------+-----+\n'''\n```\n## Clean data\n```python\n>>> import cleanflow.preprocessing as cfpr\n>>> cfpr.rmSpChars(cfpr.trim_col(df)).show(10)\n'''\n+--------------+--------------+----------------+----------+------------+\n|summons_number|violation_code|violation_county|plate_type|vehicle_year|\n+--------------+--------------+----------------+----------+------------+\n| 1307964308| 14| NY| PAS| 2008|\n| 1362655727| 98| BX| PAS| 1999|\n| 1363178234| 21| NY| COM| 0|\n| 1365797030| 74| K| PAS| 1999|\n| 1366529595| 38| NY| COM| 2005|\n| 1366571757| 20| NY| COM| 2013|\n| 1363178192| 21| NY| PAS| 0|\n| 1362906062| 21| BX| PAS| 2008|\n| 1367591351| 40| K| PAS| 2005|\n| 1354042244| 20| NY| COM| 0|\n+--------------+--------------+----------------+----------+------------+\n'''\nonly showing top 10 rows\n\n>>> cfpr.rmSpChars(trim_col(df), regex='[^A-Za-z0-9]+').show(10)\n'''\n+--------------+--------------+----------------+----------+------------+\n|summons_number|violation_code|violation_county|plate_type|vehicle_year|\n+--------------+--------------+----------------+----------+------------+\n| 1307964308| 14| NY| PAS| 2008|\n| 1362655727| 98| BX| PAS| 1999|\n| 1363178234| 21| NY| COM| 0|\n| 1365797030| 74| K| PAS| 1999|\n| 1366529595| 38| NY| COM| 2005|\n| 1366571757| 20| NY| COM| 2013|\n| 1363178192| 21| NY| PAS| 0|\n| 1362906062| 21| BX| PAS| 2008|\n| 1367591351| 40| K| PAS| 2005|\n| 1354042244| 20| NY| COM| 0|\n+--------------+--------------+----------------+----------+------------+\n'''\nonly showing top 10 rows\n```\n\n```python\n>>> cfpr.upper_case(cfpr.lower_case(df), columns='violation_county').show(10)\n'''\n+--------------+--------------+----------------+----------+------------+\n|summons_number|violation_code|violation_county|plate_type|vehicle_year|\n+--------------+--------------+----------------+----------+------------+\n| 1307964308| 14| NY! | pas| 2008|\n| 1362655727| 98| BX$ | pas| 1999|\n| 1363178234| 21| NY$ | com| 0|\n| 1365797030| 74| K$! | pas| 1999|\n| 1366529595| 38| NY | com| 2005|\n| 1366571757| 20| NY| com| 2013|\n| 1363178192| 21| NY| pas| 0|\n| 1362906062| 21| BX| pas| 2008|\n| 1367591351| 40| K| pas| 2005|\n| 1354042244| 20| NY| com| 0|\n+--------------+--------------+----------------+----------+------------+\n'''\nonly showing top 10 rows\n```", "description_content_type": "", "docs_url": null, "download_url": "https://github.com/vutsalsinghal/CleanFlow/archive/master.zip", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/vutsalsinghal/CleanFlow", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "CleanFlow", "package_url": "https://pypi.org/project/CleanFlow/", "platform": "", "project_url": "https://pypi.org/project/CleanFlow/", "project_urls": { "Download": "https://github.com/vutsalsinghal/CleanFlow/archive/master.zip", "Homepage": "https://github.com/vutsalsinghal/CleanFlow" }, "release_url": "https://pypi.org/project/CleanFlow/1.3.3a1/", "requires_dist": null, "requires_python": "", "summary": "A a framework for cleaning, pre-processing and exploring data in a scalable and distributed manner.", "version": "1.3.3a1" }, "last_serial": 3852415, "releases": { "1.0.3a1": [ { "comment_text": "", "digests": { "md5": "12bb244a295494390c582e92691d9421", "sha256": "e9c70c0c9fdf3589fa81a858ce1acfd66af9747c7a4de70f58e8af514ca8b98e" }, "downloads": -1, "filename": "CleanFlow-1.0.3a1.tar.gz", "has_sig": false, "md5_digest": "12bb244a295494390c582e92691d9421", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1022, "upload_time": "2018-05-02T21:41:33", "url": "https://files.pythonhosted.org/packages/31/84/839f2622ce78bb28921e191dc02400cc2b1871f55a0ca6bf9d89df7892d3/CleanFlow-1.0.3a1.tar.gz" } ], "1.0.4a1": [ { "comment_text": "", "digests": { "md5": "df108f6c5778653b13ef6678abcbe398", "sha256": "686225f7623f4351fa92f748d1888ae4650f447aaeeca2c2fb2286429a110355" }, "downloads": -1, "filename": "CleanFlow-1.0.4a1.tar.gz", "has_sig": false, "md5_digest": "df108f6c5778653b13ef6678abcbe398", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6753, "upload_time": "2018-05-02T22:04:22", "url": "https://files.pythonhosted.org/packages/06/4a/65488f779f7e6586b6066a2f39594ac08002603d60210cca47ed2286abf8/CleanFlow-1.0.4a1.tar.gz" } ], "1.0.5a1": [ { "comment_text": "", "digests": { "md5": "c4f9a888b44aec834c8db8eff34954ad", "sha256": "83753ac4be19887f39b020a382b77ee00b0a0310aff37e7fd0460e447d48f4a2" }, "downloads": -1, "filename": "CleanFlow-1.0.5a1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "c4f9a888b44aec834c8db8eff34954ad", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 13294, "upload_time": "2018-05-02T22:15:48", "url": "https://files.pythonhosted.org/packages/91/89/4b80f2de2654422a0551d07735a74facfa3a6c1d384dffa990708461700a/CleanFlow-1.0.5a1-py2.py3-none-any.whl" } ], "1.0.6a1": [ { "comment_text": "", "digests": { "md5": "42b6cad963d3a3026b65c55e7745eb76", "sha256": "39a7766c3a3bafe092eda04d2b579831e7d39d8fbebc11e69075ece255e60e19" }, "downloads": -1, "filename": "CleanFlow-1.0.6a1.tar.gz", "has_sig": false, "md5_digest": "42b6cad963d3a3026b65c55e7745eb76", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6827, "upload_time": "2018-05-02T23:50:30", "url": "https://files.pythonhosted.org/packages/0a/28/84859bfcd2a4d6147026c0eee21da903fa49481a7cdac723072a2b3337d2/CleanFlow-1.0.6a1.tar.gz" } ], "1.0.7a1": [ { "comment_text": "", "digests": { "md5": "4c91f6b813305989adfb879904b7c1ab", "sha256": "e20fc7347b81b78be004ef7b04e98e167d1af6f3d4fd94bd3d8ab1b18a1f9897" }, "downloads": -1, "filename": "CleanFlow-1.0.7a1.tar.gz", "has_sig": false, "md5_digest": "4c91f6b813305989adfb879904b7c1ab", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6823, "upload_time": "2018-05-03T00:01:39", "url": "https://files.pythonhosted.org/packages/d6/72/75942909087ba9544ee2db766e651bfa365f1eb0c3d831ed250f089a14a0/CleanFlow-1.0.7a1.tar.gz" } ], "1.0.8a1": [ { "comment_text": "", "digests": { "md5": "d4305fd97578ecf2ff3e818b677671a6", "sha256": "b64a5b120bd12f6c7da2c0cc3f6478215b287f636f72f97a958864a12ef649a3" }, "downloads": -1, "filename": "CleanFlow-1.0.8a1.tar.gz", "has_sig": false, "md5_digest": "d4305fd97578ecf2ff3e818b677671a6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7123, "upload_time": "2018-05-03T05:39:27", "url": "https://files.pythonhosted.org/packages/b7/fb/da0d1b6cf88b3a32a4936706a3a8e35a1e3a02516487cae41191ef3edec7/CleanFlow-1.0.8a1.tar.gz" } ], "1.2.8a1": [ { "comment_text": "", "digests": { "md5": "b00412ad6d7e9969a7a2afbc110fbbb8", "sha256": "28acee7bdc8b86e42aa5248b7ae1504651803cff696221347440a3b78348cc4d" }, "downloads": -1, "filename": "CleanFlow-1.2.8a1.tar.gz", "has_sig": false, "md5_digest": "b00412ad6d7e9969a7a2afbc110fbbb8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7000, "upload_time": "2018-05-03T06:19:06", "url": "https://files.pythonhosted.org/packages/23/04/4204de4f27a0559bff7afca64438f4d23684fdd8bd438862e0f17a3f54cb/CleanFlow-1.2.8a1.tar.gz" } ], "1.2.9a1": [ { "comment_text": "", "digests": { "md5": "f077d9f0bf6f541b6ab638f755af000f", "sha256": "7dbd41dffa34bbfbbcc66549edd12abac6fd283b86685897e3b7ad20c5b6be6f" }, "downloads": -1, "filename": "CleanFlow-1.2.9a1.tar.gz", "has_sig": false, "md5_digest": "f077d9f0bf6f541b6ab638f755af000f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7303, "upload_time": "2018-05-03T06:19:07", "url": "https://files.pythonhosted.org/packages/dd/9b/81ff78ca8dae3634f211737acc89fc1698d7e133d5227bfa9728d1fc2055/CleanFlow-1.2.9a1.tar.gz" } ], "1.3.0a1": [ { "comment_text": "", "digests": { "md5": "e348c24208086102190175a95ca93f50", "sha256": "7664c8dd02f9a87b788ed3117c8078705e0dc3eb518a8ae88680cf97428393ce" }, "downloads": -1, "filename": "CleanFlow-1.3.0a1.tar.gz", "has_sig": false, "md5_digest": "e348c24208086102190175a95ca93f50", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7322, "upload_time": "2018-05-03T06:25:46", "url": "https://files.pythonhosted.org/packages/ab/d1/ff84ab50211938a1e28d9a07733839d354c8f6f676ed66cb163523c67eef/CleanFlow-1.3.0a1.tar.gz" } ], "1.3.1.1a1": [ { "comment_text": "", "digests": { "md5": "d93e1c62120fb6ffec4a31c172f73a1f", "sha256": "ceb1cd846e3c89204be7fc73018425f53e1bffdc80f614f266e1cbb3a0e3e4a9" }, "downloads": -1, "filename": "CleanFlow-1.3.1.1a1.tar.gz", "has_sig": false, "md5_digest": "d93e1c62120fb6ffec4a31c172f73a1f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10149, "upload_time": "2018-05-04T18:35:50", "url": "https://files.pythonhosted.org/packages/f1/c2/719d3af12375a9303797919cebc268be1e7c23cf82e7c0985574276ae18a/CleanFlow-1.3.1.1a1.tar.gz" } ], "1.3.3a1": [ { "comment_text": "", "digests": { "md5": "85c541c6b9b8ede562ab8be0442b3cbe", "sha256": "40425f509102021c537157d5481856b0835b5243fb2a80a2ade2ee42abfad8e5" }, "downloads": -1, "filename": "CleanFlow-1.3.3a1.tar.gz", "has_sig": false, "md5_digest": "85c541c6b9b8ede562ab8be0442b3cbe", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10805, "upload_time": "2018-05-10T23:42:21", "url": "https://files.pythonhosted.org/packages/cb/ec/63d976e655521f51d71737b6aaa046c0cb03aeb6910ce14a1096dce1d47f/CleanFlow-1.3.3a1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "85c541c6b9b8ede562ab8be0442b3cbe", "sha256": "40425f509102021c537157d5481856b0835b5243fb2a80a2ade2ee42abfad8e5" }, "downloads": -1, "filename": "CleanFlow-1.3.3a1.tar.gz", "has_sig": false, "md5_digest": "85c541c6b9b8ede562ab8be0442b3cbe", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10805, "upload_time": "2018-05-10T23:42:21", "url": "https://files.pythonhosted.org/packages/cb/ec/63d976e655521f51d71737b6aaa046c0cb03aeb6910ce14a1096dce1d47f/CleanFlow-1.3.3a1.tar.gz" } ] }