{ "info": { "author": "Chu-Hsuan Lee", "author_email": "joseph.chuhsuanlee@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# guacamoleETL - Document\n\nAn ETL pipeline tool that\n* pre-process\n * extracts data from a .txt file ([Challenge_me.txt](Challenge_me.txt))\n * cleans up the data with invalid information\n* transforms the data through given specifications into a matrix (list of lists)\n* loads the data into a .csv file ([output.csv](output.csv))\n\n## Installation\nThis tool can be installed with `pip`
\nCopy-paste and run this command in the terminal\n```\npip install guacamoleETL\n```\n\n## Usage\nThis ETL pipeline can be part of predictive model training and feed the data directly to the model\n```py\nimport guacamoleETL\n\ndataFile = 'Challenge_me.txt'\n\nguacamoleETL.load(dataFile)\nresult = guacamoleETL.transform(dataFile)\npredictive_model = model_training(result)\n```\n\n## Functions\n* __extract_data(txt_file):__\n __Extract data from a .txt file to a temporary .csv file__
\n Leading or trailing whitespace are removed during the extraction\n* __clean_up():__\n __Clean up the data with invalid information__
\n Rows with the placeholder '-' (NA) in any of the specified columns are excluded\n* __transform(path):__\n __Transform the data from pre-process through given specifications into a matrix__
\n `engine-location` is split into two columns, `engine-location_front` and `engine-location_rear` and one-hot-encoded
\n `num-of-cylinders` is transformed from word into integer through a pre-defined dictionary
\n `engine-size` is transformed into integer
\n `weight` is transformed into integer
\n `horsepower` is transformed from German decimal notation string into float number
\n `aspiration` is modified as `aspiration_turbo` so that turbo engines are marked as 1
\n `price` is converted from minor units to major units
\n `make` is not transformed but kept in the dataset\n* __load(path):__\n __Load the data from previous transformation into a .csv file__\n\n\n## Architecture\nAll the functions are implemented in the [\\_\\_init\\_\\_.py](guacamoleETL/__init__.py), this decision is made based on the following reasons:\n* After the package is imported, if we want to use the transform and load functions directly as sub-module, the functions must be imported or defined in `__init__.py`.\n* Since they are all connected to each other, such as the transform function takes the result from pre-process (extract and clean up) and the load function also takes the result from transform function, it's easier to follow the flow if they are all in the same file.\n* This might not be the best architecture implementation, but while starting from small, simplicity is always a good consideration.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/chuhsuanlee/guacamoleETL", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "guacamoleETL", "package_url": "https://pypi.org/project/guacamoleETL/", "platform": "", "project_url": "https://pypi.org/project/guacamoleETL/", "project_urls": { "Homepage": "https://github.com/chuhsuanlee/guacamoleETL" }, "release_url": "https://pypi.org/project/guacamoleETL/0.3.0/", "requires_dist": null, "requires_python": "", "summary": "ETL package for AUTO1 challenge", "version": "0.3.0" }, "last_serial": 4503889, "releases": { "0.3.0": [ { "comment_text": "", "digests": { "md5": "175c6fd4fa2ef3002f7a12d929b355ff", "sha256": "c8cd67437a41fb259151da2682366e5c6806c5cb8316f02b3e97aaeb24a57fb3" }, "downloads": -1, "filename": "guacamoleETL-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "175c6fd4fa2ef3002f7a12d929b355ff", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6456, "upload_time": "2018-11-19T17:55:16", "url": "https://files.pythonhosted.org/packages/bc/f4/c0508079f244172f00f7353c59999e2ced8ab4e9c5b662facdcc0ae554b1/guacamoleETL-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a331408b17a67742197856510acd52ee", "sha256": "2c5ccdcfec63b51a6c6e39ff6177b3b17f297e53927e12c7fbc178319b50c47a" }, "downloads": -1, "filename": "guacamoleETL-0.3.0.tar.gz", "has_sig": false, "md5_digest": "a331408b17a67742197856510acd52ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3111, "upload_time": "2018-11-19T17:55:18", "url": "https://files.pythonhosted.org/packages/fe/f8/f32e0da8ab4a45f488d7dd1cb49b31664624e831e45c5ef552d9e9e1e0fe/guacamoleETL-0.3.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "175c6fd4fa2ef3002f7a12d929b355ff", "sha256": "c8cd67437a41fb259151da2682366e5c6806c5cb8316f02b3e97aaeb24a57fb3" }, "downloads": -1, "filename": "guacamoleETL-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "175c6fd4fa2ef3002f7a12d929b355ff", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 6456, "upload_time": "2018-11-19T17:55:16", "url": "https://files.pythonhosted.org/packages/bc/f4/c0508079f244172f00f7353c59999e2ced8ab4e9c5b662facdcc0ae554b1/guacamoleETL-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a331408b17a67742197856510acd52ee", "sha256": "2c5ccdcfec63b51a6c6e39ff6177b3b17f297e53927e12c7fbc178319b50c47a" }, "downloads": -1, "filename": "guacamoleETL-0.3.0.tar.gz", "has_sig": false, "md5_digest": "a331408b17a67742197856510acd52ee", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3111, "upload_time": "2018-11-19T17:55:18", "url": "https://files.pythonhosted.org/packages/fe/f8/f32e0da8ab4a45f488d7dd1cb49b31664624e831e45c5ef552d9e9e1e0fe/guacamoleETL-0.3.0.tar.gz" } ] }