{ "info": { "author": "cseHdz", "author_email": "carlos.hdz@me.com", "bugtrack_url": null, "classifiers": [ "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "dExtract - A Layered Extractor of Data\n======================================\n\nPackage Overview\n----------------\n\ndExtract is based on an Object-Oriented Framework to perform\ntransformations on data on a sequential basis.\n\nThe minimal structure requires: 1. An object of the Sequential Class 2.\nObjects inheriting from the Base Layer Class\n\nData flows between layers by sending the transform output to the\nsubsequent layer. The model can be run completely or from a specific\nindex.\n\nSequential Class\n~~~~~~~~~~~~~~~~\n\nThe sequential class describes a pipeline of transformations. - Layers\nare saved in a dictionary keyed by Layer Types and count by type. - The\nmethod summary returns the components of the model for visualization.\n\nTypes of Layers\n~~~~~~~~~~~~~~~\n\nThere are 4 main types of layers inheriting from the BaseLayer.\n\n1.Slicer (DataLayer)\n^^^^^^^^^^^^^^^^^^^^\n\nThe slicer layer requires a data input that will be split into a series\nof \u2018slices\u2019 based on data sparsity/density. Row and column slices are\ncalculated and then concatenated to fined \u2018boxed areas\u2019.\n\nThe output can be defined as a list or as search dictionary to find\nspecific values in the resulting slices for quick identification.\n\nThe type of each row and column are defined by simple statistical\nthresholds. These can be customized according to user parameters.\n\nThere are 3 main data types: DATA (2), HEADING (1), EMPTY (0)\n\nThe data is then divided according to the DECISIONS matrix on\n``slicing.py``.\n\nDecisions Matrix to create slices (prev_type, cur_type) as key 1.\nstart_head (SH) - Begin counting records for a new HEADING section 2.\nend_head (EH) - Finish counting records for current HEADING section 3.\nstart_data (SD) - Begin counting records for a new DATA section 4.\nend_data (ED) - Finish counting records for current DATA section 5.\nstart_slice (SS) - Begin counting records for a new SLICE 6. end_slice\n(ES) - Finish counting records for current SLICE\n\n::\n\n SH, EH, SD, ED, SS, ES = 0, 1, 2, 3, 4, 5\n\n DECISIONS = {(-1,0): [0,0,0,0,0,0],\n (-1,1): [1,0,0,0,1,0],\n (-1,2): [0,0,1,0,1,0],\n (0,0) : [0,0,0,0,0,0],\n (0,1) : [1,0,0,0,1,0],\n (0,2) : [0,0,1,0,1,0],\n (1,0) : [0,1,0,0,0,1],\n (1,1) : [0,0,0,0,0,0],\n (1,2) : [0,1,1,0,0,0],\n (2,0) : [0,0,0,1,0,1],\n (2,1) : [1,0,0,1,1,1],\n (2,2) : [0,0,0,0,0,0]}\n\n2.Cleaner (DataLayer)\n^^^^^^^^^^^^^^^^^^^^^\n\nThe Cleaner layer will apply predefined transformations based on kwargs.\nEach transformation is applied on an individual basis.\n\nThe ``clean.py`` helper includes all transformations and it is easily\nextendible. Open the helper for a complete definition of each\ntransformation\n\n3.Extractor (BaseLayer)\n^^^^^^^^^^^^^^^^^^^^^^^\n\nThe Extractor layer provides an interface to retrieve external data.\n\nOn **v0.9dev1**: csv, xl (Excel) and single Excel sheet are supported\n\nFuture development includes: - SAS files through the SAS7BDAT Package -\nAsynchronous feed-forward extraction (allow the model to run in chunks)\n- Web Scrapping (Both files and websites)\n\n4.Flattener (BaseLayer)\n^^^^^^^^^^^^^^^^^^^^^^^\n\nThe Flattener layer transforms a nested dictionary of data into a single\nlevel dictionary.\n\nIt transforms all inputs into dataframes and identifies the result names\nby adding dictionary keys as \u2018levels\u2019 and concatenates them into a\nDataFrame ID.\n\nBased on the input names dictionary or list, each dataframe is then\nassigned a new name matching the resulting ID.\n\n--------------\n\nSample Usage\n^^^^^^^^^^^^\n\n::\n\n model = Sequential()\n model.add(Extractor(ext_type, path, file, **kwargs))\n\n model.add(Cleaner(clean_type = 'data', ignore_empty_cols = True,\n ignore_empty_rows = True, delete_by_threshold = 0.82))\n\n model.add(Cleaner(treat_axis_as_data = 'both', header_row = 0,\n delete_escape_chars = True, drop_empty_columns = True))\n\n model.add(Cleaner(compress_header = True, columns_as_row = True,\n treat_axis_as_data = 'columns'))\n\n model.add(Cleaner(index_as_col = True, transpose_output = True,\n rename_columns = {'iH':'Row_ID', 'iH_y': 'Measure',\n 'value':'Value'},\n add_columns = {'Sheet': kwargs.get('sheet_name'),\n 'File_Name': file,\n 'Country': country,\n 'File_Date': date,}))\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/csehdz/dextract", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "dextract", "package_url": "https://pypi.org/project/dextract/", "platform": "", "project_url": "https://pypi.org/project/dextract/", "project_urls": { "Homepage": "http://github.com/csehdz/dextract" }, "release_url": "https://pypi.org/project/dextract/0.1.dev7/", "requires_dist": [ "pandas (>=0.9.1)", "numpy (>=1.15.4)", "xlrd (>=1.2.0)", "pyexcel (>=0.5.10)", "pyexcel-xls (>=0.5.8)", "pyexcel-xlsx (>=0.5.7)", "statistics (>=1.0.3.5)" ], "requires_python": ">=3", "summary": "Layered extractor of data", "version": "0.1.dev7" }, "last_serial": 4837296, "releases": { "0.1.dev1": [ { "comment_text": "", "digests": { "md5": "2a496d92aea0b3402657260e086a0976", "sha256": "2b01f6b013241d77988e4b5e926970b5f0dc26298780dc3307640b584374084c" }, "downloads": -1, "filename": "dextract-0.1.dev1-py3-none-any.whl", "has_sig": false, "md5_digest": "2a496d92aea0b3402657260e086a0976", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17626, "upload_time": "2018-12-26T20:42:23", "url": "https://files.pythonhosted.org/packages/e8/a2/7427a3373493e216e3c33a0815c3a6efd3a379c9836a7ce9cedd1dbb2082/dextract-0.1.dev1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "59ce2154a363ff3d3c7d247c941e6bdd", "sha256": "f84760009e680bac016452957c25cb9622241a95edd4618b42164d2c7ae43f66" }, "downloads": -1, "filename": "dextract-0.1.dev1.tar.gz", "has_sig": false, "md5_digest": "59ce2154a363ff3d3c7d247c941e6bdd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16616, "upload_time": "2018-12-26T20:42:25", "url": "https://files.pythonhosted.org/packages/61/97/a58ec9a92c5d13afddf7a51a9d2f1433949b2fb700c47113abdb7b756e2c/dextract-0.1.dev1.tar.gz" } ], "0.1.dev3": [ { "comment_text": "", "digests": { "md5": "c1020aba589e827452110e4a645d6c17", "sha256": "48bafd534dc1a83b9843a9caa00526cfb2796bd826994fb01a2be2b12d3692d0" }, "downloads": -1, "filename": "dextract-0.1.dev3-py3-none-any.whl", "has_sig": false, "md5_digest": "c1020aba589e827452110e4a645d6c17", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 17543, "upload_time": "2018-12-26T21:09:24", "url": "https://files.pythonhosted.org/packages/22/58/ea58370afbb07725edf1b0db094729dfa1bd9d8180493571c77c24fee92a/dextract-0.1.dev3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "284e6279d411bca0648556522232f885", "sha256": "329e49c2e625ed45a2e9e7b4a0f7273b528c7fd9e9a228e23a8c85d80f0399a9" }, "downloads": -1, "filename": "dextract-0.1.dev3.tar.gz", "has_sig": false, "md5_digest": "284e6279d411bca0648556522232f885", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16485, "upload_time": "2018-12-26T21:09:26", "url": "https://files.pythonhosted.org/packages/c4/45/ecf5c539066bebe5bc048ef9ae618d4e92c4dd882455d861c3fe5ca764ba/dextract-0.1.dev3.tar.gz" } ], "0.1.dev7": [ { "comment_text": "", "digests": { "md5": "2901a045cd03789a6599ecae49584d7e", "sha256": "82f02f108a2c138d8f25878c9e7b0e9189ffa086bb7d01b54fcdb27b3d0198fb" }, "downloads": -1, "filename": "dextract-0.1.dev7-py3-none-any.whl", "has_sig": false, "md5_digest": "2901a045cd03789a6599ecae49584d7e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 4067, "upload_time": "2019-02-18T22:46:08", "url": "https://files.pythonhosted.org/packages/14/40/37bc8eebb02b42b507a1b43754137c2f0a099bfef729646251c52ed20877/dextract-0.1.dev7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6229eaa1a476de9d5071e9a929a3e93d", "sha256": "6e715c251ee94d8867ad1f72a137f9abedb0bdafc0107f2354e22a17ac21eec8" }, "downloads": -1, "filename": "dextract-0.1.dev7.tar.gz", "has_sig": false, "md5_digest": "6229eaa1a476de9d5071e9a929a3e93d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 15663, "upload_time": "2019-02-18T22:46:09", "url": "https://files.pythonhosted.org/packages/95/65/95be24fc655c92f2593205ba7442966ffe4b27c58b27e8bbd14f5e1bfa45/dextract-0.1.dev7.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2901a045cd03789a6599ecae49584d7e", "sha256": "82f02f108a2c138d8f25878c9e7b0e9189ffa086bb7d01b54fcdb27b3d0198fb" }, "downloads": -1, "filename": "dextract-0.1.dev7-py3-none-any.whl", "has_sig": false, "md5_digest": "2901a045cd03789a6599ecae49584d7e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3", "size": 4067, "upload_time": "2019-02-18T22:46:08", "url": "https://files.pythonhosted.org/packages/14/40/37bc8eebb02b42b507a1b43754137c2f0a099bfef729646251c52ed20877/dextract-0.1.dev7-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "6229eaa1a476de9d5071e9a929a3e93d", "sha256": "6e715c251ee94d8867ad1f72a137f9abedb0bdafc0107f2354e22a17ac21eec8" }, "downloads": -1, "filename": "dextract-0.1.dev7.tar.gz", "has_sig": false, "md5_digest": "6229eaa1a476de9d5071e9a929a3e93d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3", "size": 15663, "upload_time": "2019-02-18T22:46:09", "url": "https://files.pythonhosted.org/packages/95/65/95be24fc655c92f2593205ba7442966ffe4b27c58b27e8bbd14f5e1bfa45/dextract-0.1.dev7.tar.gz" } ] }