{ "info": { "author": "Adam Kariv", "author_email": "adam.kariv@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "# Dataflows Resource write to db normalized\n\nThis library provides some dataflows processing for normalizing a resource.\n\nIt has special support for storing normalized data into DB tables.\n\n## What is normalization?\n\nIn short, it is the process of reducing duplication in a dataset.\n\nMore can be read about this concept [here](https://en.wikipedia.org/wiki/Database_normalization).\n\n## Example\n\nLet's take, as an example, this world cities dataset (we shall call it the *fact* resource):\n\n```python\nfrom dataflows import Flow, load, printer\n\nFlow(\n load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),\n printer(num_rows=1)\n).process()\n```\n\n*cities:*\n|# |name |country |subcountry |geonameid\n|-----|----------------|----------|------------------|-----------\n |1 |les Escaldes |Andorra |Escaldes-Engordany |3040051\n |2 |Andorra la Vella|Andorra |Andorra la Vella |3041563\n |...\n |23018|**Chitungwiza** |**Zimbabwe** |**Harare** |1106542\n\n\n\nIt seems that the `country` and `subcountry` columns are quite repetitive - let's extract them into a separate, deduplicated resource (we will call that a *dimension* resource).\n\nTo do that we use the `normalize` processor.\n\nThis processor receives a single resource name, and a list of `NormGroup` instances. Each of these groups specifies one new *dimension* resource to be extracted and deduplicated.\n\nLet's see it in action:\n\n\n```python\nfrom dataflows_normalize import normalize, NormGroup\n\nFlow(\n load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),\n normalize([\n NormGroup(['country', 'subcountry'], 'country_id', 'id') \n ], resource='cities'),\n printer()\n).process()\n```\n\n*cities:*\n |# |name |geonameid |country_id\n |-----|-----------------|-----------|------------\n |1 |les Escaldes |3040051 |0\n |2 |Andorra la Vella |3041563 |1\n |3 |Umm al Qaywayn |290594 |2\n |4 |Ras al-Khaimah |291074 |3\n |5 |Khawr Fakk\u0101n |291696 |4\n |...\n |23014|Bulawayo |894701 |2677\n |23015|Bindura |895061 |2678\n |23016|Beitbridge |895269 |2679\n |23017|Epworth |1085510 |2676\n |23018|**Chitungwiza** |1106542 |**2676**\n\n*cities_country_id:*\n |# |id|country |subcountry\n |----|-----------|-------------------|----------------------\n |1 |30|Afghanistan |Badakhshan\n |2 |27|Afghanistan |Badghis\n |3 |21|Afghanistan |Balkh\n |4 |33|Afghanistan |B\u0101m\u012b\u0101n\n |5 |31|Afghanistan |Farah\n |6 |19|Afghanistan |Faryab\n |7 |28|Afghanistan |Ghazn\u012b\n |8 |13|Afghanistan |Ghowr\n |9 |22|Afghanistan |Helmand\n |10 |11|Afghanistan |Herat\n |...\n |2671 |2677|Zimbabwe |Bulawayo\n |2672 |**2676**|**Zimbabwe** |**Harare**\n |2673 |2673|Zimbabwe |Manicaland\n |2674 |2678|Zimbabwe |Mashonaland Central\n |2675 |2675|Zimbabwe |Mashonaland East\n |2676 |2674|Zimbabwe |Mashonaland West\n |2677 |2670|Zimbabwe |Masvingo\n |2678 |2671|Zimbabwe |Matabeleland North\n |2679 |2679|Zimbabwe |Matabeleland South\n |2680 |2672|Zimbabwe |Midlands\n\nIf we follow the last line in the dataset (`Chitungwiza`), we can see that an entry for its region (`Zimbabwe/Harare`) was created with id `2676`, and that id was added to the original row instead of the original values.\n\n**How much did we gain?**\n\nThe original CSV file has a size of 895,586 bytes.\n\nIf we save the two new resources as CSVs, we would get\n\n542,299 bytes for the *fact* resource and 68,023 for the regions *dimension* resource - a total of 610,322 bytes (or a reduction of 31% in size).\n\nNot only this helps with size, it also improves greatly DB performance to store data in normalized form.\n\n## DB Normalization\n\nRunning similar code to above, only using `normalize_to_db` will do the following:\n- Load existing values from database *dimension* tables (in case these tables exist)\n- Normalize the input data, and split into *fact* and *dimension* resources\n- Update the DB tables with new values, while reusing existing references\n\nThe main difference in usage from `normalize` is that the names of DB tables are provided.\n\n\n```python\nfrom dataflows_normalize import normalize_to_db, NormGroup\n\nFlow(\n load('https://datahub.io/core/world-cities/r/world-cities.csv', name='cities'),\n normalize_to_db(\n [\n NormGroup(['country', 'subcountry'], 'country_id', 'id', db_table='countries_db_table') \n ], \n 'cities_db_table', 'cities',\n db_connection_str='...'\n ),\n).process()\n```\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/dataspot/dataflows-normalize", "keywords": "data", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "dataflows-normalize", "package_url": "https://pypi.org/project/dataflows-normalize/", "platform": "", "project_url": "https://pypi.org/project/dataflows-normalize/", "project_urls": { "Homepage": "https://github.com/dataspot/dataflows-normalize" }, "release_url": "https://pypi.org/project/dataflows-normalize/0.0.8/", "requires_dist": [ "dataflows (>=0.0.51)", "kvfile (>=0.0.7)", "psycopg2-binary", "pylama ; extra == 'develop'", "tox ; extra == 'develop'" ], "requires_python": "", "summary": "A resource normalizer for dataflows", "version": "0.0.8" }, "last_serial": 5598084, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "caa5d218e403af49cdaf9f68e0435647", "sha256": "4f00bd30e24707cbffd801b0707ea05e5baf47b81181c1a04b41ad7c35f09691" }, "downloads": -1, "filename": "dataflows-normalize-0.0.1.tar.gz", "has_sig": false, "md5_digest": "caa5d218e403af49cdaf9f68e0435647", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5909, "upload_time": "2019-05-02T09:40:06", "url": "https://files.pythonhosted.org/packages/0d/a3/83c5e57865a6461c24175e5617dcee537127f831c9f68a3af113b519a2e5/dataflows-normalize-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "290a590a1220479379170c0f9c9f0039", "sha256": "b251777063c81a1c0b3085d6eef76e75c52bd6f9c304958159ba0457d9f0c131" }, "downloads": -1, "filename": "dataflows_normalize-0.0.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "290a590a1220479379170c0f9c9f0039", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6151, "upload_time": "2019-07-18T12:23:27", "url": "https://files.pythonhosted.org/packages/fe/ee/66b445568992403d8860e85ced0dc36acc606b379d8baa613eff20b3b5f6/dataflows_normalize-0.0.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "32c06e37145dbe9046e2d0b677032ecb", "sha256": "98b4f6ee68f44a26cd726e41d8ad75ffb1e5ce0d1d46c7ef214308f17f46adb1" }, "downloads": -1, "filename": "dataflows-normalize-0.0.2.tar.gz", "has_sig": false, "md5_digest": "32c06e37145dbe9046e2d0b677032ecb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6037, "upload_time": "2019-07-18T12:23:28", "url": "https://files.pythonhosted.org/packages/21/91/43da99c5556cc654c0e8d42b2b262d1edcb7b92f7663ffef58f806599d78/dataflows-normalize-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "273e5a1d97572fedfbbf96e1655166f1", "sha256": "f35e160fd2e1543fbcdc0c3db9d191abafdca8500881b9acf45618d7b6960978" }, "downloads": -1, "filename": "dataflows_normalize-0.0.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "273e5a1d97572fedfbbf96e1655166f1", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6207, "upload_time": "2019-07-25T16:55:31", "url": "https://files.pythonhosted.org/packages/71/01/5ca80e29a10395a63c397100ea9475566ac13ae3392c00bac0c38e458ca2/dataflows_normalize-0.0.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "00c573339c530f752e4edf3dee97158d", "sha256": "42471eadb06b7ac27fd9b310bc0ac613aaaec1c1e0b5ad8affb5237ea625a3f4" }, "downloads": -1, "filename": "dataflows-normalize-0.0.3.tar.gz", "has_sig": false, "md5_digest": "00c573339c530f752e4edf3dee97158d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6089, "upload_time": "2019-07-25T16:55:33", "url": "https://files.pythonhosted.org/packages/92/2e/3314af4c4cbe4a8e03209f5f319760e0baf6628c2ca75f11efb148c18dc4/dataflows-normalize-0.0.3.tar.gz" } ], "0.0.4": [ { "comment_text": "", "digests": { "md5": "8526ffc7c7e879c2b5ad50c5a0fc07f6", "sha256": "64a8e079c5b1f916690fb3f2363688c6e44082e096d5530d3f3149c4c730cd72" }, "downloads": -1, "filename": "dataflows_normalize-0.0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8526ffc7c7e879c2b5ad50c5a0fc07f6", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6295, "upload_time": "2019-07-26T05:46:40", "url": "https://files.pythonhosted.org/packages/1f/43/a373295b121a6e2980cd00ae1d17c13f7b31c1250b52dfec8621034f1005/dataflows_normalize-0.0.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "65753fba488f365eaaa085fb91cd63df", "sha256": "22d5c35f4850ffd0a7581170dacbf479b8c05d2832af5878e5b65257e1431ec0" }, "downloads": -1, "filename": "dataflows-normalize-0.0.4.tar.gz", "has_sig": false, "md5_digest": "65753fba488f365eaaa085fb91cd63df", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6166, "upload_time": "2019-07-26T05:46:43", "url": "https://files.pythonhosted.org/packages/0e/79/bb45741a1261fb4cc4fdf1d49de42290926f6ac401e203a8de76179f6657/dataflows-normalize-0.0.4.tar.gz" } ], "0.0.5": [ { "comment_text": "", "digests": { "md5": "f0c50a03ce572aa19ebfacd26d858bdd", "sha256": "e887989f1f599425c5bf37b573c6b9a6360e571e704fc7f835e878ae929bf47a" }, "downloads": -1, "filename": "dataflows_normalize-0.0.5-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "f0c50a03ce572aa19ebfacd26d858bdd", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6312, "upload_time": "2019-07-28T19:54:56", "url": "https://files.pythonhosted.org/packages/00/90/b4afa034e80a4f370a74ae7da3363f2cab634a6aa226bb8be49fd9e63cf2/dataflows_normalize-0.0.5-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d5702e01f9bd81ea2284aec4628536d5", "sha256": "14c4ee05473a81ac521dcae47dca1c0c149d432ca9e6e2ee1684b38a2edf2990" }, "downloads": -1, "filename": "dataflows-normalize-0.0.5.tar.gz", "has_sig": false, "md5_digest": "d5702e01f9bd81ea2284aec4628536d5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6186, "upload_time": "2019-07-28T19:54:58", "url": "https://files.pythonhosted.org/packages/9b/a3/a08b6303bbc49e89979b1c6d9c7408b8e56ec2e5d987786799de98c0e148/dataflows-normalize-0.0.5.tar.gz" } ], "0.0.6": [ { "comment_text": "", "digests": { "md5": "ccf6b01a60c6b8650e73bfb9eda58612", "sha256": "c6847c740438f3667381c5472dda2a978f2afde8a70032f7a17a4e0267a8941f" }, "downloads": -1, "filename": "dataflows_normalize-0.0.6-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "ccf6b01a60c6b8650e73bfb9eda58612", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6703, "upload_time": "2019-07-28T20:32:29", "url": "https://files.pythonhosted.org/packages/94/5a/eefb5f2babd96c01a0852eeeee7ac44115eedddacd0f3ff6f33cb9574572/dataflows_normalize-0.0.6-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2eeacb806c137eabe74ad3c96f1cdab6", "sha256": "9dcca68d709627ab88ef0fd832af3b65ba910f6063bdb14e472c9c98a4c419e9" }, "downloads": -1, "filename": "dataflows-normalize-0.0.6.tar.gz", "has_sig": false, "md5_digest": "2eeacb806c137eabe74ad3c96f1cdab6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6401, "upload_time": "2019-07-28T20:32:31", "url": "https://files.pythonhosted.org/packages/b4/17/8130987bc5e50b02d4348c2a06799ff2b7ac931eccd3ee035529d1ef5740/dataflows-normalize-0.0.6.tar.gz" } ], "0.0.7": [ { "comment_text": "", "digests": { "md5": "0dbf0fe3267dd4513ddea0e7955b5756", "sha256": "a3e5d0e36084488bb5940eccbd539344f5d64386ee8de6d004386b92376e9987" }, "downloads": -1, "filename": "dataflows_normalize-0.0.7-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "0dbf0fe3267dd4513ddea0e7955b5756", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6733, "upload_time": "2019-07-28T21:13:42", "url": "https://files.pythonhosted.org/packages/30/03/94960efa15cee42f28c0feb08586b8044b115cb223af4a32213e77852f5c/dataflows_normalize-0.0.7-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f4dd5277559939ae45d660b0df34caff", "sha256": "76813ec7cc83f6fecd9503517562a87d9928d388174707ed13e766b61bc53619" }, "downloads": -1, "filename": "dataflows-normalize-0.0.7.tar.gz", "has_sig": false, "md5_digest": "f4dd5277559939ae45d660b0df34caff", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6432, "upload_time": "2019-07-28T21:13:44", "url": "https://files.pythonhosted.org/packages/72/d3/1f2fb9a39ceebf90c297a03f446987e83a93b6f84b7cd7ee8ed07ad33ab6/dataflows-normalize-0.0.7.tar.gz" } ], "0.0.8": [ { "comment_text": "", "digests": { "md5": "fb7acdc263f6cbf32e0409e90d537f83", "sha256": "c45dcab1ff59a9ce81bb1a0cf7a1baaa064dec2d0bd0dabe94469a258b50682d" }, "downloads": -1, "filename": "dataflows_normalize-0.0.8-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "fb7acdc263f6cbf32e0409e90d537f83", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6752, "upload_time": "2019-07-29T07:46:38", "url": "https://files.pythonhosted.org/packages/25/10/1926e9d9e40a10134b673382869e37bcad81628b9f2131b2564cef4d50a4/dataflows_normalize-0.0.8-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "778e2183d905ac9d7acf98a71ee676cc", "sha256": "075988a96c5cfbcc919758c097cb17b08e0998b138d0a3c764fe44f12bcc68e2" }, "downloads": -1, "filename": "dataflows-normalize-0.0.8.tar.gz", "has_sig": false, "md5_digest": "778e2183d905ac9d7acf98a71ee676cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6448, "upload_time": "2019-07-29T07:46:40", "url": "https://files.pythonhosted.org/packages/32/a6/552374f32239a4a06970b8ab61eb456bff56f4be0548b979c302f9ff11de/dataflows-normalize-0.0.8.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fb7acdc263f6cbf32e0409e90d537f83", "sha256": "c45dcab1ff59a9ce81bb1a0cf7a1baaa064dec2d0bd0dabe94469a258b50682d" }, "downloads": -1, "filename": "dataflows_normalize-0.0.8-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "fb7acdc263f6cbf32e0409e90d537f83", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 6752, "upload_time": "2019-07-29T07:46:38", "url": "https://files.pythonhosted.org/packages/25/10/1926e9d9e40a10134b673382869e37bcad81628b9f2131b2564cef4d50a4/dataflows_normalize-0.0.8-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "778e2183d905ac9d7acf98a71ee676cc", "sha256": "075988a96c5cfbcc919758c097cb17b08e0998b138d0a3c764fe44f12bcc68e2" }, "downloads": -1, "filename": "dataflows-normalize-0.0.8.tar.gz", "has_sig": false, "md5_digest": "778e2183d905ac9d7acf98a71ee676cc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6448, "upload_time": "2019-07-29T07:46:40", "url": "https://files.pythonhosted.org/packages/32/a6/552374f32239a4a06970b8ab61eb456bff56f4be0548b979c302f9ff11de/dataflows-normalize-0.0.8.tar.gz" } ] }