{ "info": { "author": "Thomas Levine", "author_email": "_@thomaslevine.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers" ], "description": "Searching spreadsheets\n-------------------------\nWhen we search for ordinary written documents, we send words into a search\nengine and get pages of words back.\n\nWhat if we could search for spreadsheets\nby sending spreadsheets into a search engine and getting spreadsheets back?\nThe order of the results would be determined by various specialized statistics;\njust as we use PageRank to find relevant hypertext documents, we can develop\nother statistics that help us find relevant spreadsheets.\n\nData tables\n-------------\nI think a lot about rows and columns. When we define tables in relational\ndatabases, we can say reasonably well what each column means, based on\nnames and types, and what a row means, based on unique indices.\nIn spreadsheets, we still have column names, but we don't get everything\nelse.\n\nI define a data table as a document that describes a collection of\nsimilar things, with similar information about each thing\n(http://www.datakind.org/blog/whats-in-a-table/).\nWhen we represent a data table in a grid, each row is an observation\n(a thing), and each column is a variable.\nBlizzard uses this table structure to find connections between arbitrary\ndata tables.\n\nUnique indices\n----------------\nI define a unique index as a column in a datatable, or combination of columns,\nfor which each row has a different value.\n\nThe unique indices tell us quite a lot; they give us an idea about the\nobservational unit of the table and what other tables we can nicely\njoin or union with that table. So I made a package that finds unique\nindices in ordinary CSV files. ::\n\n pip3 install special_snowflake\n\nIt's called \"special snowflake\", but it needs a better name.\n\nIf we pass the iris dataset to it, ::\n\n \"Sepal.Length\",\"Sepal.Width\",\"Petal.Length\",\"Petal.Width\",\"Species\"\n 5.1,3.5,1.4,0.2,\"setosa\"\n 4.9,3,1.4,0.2,\"setosa\"\n 4.7,3.2,1.3,0.2,\"setosa\"\n 4.6,3.1,1.5,0.2,\"setosa\"\n ...\n\nwe get no unique keys ::\n\n >>> special_snowflake.fromcsv(open('iris.csv')) \n set()\n\nbecause no combination of columns uniquely identifies the rows.\nOf course, if we add an identifier column, ::\n\n \"Id\",\"Sepal.Length\",\"Sepal.Width\",\"Petal.Length\",\"Petal.Width\",\"Species\"\n 1,5.1,3.5,1.4,0.2,\"setosa\"\n 2,4.9,3,1.4,0.2,\"setosa\"\n 3,4.7,3.2,1.3,0.2,\"setosa\"\n 4,4.6,3.1,1.5,0.2,\"setosa\"\n ...\n\nthat one gets returned. ::\n\n >>> special_snowflake.fromcsv(open('iris.csv')) \n {('Id',)}\n\nFor a more interesting example, let's look at chickweight.\n\n \"weight\",\"Time\",\"Chick\",\"Diet\"\n 42,0,\"1\",\"1\"\n 51,2,\"1\",\"1\"\n 59,4,\"1\",\"1\"\n 64,6,\"1\",\"1\"\n 76,8,\"1\",\"1\"\n ...\n\nI could read the documentation on this dataset and tell you\nwhat its statistical unit is (`?ChickWeight` in R), or I could\njust let `special_snowflake` figure it out for me.\n\n >>> special_snowflake.fromcsv(open('chick.csv'))\n {('Time', 'Chick')}\n\nThe statistical unit is chicks in time. That is, something was\nobserved across multiple chick, and multiple observations were\ntaken from each (well, at least one) chick.\n\nSome spreadsheets are have less obvious identifiers. In this\ntable of 1219 rows and 33 columns,\n\n >>> from urllib.request import urlopen\n >>> url = 'http://data.iledefrance.fr/explore/dataset/liste-des-points-de-contact-du-reseau-postal-dile-de-france/download/?format=csv'\n >>> fp = urlopen(url)\n >>> special_snowflake.fromcsv(fp, delimiter = ';')\n {('adresse', 'code_postal'),\n ('adresse', 'localite'),\n ('identifiant',),\n ('libelle_du_site',),\n ('wgs84',)}\n\nwe find five functional unique keys. Just by looking at the column names,\nI'm gussing that the first two are combinations of parts of the postal address\nand that the latter three look are formal identifiers.\nAnd when I do things correctly and look at the\n`data dictionary `_,\nI come to the same interpretation.\n\nThis tells me that this dataset is about postal service locations,\nwith one location per row. It also gives me some ideas as to things that can\nact as unique identifiers for postal service locations.\n\nIt's kind of cool to run this on individual spreadsheets, but it's even cooler\nto run this on lots of spreadsheets.\nIn blizzard, I find spreadsheets with\nthe same unique indices, and then I look for overlap between those spreadsheets.\nSpreadsheets with high overlap might be good to join to each other, and\nspreadsheets with low overlap might be good to union with each other.\n\nAll of this is quite crude at the moment, so I'm somewhat surprised that\nanything interesting comes out.\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tlevine/blizzard.git", "keywords": null, "license": "AGPL", "maintainer": null, "maintainer_email": null, "name": "blizzard", "package_url": "https://pypi.org/project/blizzard/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/blizzard/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/tlevine/blizzard.git" }, "release_url": "https://pypi.org/project/blizzard/0.0.3/", "requires_dist": null, "requires_python": null, "summary": "Find the unique indices for a ton of datasets", "version": "0.0.3" }, "last_serial": 1089718, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "40ee6bbb6347692b557a313532653e35", "sha256": "ea421e02d06f27fc1dec556440344a11859ea1ffa541560390810ae1c0751157" }, "downloads": -1, "filename": "blizzard-0.0.1.tar.gz", "has_sig": false, "md5_digest": "40ee6bbb6347692b557a313532653e35", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3309, "upload_time": "2014-04-23T15:16:49", "url": "https://files.pythonhosted.org/packages/2e/a5/6fa9d20f288475bc48b4e323f39c412a1c82bdfe79455e14a4f60bbb8137/blizzard-0.0.1.tar.gz" } ], "0.0.2": [ { "comment_text": "", "digests": { "md5": "40cd60d272f4b8a83b9be60cb06261f7", "sha256": "fc38a905c6d390215a5453233553e9d941a2a111e5422c54973c280da2e4c60d" }, "downloads": -1, "filename": "blizzard-0.0.2.tar.gz", "has_sig": false, "md5_digest": "40cd60d272f4b8a83b9be60cb06261f7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 3351, "upload_time": "2014-05-02T13:09:31", "url": "https://files.pythonhosted.org/packages/be/99/453db2f259c7559ff6d76a98fc63b34a5dadb20429befe99745c9961ee4d/blizzard-0.0.2.tar.gz" } ], "0.0.3": [ { "comment_text": "", "digests": { "md5": "f8deb2994beeb4f53c3f4ed310614050", "sha256": "83fe7738d5477e4b964574735e15109675a95a462e4a71e8c1d3211fad7a7eae" }, "downloads": -1, "filename": "blizzard-0.0.3.tar.gz", "has_sig": false, "md5_digest": "f8deb2994beeb4f53c3f4ed310614050", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5108, "upload_time": "2014-05-12T16:33:40", "url": "https://files.pythonhosted.org/packages/78/2c/95ea864b01ef57dbe24b936c75eff0129ced2c8bc9737a07e3564f008895/blizzard-0.0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f8deb2994beeb4f53c3f4ed310614050", "sha256": "83fe7738d5477e4b964574735e15109675a95a462e4a71e8c1d3211fad7a7eae" }, "downloads": -1, "filename": "blizzard-0.0.3.tar.gz", "has_sig": false, "md5_digest": "f8deb2994beeb4f53c3f4ed310614050", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5108, "upload_time": "2014-05-12T16:33:40", "url": "https://files.pythonhosted.org/packages/78/2c/95ea864b01ef57dbe24b936c75eff0129ced2c8bc9737a07e3564f008895/blizzard-0.0.3.tar.gz" } ] }