{ "info": { "author": "Microsoft Corporation", "author_email": "kustalk@microsoft.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3.7", "Topic :: Software Development" ], "description": "# Kusto Ingestion Tools (Kit) \n\n\n[*azure-kusto-ingestion-tools*](\"https://github.com/Azure/azure-kusto-ingestion-tools/tree/master/kit\") a simple toolkit to help with ingestions, avialble here\n
\n[![PyPI version](https://badge.fury.io/py/azure-kusto-ingestion-tools.svg)](https://badge.fury.io/py/azure-kusto-ingestion-tools)\n[![Downloads](https://pepy.tech/badge/azure-kusto-ingestion-tools)](https://pepy.tech/project/azure-kusto-ingestion-tools)
\n\n## Purpose \nMake ingestion simpler (*at least for common cases*).\nAfter creating an ADX (Kusto) cluster via Azure portal, we want to explore / visualize some data. \nWhen evaluating data stores / tools we usually want to just POC capabilities and move fast. \n\nThat is what this project was created for. \nIt contains features to support:\n\n1. Data source **schema** inference (csv / kusto/ ...)\n2. Common **ingestion** scenarios (from file /entire folder / ...)\n3. Other helpful utilities (kql generator, ...)\n\n## TOC\n* [Concept](#concept) \n* [Install](#install)\n* [Usage](#usage)\n* [Examples](#examples)\n\n## Concept\nGiven a data source, usually the workflow would consist of: \n \n1. Describing the data source. \n2. Preparing the target data store (in our case, Kusto) \n3. Mapping Source to Target \n4. Loading the data \n5. *Optional* : Automation / Moving to Production \n \n## Install\n\n### Minimum Requirements\n* Python 3.7+\n* See setup.py for dependencies\n\n### Pip\n\nTo install via the Python Package Index (PyPI), type:\n\n`pip install azure-kusto-ingestion-tools`\n\nThis will expose a new cli commands : `kit --help`\n\n\n## Usage \n\n### Basic \n \n`kit ingest -d /path/to/data/imdb -h mycluster.westus` \n \nThe following command will try to ingest all files in `/path/to/data/imdb` (non-recursive) using type inference. \n \n \n**NOTICE**: without providing any other arguments, this command is extremely *opinionated*, and will assume the following: \n \n### Options \n \n#### Auth \nEvery command that needs to authenticate against kusto, will require authentication arguemnts.\n\nBy default, will try to grab token from [azure cli](https://docs.microsoft.com/en-us/cli/azure/?view=azure-cli-latest)\n\nOther options are:\n\nApp:\n\n`kit [command] -app {app_id}:{app_secret}`\n\nUser:\n\n`kit [command] -user {user_id}:{password}` \n \n#### Naming \n* **database** will be set to is the dir in which data sits, so `/path/to/data/imdb` will look for, and create if missing, a database named `imdb`. \nIf more control is need, try `--database` \n* **tables** are actual file names, so `/path/to/data/imdb/aka_name.csv` will look for, and create if missing, a table named `aka_name`. \nThis can be tweaked by making sure data is split into folder, where any folder would be a table. \nThis recursive mode assumes that the table structure is the same for all files. \n\n### Files \n \n#### Database Schema file \nThis is a simple way to describe a database. \n \nThis can be used to describe a db schema using plain JSON format, and thus easily copy entire database schemas. \n \n```json \n{\n \"name\": \"imdb\",\n \"tables\": [{\n \"name\": \"aka_name\",\n \"columns\": [{\n \"dtype\": \"int\",\n \"name\": \"id\",\n \"index\": 0\n }, {\n \"dtype\": \"int\",\n \"name\": \"person_id\",\n \"index\": 1\n }, {\n \"dtype\": \"string\",\n \"name\": \"name\",\n \"index\": 2\n }, {\n \"dtype\": \"string\",\n \"name\": \"imdb_index\",\n \"index\": 3\n }, {\n \"dtype\": \"string\",\n \"name\": \"name_pcode_cf\",\n \"index\": 4\n }, {\n \"dtype\": \"string\",\n \"name\": \"name_pcode_nf\",\n \"index\": 5\n }, {\n \"dtype\": \"string\",\n \"name\": \"surname_pcode\",\n \"index\": 6\n }, {\n \"dtype\": \"string\",\n \"name\": \"md5sum\",\n \"index\": 7\n }]\n },\n ...\n ] \n}\n``` \n\n**From an existing cluster** \n \n`kit schema create -h 'https://mycluster.kusto.windows.net' -db imdb > imdb_schema.json` \n \n**From an sql file** \n \n`kit schema create -sql imdb.sql > schema.json` \n \n**From a folder with raw data** \n \n`kit schema create -d path/to/dir > schema.json` \n \n**More to come...** \n \n#### Manifest file \nA file to describe the details of an ingestion which can be run later \n \n```json \n{ \n \"databases\": [ \"same as schema.json\" ], \n \"mappings\": [{ \n \"name\": \"aka_name_from_csv\", \n \"columns\": [{ \n \"source\": { \n \"index\": 0, \n \"data_type\": \"str\" \n }, \n \"target\": { \n \"index\": 0, \n \"data_type\": \"str\" \n }\n }] \n }], \n \"operations\": [{ \n \"database\": \"imdb\", \n \"sources\": [{ \n \"files\": [\"1.csv\", \"...\", \"99.csv\"], \n \"mapping\": \"aka_name_from_csv\" \n }], \n \"target\": [ \"aka_name\" ] \n }]\n } \n``` \n \n## Examples\n \n### Example 1 : Ingest IMDB Dataset , CSV files (used in Join Order Benchmark) \n \nOne useful scenario would be to load an entire existing dataset into Kusto. \nLet's take for example the [Join Order Benchmark](https://github.com/gregrahn/join-order-benchmark) used in the paper [How good are query optimizers really?](http://www.vldb.org/pvldb/vol9/p204-leis.pdf). \n \n#### 1. Copy files to local dir: \n \n[Download](https://imdb2013dataset.blob.core.windows.net/data/imdb.tgz) from Azure Storage \n`wget https://imdb2013dataset.blob.core.windows.net/data/imdb.tgz --no-check-certificate` \n or \n`curl https://imdb2013dataset.blob.core.windows.net/data/imdb.tgz --output imdb.tgz` \n \n \nOriginal Files [are available](https://homepages.cwi.nl/~boncz/job/imdb.tgz), but are malformed (don't conform to https://tools.ietf.org/html/rfc4180). \nOne can fix them using tools like [xsv](https://github.com/BurntSushi/xsv/releases/tag/0.13.0), \nbut this is we shall leave error handling for another section. \n \n#### 2. Extract files: \n \n`tar -xvzf imdb.tgz` \n \n \n#### 3. Download SQL Create commands: \n \n`wget https://raw.githubusercontent.com/gregrahn/join-order-benchmark/master/schema.sql -O imdb.sql --no-check-certificate` \n \nor \n \n`curl https://raw.githubusercontent.com/gregrahn/join-order-benchmark/master/schema.sql --output imdb.sql` \n \n#### 4. Create schema from sql statement \n \n`kit schema create -sql schema.sql > imdb_schema.json` \n \n#### 5. Apply schema on cluster \nAssuming we already have a cluster, and we are signed in using az cli, we can just apply the schema on a database we choose: \n \n`kit schema apply -f imdb_schema.json -h mycluster.westus -db imdb` \n \n#### 6. Ingest data from local files \n \n`kit ingest -d . --pattern \"*.csv\" -h mycluster.westus -db imdb` \n \n#### 7. Query \n \nUsing the Azure portal, you can now easily login and query your data. \n \nYou can always make sure that data was loaded by comparing the source line count with target column count:\n\n`xsv count aka_name.csv` - should show 901343 rows\n\nor\n\n`wc -l aka_name.csv` - should show 901343 rows\n\nQuery from kusto should show the same:\n\n`kit count --table aka_name -h mycluster.westus -db imdb` - should show 901343\n\nAnd take a peek at the data:\n`kit peek --table aka_name -n 10 -h mycluster.westus -db imdb`\n\n \n### Example 2 : Ingest Kaggle ML Datasets, CSV and JSON\n\nKaggale has tons of interesting dataset for ML/AI purposes.\n\nLet's try and ingest some:\n\nhttps://www.kaggle.com/mlg-ulb/creditcardfraud/\nhttps://www.kaggle.com/START-UMD/gtd/\n\nUploaded to our azure storage for convenience:\n\n```\nwget https://imdb2013dataset.blob.core.windows.net/data/creditcard.csv.gz --no-check-certificate \nwget https://imdb2013dataset.blob.core.windows.net/data/globalterrorism.csv.gz --no-check-certificate\nwget https://imdb2013dataset.blob.core.windows.net/data/arxivData.csv.gz --no-check-certificate\n```\n or \n```\ncurl https://imdb2013dataset.blob.core.windows.net/data/creditcard.csv.gz --output creditcard.csv.gz\ncurl https://imdb2013dataset.blob.core.windows.net/data/globalterrorism.csv.gz --output globalterrorism.csv.gz \ncurl https://imdb2013dataset.blob.core.windows.net/data/arxivData.json.gz --output arxivData.json.gz\n```\n \n Once downloaded and unzipped, same idea, only this time files contain headers, so schema is infered:\n\n`kit ingest -d . -h mycluster.westus -db ml --headers`\n\n### Example 3 : Complex nested JSON mappings\n\nLet's look at a more advance use case:\n\n`wget https://imdb2013dataset.blob.core.windows.net/data/demo.json --no-check-certificate`\n \n or\n\n`curl https://imdb2013dataset.blob.core.windows.net/data/demo.json --output demo.json`\n\nSay our data is a json lines files, where each item looks like:\n\n`{\"header\":{\"time\":\"24-Aug-18 09:42:15\", \"id\":\"0944f542-a637-411b-94dd-8874992d6ebc\", \"api_version\":\"v2\"}, \"payload\":{\"data\":\"NEEUGQSPIPKDPQPIVFE\", \"user\":\"owild@fabrikam.com\"}}`\n\nIt seems that we have a nested object. \nBecause we are not sure what will happen, let's dry run.\nLet's try and `--dry` run an ingestion with `--object-depth 2`.\n\n`kit ingest -f demo.json --object-depth 2 -h mycluster.westus -db ml --dry > manifest.json`\n\nThis produces the following `manifest.json` which contains the operations to be executed.\n\n```json\n{\n \"databases\": [\n {\n \"name\": \"ml\",\n \"tables\": []\n }\n ],\n \"mappings\": [\n {\n \"name\": \"demo_from_json\",\n \"columns\": [\n {\n \"source\": {\n \"dtype\": \"string\",\n \"name\": \"header.time\",\n \"index\": null\n },\n \"target\": {\n \"dtype\": \"string\",\n \"name\": \"header.time\",\n \"index\": null\n }\n },\n {\n \"source\": {\n \"dtype\": \"string\",\n \"name\": \"header.id\",\n \"index\": null\n },\n \"target\": {\n \"dtype\": \"string\",\n \"name\": \"header.id\",\n \"index\": null\n }\n },\n {\n \"source\": {\n \"dtype\": \"string\",\n \"name\": \"header.api_version\",\n \"index\": null\n },\n \"target\": {\n \"dtype\": \"string\",\n \"name\": \"header.api_version\",\n \"index\": null\n }\n },\n {\n \"source\": {\n \"dtype\": \"string\",\n \"name\": \"payload.data\",\n \"index\": null\n },\n \"target\": {\n \"dtype\": \"string\",\n \"name\": \"payload.data\",\n \"index\": null\n }\n },\n {\n \"source\": {\n \"dtype\": \"string\",\n \"name\": \"payload.user\",\n \"index\": null\n },\n \"target\": {\n \"dtype\": \"string\",\n \"name\": \"payload.user\",\n \"index\": null\n }\n }\n ]\n }\n ],\n \"operations\": [\n {\n \"database\": \"ml\",\n \"sources\": [\n {\n \"files\": [\n \"C:\\\\Users\\\\dadubovs\\\\temp\\\\ml_datasets\\\\demo.json\"\n ],\n \"mapping\": \"demo_from_json\",\n \"options\": {},\n \"data_format\": \"json\"\n }\n ],\n \"target\": \"demo\"\n }\n ]\n}\n```\n\nNow, let's say that we don't need the `id` field, we can edit the mapping and save it.\n\n\nIf we are still unsure, and want to get a better understanding are the commands that will be created, we can inspect the kql\n\n`kit kql -m manifest.json`\n\nWhich should output something like:\n\n```\n// Table Creation Commands:\n.create table demo (['header.time']:string,['header.api_version']:string,['payload.data']:string,['payload.user']:string)\n\n// Ingestion Mapping Creation Commands:\n.create table demo ingestion json mapping \"demo_from_json\" '[{\"column\":\"header.time\",\"path\":\"$.header.time\",\"datatype\":\"string\"},{\"column\":\"header.api_version\",\"path\":\"$.header.api_version\",\"datatype\":\"string\"},{\"column\":\"payload.data\",\"path\":\"$.payload.data\",\"datatype\":\"string\"},{\"column\":\"payload.user\",\"path\":\"$.payload.user\",\"datatype\":\"string\"}]'\n```\n\nOnce we are ready, we can resume our ingestion based on the manifest\n\n`kit ingest -m manifest.json -h mycluster.westus`", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Azure/azure-kusto-ingestion-tools", "keywords": "kusto wrapper client library", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "azure-kusto-ingestion-tools", "package_url": "https://pypi.org/project/azure-kusto-ingestion-tools/", "platform": "", "project_url": "https://pypi.org/project/azure-kusto-ingestion-tools/", "project_urls": { "Homepage": "https://github.com/Azure/azure-kusto-ingestion-tools" }, "release_url": "https://pypi.org/project/azure-kusto-ingestion-tools/0.3.1/", "requires_dist": null, "requires_python": "", "summary": "Kusto Ingestion Tools (Kit)", "version": "0.3.1" }, "last_serial": 5420278, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "6c0c9497d75919527bdfc37fe3719b16", "sha256": "a45110210079b2c1277ee24740b7af9ebd97108ee6550e1874fab387f354a4f2" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.1.0.tar.gz", "has_sig": false, "md5_digest": "6c0c9497d75919527bdfc37fe3719b16", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18363, "upload_time": "2019-06-03T12:54:08", "url": "https://files.pythonhosted.org/packages/1d/1a/fbe26fcd092efcddb9273c7ecc6359d107f57838fc45f10dd71a61d9980a/azure-kusto-ingestion-tools-0.1.0.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "37eaf4b1f4d488eb555c639ef4caad33", "sha256": "a56781e41cd5bdd41f2b5e860a5e7cda5d7f0085f75e9c25ef7eb7f1729d1d4e" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.2.0.tar.gz", "has_sig": false, "md5_digest": "37eaf4b1f4d488eb555c639ef4caad33", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23127, "upload_time": "2019-06-13T13:58:06", "url": "https://files.pythonhosted.org/packages/40/b8/34ac3c2bd6136bd5836d13b91e310c0b3aafe65283ea69c62d8443b1f0db/azure-kusto-ingestion-tools-0.2.0.tar.gz" } ], "0.2.2": [ { "comment_text": "", "digests": { "md5": "9d0cac78a1ee4b220a31f836dcb742c9", "sha256": "f56b31a120e9f495285e9e45170fd2bc62d695c31b46a19a6c3c223d06853c8b" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.2.2.tar.gz", "has_sig": false, "md5_digest": "9d0cac78a1ee4b220a31f836dcb742c9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24440, "upload_time": "2019-06-13T14:16:39", "url": "https://files.pythonhosted.org/packages/d8/7e/a54600a625dfa6608110eebdf943db7c4dba33f1d57dd5e2cab74b42bcd0/azure-kusto-ingestion-tools-0.2.2.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "ac684b19eea91ef7a5acc8353b67febe", "sha256": "fb7d4c615884ebdaa66a558a483773171271b2d90dd75d93fc9a8db187f7bcdd" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.3.0.tar.gz", "has_sig": false, "md5_digest": "ac684b19eea91ef7a5acc8353b67febe", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24746, "upload_time": "2019-06-18T09:02:15", "url": "https://files.pythonhosted.org/packages/56/d5/0acda91fa7a70e9c98b1e4b6e9bddd2fb0c2468394e76467711a2048d753/azure-kusto-ingestion-tools-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "fa99f27048706232cd482c951900e009", "sha256": "fee1bb71cf9f9a9421ea16fd216b78c37889b644100166d7bfa1b20aa982afc8" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.3.1.tar.gz", "has_sig": false, "md5_digest": "fa99f27048706232cd482c951900e009", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24861, "upload_time": "2019-06-19T13:33:35", "url": "https://files.pythonhosted.org/packages/b0/72/f2f6844c77de934ba06cbabe6df4022f2653f224f5540238f1c7674a4ce4/azure-kusto-ingestion-tools-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "fa99f27048706232cd482c951900e009", "sha256": "fee1bb71cf9f9a9421ea16fd216b78c37889b644100166d7bfa1b20aa982afc8" }, "downloads": -1, "filename": "azure-kusto-ingestion-tools-0.3.1.tar.gz", "has_sig": false, "md5_digest": "fa99f27048706232cd482c951900e009", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 24861, "upload_time": "2019-06-19T13:33:35", "url": "https://files.pythonhosted.org/packages/b0/72/f2f6844c77de934ba06cbabe6df4022f2653f224f5540238f1c7674a4ce4/azure-kusto-ingestion-tools-0.3.1.tar.gz" } ] }