{ "info": { "author": "R Hodges", "author_email": "info@altinity.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: System Administrators", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "# Altinity Datasets for ClickHouse\n\nWelcome! `altinity-datasets` loads test datasets for ClickHouse. It is \ninspired by Python libraries that [automatically load standard datasets](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris) \nfor quick testing. \n\n## Getting Started\n\nAltinity-datasets requires Python 3.5 or greater. The `clickhouse-client` \nexecutable must be in the path to load data. \n\nBefore starting you must install the altinity-datasets package using\npip3. Following example shows install into a Python virtual environment. \nFirst command is only required if you don't have clickhouse-client already\ninstalled on the host. \n\n```\nsudo apt install clickhouse-client\nsudo pip3 install altinity-datasets\n```\n\nMany users will prefer to install within a Python3 virtual environment, \nfor example:\n\n```\npython3 -m venv my-env\n. my-env/bin/activate\npip3 install altinity-datasets\n```\n\nYou can also install a current version directly from Github:\n```\npip3 install git+https://github.com/altinity/altinity-datasets.git\n```\nTo remove altinity-datasets run the following command:\n```\npip3 uninstall altinity-datasets\n```\n\n## Using datasets\n\nThe `ad-cli` command manages datasets. You can see available commands by\ntyping `ad-cli -h/--help`. All subcommands also accept -h/--help options.\n\n### Listing repos\n\nLet's start by listing repos, which are locations that contain datasets. \n\n```\nad-cli repo list\n```\n\nThis will return a list of repos that have datasets. For the time being there\nis just a built-in repo that is part of the altinity-datasets package. \n\n### Finding datasets\n\nNext let's see the available datasets. \n```\nad-cli dataset search\n```\nThis gives you a list of datasets with detailed descriptions. You can \nrestrict the search to a single dataset by typing the name, for example\n`ad-cli search wine`. You can also search other repos using the repo \nfile system location, e.g., `ad-cli search wine --repo-path=$HOME/myrepo`.\n\n### Loading datasets\n\nNow, let's load a dataset. Here's a command to load the iris dataset\nto a ClickHouse server running on localhost.\n\n```\nad-cli dataset load iris\n```\n\nHere is a more complex example. It loads the iris dataset to the `iris_new`\ndatabase on a remote server. Also, we parallize the upload with 10 threads. \n```\nad-cli load iris --database=iris_new --host=my.remote.host.com --parallel=10\n```\n\nThe command shown above is typical of the invocation when loading on a \nserver that has a large number of cores and fast storage. \n\nNote that it's common to reload datasets expecially during development.\nYou can do this using `ad-cli load --clean`. IMPORTANT: This drops the\ndatabase to get rid of dataset tables. If you have other tables in the\nsame database they will be dropped as well.\n\n### Dumping datasets\n\nYou can make a dataset from any existing table or tables in ClickHouse \nthat reside in a single database. Here's a simple example that shows \nhow to dump the weather dataset to create a new dataset. (The weather\ndataset is a built-in that loads by default to the weather database.)\n\n```\nad-cli dataset dump weather\n```\n\nThere are additional options to control dataset dumps. For example,\nwe can rename the dateset, restrict the dump to tables that start with\n'central', compress data, and overwrite any existing data in the output\ndirectory.\n\n```\nad-cli dataset dump new_weather -d weather --tables='^central' --compress \\\n --overwrite\n```\n\n### Extra Connection Options\n\nThe dataset load and dump commands by default connect to ClickHouse\nrunning on localhost with default user and empty password. The following\nexample options connect using encrypted communications to a specific\nserver with explicit user name and password. The last option suppresses\ncertificate verification. \n\n```\nad-cli dataset load iris -H 127.0.0.1 -P 9440 \\\n-u special -p secret --secure --no-verify \n```\n\nNote: To use --no-verify you must also ensure that clickhouse-client is\nconfigured to accept invalid certificates. Validate by logging in using\nclickhouse-client with the --secure option. Check and correct settings\nin /etc/clickhouse-client/config.xml if you have problems.\n\n## Repo and Dataset Format\n\nRepos are directories on the file system. The exact location of the repo is \nknown as the repo path. Data sets under the repo are child directories that\nin turn have subdirectories for DDL commands and data. The following listing \nshows part of the organization of the built-ins repo. \n\n```\nbuilt-ins/\n iris/\n data/\n iris/\n iris.csv\n ddl/\n iris.sql\n manifest.yaml\n wine/\n data/\n wine/\n wine.csv\n ddl/\n wine.sql\n manifest.yaml\n```\n\nTo create your own dataset you can dump existing tables using `ad-cli dataset \ndump` or copy the examples in built-ins. The format is is simple. \n\n* The manifest.yaml file describes the dataset. If you put in extra fields \n they will be ignored. \n* The DDL directory contains SQL scripts to run. By convention these should\n be named for the objects (i.e., tables) that they create. \n* The data directory contains CSV data. There is a separate subdirectory \n for each table to be loaded. Its name must match the table name exactly.\n* CSV files can be uncompressed .csv or gzipped .csv.gz. No other formats\n are supported and the file types must be correctly specified. \n\nYou can place new repos in any location you please. To load from your \nown repo run a load command and use the --repo-path option to point to the\nrepo location. Here's an example:\n\n```\nad-cli dataset load mydataset --repo-path=$HOME/my-repo\n```\n\n## Development\n\nTo work on altinity-datasets clone from Github and install. \n```\ngit clone https://github.com/altinity/altinity-datasets.git\ncd altinity-datasets\npython3 setup.py develop \n```\n\nAfter making changes you should run tests.\n```\ncd tests\npython3 -m unittest --verbose\n```\n\nThe following commands build an installable and push to pypi.org.\nPyPI account credentials must be set in TWINE_USERNAME and TWINE_PASSWORD.\n\n```\npython3 setup.py sdist\ntwine upload --repository-url https://upload.pypi.org/legacy/ dist/*\n```\n\nCode conventions are enforced using yapf and flake8. Run the\ndev-format-code.sh script to check formatting.\n\nRun tests as follows with virtual environment set. You will need a\nClickHouse server with a null password on the default user.\n\n```\ncd tests\npython3 -m unittest -v\n```\n\n## Errors\n\n### Out-of-date pip3 causes installation failure\n\nIf pip3 installs with the message `error: invalid command 'bdist_wheel'` you \nmay need to upgrade pip. Run `pip3 install --upgrade pip` to correct the\nproblem. \n\n### Materialized views cannot be dumped\n\nad-cli will fail with an error if you try to dump a database that has\nmaterialized views. The workaround is to omit them from the dump operation \nusing a table regex as shown in the following example: \n\n```\nad-cli dataset dump nyc_taxi_rides --repo-path=. --compress --parallel=6 \\\n--tables='^(tripdata|taxi_zones|central_park_weather_observations)$'\n```\n\n### --no-verify option fails on self-signed certs\n\nWhen using ad-cli --secure together with --no-verify options you need\nto also configure clickhouse-client to skip certificate verification.\nThis only applies when the certificate is self-signed. You must\nchange /etc/clickhouse-client/config.xml as follows to skip certificate\nvalidation:\n\n```\n\n \n \n ...\n \n AcceptCertificateHandler\n \n \n \n ...\n\n\n```\n\n## Limitations\n\nThe most important are:\n\n* Error handling is spotty. If clickhouse-client is not in the path \n things may fail mysteriously. \n* Datasets have to be on the local file system. In the future we will \n use cloud object storage such as S3.\n\nPlease file issues at https://github.com/Altinity/altinity-datasets/issues.\nPull requests to fix problems are welcome.", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/Altinity/altinity-datasets", "keywords": "", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "altinity-datasets", "package_url": "https://pypi.org/project/altinity-datasets/", "platform": "", "project_url": "https://pypi.org/project/altinity-datasets/", "project_urls": { "Homepage": "https://github.com/Altinity/altinity-datasets" }, "release_url": "https://pypi.org/project/altinity-datasets/0.1.2/", "requires_dist": null, "requires_python": "", "summary": "Altinity Datasets for ClickHouse", "version": "0.1.2" }, "last_serial": 5463738, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "5f9148de8c0564204b92d776eba998f9", "sha256": "8d58978b4cff0c3d3de591bf42622fa3448f6594a83c5912e7285ef3744da3a8" }, "downloads": -1, "filename": "altinity_datasets-0.1.1-py3-none-any.whl", "has_sig": false, "md5_digest": "5f9148de8c0564204b92d776eba998f9", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 34658, "upload_time": "2019-06-18T00:54:35", "url": "https://files.pythonhosted.org/packages/52/ce/64539aa9e7ca9f2e232383857749666a131faa0386c902647ccdc343ecfa/altinity_datasets-0.1.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5645754b9a52e583ef2febd17eab9337", "sha256": "7f978a125993f6e99e7b085e26f37f04a30a9edc3307cb246b4bc989b48799ca" }, "downloads": -1, "filename": "altinity_datasets-0.1.1.tar.gz", "has_sig": false, "md5_digest": "5645754b9a52e583ef2febd17eab9337", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 30251, "upload_time": "2019-06-18T00:54:38", "url": "https://files.pythonhosted.org/packages/c6/37/8c5340aefab6fce8bdea1f792b54dc334a5ffd7a6a5c077cfe36f40a4bf1/altinity_datasets-0.1.1.tar.gz" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "b3c0b03f8a9c4d6da9a1002dae6ec7ec", "sha256": "0706460e88bee58100bd2480c595157112abad509bd6d0ebdc146a1197339641" }, "downloads": -1, "filename": "altinity_datasets-0.1.2.tar.gz", "has_sig": false, "md5_digest": "b3c0b03f8a9c4d6da9a1002dae6ec7ec", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33038, "upload_time": "2019-06-28T22:16:20", "url": "https://files.pythonhosted.org/packages/e5/96/a38fd5cf8617199ffb6c3044c669835dbeb24397cfda1cb63e5dcb3e65df/altinity_datasets-0.1.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "b3c0b03f8a9c4d6da9a1002dae6ec7ec", "sha256": "0706460e88bee58100bd2480c595157112abad509bd6d0ebdc146a1197339641" }, "downloads": -1, "filename": "altinity_datasets-0.1.2.tar.gz", "has_sig": false, "md5_digest": "b3c0b03f8a9c4d6da9a1002dae6ec7ec", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 33038, "upload_time": "2019-06-28T22:16:20", "url": "https://files.pythonhosted.org/packages/e5/96/a38fd5cf8617199ffb6c3044c669835dbeb24397cfda1cb63e5dcb3e65df/altinity_datasets-0.1.2.tar.gz" } ] }