{ "info": { "author": "Karik Isichei", "author_email": "karik.isichei@digital.justice.gov.uk", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# pydbtools\n\nThis is a simple package that let's you query databases using Amazon Athena and get the s3 path to the athena out (as a csv). This is significantly faster than using the the database drivers so might be a good option when pulling in large data. By default, data is converted into a pandas dataframe with equivalent column data types as the Athena table - see \"Meta Data\" section below.\n\nNote to use this package you need to be added to the StandardDatabaseAccess IAM Policy on the Analytical Platform. Please contact the team if you require access.\n\nTo install...\n\n```\npip install pydbtools\n```\n\nOr from github...\n\n```\npip install git+git://github.com/moj-analytical-services/pydbtools.git#egg=pydbtools\n```\n\npackage requirements are:\n\n* `pandas` _(preinstalled)_\n* `boto3` _(preinstalled)_\n* `numpy` _(preinstalled)_\n* `s3fs`\n* `gluejobutils`\n\n## Usage\n\nMost simple way to use pydbtools. This will return a pandas df reprentation of the data (with matching meta data).\n\n```python\nimport pydbtools as pydb\n\n# Run SQL query and return as a pandas df\ndf = pydb.read_sql(\"SELECT * from database.table limit 10000\")\ndf.head()\n```\n\nYou might want to cast the data yourself or read all the columns as strings.\n\n```python\nimport pydbtools as pydb\n\n# Run SQL query and return as a pandas df\ndf = pydb.read_sql(\"SELECT * from database.table limit 10000\", cols_as_str=True)\ndf.head()\n\ndf.dtypes # all objects\n```\n\nYou can also pass additional arguments to the pandas.read_csv that reads the resulting Athena SQL query.\nNote you cannot pass dtype as this is specified within the `read_sql` function.\n\n```python\nimport pydbtools as pydb\n\n# pass nrows parameter to pandas.read_csv function\npydb.read_sql(\"SELECT * from database.table limit 10000\", nrows=20)\n```\n\nIf you didn't want to read the data into pandas you can run the SQL query and get the s3 path and meta data \nof the output using the get_athena_query_response. The data is then read in using `boto3`, `io` and `csv`. \n\n\n```python\nimport pydbtools as pydb\nimport io\nimport csv\nimport boto3\n\nresponse = pydb.get_athena_query_response(\"SELECT * from database.table limit 10000\")\n\n# print out path to athena query output (as a csv)\nprint(response['s3_path'])\n\n# print out meta data\nprint(response['meta'])\n\n# Read the csv into a string in memory\ns3_resource = boto3.resource('s3')\nbucket, key = response['s3_path'].replace(\"s3://\", \"\").split('/', 1)\nobj = s3_resource.Object(bucket, key)\ntext = obj.get()['Body'].read().decode('utf-8')\n\n# Use csv reader to print the outputting csv\nreader = csv.reader(text.split('\\n'), delimiter=',')\nfor row in reader:\n print('\\t'.join(row))\n```\n\n## Meta data\n\nThe output from get_athena_query_response(...) is a dictionary one of it's keys is `meta`. The meta key is a list where each element in this list is the name (`name`) and data type (`type`) for each column in your athena query output. For example for this table output:\n\n|col1|col2|\n|---|---|\n|1|2018-01-01|\n|2|2018-01-02|\n...\n\nWould have a meta like:\n\n```python\nfor m in response['meta']:\n print(m['name'], m['type'])\n```\n\noutput:\n\n```\n> col1 int\n> col1 date\n```\n\nThe meta types follow those listed as the generic meta data types used in [etl_manager](https://github.com/moj-analytical-services/etl_manager). If you want the actual athena meta data instead you can get them instead of the generic meta data types by setting the `return_athena_types` input parameter to `True` e.g.\n\n```python\nresponse = pydb.get_athena_query_response(\"SELECT * from database.table limit 10000\", return_athena_types=True)\n\nprint(response['meta'])\n```\n\nIf you wish to read your SQL query directly into a pandas dataframe you can use the read_sql function. You can apply `*args` or `**kwargs` into this function which are passed down to `pd.read_csv()`.\n\n```python\nimport pydbtools as pydb\n\ndf = pydb.read_sql(\"SELECT * FROM database.table limit 1000\")\ndf.head()\n```\n\n### Meta data conversion\n\nBelow is a table that explains what the conversion is from our data types to a pandas df (using the `read_sql` function):\n\n| data type | pandas column type| Comment |\n|-----------|-------------------|-----------------------------------------------------------------------------------------|\n| character | object | [see here](https://stackoverflow.com/questions/34881079/pandas-distinction-between-str-and-object-types)|\n| int | np.float64 | Pandas integers do not allow nulls so using floats |\n| long | np.float64 | Pandas integers do not allow nulls so using floats |\n| date | pandas timestamp | |\n| datetime | pandas timestamp | |\n| boolean | np.bool | |\n| float | np.float64 | |\n| double | np.float64 | |\n\n#### Notes:\n\n- Amazon Athena using a flavour of SQL called presto docs can be found [here](https://prestodb.io/docs/current/)\n- To query a date column in Athena you need to specify that your value is a date e.g. `SELECT * FROM db.table WHERE date_col > date '2018-12-31'`\n- To query a datetime or timestamp column in Athena you need to specify that your value is a timestamp e.g. `SELECT * FROM db.table WHERE datetime_col > timestamp '2018-12-31 23:59:59'`\n- Note dates and datetimes formatting used above. See more specifics around date and datetimes [here](https://prestodb.io/docs/current/functions/datetime.html)\n- To specify a string in the sql query always use '' not \"\". Using \"\"'s means that you are referencing a database, table or col, etc.\n- When data is pulled back into rStudio the column types are either R characters (for any col that was a dates, datetimes, characters) or doubles (for everything else).\n\nSee changelog for release changes\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "", "keywords": "", "license": "MIT", "maintainer": "Karik Isichei", "maintainer_email": "karik.isichei@digital.justice.gov.uk", "name": "pydbtools", "package_url": "https://pypi.org/project/pydbtools/", "platform": "", "project_url": "https://pypi.org/project/pydbtools/", "project_urls": null, "release_url": "https://pypi.org/project/pydbtools/1.0.3/", "requires_dist": [ "boto3 (>=1.7.4)", "gluejobutils (>=v1.0.0)", "numpy (>=1.16.1)", "pandas (>=0.23.4)", "s3fs (>=0.2.2)" ], "requires_python": ">=3.5,<4.0", "summary": "A python package to query data via amazon athena and bring it into a pandas df", "version": "1.0.3" }, "last_serial": 5862391, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "1351423f56c247f83763211d00f86ee1", "sha256": "4618c8edcd7b2050ba5c61861f6a5bcfaaab9dd963f453a5623f4a4f053541e7" }, "downloads": -1, "filename": "pydbtools-1.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "1351423f56c247f83763211d00f86ee1", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.6,<4.0", "size": 7077, "upload_time": "2019-06-13T13:53:04", "url": "https://files.pythonhosted.org/packages/53/bd/eab8a95753b27a78e369adbd27fd75ebde7dd6a8f126a319256ce22510cf/pydbtools-1.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "cf7dcf97d6b99c0038919b535e590b41", "sha256": "532ad0a7dba8a867cfa1c8f4a1a0ad1a83a466d1ccef2f6967f2953dc4fed94e" }, "downloads": -1, "filename": "pydbtools-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cf7dcf97d6b99c0038919b535e590b41", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6,<4.0", "size": 6872, "upload_time": "2019-06-13T13:53:07", "url": "https://files.pythonhosted.org/packages/63/95/9280bf5e1593eeefe094c2efebbc4bf78db7babf8fdfcae0c36d441a32ce/pydbtools-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "81c69036863b2529fb69ea0646df5e1d", "sha256": "d64d260f79e2424bc2320b7687a71dd6614ba4373b675850e70c279d1a4909ee" }, "downloads": -1, "filename": "pydbtools-1.0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "81c69036863b2529fb69ea0646df5e1d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5,<4.0", "size": 7091, "upload_time": "2019-06-18T12:27:17", "url": "https://files.pythonhosted.org/packages/52/bc/0b614615f2c075b6bb26fcd621f40e2c8c5a9a6f5490e28588e85d9fa5e8/pydbtools-1.0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "a5e69f715b176fbd991f8feaa890a3c2", "sha256": "9bec9afbdb538c9a4ac2c858b83278bf544bbe0fd4a5d23fc8d3b039374a2bbb" }, "downloads": -1, "filename": "pydbtools-1.0.1.tar.gz", "has_sig": false, "md5_digest": "a5e69f715b176fbd991f8feaa890a3c2", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5,<4.0", "size": 6879, "upload_time": "2019-06-18T12:27:18", "url": "https://files.pythonhosted.org/packages/c3/e1/32a1fe52808b4cc248febd18ab1769b8a98973180847da22f33f65677ee7/pydbtools-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "f9c2db7b7d10edb1a9d9f760594df390", "sha256": "2770b9d13da4a64ef5c2a59f2e420a9be2877f4453b0c642c00af28cc87f532c" }, "downloads": -1, "filename": "pydbtools-1.0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "f9c2db7b7d10edb1a9d9f760594df390", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5,<4.0", "size": 7210, "upload_time": "2019-09-17T12:31:07", "url": "https://files.pythonhosted.org/packages/5b/43/31e4cb5b8dacda6056f82b0c206b77b3ae428ba7db7a14455e9c6e037c93/pydbtools-1.0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "290320d51d1cc587632f317b646eb2a0", "sha256": "60e06e5cae3832de99cc87a291d7f3bbc4f9574c7d0c292f69dd51b2bdeeb618" }, "downloads": -1, "filename": "pydbtools-1.0.2.tar.gz", "has_sig": false, "md5_digest": "290320d51d1cc587632f317b646eb2a0", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5,<4.0", "size": 6992, "upload_time": "2019-09-17T12:31:08", "url": "https://files.pythonhosted.org/packages/25/9b/228ab365011d03dcd8d526831f614dd976117f107ee0fe74617b6dd078ed/pydbtools-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "01218a6ba6fc14da6aede47e442e21f0", "sha256": "2a6c538d2eebdab37b8fa0fd64cd696325670ef0739f80134b29c60b64863178" }, "downloads": -1, "filename": "pydbtools-1.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "01218a6ba6fc14da6aede47e442e21f0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5,<4.0", "size": 7707, "upload_time": "2019-09-20T13:44:42", "url": "https://files.pythonhosted.org/packages/06/0e/61adeac3e61869add3cfb91563a9b08c386978939c14f74e121a5a2756ad/pydbtools-1.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1d2603249d2b5f2272cf9f5559e50840", "sha256": "ec3169d06d82346354c338b157625dfb3568bc1f3acaa2ba6b3ddef26722920e" }, "downloads": -1, "filename": "pydbtools-1.0.3.tar.gz", "has_sig": false, "md5_digest": "1d2603249d2b5f2272cf9f5559e50840", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5,<4.0", "size": 7384, "upload_time": "2019-09-20T13:44:44", "url": "https://files.pythonhosted.org/packages/ac/c0/92434a59340e400abd6f061cbd522af9b2d21d16d985d5d0eba3e453f68c/pydbtools-1.0.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "01218a6ba6fc14da6aede47e442e21f0", "sha256": "2a6c538d2eebdab37b8fa0fd64cd696325670ef0739f80134b29c60b64863178" }, "downloads": -1, "filename": "pydbtools-1.0.3-py3-none-any.whl", "has_sig": false, "md5_digest": "01218a6ba6fc14da6aede47e442e21f0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5,<4.0", "size": 7707, "upload_time": "2019-09-20T13:44:42", "url": "https://files.pythonhosted.org/packages/06/0e/61adeac3e61869add3cfb91563a9b08c386978939c14f74e121a5a2756ad/pydbtools-1.0.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1d2603249d2b5f2272cf9f5559e50840", "sha256": "ec3169d06d82346354c338b157625dfb3568bc1f3acaa2ba6b3ddef26722920e" }, "downloads": -1, "filename": "pydbtools-1.0.3.tar.gz", "has_sig": false, "md5_digest": "1d2603249d2b5f2272cf9f5559e50840", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5,<4.0", "size": 7384, "upload_time": "2019-09-20T13:44:44", "url": "https://files.pythonhosted.org/packages/ac/c0/92434a59340e400abd6f061cbd522af9b2d21d16d985d5d0eba3e453f68c/pydbtools-1.0.3.tar.gz" } ] }