{ "info": { "author": "Dima Gerasimov", "author_email": "karlicoss@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: 3 :: Only", "Topic :: Database" ], "description": "\n\n\n\n\n\n[![CircleCI](https://circleci.com/gh/karlicoss/cachew.svg?style=svg)](https://circleci.com/gh/karlicoss/cachew)\n\n# Cachew: quick NamedTuple/dataclass cache\nTLDR: cachew can persistently cache any sequence (an [Iterator](https://docs.python.org/3/library/typing.html#typing.Iterator)) over [NamedTuples](https://docs.python.org/3/library/typing.html#typing.NamedTuple) or [dataclasses](https://docs.python.org/3/library/dataclasses.html) into an sqlite database on your disk.\nDatabase schema is automatically inferred from type annotations ([PEP 526](https://www.python.org/dev/peps/pep-0526)).\n\nIt works in a similar manner to [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache): caching your data is just a matter of decorating it.\n\nThe difference from `functools.lru_cache` is that data is preserved between program runs.\n\n## Motivation\n\nI often find myself processing big chunks of data, computing some aggregates on it or extracting only bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases. \n\nConventiaonal way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files,\ncomparing on the next run and returning cached data if nothing changed.\n\nSimple as it sounds, this is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.\n\n\n# Example\nImagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive.\nParsing it (`extract_links` function) takes hours, however, the archive is presumably updated not very frequently.\n\n\nWith this library your can achieve it through single `@cachew` decorator.\n\n\n```python\n>>> from typing import NamedTuple, Iterator\n>>> class Link(NamedTuple):\n... url : str\n... text: str\n...\n>>> @cachew\n... def extract_links(archive: str) -> Iterator[Link]:\n... for i in range(5):\n... import time; time.sleep(1) # simulate slow IO\n... yield Link(url=f'http://link{i}.org', text=f'text {i}')\n...\n>>> list(extract_links(archive='wikipedia_20190830.zip')) # that would take about 5 seconds on first run\n[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]\n\n>>> from timeit import Timer\n>>> res = Timer(lambda: list(extract_links(archive='wikipedia_20190830.zip'))).timeit(number=1) # second run is cached, so should take less time\n>>> print(f\"took {int(res)} seconds to query cached items\")\ntook 0 seconds to query cached items\n```\n\n\n\n\n# How it works\nBasically, your data object gets [flattened out](src/cachew/__init__.py:272)\nand python types are mapped [onto sqlite types and back](src/cachew/__init__.py:324)\n\nWhen the function is called, `cachew` [computes the hash](src/cachew/__init__.py:544) of your function's arguments \nand compares it against the previously stored hash value.\n\nIf they match, it would deserialize and yield whatever is stored in the cache database, if the hash mismatches, the original data provider is called and new data is stored along with the new hash.\n\n\n\n\n# Features\n\n\n\n\n* supports primitive types: `str`, `int`, `float`, `bool`, `datetime`, `date`\n* supports [Optional](src/cachew/tests/test_cachew.py:325)\n* supports [nested datatypes](src/cachew/tests/test_cachew.py:241)\n* supports return type inference: [1](src/cachew/tests/test_cachew.py:185), [2](src/cachew/tests/test_cachew.py:199)\n* detects [datatype schema changes](src/cachew/tests/test_cachew.py:271) and discards old data automatically \n\n\n\n\n\n# Using\nSee [docstring](src/cachew/__init__.py:462) for up-to-date documentation on parameters and return types. \nYou can also use [extensive unit tests](src/cachew/tests/test_cachew.py) as a reference.\n\nSome highlights:\n\n* `cache_path` can be a filename, or you can specify a callable [returning path](src/cachew/tests/test_cachew.py:221) and depending on function's arguments.\n\n It's not required to specify the path (it will be created in `/tmp`) but recommended.\n\n* `hashf` by default just hashes all the arguments, you can also specify a custom callable.\n\n For instance, it can be used to [discard cache](src/cachew/tests/test_cachew.py:51) the input file was modified.\n\n* `cls` is deduced from return type annotations by default, but can be specified if you don't control the code you want to cache. \n\n\n\n# Installing\n\n TODO\n\n# Implementation\n\n* why tuples and dataclasses?\n\n Tuples are natural in Python for quickly grouping together return results.\n `NamedTuple` and `dataclass` specifically provide a very straighforward and self documenting way way to represent a bit of data in Python.\n Very compact syntax makes it extremely convenitent even for one-off means of communicating between couple of functions.\n\n If you want to find out more why you should use more dataclasses in your code I suggest these links:\n [What are data classes?](https://stackoverflow.com/questions/47955263/what-are-data-classes-and-how-are-they-different-from-common-classes), [basic data classes](https://realpython.com/python-data-classes/#basic-data-classes).\n\n\n* why not [pickle](https://docs.python.org/3/library/pickle.html)?\n\n Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python.\n\n* why `sqlite` database for storage?\n\n It's pretty effecient and sequence of namedtuples maps onto database rows in a very straighforward manner.\n\n* why not `pandas.DataFrame`?\n\n DataFrames are great and can be serialised to csv or pickled.\n They are good to have as one of the ways you can interface with your data, however hardly convenitent to think about it abstractly due to their dynamic nature.\n They also can't be nested.\n\n* why not [ORM](https://en.wikipedia.org/wiki/Object-relational_mapping)?\n\n ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.\n\n * E.g. [SQLAlchemy](https://docs.sqlalchemy.org/en/13/orm/tutorial.html#declare-a-mapping) requires you using custom sqlalchemy specific types and inheriting a base class.\n Also it doesn't support nested types.\n\n* why not [marshmallow](https://marshmallow.readthedocs.io/en/3.0/nesting.html)?\n\n Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilise type annotations, but didn't find them covering all I wanted:\n\n * https://marshmallow-annotations.readthedocs.io/en/latest/ext/namedtuple.html#namedtuple-type-api\n * https://pypi.org/project/marshmallow-dataclass\n\n\n", "description_content_type": "text/markdown; charset=UTF-8", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/karlicoss/cachew", "keywords": "", "license": "mit", "maintainer": "", "maintainer_email": "", "name": "cachew", "package_url": "https://pypi.org/project/cachew/", "platform": "any", "project_url": "https://pypi.org/project/cachew/", "project_urls": { "Homepage": "https://github.com/karlicoss/cachew" }, "release_url": "https://pypi.org/project/cachew/0.4/", "requires_dist": [ "sqlalchemy", "dataclasses ; python_version<\"3.7\"", "bandit ; extra == 'testing'", "mypy ; extra == 'testing'", "pylint ; extra == 'testing'", "pytest ; extra == 'testing'", "pytz ; extra == 'testing'" ], "requires_python": "", "summary": "Easy sqlite-backed persistent cache for dataclasses", "version": "0.4" }, "last_serial": 5695206, "releases": { "0.4": [ { "comment_text": "", "digests": { "md5": "8b457cfe64f6693d9ab2d8db8c7316e0", "sha256": "bdc7c137b221691d5a6a08102a61e11117a663add96e321635dda5e61ad3c562" }, "downloads": -1, "filename": "cachew-0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8b457cfe64f6693d9ab2d8db8c7316e0", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 17506, "upload_time": "2019-08-18T16:48:19", "url": "https://files.pythonhosted.org/packages/bb/56/a10c167757590de61c02b4f731d79e4a39ce300e3eb6e833ad365b69ddf9/cachew-0.4-py2.py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8b457cfe64f6693d9ab2d8db8c7316e0", "sha256": "bdc7c137b221691d5a6a08102a61e11117a663add96e321635dda5e61ad3c562" }, "downloads": -1, "filename": "cachew-0.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8b457cfe64f6693d9ab2d8db8c7316e0", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": null, "size": 17506, "upload_time": "2019-08-18T16:48:19", "url": "https://files.pythonhosted.org/packages/bb/56/a10c167757590de61c02b4f731d79e4a39ce300e3eb6e833ad365b69ddf9/cachew-0.4-py2.py3-none-any.whl" } ] }