{ "info": { "author": "Robert Stojnic", "author_email": "robert.stojnic@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "[![CircleCI](https://circleci.com/gh/rstojnic/lazydata/tree/master.svg?style=shield)](https://circleci.com/gh/rstojnic/lazydata/tree/master)\n\n# lazydata: scalable data dependencies\n\n`lazydata` is a minimalist library for including data dependencies into Python projects. \n\n**Problem**: Keeping all data files in git (e.g. via git-lfs) results in a bloated repository copy that takes ages to pull. Keeping code and data out of sync is a disaster waiting to happen. \n\n**Solution**: `lazydata` only stores references to data files in git, and syncs data files on-demand when they are needed.\n\n**Why**: The semantics of code and data are different - code needs to be versioned to merge it, and data just needs to be kept in sync. `lazydata` achieves exactly this in a minimal way. \n\n**Benefits**:\n\n- Keeps your git repository clean with just code, while enabling seamless access to any number of linked data files \n- Data consistency assured using file hashes and automatic versioning\n- Choose your own remote storage backend: AWS S3 or (coming soon:) directory over SSH\n\n`lazydata` is primarily designed for machine learning and data science projects. \nSee [this medium post](https://medium.com/@rstojnic/structuring-ml-projects-so-they-can-grow-b63e89c8be8f) for more. \n\n
\n\n
\n\n## Getting started \n\nIn this section we'll show how to use `lazydata` on an example project.\n\n### Installation\n\nInstall with pip (requires Python 3.5+):\n\n```bash\n$ pip install lazydata\n```\n\n### Add to your project\n\nTo enable `lazydata`, run in project root:\n\n```bash\n$ lazydata init \n```\n\nThis will initialise `lazydata.yml` which will hold the list of files managed by lazydata. \n\n### Tracking a file\n\nTo start tracking a file use `track(\"\")` in your code:\n\n**my_script.py**\n```python\nfrom lazydata import track\n\n# store the file when loading \nimport pandas as pd\ndf = pd.read_csv(track(\"data/my_big_table.csv\"))\n\nprint(\"Data shape:\" + df.shape)\n\n```\n\nRunning the script the first time will start tracking the file:\n\n```bash\n$ python my_script.py\n## lazydata: Tracking a new file data/my_big_table.csv\n## Data shape: (10000,100)\n```\n\nThe file is now tracked and has been backed-up in your local lazydata cache in `~/.lazydata` and added to **lazydata.yml**:\n```yaml\nfiles:\n - path: data/my_big_table.csv\n hash: 2C94697198875B6E...\n usage: my_script.py\n\n```\n\nIf you re-run the script without modifying the data file, lazydata will just quickly check that the data file hasn't changed and won't do anything else. \n\nIf you modify the data file and re-run the script, this will add another entry to the yml file with the new hash of the data file, i.e. data files are automatically versioned. If you don't want to keep past versions, simply remove them from the yml. \n\nAnd you are done! This data file is now tracked and linked to your local repository.\n\n### Sharing your tracked files\n\nTo access your tracked files from multiple machines add a remote storage backend where they can be uploaded. To use S3 as a remote storage backend run:\n\n```bash\n$ lazydata add-remote s3://mybucket/lazydata\n```\n\nThis will configure the S3 backend and also add it to `lazydata.yml` for future reference. \n\nYou can now git commit and push your `my_script.py` and `lazydata.yml` files as you normally would. \n\nTo copy the stored data files to S3 use:\n\n```bash\n$ lazydata push\n```\n\nWhen your collaborator pulls the latest version of the git repository, they will get the script and the `lazydata.yml` file as usual. \n\nData files will be downloaded when your collaborator runs `my_script.py` and the `track(\"my_big_table.csv\")` is executed:\n\n```bash\n$ python my_script.py\n## lazydata: Downloading stored file my_big_table.csv ...\n## Data shape: (10000,100)\n``` \n\nTo get the data files without running the code, you can also use the command line utility:\n\n```bash\n# download just this file\n$ lazydata pull my_big_table.csv\n\n# download everything used in this script\n$ lazydata pull my_script.py\n\n# download everything stored in the data/ directory and subdirs\n$ lazydata pull data/\n\n# download the latest version of all data files\n$ lazydata pull\n```\n\nBecause `lazydata.yml` is tracked by git you can safely make and switch git branches. \n\n### Data dependency scenarios\n\nYou can achieve multiple data dependency scenarios by putting `lazydata.track()` into different parts of the code:\n\n- Jupyter notebook data dependencies by using tracking in notebooks\n- Data pipeline output tracking by tracking saved files \n- Class-level data dependencies by tracking files in `__init__(self)`\n- Module-level data dependencies by tracking files in `__init__.py`\n- Package-level data dependencies by tracking files in `setup.py` \n\n### Coming soon... \n\n- Examine stored file provenance and properties\n- Faceting multiple files into portable datasets\n- Storing data coming from databases and APIs\n- More remote storage options\n\n## Stay in touch\n\nThis is an early stable beta release. To find out about new releases subscribe to our [new releases mailing list](http://eepurl.com/dFYLIL). \n\n## Contributing\n\nThe library is licenced under Apache-2 licence. All contributions are welcome!\n\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/rstojnic/lazydata", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "lazydata", "package_url": "https://pypi.org/project/lazydata/", "platform": "", "project_url": "https://pypi.org/project/lazydata/", "project_urls": { "Homepage": "https://github.com/rstojnic/lazydata" }, "release_url": "https://pypi.org/project/lazydata/1.0.19/", "requires_dist": [ "pyyaml", "peewee", "boto3", "lazy-import" ], "requires_python": ">=3.5.2", "summary": "Scalable data dependencies", "version": "1.0.19" }, "last_serial": 4286018, "releases": { "1.0.18": [ { "comment_text": "", "digests": { "md5": "381d25b7cbdbc33836cfa17d4023359c", "sha256": "7a39c7a549d986880ec26095938e8b089f91e83747e9bbab2a5e2856095c0980" }, "downloads": -1, "filename": "lazydata-1.0.18-py3-none-any.whl", "has_sig": false, "md5_digest": "381d25b7cbdbc33836cfa17d4023359c", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.2", "size": 18970, "upload_time": "2018-09-07T17:13:11", "url": "https://files.pythonhosted.org/packages/05/df/d0090b3c9ca17734365feccf6e7813904e6fc0b6354efe37111d327feac5/lazydata-1.0.18-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "65d8fd5d64e41b30df3cf1c3d5dd1707", "sha256": "8436618dfbb162100531c0757e16fd339a88d7e6816f882637523259be3e163c" }, "downloads": -1, "filename": "lazydata-1.0.18.tar.gz", "has_sig": false, "md5_digest": "65d8fd5d64e41b30df3cf1c3d5dd1707", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.2", "size": 15255, "upload_time": "2018-09-07T17:13:14", "url": "https://files.pythonhosted.org/packages/22/30/a85785d0091bfcea6b109d48cd8f0f05c46949c430ef68d8db64771ee698/lazydata-1.0.18.tar.gz" } ], "1.0.19": [ { "comment_text": "", "digests": { "md5": "698399c47e7f0d22f1455377023aa2f5", "sha256": "33fe7a6b4594acde9099ef3f8b0ea7889364610f3a8d61d4c9e559d01a4317d9" }, "downloads": -1, "filename": "lazydata-1.0.19-py3-none-any.whl", "has_sig": false, "md5_digest": "698399c47e7f0d22f1455377023aa2f5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.2", "size": 19385, "upload_time": "2018-09-18T21:25:30", "url": "https://files.pythonhosted.org/packages/73/ec/868e567c36b533c722402d8703aca648dfbab855a808091bc2a241d72406/lazydata-1.0.19-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "369cf64d897267310104570cc9f17df6", "sha256": "9a51fa330ece167180a85207e92dca6d61aa58bdf0f23e246a82e2434132bd3c" }, "downloads": -1, "filename": "lazydata-1.0.19.tar.gz", "has_sig": false, "md5_digest": "369cf64d897267310104570cc9f17df6", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.2", "size": 15483, "upload_time": "2018-09-18T21:25:32", "url": "https://files.pythonhosted.org/packages/ac/7c/82041f9bf48ee8264a24de2efcd183d8be83664aa5ae0cc436d5da7b6962/lazydata-1.0.19.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "698399c47e7f0d22f1455377023aa2f5", "sha256": "33fe7a6b4594acde9099ef3f8b0ea7889364610f3a8d61d4c9e559d01a4317d9" }, "downloads": -1, "filename": "lazydata-1.0.19-py3-none-any.whl", "has_sig": false, "md5_digest": "698399c47e7f0d22f1455377023aa2f5", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": ">=3.5.2", "size": 19385, "upload_time": "2018-09-18T21:25:30", "url": "https://files.pythonhosted.org/packages/73/ec/868e567c36b533c722402d8703aca648dfbab855a808091bc2a241d72406/lazydata-1.0.19-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "369cf64d897267310104570cc9f17df6", "sha256": "9a51fa330ece167180a85207e92dca6d61aa58bdf0f23e246a82e2434132bd3c" }, "downloads": -1, "filename": "lazydata-1.0.19.tar.gz", "has_sig": false, "md5_digest": "369cf64d897267310104570cc9f17df6", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5.2", "size": 15483, "upload_time": "2018-09-18T21:25:32", "url": "https://files.pythonhosted.org/packages/ac/7c/82041f9bf48ee8264a24de2efcd183d8be83664aa5ae0cc436d5da7b6962/lazydata-1.0.19.tar.gz" } ] }