{ "info": { "author": "Lukasz Bolikowski", "author_email": "lukasz@bolikowski.eu", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Other Environment", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "Elkhound is an opinionated, data-centric workflow engine. It makes the following assumptions about your project:\n\n* Project workflow can be split into several tasks, each task has clearly defined input and output data files, ideally with no side effects. Tasks form a directed acyclic graph (i.e., no loops).\n* Each data file has clearly defined format and schema. Information between consecutive tasks is transferred primarily via the files. Preference for CSV or gzipped CSV files.\n\nElkhound will help you by:\n\n* **Versioning** data files by timestamping them (tasks read input files with the latest timestamp, write output files with current timestamp).\n* Supporting **big data files** by providing convenience schema-aware iterators over rows (using Python generators), lowering memory footprint.\n* Managing **workflows** of tasks, easily running arbitrary lists of tasks.\n* Automatic **checkpointing** of intermediate results, thanks to inter-task communication via data files and data file versioning.\n* Managing project's parameters by injecting contents of **configuration files and command line parameters** as tasks' contexts (less temptation to hardcode constants).\n* **Logging** workflow executions, reporting context and execution progress. Automatically collecting and archiving logs from different runs in one place to facilitate reproducibility.\n* Facilitating **unit testing** by injecting mockable data file objects as inputs and outputs of task classes.\n\nIn order to run Elkhound workflows, you need to create\nan engine configuration file\nand implement business logic in ``Task`` subclasses.\n\nEngine configuration\n--------------------\n\nEngine configuration file (in our example we'll name it ``engine.yaml``)\nhas three sections:\n\n* ``specs``, where you'll describe data files (names, formats, schemas; these files are outputs for some tasks, inputs for other tasks),\n* ``tasks``, where you'll point to implementations of business logic. Tasks define transformations of data files (how to build an output file Z given input files X and Y)\n* ``workflows``, where you'll bundle groups of target data files.\n\nData files are first class citizens in Elkhound.\nThey are identified by four digit codes (e.g. ``1230``, ``2110``, ``4315``, ``5214``).\nDesign of a new workflow should begin with registering\ndata files, and optionally defining their schemas (if applicable).\nHere is an example of data files defined in an engine configuration file:\n\n.. code:: yaml\n\n specs:\n - code: 1230\n name: people\n extension: csv.gz\n flags:\n - gzipped\n schema:\n - name: name\n type: str\n - name: dob\n type: datetime\n - name: is_employee\n type: bool\n - code: 2110\n name: budget\n extension: xlsx\n flags:\n - binary\n - code: 4315\n name: report\n extension: txt\n - code: 5214\n name: plots\n extension: dir\n flags:\n - directory\n\nTasks are Python classes that take zero or more data files on input\nand produce zero or more data files on output.\nEach task class has to implement three methods:\n\n* ``get_input_data_file_codes(self)`` returning a list of input data file codes,\n* ``get_output_data_file_codes(self)`` returning a list of output data file codes,\n* ``run(self, input_files, output_files, context)`` executing business logic.\n\nHere is an example of tasks registered in an engine configuration file:\n\n.. code:: yaml\n\n tasks:\n - class: myapp.DownloadDataTask\n - class: myapp.GenerateReportTask\n - class: myapp.PlotBudgetTask\n\nIn our example we will assume that:\n\n* ``DownloadDataTask`` takes no data files on input, produces ``1230`` and ``2110`` on output.\n* ``GenerateReportTask`` takes ``1230`` and ``2110`` on input, produces ``4315`` on output.\n* ``PlotBudgetTask`` takes ``2110`` on input, produces ``5214`` on output.\n\nWorkflows are named lists of targets, i.e., data files to be created.\nHere is an example (excerpt of an engine configuration file):\n\n.. code:: yaml\n\n workflows:\n monthly_briefing:\n - 4315\n - 5214\n\nBusiness logic implementation\n-----------------------------\n\nEach task is implemented as a subclass of ``elkhound.Task``.\nTheir task is to read the input files they need and create\nthe output files.\nHere is a simple example:\n\n.. code:: python\n\n class GenerateReportTask(Task):\n def get_input_data_file_codes(self):\n return [1230, 2110]\n\n def get_output_data_file_codes(self):\n return [4315]\n\n def run(self, input_files, output_files, context=None):\n with output_files[4315].open() as f:\n for _, input_file in input_files.items():\n f.write('Used input file {}\\n'.format(input_file.get_path()))\n\nWhen method ``run`` is called by the engine,\nthe ``input_files`` and ``output_files`` arguments\ncontain ``DataFile`` objects that know the exact path of the files\nand can assist in opening them in the right mode (read or write, text or binary, gzipped or not).\nData file objects have utility methods for specific situations,\nfor example when an input file is in CSV format, the corresponding data file object\nhas methods like ``read_data_frame()`` that returns a Pandas data frame,\nand ``iterate_records()`` which returns a generator yielding records one-by-one\n(useful when scanning huge files that won't fit into memory).\n\nRunning workflows\n-----------------\n\nHere's an example of how to run a workflow:\n\n.. code:: bash\n\n python -m elkhound.runner --dir /workspace/foo --engine engine.yaml --targets monthly_briefing --deps\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/elkhound/elkhound", "keywords": "data,workflow,engine", "license": "", "maintainer": "", "maintainer_email": "", "name": "elkhound", "package_url": "https://pypi.org/project/elkhound/", "platform": "", "project_url": "https://pypi.org/project/elkhound/", "project_urls": { "Homepage": "https://github.com/elkhound/elkhound" }, "release_url": "https://pypi.org/project/elkhound/0.0.4/", "requires_dist": [ "PyYAML" ], "requires_python": "", "summary": "Elkhound workflow engine", "version": "0.0.4" }, "last_serial": 3306139, "releases": { "0.0.4": [ { "comment_text": "", "digests": { "md5": "bfe05dda7da0a363118e920b9a779c49", "sha256": "9866b278cb8fa1b9b04a16b675a2bc6bd0a14865749a7fdab50f16a62e9e9b71" }, "downloads": -1, "filename": "elkhound-0.0.4-py3-none-any.whl", "has_sig": true, "md5_digest": "bfe05dda7da0a363118e920b9a779c49", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13891, "upload_time": "2017-11-04T21:10:25", "url": "https://files.pythonhosted.org/packages/bb/51/465fa74d1d39c3e7af336f51e63ed1d868bc5732ad92109077683e4e0cda/elkhound-0.0.4-py3-none-any.whl" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "bfe05dda7da0a363118e920b9a779c49", "sha256": "9866b278cb8fa1b9b04a16b675a2bc6bd0a14865749a7fdab50f16a62e9e9b71" }, "downloads": -1, "filename": "elkhound-0.0.4-py3-none-any.whl", "has_sig": true, "md5_digest": "bfe05dda7da0a363118e920b9a779c49", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13891, "upload_time": "2017-11-04T21:10:25", "url": "https://files.pythonhosted.org/packages/bb/51/465fa74d1d39c3e7af336f51e63ed1d868bc5732ad92109077683e4e0cda/elkhound-0.0.4-py3-none-any.whl" } ] }