{ "info": { "author": "Al Niessner", "author_email": "Al.Niessner@jpl.nasa.gov", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "License :: Free To Use But Restricted", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "## DAWGIE\n\n### Data and Algorithm Work-flow Generation, Introspection, and Execution\n\nThe DAWGIE software accomplishes:\n\n1. Data anonymity is required by the framework because the implementation DAWGIE is independent of the Algorithm Engine element of the framework. The framework allows DAWGIE to identify the author of the data through naming and version even though DAWGIE as no knowledge of the data itself. Because the language of the framework and further implementations is Python, DAWGIE forces a further requirement on the data that it be \"pickle-able\". The additional requirement was accepted by the framework and pushed to the Algorithm Engine element. DAWGIE uses a lazy load implementation -- loaded only when an Algorithm Engine sub-element requests data from a specific author -- to remove routing and, thus, requiring knowledge about the data itself.\n1. Data persistence has two independent implementations. In the case of small and simple data author relationships, the Python shelve module is used for persistence. As the number of unique authors grows, the relations grow faster (non-linear) and requires a more sophisticated persistence tool. In the latter case, postgresql is used for persistence. Both implementations are the design making them quickly interchangeable. Neither actually store the data blobs themselves. They -- shelve and postgresql -- only store the author's unique ID -- name and version -- and then reference the blob on disk. Using the relational DB proprieties of shelve (Python dictionaries) and postgresql (relational DB) any request for data from a specific author can quickly be resolved. Creating unique names for all of the data blobs is achieved using {md5 hash}_{sha1 hash}. These names also allow the persistence to recognize if the data is already known.\n1. For pipeline management, DAWGIE implements a worker farm, a work flow Abstract Syntax Tree (AST), and signaling from data persistence to execute the tasks within the framework Algorithm Engine element. The data persistence implementation signals the pipeline management element when new data has been authored. The manager searches the AST for all tasks that depend on the author and schedules all tasks that depend on the author starting with the earliest dependent. The new data signals can be generated at the end any task. When a task moves from being scheduled to executing, the foreman of the worker farm passes the task to a waiting worker on a compute node. The worker then loads the data via data persistence for the task and begins executing the task. Upon completion of the task, the worker saves the data via data persistence and notifies the foreman it is ready for another task. In this fashion, DAWGIE walks the minimum and complete nodes in the AST that depend on any new data that has been generated. The pipeline management offers periodic tasks as well that treat a temporal event as new data.\n\n### Organization\n\nDAWGIE is configured and controlled by ENV variables and command line arguments. The command line arguments override the ENV variables. While this section cover nearly all of the command line options, please use the --help switch to see all of the available options.\n\n#### Access to Running DAWGIE\n\n