{ "info": { "author": "Ken Farmer", "author_email": "kenfar@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: POSIX", "Programming Language :: Python", "Topic :: Database", "Topic :: Scientific/Engineering", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing", "Topic :: Utilities" ], "description": "| Datagristle is a toolbox of tough and flexible data connectors and\n analyzers.\n| It's kind of an interactive mix between ETL and data analysis\n optimized for rapid analysis and manipulation of a wide variety of\n data.\n\nIt's neither an enterprise ETL tool, nor an enterprise analysis,\nreporting, or data mining tool. It's intended to be an easily-adopted\ntool for technical analysts that combines the most useful subset of data\ntransformation and analysis capabilities necessary to do 80% of the\nwork. Its open source python codebase allows it to be easily extended to\nwith custom code to handle that always challenging last 20%.\n\nCurrent Status: Strong support for easy analysis, simple transformations\nof csv files, ability to create data dictionaries, change detection, and\nemerging data quality capabilities.\n\nMore info is on the DataGristle wiki here:\nhttps://github.com/kenfar/DataGristle/wiki\n\nNext Steps:\n===========\n\n- attractive PDF output of gristle\\_determinator.py\n- metadata database population\n\nIts objectives include:\n=======================\n\n- multi-platform (unix, linux, mac os, windows with effort)\n- multi-language (primarily python)\n- free - no cripple-licensing\n- primary audience is programming data analysts - not non-technical\n analysts\n- primary environment is command-line rather than windows, graphical\n desktop or eclipse\n- extensible\n- allow a bi-directional iteration between ETL & data analysis\n- can quickly perform initial data analysis prior to longer-duration,\n deeper analysis with heavier-weight tools.\n\nInstallation\n============\n\n- Using `pip `__:\n\n :sub:`~` $ pip install datagristle :sub:`~`\n\nDependencies\n============\n\n- Python 3.7\n\nUtilities Provided in This Release:\n===================================\n\n- gristle\\_slicer\n\n - Used to extract a subset of columns and rows out of an input file.\n\n- gristle\\_freaker\n\n - Produces a frequency distribution of multiple columns from input\n file.\n\n- gristle\\_determinator\n\n - Identifies file formats, generates metadata, prints file analysis\n report\n - This is the most mature - and also used by the other utilities so\n that you generally do not need to enter file structure info.\n\n- gristle\\_differ\n\n - Allows two identically-structured files to be compared by key\n columns and split into same, inserts, deletes, chgold and chgnew\n files.\n - The user can configure which columns are included in the\n comparison.\n - Post delta transformations can include assign sequence numbers,\n copying field values, etc.\n\n- gristle\\_validator\n\n - Validates csv files by confirming that all records have the right\n number of fields, and by apply a json schema full of requirements\n to each record.\n\n- gristle\\_dir\\_merger\n\n - Used to consolidate large directories with options to control\n matching criteria as well as matching actions.\n\n- gristle\\_processor\n\n - Used to apply actions, like delete, compress, etc, to files based\n on very flexible criteria.\n\n- gristle\\_viewer\n\n - Shows one record from a file at a time - formatted based on\n metadata.\n\ngristle\\_validator\n==================\n\n::\n\n Splits a csv file into two separate files based on how records pass or fail\n validation checks:\n - Field count - checks the number of fields in each record against the\n number required. The correct number of fields can be provided in an\n argument or will default to using the number from the first record.\n - Schema - uses csv file requirements defined in a json-schema file for\n quality checking. These requirements include the number of fields, \n and for each field - the type, min & max length, min & max value,\n whether or not it can be blank, existance within a list of valid\n values, and finally compliance with a regex pattern.\n\n The output can just be the return code (0 for success, 1+ for errors), can\n be some high level statistics, or can be the csv input records split between\n good and erroneous files. Output can also be limited to a random subset.\n\n Examples:\n $ gristle_validator sample.csv -f 3\n Prints all valid input rows to stdout, prints all records with \n other than 3 fields to stderr along with an extra final field that\n describes the error.\n $ gristle_validator sample.csv \n Prints all valid input rows to stdout, prints all records with \n other than the same number of fields found on the first record to\n stderr along with an extra final field that describes the error.\n $ gristle_validator sample.csv -d '|' --hasheader\n Same comparison as above, but in this case the file was too small\n or complex for the pgm to automatically determine csv dialect, so\n we had to explicitly give that info to program.\n $ gristle_validator sample.csv --outgood sample_good.csv --outerr sample_err.csv\n Same comparison as above, but explicitly splits good and bad data\n into separate files.\n $ gristle_validator sample.csv --randomout 1\n Same comparison as above, but only writes a random 1% of data out.\n $ gristle_validator sample.csv --silent\n Same comparison as above, but writes nothing out. Exit code can be\n used to determine if any bad records were found.\n $ gristle_validator sample.csv --validschema sample_schema.csv \n The above command checks both field count as well as validations\n described in the sample_schema.csv file. Here's an example of what \n that file might look like:\n items:\n - title: rowid\n blank: False\n required: True\n dg_type: integer\n dg_minimum: 1\n dg_maximum: 60\n - title: start_date\n blank: False\n minLength: 8\n maxLength: 10\n pattern: '[0-9]*/[0-9]*/[1-2][0-9][0-9][0-9]'\n - title: location\n blank: False\n minLength: 2\n maxLength: 2\n enum: ['ny','tx','ca','fl','wa','ga','al','mo']\n\ngristle\\_slicer\n===============\n\n::\n\n Extracts subsets of input files based on user-specified columns and rows.\n The input csv file can be piped into the program through stdin or identified\n via a command line option. The output will default to stdout, or redirected\n to a filename via a command line option.\n\n The columns and rows are specified using python list slicing syntax -\n so individual columns or rows can be listed as can ranges. Inclusion\n or exclusion logic can be used - and even combined.\n\n Examples:\n $ gristle_slicer sample.csv\n Prints all rows and columns\n $ gristle_slicer sample.csv -c\":5, 10:15\" -C 13\n Prints columns 0-4 and 10,11,12,14 for all records\n $ gristle_slicer sample.csv -C:-1\n Prints all columns except for the last for all records\n $ gristle_slicer sample.csv -c:5 -r-100\n Prints columns 0-4 for the last 100 records\n $ gristle_slicer sample.csv -c:5 -r-100 -d'|' --quoting=quote_all\n Prints columns 0-4 for the last 100 records, csv\n dialect info (delimiter, quoting) provided manually)\n $ cat sample.csv | gristle_slicer -c:5 -r-100 -d'|' --quoting=quote_all\n Prints columns 0-4 for the last 100 records, csv\n dialect info (delimiter, quoting) provided manually)\n \n\ngristle\\_freaker\n================\n\n::\n\n Creates a frequency distribution of values from columns of the input file\n and prints it out in columns - the first being the unique key and the last \n being the count of occurances.\n\n\n Examples:\n $ gristle_freaker sample.csv -d '|' -c 0\n Creates two columns from the input - the first with\n unique keys from column 0, the second with a count of\n how many times each exists.\n $ gristle_freaker sample.csv -d '|' -c 0 --sortcol 1 --sortorder forward --writelimit 25\n In addition to what was described in the first example, \n this example adds sorting of the output by count ascending \n and just prints the first 25 entries.\n $ gristle_freaker sample.csv -d '|' -c 0 --sampling_rate 3 --sampling_method interval\n In addition to what was described in the first example,\n this example adds a sampling in which it only references\n every third record.\n $ gristle_freaker sample.csv -d '|' -c 0,1\n Creates three columns from the input - the first two\n with unique key combinations from columns 0 & 1, the \n third with the number of times each combination exists.\n $ gristle_freaker sample.csv -d '|' -c -1\n Creates two columns from the input - the first with unique\n keys from the last column of the file (negative numbers \n wrap), then a second with the number of times each exists.\n $ gristle_freaker sample.csv -d '|' --columntype all\n Creates two columns from the input - all columns combined\n into a key, then a second with the number of times each\n combination exists.\n $ gristle_freaker sample.csv -d '|' --columntype each\n Unlike the other examples, this one performs a separate\n analysis for every single column of the file. Each analysis\n produces three columns from the input - the first is a \n column number, second is a unique value from the column, \n and the third is the number of times that value appeared. \n This output is repeated for each column.\n\ngristle\\_viewer\n===============\n\n::\n\n Displays a single record of a file, one field per line, with field names \n displayed as labels to the left of the field values. Also allows simple \n navigation between records.\n\n Examples:\n $ gristle_viewer sample.csv -r 3 \n Presents the third record in the file with one field per line\n and field names from the header record as labels in the left\n column.\n $ gristle_viewer sample.csv -r 3 -d '|' -q quote_none\n In addition to what was described in the first example this\n adds explicit csv dialect overrides.\n \n\ngristle\\_determinator\n=====================\n\n::\n\n Analyzes the structures and contents of csv files in the end producing a \n report of its findings. It is intended to speed analysis of csv files by\n automating the most common and frequently-performed analysis tasks. It's\n useful in both understanding the format and data and quickly spotting issues.\n\n Examples:\n $ gristle_determinator japan_station_radiation.csv\n This command will analyze a file with radiation measurements\n from various Japanese radiation stations.\n\n File Structure:\n format type: csv\n field cnt: 4\n record cnt: 100\n has header: True\n delimiter: \n csv quoting: False \n skipinitialspace: False \n quoting: QUOTE_NONE \n doublequote: False \n quotechar: \" \n lineterminator: '\\n' \n escapechar: None \n\n Field Analysis Progress: \n Analyzing field: 0\n Analyzing field: 1\n Analyzing field: 2\n Analyzing field: 3\n\n Fields Analysis Results: \n\n ------------------------------------------------------\n Name: station_id \n Field Number: 0 \n Wrong Field Cnt: 0 \n Type: timestamp \n Min: 1010000001 \n Max: 1140000006 \n Unique Values: 99 \n Known Values: 99 \n Top Values not shown - all values are unique\n\n ------------------------------------------------------\n Name: datetime_utc \n Field Number: 1 \n Wrong Field Cnt: 0 \n Type: timestamp \n Min: 2011-02-28 15:00:00 \n Max: 2011-02-28 15:00:00 \n Unique Values: 1 \n Known Values: 1 \n Top Values: \n 2011-02-28 15:00:00 x 99 occurrences\n\n ------------------------------------------------------\n Name: sa \n Field Number: 2 \n Wrong Field Cnt: 0 \n Type: integer \n Min: -999 \n Max: 52 \n Unique Values: 35 \n Known Values: 35 \n Mean: 2.45454545455 \n Median: 38.0 \n Variance: 31470.2681359 \n Std Dev: 177.398613681 \n Top Values: \n 41 x 7 occurrences\n 42 x 7 occurrences\n 39 x 6 occurrences\n 37 x 5 occurrences\n 46 x 5 occurrences\n 17 x 4 occurrences\n 38 x 4 occurrences\n 40 x 4 occurrences\n 45 x 4 occurrences\n 44 x 4 occurrences\n\n ------------------------------------------------------\n Name: ra \n Field Number: 3 \n Wrong Field Cnt: 0 \n Type: integer \n Min: -888 \n Max: 0 \n Unique Values: 2 \n Known Values: 2 \n Mean: -556.121212121 \n Median: -888.0 \n Variance: 184564.833792 \n Std Dev: 429.610095077 \n Top Values: \n -888 x 62 occurrences\n 0 x 37 occurrences\n\ngristle\\_differ\n===============\n\n::\n\n gristle_differ compares two files, typically an old and a new file, based \n on explicit keys in a way that is far more accurate than diff. It can also\n compare just subsets of columns, and perform post-delta transforms to \n populate fields with static values, values from other fields, variables\n from the command line, or incrementing sequence numbers.\n\n Examples:\n\n $ gristle_differ file0.dat file1.dat --key-cols '0, 2' --ignore_cols '19,22,33'\n\n - Sorts both files on columns 0 & 2\n - Dedupes both files on column 0\n - Compares all fields except fields 19,22, and 23\n - Automatically determines the csv dialect\n - Produces the following files:\n - file1.dat.insert\n - file1.dat.delete\n - file1.dat.same\n - file1.dat.chgnew\n - file1.dat.chgold\n\n $ gristle_differ file0.dat file1.dat --key-cols '0' --compare_cols '1,2,3,4,5,6,7' -d '|'\n\n - Sorts both files on columns 0 \n - Dedupes both files on column 0\n - Compares fields 1,2,3,4,5,6,7\n - Uses '|' as the field delimiter\n - Produces the same output file names as example 1.\n\n\n $ gristle_differ file0.dat file1.dat --config-fn ./foo.yml \\\n --variables batchid:919 --variables pkid:82304\n\n - Produces the same output file names as example 1.\n - But in this case it gets the majority of its configuration items from\n the config file ('foo.yml'). This could include key columns, comparison\n columns, ignore columns, post-delta transformations, and other information.\n - The two variables options are used to pass in user-defined variables that\n can be referenced by the post-delta transformations. The batchid will get\n copied into a batch_id column for every file, and the pkid is a sequence\n that will get incremented and used for new rows in the insert, delete and\n chgnew files.\n\ngristle\\_metadata\n=================\n\n::\n\n Gristle_metadata provides a command-line interface to the metadata database.\n It's mostly useful for scripts, but also useful for occasional direct\n command-line access to the metadata.\n\n Examples:\n $ gristle_metadata --table schema --action list\n Prints a list of all rows for the schema table.\n $ gristle_metadata --table element --action put --prompt\n Allows the user to input a row into the element table and \n prompts the user for all fields necessary.\n \n\ngristle\\_md\\_reporter\n=====================\n\n::\n\n Gristle_md_reporter allows the user to create data dictionary reports that\n combine information about the collection and fields along with field value\n descriptions and frequencies.\n\n Examples:\n $ gristle_md_reporter --report datadictionary --collection_id 2\n Prints a data dictionary report of collection_id 2.\n $ gristle_md_reporter --report datadictionary --collection_name presidents\n Prints a data dictionary report of the president collection.\n $ gristle_md_reporter --report datadictionary --collection_id 2 --field_id 3\n Prints a data dictionary report of the president collection,\n only shows field-level information for field_id 3.\n\ngristle\\_dir\\_merger\n====================\n\n::\n\n Gristle_dir_merger consolidates directory structures of files. Is both fast\n and flexible with a variety of options for choosing which file to use based\n on full (name and md5) and partial matches (name only) .\n\n Examples\n $ gristle_dir_merger /tmp/foo /data/foo\n - Compares source of /tmp/foo to dest of /data/foo.\n - Files will be consolidated into /data/foo, and deleted from /tmp/foo.\n - Comparison will be: match-on-name-and-md5 (default)\n - Full matches will use: keep_dest (default)\n - Partial matches will use: keep_newest (default)\n - Bottom line: this is what you normally want.\n $ gristle_dir_merger /tmp/foo /data/foo --dry-run\n - Same as the first example - except it only prints what it would do\n without actually doing it.\n - Bottom line: this is a good step to take prior to running it for real.\n $ gristle_dir_merger /tmp/foo /data/foo -r\n - Same as the first example - except it runs recursively through\n the directories.\n $ gristle_dir_merger /tmp/foo /data/foo --on-partial-match keep-biggest\n - Comparison will be: match-on-name-and-md5 (default)\n - Full matches will use: keep_dest (default)\n - Partial matches will use: keep_biggest (override)\n - Bottom line: this is a good combo if you know that some files\n have been modified on both source & dest, and newest isn't the best.\n $ gristle_dir_merger /tmp/foo /data/foo --match-on-name-only --on-full-match keep-source\n - Comparison will be: match-on-name-only (override)\n - Full matches will use: keep_source (override)\n - Bottom line: this is a good way to go if you have\n files that have changed in both directories, but always want to\n use the source files.\n\nDevelopment & Testing\n=====================\n\n- If you're going to test directly out of the source code then set up\n the pathing to point to the parent directory. If using\n virtualenvwrapper then just run:\n\n - $ add2virtualenv .\n\nLicensing\n=========\n\n- Gristle uses the BSD license - see the separate LICENSE file for\n further information\n\nCopyright\n=========\n\n- Copyright 2011,2012,2013,2014,2015,2017 Ken Farmer\n\n\nv0.1.6 - 2019-02\n================\n\n- upgraded to support and require python3.7\n\nv0.1.5 - 2018-05\n================\n\n- fixed setup.py bug in which pip10 no longer includes req module\n\nv0.1.4 - 2017-12\n================\n\n- fixed gristle\\_validator bug in which checks on dg\\_maximum were not\n being run\n\nv0.1.3 - 2017-08\n================\n\n- additional improvements to code quality, but with some breaking\n changes\n- changed argument handling for multiple utilities to simplify code and\n get more consistency.\n\n - affects: gristle\\_freaker, gristle\\_slicer, and gristle\\_viewer\n - This means words are separated by hyphens, not underscores.\n --sortorder is --sort-order.\n\n- changed file handling for multiple utilities to simplify code and get\n more consistency.\n\n - affects: gristle\\_freaker, gristle\\_slicer, gristle\\_validator,\n and gristle\\_viewer\n - This means that behavior in handling multiple files, piped input,\n and other edge cases is more consistent between utilities.\n\nv0.1.2 - 2017-06\n================\n\n- long-overdue code quality updates\n\nv0.1.1 - 2017-05\n================\n\n- upgraded to use python3.6\n- changed versioning format, which has broken pypy for history\n\nv0.59 - 2016-11\n===============\n\n- gristle\\_differ\n\n - totally rewritten. Can now handle very large files, perform\n post-transform transformations, handle more complex comparisons,\n and use column names rather than just positions.\n\n- gristle\\_determinator\n\n - added read-limit argument. This allows the tool to be easily run\n against a subset of a very large input file.\n\n- gristle\\_scalar\n\n - removed from toolkit. There are better tools in other solutions\n can be used instead. This tool may come back again later, but only\n if enormously rewritten.\n\n- gristle\\_filter\n\n - removed from toolkit. There are better tools in other solutions\n can be used instead. This tool may come back again later, but only\n if enormously rewritten.\n\n- minor:\n\n - gristle\\_md\\_reporter - slight formatting change: text\n descriptions of fields are now included, and column widths were\n tweaked.\n - all utilities - a substantial performance improvement for large\n files when quoting information is not provided.\n\nv0.58 - 2014-08\n===============\n\n- gristle\\_dir\\_merger\n\n - initial addition to toolkit. Merges directories of files using a\n variety of matching criteria and matching actions.\n\nv0.57 - 2014-07\n===============\n\n- gristle\\_processor\n\n - initial addition to toolkit. Provides ability to scan through\n directory structure recursively, and delete files that match\n config criteria.\n\nv0.56 - 2014-03\n===============\n\n- gristle\\_determinator\n\n - added hasnoheader arg\n - fixed problem printing top\\_values on empty file with header\n\n- gristle\\_validator\n\n - added hasnoheader arg\n\n- gristle\\_freaker\n\n - added hasnoheader arg\n\nv0.55 - 2014-02\n===============\n\n- gristle\\_determinator - fixed a few problems:\n\n - the 'Top Values not shown - all unique' message being truncated\n - floats not handled correctly for stddev & variance\n - quoted ints & floats not handled\n\nv0.54 - 2014-02\n===============\n\n- gristle\\_validator - major updates to allow validation of csv files\n based on the json schema standard, with help from the Validictory\n module.\n\nv0.53 - 2014-01\n===============\n\n- gristle\\_freaker - major updates to enable distributes on all columns\n to be automatically gathered through either (all or each) args. 'All'\n combines all columns into a single tuple prior to producing\n distribution. 'Each' creates a separate distribution for every column\n within the csv file.\n- travisci - added support and started using this testing service.\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/kenfar/DataGristle", "keywords": "data analysis quality utility etl", "license": "BSD", "maintainer": "", "maintainer_email": "", "name": "datagristle", "package_url": "https://pypi.org/project/datagristle/", "platform": "", "project_url": "https://pypi.org/project/datagristle/", "project_urls": { "Homepage": "http://github.com/kenfar/DataGristle" }, "release_url": "https://pypi.org/project/datagristle/0.1.6/", "requires_dist": null, "requires_python": "", "summary": "A toolbox and library of ETL, data quality, and data analysis tools", "version": "0.1.6" }, "last_serial": 4877400, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "685282908a8e2de76b9f510ff4173db7", "sha256": "9eb1b3a4d7fe6cb6bfb47a5083318447aba6216bf6c4e7cac87f63fed8682cf8" }, "downloads": -1, "filename": "datagristle-0.1.2.tar.gz", "has_sig": false, "md5_digest": "685282908a8e2de76b9f510ff4173db7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 429187, "upload_time": "2017-07-08T05:02:27", "url": "https://files.pythonhosted.org/packages/10/6c/0719dab6e100f04bd3c96d64627b43f3eb3e08143d7b8d5329038206dda3/datagristle-0.1.2.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "f8a375fcdb0a7bfb121c285e487cf5f2", "sha256": "c50c99fa243c7ed9981d861914f08b00651f797182750591b7c1892b2a74c340" }, "downloads": -1, "filename": "datagristle-0.1.4.tar.gz", "has_sig": false, "md5_digest": "f8a375fcdb0a7bfb121c285e487cf5f2", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 429745, "upload_time": "2017-12-13T06:00:48", "url": "https://files.pythonhosted.org/packages/52/6e/30fcbddd36057086b54e7480c938de083dceb7ed9242679ed8186e2e08f4/datagristle-0.1.4.tar.gz" } ], "0.1.5": [ { "comment_text": "", "digests": { "md5": "5e01a76994b4aceddcc5fa1be9c1dda8", "sha256": "c7337999447a9fefc905796ac3749ce8f511da1121990e4b57f798453ae8bd4d" }, "downloads": -1, "filename": "datagristle-0.1.5.tar.gz", "has_sig": false, "md5_digest": "5e01a76994b4aceddcc5fa1be9c1dda8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 434739, "upload_time": "2018-05-04T03:27:12", "url": "https://files.pythonhosted.org/packages/41/51/9bec9078451ba729799d9300bbc8a33a569ceeae7faf2d19ba2614f73620/datagristle-0.1.5.tar.gz" } ], "0.1.6": [ { "comment_text": "", "digests": { "md5": "07a1089e70ed17d8dd362c4f40ac820d", "sha256": "ee09f2425907506aa83ed0284216f93243a03e18a604b88ec17beacdfc6fd85d" }, "downloads": -1, "filename": "datagristle-0.1.6.tar.gz", "has_sig": false, "md5_digest": "07a1089e70ed17d8dd362c4f40ac820d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 427481, "upload_time": "2019-02-28T04:56:22", "url": "https://files.pythonhosted.org/packages/a0/c4/3e1a1f6c9be5b285668c3f69a2c605a9559baf886f42e4161f12280b4b83/datagristle-0.1.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "07a1089e70ed17d8dd362c4f40ac820d", "sha256": "ee09f2425907506aa83ed0284216f93243a03e18a604b88ec17beacdfc6fd85d" }, "downloads": -1, "filename": "datagristle-0.1.6.tar.gz", "has_sig": false, "md5_digest": "07a1089e70ed17d8dd362c4f40ac820d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 427481, "upload_time": "2019-02-28T04:56:22", "url": "https://files.pythonhosted.org/packages/a0/c4/3e1a1f6c9be5b285668c3f69a2c605a9559baf886f42e4161f12280b4b83/datagristle-0.1.6.tar.gz" } ] }