{ "info": { "author": "Open Knowledge International", "author_email": "info@okfn.org", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Web Environment", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Topic :: Internet :: WWW/HTTP :: Dynamic Content", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "# goodtables-py\n\n[](https://travis-ci.org/frictionlessdata/goodtables-py)\n[](https://coveralls.io/r/frictionlessdata/goodtables-py?branch=master)\n[](https://pypi.python.org/pypi/goodtables)\n[](https://github.com/frictionlessdata/goodtables-py)\n[](https://gitter.im/frictionlessdata/chat)\n\nGoodtables is a framework to validate tabular data. It can check the structure\nof your data (e.g. all rows have the same number of columns), and its contents\n(e.g. all dates are valid).\n\n> **[Important Notice]** `goodtables` was renamed to `frictionless` since version 3. The framework got various improvements and was extended to be a complete data solution. The change in not breaking for the existing software so no actions are required. Please read the [Migration Guide](https://framework.frictionlessdata.io/docs/development/migration#from-goodtables) to start working with Frictionless for Python.\n> - we continue to bug-fix `goodtables@2.x` in this [branch](https://github.com/frictionlessdata/goodtables-py/tree/goodtables) as well as it's available on [PyPi](https://pypi.org/project/goodtables/) as it was before\n> - please note that `frictionless@3.x` version's API, we're working on at the moment, is not stable\n> - we will release `frictionless@4.x` by the end of 2020 to be the first SemVer/stable version\n\n## Features\n\n* **Structural checks**: Ensure that there are no empty rows, no blank headers, etc.\n* **Content checks**: Ensure that the values have the correct types (\"string\", \"number\", \"date\", etc.), that their format is valid (\"string must be an e-mail\"), and that they respect the constraints (\"age must be a number greater than 18\").\n* **Support for multiple tabular formats**: CSV, Excel files, LibreOffice, Data Package, etc.\n* **Parallelized validations for multi-table datasets**\n* **Command line interface**\n\n## Contents\n\n\n\n - [Getting Started](#getting-started)\n - [Installing](#installing)\n - [Running on CLI](#running-on-cli)\n - [Running on Python](#running-on-python)\n - [Documentation](#documentation)\n - [Report](#report)\n - [Checks](#checks)\n - [Presets](#presets)\n - [Data Quality Errors](#data-quality-errors)\n - [Frequently Asked Questions](#frequently-asked-questions)\n - [API Reference](#api-reference)\n - [`cli`](#cli)\n - [`validate`](#validate)\n - [`preset`](#preset)\n - [`check`](#check)\n - [`Error`](#error)\n - [`spec`](#spec)\n - [`GoodtablesException`](#goodtablesexception)\n - [Contributing](#contributing)\n - [Changelog](#changelog)\n\n\n\n## Getting Started\n\n> For faster goodtables-combatible Pandas dataframes validation take a look at https://github.com/ezwelty/goodtables-pandas-py\n\n### Installing\n\n```\npip install goodtables\npip install goodtables[ods] # If you need LibreOffice's ODS file support\n```\n\n### Running on CLI\n\n```\ngoodtables data.csv\n```\n\nUse `goodtables --help` to see the different options.\n\n### Running on Python\n\n```python\nfrom goodtables import validate\n\nreport = validate('invalid.csv')\nreport['valid'] # false\nreport['table-count'] # 1\nreport['error-count'] # 3\nreport['tables'][0]['valid'] # false\nreport['tables'][0]['source'] # 'invalid.csv'\nreport['tables'][0]['errors'][0]['code'] # 'blank-header'\n```\n\nYou can read a more in depth explanation on using goodtables with Python on\nthe [developer documentation](#developer-documentation) section. Check also\nthe [examples](examples) folder for other examples.\n\n## Documentation\n\nGoodtables validates your tabular dataset to find structural and content\nerrors. Consider you have a file named `invalid.csv`. Let's validate it:\n\n```python\nreport = validate('invalid.csv')\n```\n\nWe could also pass a remote URI instead of a local path. It supports CSV, XLS,\nXLSX, ODS, JSON, and all other formats supported by the [tabulator][tabulator]\nlibrary.\n\n### Report\n\n> The validation report follows the JSON Schema defined on [goodtables/schemas/report.json][validation-jsonschema].\n\nThe output of the `validate()` method is a report dictionary. It includes\ninformation if the data was valid, count of errors, list of table reports, which\nindividual checks failed, etc. A report will be looking like this:\n\n```json\n{\n \"time\": 0.009,\n \"error-count\": 1,\n \"warnings\": [\n \"Table \\\"data/invalid.csv\\\" inspection has reached 1 error(s) limit\"\n ],\n \"preset\": \"table\",\n \"valid\": false,\n \"tables\": [\n {\n \"errors\": [\n {\n \"row-number\": null,\n \"message\": \"Header in column 3 is blank\",\n \"row\": null,\n \"column-number\": 3,\n \"code\": \"blank-header\"\n }\n ],\n \"error-count\": 1,\n \"headers\": [\n \"id\",\n \"name\",\n \"\",\n \"name\"\n ],\n \"scheme\": \"file\",\n \"row-count\": 2,\n \"valid\": false,\n \"encoding\": \"utf-8\",\n \"time\": 0.007,\n \"schema\": null,\n \"format\": \"csv\",\n \"source\": \"data/invalid\"\n }\n ],\n \"table-count\": 1\n}\n```\n\nThe errors are divided in one of the following categories:\n\n- `source` - data can't be loaded or parsed\n- `structure` - general tabular errors like duplicate headers\n- `schema` - error of checks against [Table Schema](http://specs.frictionlessdata.io/table-schema/)\n- `custom` - custom checks errors\n\n### Checks\n\nCheck is a main validation actor in goodtables. The list of enabled checks can\nbe changed using `checks` and `skip_checks` arguments. Let's explore the options\non an example:\n\n```python\nreport = validate('data.csv') # by default structure and schema (if available) checks\nreport = validate('data.csv', checks=['structure']) # only structure checks\nreport = validate('data.csv', checks=['schema']) # only schema (if available) checks\nreport = validate('data.csv', checks=['bad-headers']) # check only 'bad-headers'\nreport = validate('data.csv', skip_checks=['bad-headers']) # exclude 'bad-headers'\n```\n\nBy default a dataset will be validated against all available Data Quality Spec\nerrors. Some checks can be unavailable for validation. For example, if the\nschema isn't provided, only the `structure` checks will be done.\n\n### Presets\n\nGoodtables support different formats of tabular datasets. They're called\npresets. A tabular dataset is some data that can be split in a list of data\ntables, as:\n\n\n\nWe can change the preset using the `preset` argument for `validate()`. By\ndefault, it'll be inferred from the source, falling back to `table`. To validate\na [data package][datapackage], we can do:\n\n```python\nreport = validate('datapackage.json') # implicit preset\nreport = validate('datapackage.json', preset='datapackage') # explicit preset\n```\n\nThis will validate all tabular resources in the datapackage.\n\nIt's also possible to validate a list of files using the \"nested\" preset. To do\nso, the first argument to `validate()` should be a list of dictionaries, where\neach key in the dictionary is named after a parameter on `validate()`. For example:\n\n```python\nreport = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}]) # implicit preset\nreport = validate([{'source': 'data1.csv'}, {'source': 'data2.csv'}], preset='nested') # explicit preset\n```\n\nIs similar to:\n\n```python\nreport_data1 = validate('data1.csv')\nreport_data2 = validate('data2.csv')\n```\n\nThe difference is that goodtables validates multiple tables in parallel, so\ncalling using the \"nested\" preset should run faster.\n\n### Data Quality Errors\n\nBase report errors are standardized and described in\n[Data Quality Spec](https://github.com/frictionlessdata/data-quality-spec/blob/master/spec.json).\n\n#### Source errors\n\nThe basic checks can't be disabled, as they deal with goodtables being able to read the files.\n\n| check | description |\n| --- | --- |\n| io-error | Data reading error because of IO error. |\n| http-error | Data reading error because of HTTP error. |\n| source-error | Data reading error because of not supported or inconsistent contents. |\n| scheme-error | Data reading error because of incorrect scheme. |\n| format-error | Data reading error because of incorrect format. |\n| encoding-error | Data reading error because of an encoding problem. |\n\n#### Structure errors\n\nThese checks validate that the structure of the file are valid.\n\n| check | description |\n| --- | --- |\n| blank-header | There is a blank header name. All cells in the header row must have a value. |\n| duplicate-header | There are multiple columns with the same name. All column names must be unique. |\n| blank-row | Rows must have at least one non-blank cell. |\n| duplicate-row | Rows can't be duplicated. |\n| extra-value | A row has more columns than the header. |\n| missing-value | A row has less columns than the header. |\n\n#### Schema errors\n\nThese checks validate the contents of the file. To use them, you need to pass a [Table Schema][tableschema]. If you don't have a schema, goodtables can infer it if you use the `infer_schema` option.\n\nIf your schema only covers part of the data, you can use the `infer_fields` to infer the remaining fields.\n\nLastly, if the order of the fields in the data is different than in your schema, enable the `order_fields` option.\n\n| check | description |\n| --- | --- |\n| schema-error | Schema is not valid. |\n| non-matching-header | The header's name in the schema is different from what's in the data. |\n| extra-header | The data contains a header not defined in the schema. |\n| missing-header | The data doesn't contain a header defined in the schema. |\n| type-or-format-error | The value can\u2019t be cast based on the schema type and format for this field. |\n| required-constraint | This field is a required field, but it contains no value. |\n| pattern-constraint | This field value's should conform to the defined pattern. |\n| unique-constraint | This field is a unique field but it contains a value that has been used in another row. |\n| enumerable-constraint | This field value should be equal to one of the values in the enumeration constraint. |\n| minimum-constraint | This field value should be greater or equal than constraint value. |\n| maximum-constraint | This field value should be less or equal than constraint value. |\n| minimum-length-constraint | A length of this field value should be greater or equal than schema constraint value. |\n| maximum-length-constraint | A length of this field value should be less or equal than schema constraint value. |\n\n#### Custom errors\n\n| check | description |\n| --- | --- |\n| [blacklisted-value](#blacklisted-value) | Ensure there are no cells with the blacklisted values. |\n| [deviated-value](#deviated-value) | Ensure numbers are within a number of standard deviations from the average. |\n| [foreign-key](#foreign-key) | Ensure foreign keys are valid within a data package |\n| [sequential-value](#sequential-value) | Ensure numbers are sequential. |\n| [truncated-value](#truncated-value) | Detect values that were potentially truncated. |\n| [custom-constraint](#custom-constraint) | Defines a constraint based on the values of other columns (e.g. `value * quantity == total`). |\n\n##### blacklisted-value\n\nSometimes we have to check for some values we don't want to have in out dataset. It accepts following options:\n\n| option | type | description |\n| --- | --- | --- |\n| column | int/str | Column number or name |\n| blacklist | list of str | List of blacklisted values |\n\nConsider the following CSV file:\n\n```csv\nid,name\n1,John\n2,bug\n3,bad\n5,Alex\n```\n\nLet's check that the `name` column doesn't contain rows with `bug` or `bad`:\n\n```python\nfrom goodtables import validate\n\nreport = validate('data.csv', checks=[\n {'blacklisted-value': {'column': 'name', 'blacklist': ['bug', 'bad']}},\n])\n# error on row 3 with code \"blacklisted-value\"\n# error on row 4 with code \"blacklisted-value\"\n```\n\n##### deviated-value\n\nThis check helps to find outlines in a column containing positive numbers. It accepts following options:\n\n| option | type | description |\n| --- | --- | --- |\n| column | int/str | Column number or name |\n| average | str | Average type, either \"mean\", \"median\" or \"mode\" |\n| interval | int | Values must be inside range `average \u00b1 standard deviation * interval` |\n\nConsider the following CSV file:\n\n```csv\ntemperature\n1\n-2\n7\n0\n1\n2\n5\n-4\n100\n8\n3\n```\n\nWe use `median` to get an average of the column values and allow interval of 3 standard deviations. For our case median is `2.0` and standard deviation is `29.73` so all valid values must be inside the `[-87.19, 91.19]` interval.\n\n```python\nreport = validate('data.csv', checks=[\n {'deviated-value': {'column': 'temperature', 'average': 'median', 'interval': 3}},\n])\n# error on row 10 with code \"deviated-value\"\n```\n\n##### foreign-key\n\n> We support here relative paths. It MUST be used only for trusted data sources.\n\nThis check validate foreign keys within a data package. Consider we have a data package defined below:\n\n```python\nDESCRIPTOR = {\n 'resources': [\n {\n 'name': 'cities',\n 'data': [\n ['id', 'name', 'next_id'],\n [1, 'london', 2],\n [2, 'paris', 3],\n [3, 'rome', 4],\n # [4, 'rio', None],\n ],\n 'schema': {\n 'fields': [\n {'name': 'id', 'type': 'integer'},\n {'name': 'name', 'type': 'string'},\n {'name': 'next_id', 'type': 'integer'},\n ],\n 'foreignKeys': [\n {\n 'fields': 'next_id',\n 'reference': {'resource': '', 'fields': 'id'},\n },\n {\n 'fields': 'id',\n 'reference': {'resource': 'people', 'fields': 'label'},\n },\n ],\n },\n }, {\n 'name': 'people',\n 'data': [\n ['label', 'population'],\n [1, 8],\n [2, 2],\n # [3, 3],\n # [4, 6],\n ],\n },\n ],\n}\n```\n\nRunning `goodtables` on it will raise a few `foreign-key` errors because we have commented some rows in the data package's data:\n\n```python\nreport = validate(DESCRIPTOR, checks=['structure', 'schema', 'foreign-key'])\nprint(report)\n```\n\n```\n{'error-count': 2,\n 'preset': 'datapackage',\n 'table-count': 2,\n 'tables': [{'datapackage': '...',\n 'error-count': 2,\n 'errors': [{'code': 'foreign-key',\n 'message': 'Foreign key \"[\\'next_id\\']\" violation in '\n 'row 4',\n 'message-data': {'fields': ['next_id']},\n 'row-number': 4},\n {'code': 'foreign-key',\n 'message': 'Foreign key \"[\\'id\\']\" violation in row 4',\n 'message-data': {'fields': ['id']},\n 'row-number': 4}],\n 'format': 'inline',\n 'headers': ['id', 'name', 'next_id'],\n 'resource-name': 'cities',\n 'row-count': 4,\n 'schema': 'table-schema',\n 'source': 'inline',\n 'time': 0.031,\n 'valid': False},\n {'datapackage': '...',\n 'error-count': 0,\n 'errors': [],\n 'format': 'inline',\n 'headers': ['label', 'population'],\n 'resource-name': 'people',\n 'row-count': 3,\n 'source': 'inline',\n 'time': 0.038,\n 'valid': True}],\n 'time': 0.117,\n 'valid': False,\n 'warnings': []}\n```\n\nIt experimetally supports external resource checks, for example, for a `foreignKey` definition like these:\n\n```json\n{\"package\": \"../people/datapackage.json\", \"resource\": \"people\", \"fields\": \"label\"}\n{\"package\": \"http:/example.com/datapackage.json\", \"resource\": \"people\", \"fields\": \"label\"}\n```\n\n##### sequential-value\n\nThis checks is for pretty common case when a column should have integers that sequentially increment. It accepts following options:\n\n| option | type | description |\n| --- | --- | --- |\n| column | int/str | Column number or name |\n\nConsider the following CSV file:\n\n```csv\nid,name\n1,one\n2,two\n3,three\n5,five\n```\n\nLet's check if the `id` column contains sequential integers:\n\n```python\nfrom goodtables import validate\n\nreport = validate('data.csv', checks=[\n {'sequential-value': {'column': 'id'}},\n])\n# error on row 5 with code \"sequential-value\"\n```\n\n##### truncated-value\n\nSome database or spreadsheet software (like MySQL or Excel) could cutoff values on saving. There are some well-known heuristics to find this bad values. See https://github.com/propublica/guides/blob/master/data-bulletproofing.md for more detailed information.\n\nConsider the following CSV file:\n\n```csv\nid,amount,comment\n1,14000000,good\n2,2147483647,bad\n3,32767,bad\n4,234234234,bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbad\n```\n\nTo detect all probably truncated values we could use `truncated-value` check:\n\n```python\nreport = validate('data.csv', checks=[\n 'truncated-value',\n])\n# error on row 3 with code \"truncated-value\"\n# error on row 4 with code \"truncated-value\"\n# error on row 5 with code \"truncated-value\"\n```\n\n##### custom-constraint\n\nWith Table Schema we could create constraints for an individual field but sometimes it's not enough. With a custom constraint check every row could be checked against given limited python expression in which variable names resolve to column values. See list of [available operators]( https://github.com/danthedeckie/simpleeval#operators). It accepts following options:\n\n
col1 + col2 == col3)