{ "info": { "author": "", "author_email": "", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "Programming Language :: Python :: 3.7" ], "description": "# Arche\n\n[![PyPI](https://img.shields.io/pypi/v/arche.svg)](https://pypi.org/project/arche)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/arche.svg)](https://pypi.org/project/arche)\n![GitHub](https://img.shields.io/github/license/scrapinghub/arche.svg)\n[![Build Status](https://travis-ci.com/scrapinghub/arche.svg?branch=master)](https://travis-ci.com/scrapinghub/arche)\n[![Codecov](https://img.shields.io/codecov/c/github/scrapinghub/arche.svg)](https://codecov.io/gh/scrapinghub/arche)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n[![GitHub commit activity](https://img.shields.io/github/commit-activity/m/scrapinghub/arche.svg)](https://github.com/scrapinghub/arche/commits/master)\n\n pip install arche\n\nArche (pronounced *Arkey*) helps to verify scraped data using set of defined rules, for example:\n * Validation with [JSON schema](https://json-schema.org/)\n * Coverage (items, fields, categorical data, including booleans and enums)\n * Duplicates\n * Garbage symbols\n * Comparison of two jobs\n \n_We use it in Scrapinghub, among the other tools, to ensure quality of scraped data_\n\n## Installation\n\nArche requires [Jupyter](https://jupyter.org/install) environment, supporting both [JupyterLab](https://github.com/jupyterlab/jupyterlab#installation) and [Notebook](https://github.com/jupyter/notebook) UI\n\nFor JupyterLab, you will need to properly install [plotly extensions](https://github.com/plotly/plotly.py#jupyterlab-support-python-35)\n\nThen just `pip install arche`\n\n## Why\nTo check the quality of scraped data continuously. For example, if you scraped a website, a typical approach would be to validate the data with Arche. You can also create a schema and then set up [Spidermon](https://spidermon.readthedocs.io/en/latest/item-validation.html#with-json-schema)\n\n## Developer Setup\n\n\tpipenv install --dev\n\tpipenv shell\n\ttox\n\n## Contribution\nAny contributions are welcome! See https://github.com/scrapinghub/arche/issues if you want to take on something or suggest an improvement/report a bug.\n\n# Changes\nMost recent releases are shown at the top. Each release shows:\n\n- **Added**: New classes, methods, functions, etc\n- **Changed**: Additional parameters, changes to inputs or outputs, etc\n- **Fixed**: Bug fixes that don't change documented behaviour\n\nNote that the top-most release is changes in the unreleased master branch on Github. Parentheses after an item show the name or github id of the contributor of that change.\n\n[Keep a Changelog](https://keepachangelog.com/en/1.0.0/), [Semantic Versioning](https://semver.org/spec/v2.0.0.html).\n\n\n## [0.3.6] (2019-07-12)\n### Added\n- **Categories** rule with a plot showing unique values and count per field. By default, `report_all()` only includes fields which have less or equal to 10 unique values. See https://arche.readthedocs.io/en/latest/nbs/Rules.html#Category-fields, #100\n- Category documentation\n### Changed\n- `Arche.report_all()` does not shorten report by default, added `short` parameter.\n- Data is consistent with Dash and Spidermon: `_type, _key` fields are dropped from dataframe, raw data, basic schema, #104, #106\n- `df.index` now stores `_key` instead\n- `basic_json_schema()` works with `deleted` jobs\n- `start` is supported for Collections, #112\n- `enum` is counted as a `category` tag, #18\n- `Garbage Symbols` searches in str representation of nested fields instead of expanded df, #130\n- Show real coverage difference (negative\\positive) instead of absolute, #114\n### Fixed\n- `Arche.glance()`, #88\n- Item links in Schema validation errors, #89\n- Empty NAN bars on category graphs, #93\n- `data_quality_report()`, #95\n- Wrong number of Collection Items if it contains item 0, #112\n### Removed\n- **Responses Per Item Ratio** rule\n- Deprecated `expand` parameter and removed `flat_df`, since `Garbage Rule` deal with nested data itself, #133\n\n\n## [0.3.5] (2019-05-14)\n### Added\n- `Arche()` supports any iterables with item dicts, fixing jsonschema consistency, #83\n- `Items.from_array` to read raw data from iterables, #83\n### Changed\n- If reading from pandas df directly, store raw data in numpy array. See gotchas http://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#support-for-integer-na\n### Fixed\n### Removed\n\n\n## [0.3.4] (2019-05-06)\n### Fixed\n- basic_json_schema() fails with long `1.0` types, #80\n\n\n## [0.3.3] (2019-05-03)\n### Added\n- Accept dataframes as source or target, #69\n### Changed\n- data_quality_report plots the same \"Fields Coverage\" instead of green \"Scraped Fields Coverage\"\n- Plot theme changed from ggplot2 to seaborn, #62\n- Same target and source raise an error, was a warning before\n- Passed rules marked with green PASSED.\n### Fixed\n- Online documentation now renders graphs https://arche.readthedocs.io/en/latest/, #41\n- Error colours are back in `report_all()`. \n### Removed\n- Deprecated `Arche.basic_json_schema()`, use `basic_json_schema()`\n- Removed Quickstart.md as redundant - documentation lives in notebooks\n\n\n## [0.3.2] (2019-04-18)\n### Added\n- Allow reading private raw schemas directly from bitbucket, #58\n### Changed\n- Progress widgets are removed before printing graphs\n- New plotly v4 API\n### Fixed\n- Failing `Compare Prices For Same Urls` when url is `nan`, #67\n- Empty graphs in Jupyter Notebook, #63\n### Removed\n- Scraped Items History graphs\n\n\n## [0.3.1] (2019-04-12)\n### Fixed\n- Empty graphs due to lack of plotlyjs, #61\n\n\n## [0.3.0] (2019-04-12)\n### Fixed\n- Big notebook size, replaced cufflinks with plotly and ipython, #39\n### Changed\n- *Fields Coverage* now is printed as a bar plot, #9\n- *Fields Counts* renamed to *Coverage Difference* and results in 2 bar plots, #9, #51:\n * *Coverage from job stats fields counts* which reflects coverage for each field for both jobs\n * *Coverage difference more than 5%* which prints >5% difference between the coverages (was ratio difference before)\n- *Compare Scraped Categories* renamed to *Category Coverage Difference* and results in 2 bar plots for each category, #52:\n * *Coverage for `field`* which reflects value counts (categories) coverage for the field for both jobs\n * *Coverage difference more than 10% for `field`* which shows >10% differences between the category coverages\n- *Boolean Fields* plots *Coverage for boolean fields* graph which reflects normalized value counts for boolean fields for both jobs, #53\n### Removed\n- `cufflinks` dependency\n- Deprecated `category_field` tag\n\n\n## [2019.03.25]\n### Added\n- CHANGES.md\n- new `arche.rules.duplicates.find_by()` to find duplicates by chosen columns\n```\nimport arche\nfrom arche.readers.items import JobItems\ndf = JobItems(0, \"235801/1/15\").df\narche.rules.duplicates.find_by(df, [\"title\", \"category\"]).show()\n```\n- `basic_json_schema().json()` prints a schema in JSON format\n- `Result.show()` to print a rule result, e.g.\n```\nfrom arche.rules.garbage_symbols import garbage_symbols\nfrom arche.readers.items import JobItems\nitems = JobItems(0, \"235801/1/15\")\ngarbage_symbols(items).show()\n```\n- notebooks to documentation\n### Changed\n- Tags rule returns unused tags, #2\n- `basic_json_schema()` prints a schema as a python dict\n### Deprecated\n- `Arche().basic_json_schema()` deprecated in favor of `arche.basic_json_schema()`\n### Removed\n### Fixed\n- `Arche().basic_json_schema()` not using `items_numbers` argument\n\n\n## 2019.03.18\n- Last release without CHANGES updates", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/scrapinghub/arche", "keywords": "scrapinghub,scraping,data,data-visualization,data-analysis,pandas", "license": "MIT", "maintainer": "Scrapinghub, manycoding", "maintainer_email": "info@scrapinghub.com", "name": "arche", "package_url": "https://pypi.org/project/arche/", "platform": "", "project_url": "https://pypi.org/project/arche/", "project_urls": { "Homepage": "http://github.com/scrapinghub/arche" }, "release_url": "https://pypi.org/project/arche/0.3.6/", "requires_dist": null, "requires_python": "", "summary": "Analyze Scrapy Cloud data", "version": "0.3.6" }, "last_serial": 5524468, "releases": { "0.3.0": [ { "comment_text": "", "digests": { "md5": "4153fc7a8529482c0d5b991af6563f7d", "sha256": "0295f7752d0f8c7805103e5470ad2c2e51231e38f3be0fbb8996a3843bec8196" }, "downloads": -1, "filename": "arche-0.3.0.tar.gz", "has_sig": false, "md5_digest": "4153fc7a8529482c0d5b991af6563f7d", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 26531, "upload_time": "2019-04-12T23:49:53", "url": "https://files.pythonhosted.org/packages/2d/a9/2350b81db2e7b040572835e035eaf3942abdcef35b835e6c7a0ac5627c1e/arche-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "967fbdc46adb60f9ff85d80d0f5fee9c", "sha256": "5104fc33d03a5786ede58c4f086d390b54cac285ee0bf4ff16ad0d61a95180b9" }, "downloads": -1, "filename": "arche-0.3.1.tar.gz", "has_sig": false, "md5_digest": "967fbdc46adb60f9ff85d80d0f5fee9c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.7", "size": 26575, "upload_time": "2019-04-13T00:57:03", "url": "https://files.pythonhosted.org/packages/7d/ba/74e06088b53796039d5821f705974becc081f07091ca76e560b21cb19a76/arche-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "2a6148ded293af8460d875ea030febfc", "sha256": "a58eee20e960ee77a19342d975e1cb12b24156935b5903c6f6c4f4f5f23b1e78" }, "downloads": -1, "filename": "arche-0.3.2.tar.gz", "has_sig": false, "md5_digest": "2a6148ded293af8460d875ea030febfc", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 442651, "upload_time": "2019-04-18T20:45:57", "url": "https://files.pythonhosted.org/packages/c6/2b/ffe6ffe40d8bb4796c677c1d399ae91e02e5eb6b849f7212e7892e7572a6/arche-0.3.2.tar.gz" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "965e7e970685c7eab83d5f9df430f3f5", "sha256": "4bad14c3abc187a3415418d5e9ab35ca6ba60915a0214e3bda445a5c3ab081a3" }, "downloads": -1, "filename": "arche-0.3.3.tar.gz", "has_sig": false, "md5_digest": "965e7e970685c7eab83d5f9df430f3f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1213686, "upload_time": "2019-05-03T23:10:56", "url": "https://files.pythonhosted.org/packages/99/46/83c47352b5a3069c213ad6cc6537dee56bc67a6b41c5239f3782e59f0d79/arche-0.3.3.tar.gz" } ], "0.3.4": [ { "comment_text": "", "digests": { "md5": "470e71a8e055ea3ed09fa610470c8e7d", "sha256": "54bf46b03943e5136fa54955e5e33f78c6b60652d2cbd636f5b4d98895d63ca8" }, "downloads": -1, "filename": "arche-0.3.4.tar.gz", "has_sig": false, "md5_digest": "470e71a8e055ea3ed09fa610470c8e7d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1213759, "upload_time": "2019-05-06T18:35:30", "url": "https://files.pythonhosted.org/packages/29/33/4267857e162b3279a2264017f09650802f3b39111c369ff05ef7d3080c44/arche-0.3.4.tar.gz" } ], "0.3.5": [ { "comment_text": "", "digests": { "md5": "0a7e47f69ffacc12d4aa292a08183793", "sha256": "b948983b97a7ebdb93ca4b2764d9f195b21e36a97ade9c59e901366f6df560a3" }, "downloads": -1, "filename": "arche-0.3.5.tar.gz", "has_sig": false, "md5_digest": "0a7e47f69ffacc12d4aa292a08183793", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1693489, "upload_time": "2019-05-14T19:24:35", "url": "https://files.pythonhosted.org/packages/46/b8/c85d1500a846434525325953d081490a5f51800df523211a973e25755553/arche-0.3.5.tar.gz" } ], "0.3.6": [ { "comment_text": "", "digests": { "md5": "47c8a8f3dcc0879a43bc6a2868c48c86", "sha256": "f19aca0d572e4cb25da064adbec898a27924b80c36205d5c50717ae83a986e4d" }, "downloads": -1, "filename": "arche-0.3.6.tar.gz", "has_sig": false, "md5_digest": "47c8a8f3dcc0879a43bc6a2868c48c86", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2691897, "upload_time": "2019-07-12T18:27:05", "url": "https://files.pythonhosted.org/packages/3c/20/ba1b60a885cd417d18ec00c4a86246cb6ffbf71745332db0b65be5d8f64f/arche-0.3.6.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "47c8a8f3dcc0879a43bc6a2868c48c86", "sha256": "f19aca0d572e4cb25da064adbec898a27924b80c36205d5c50717ae83a986e4d" }, "downloads": -1, "filename": "arche-0.3.6.tar.gz", "has_sig": false, "md5_digest": "47c8a8f3dcc0879a43bc6a2868c48c86", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 2691897, "upload_time": "2019-07-12T18:27:05", "url": "https://files.pythonhosted.org/packages/3c/20/ba1b60a885cd417d18ec00c4a86246cb6ffbf71745332db0b65be5d8f64f/arche-0.3.6.tar.gz" } ] }