{ "info": { "author": "Cathal Garvey", "author_email": "cathalgarvey@cathalgarvey.me", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)", "Natural Language :: English", "Programming Language :: Python", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Utilities" ], "description": "# JLTool - Tools for JSON-Lines Records\nby Cathal Garvey, \u00a92016, Released under terms of the GNU AGPLv3 or later\n\nThe [JSON-Lines format](http://jsonlines.org) is a clean alternative to\ncomma-separated values as a means to store data records in a scaleable, flat\nmanner, for cases where a database is too much but a flat JSON file is\ninefficient.\n\nJLTool is a tool for working with JSON-Lines records; it assists in schema\nvalidation, duplicate detection, de-duplication and normalisation, and\n'grepping' using objectpath queries.\n\n[ObjectPath](http://objectpath.org/) is supported for most operations; in\nparticular, for fetching a unique, representative ID from objects for purposes\nof deduplication or diffing documents. For grepping, ObjectPath can be used to\nquery for matching rows.\n\nInstalling JLTool with `python3 setup.py install` or `pip install jl` will install\nthe `jltool` command-line tool, which is the primary intended purpose. However,\nfor operations on files the subcommands of `jltool` are all available in the `jltool`\nimport if desired. Just open it in `ipython` and take a look at the docs on the\ncommand functions for more information.\n\n### Usage Examples\nSay you have a JSON-Lines file '`records.jsonl`' containing records that look like this:\n\n```json\n{\"type\": \"email\", \"value\": \"cathal@isgre.at\", \"meta\": {\"foo\": \"bar\"} }\n```\n\n..which is similar enough to the job I needed doing, when I wrote `jltool`. :)\n\nMany of the commands use [objectpath](http://objectpath.org), as an optional\nway of selecting or uniquifying records. Check the documentation there for info.\n\nFor some commands that require a 'fingerprint' for a record in order to work\n(dedupe, report, diff, clean), if an objectpath selector is not given then a\nfingerprint will be generated by normalising objects (sorted keys) and hashing\nthe resulting JSON.\n\nThis may be highly misleading for some kinds of data, for\nexample when a record may represent an updated form of another record, differing\nonly in timestamp. However do note that in such cases (update records), the first\nmatching result is kept, discarding the rest, by default. This may also not be\ndesired behaviour. An option to reverse this behaviour may be added in the future,\nbut would mean loading everything into memory. Meanwhile, pipe files backwards-linewise\nusing `tac` (on Linux, obviously) to approximate a reversal of this behaviour.\n\n#### Get a Report\nThe `report` subcommand returns a report on the size and structure of a file,\nincluding reporting common keys and keys that have an uncertain type/schema:\n\n```bash\n$ jltool report records.jsonl\nNumber of records: 13\nNumber of Duplicates: 0\nCommon keys: {'type': {'string'}, 'value': {'string'}, 'meta': {'object'}}\n```\n\n#### Filtering Reports\nThe `grep` subcommand allows the use of objectpath queries to filter a JSONL file.\nThe objectpath query must evaluate to a boolean. If desired, deduplication\nmay be done prior to selection, by passing a `-s` selector by which to deduplicate\nrecords, but if no `-s` selector is given then no deduplication is performed.\n\n```bash\n$ cat records.jsonl\n{\"type\": \"email\", \"value\": \"foo@bar.com\", \"meta\": {}}\n{\"type\": \"twitter\", \"value\": \"onetruecathal\", \"meta\": {\"awesomeness\": 9001}}\n{\"type\": \"email\", \"value\": \"baz@qux.tld\", \"meta\": {\"lol\": \"wut\"}}\n$ jltool grep '$.type is \"twitter\"' records.jsonl\n{\"type\": \"twitter\", \"value\": \"onetruecathal\", \"meta\": {\"awesomeness\": 9001}}\n```\n\n#### Difference Between Two files\nThe `diff` command reports records that are present in one file and not the\nother. This is done without regard to order, and hashes or representative\nextracted strings are stored in memory during this operation, so for very large\nfiles this may consume a lot of RAM.\n\nBy default, this uses the hash of a normalised form of each line as a fingerprint,\nbut this is obviously not ideal in cases where metadata, timestamps or other\nbits may cause two functionally identical records to appear different.\n\nTo fix this, you can use objectpath queries to extract a representative string\naccording to your needs, by passing an objectpath query with the `-s` flag.\nThis is also true of many ensuing commands, not just `diff`.\n\n```bash\n$ # Observe query that pulls out type and value for a unique reference..\n$ jltool diff -s '$.type + \":\" + $.value' records.jsonl others.jsonl\n<<< 50: {\"meta\": {\"job\": \"http://www.lol.org\"}, \"type\": \"email\", \"value\": \"kboo@lol.foo\"}\n<<< 51: {\"meta\": {\"job\": \"http://www.baaa.com\"}, \"type\": \"email\", \"value\": \"adonis@rap.com\"}\n>>> 0: {\"meta\": {\"job\": \"http://nonsense.com/\"}, \"type\": \"twitter\", \"value\": \"nonsense\"}\n```\n\n#### Deduplicate\nThe `dedupe` command reports duplicate records. This is where objectpath queries\nmay become relevant, because the same \"record\" may have different metadata\nattached, and may therefore appear to be different if simply serialised as\nordered JSON, which is the default.\n\nNote that due to the linewise way reports are made, this may issue notifications\nof duplicates several times as additional duplicates appear, as in the below\nexample.\n\n```bash\n$ jltool dedupe records.jsonl\nDuplicate of line 0 at lines: [13]\nDuplicate of line 2 at lines: [15]\nDuplicate of line 2 at lines: [15, 28]\nDuplicate of line 5 at lines: [18, 31]\nDuplicate of line 10 at lines: [23, 36, 49]\nDuplicate of line 12 at lines: [25, 38, 51]\nFound 39 duplicates.\n```\n\n#### Clean\nThe `clean` subcommand normalises, minifies, and deduplicates jsonl files.\nIt should be used with similar care to other optional-query commands as, if\na query is incorrectly formed, it may result in loss of data.\n\n```bash\n$ ls -lah\ndrwxr-sr-x 4 cathal cathal 4.0K May 31 16:39 .\ndrwxrwxr-x 3 cathal cathal 4.0K May 30 19:52 ..\n-rw-rw-r-- 1 cathal cathal 10K May 31 15:43 records.jsonl\n$ jltool clean records.jsonl dedupe.jsonl\n$ ls -lah\ndrwxr-sr-x 4 cathal cathal 4.0K May 31 16:39 .\ndrwxrwxr-x 3 cathal cathal 4.0K May 30 19:52 ..\n-rw-rw-r-- 1 cathal cathal 2.4K May 31 16:42 dedupe.jsonl\n-rw-rw-r-- 1 cathal cathal 10K May 31 15:43 records.jsonl\n```", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/cathalgarvey/jltool", "keywords": null, "license": "GNU Affero General Public License v3", "maintainer": null, "maintainer_email": null, "name": "jltool", "package_url": "https://pypi.org/project/jltool/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/jltool/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/cathalgarvey/jltool" }, "release_url": "https://pypi.org/project/jltool/1.1.0/", "requires_dist": null, "requires_python": null, "summary": "Tools for inspecting, comparing, & cleaning JSON-Lines files", "version": "1.1.0" }, "last_serial": 2143245, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "902d9bff92795c70a0193ae1025e7572", "sha256": "c933cab0859744d6c9ecf191b9bcb415a004a6263ed323cb0e569ecc51ad84d0" }, "downloads": -1, "filename": "jltool-1.0.0.tar.gz", "has_sig": false, "md5_digest": "902d9bff92795c70a0193ae1025e7572", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6292, "upload_time": "2016-05-31T17:32:51", "url": "https://files.pythonhosted.org/packages/63/df/91bd572e7127d3a4d3d89936e82c99333c5bbfd6ad7094e529bc3e2d3b91/jltool-1.0.0.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "01e7d922fee0e5f0001145ebd348a9ef", "sha256": "c8ca39b5efe03334972d20944ca1262413c14beafc38fbf82bea3210836950b6" }, "downloads": -1, "filename": "jltool-1.1.0.tar.gz", "has_sig": false, "md5_digest": "01e7d922fee0e5f0001145ebd348a9ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6712, "upload_time": "2016-05-31T19:32:30", "url": "https://files.pythonhosted.org/packages/9a/c9/dd046c73fc93eccc08596cbe36947e5fa16cac3056c44aa02518e76a402b/jltool-1.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "01e7d922fee0e5f0001145ebd348a9ef", "sha256": "c8ca39b5efe03334972d20944ca1262413c14beafc38fbf82bea3210836950b6" }, "downloads": -1, "filename": "jltool-1.1.0.tar.gz", "has_sig": false, "md5_digest": "01e7d922fee0e5f0001145ebd348a9ef", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6712, "upload_time": "2016-05-31T19:32:30", "url": "https://files.pythonhosted.org/packages/9a/c9/dd046c73fc93eccc08596cbe36947e5fa16cac3056c44aa02518e76a402b/jltool-1.1.0.tar.gz" } ] }