{ "info": { "author": "James C. Palmer", "author_email": "me@jcp.io", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "# Data Filter\n\n[![pypi](https://img.shields.io/pypi/v/datafilter.svg?color=brightgreen)](https://pypi.org/project/datafilter/)\n[![pypi](https://img.shields.io/pypi/pyversions/datafilter.svg)](https://pypi.org/project/datafilter/)\n[![codecov](https://codecov.io/gh/jcp/datafilter/branch/master/graph/badge.svg)](https://codecov.io/gh/jcp/datafilter/)\n[![Build Status](https://travis-ci.org/jcp/datafilter.svg?branch=master)](https://travis-ci.org/jcp/datafilter/)\n\nQuickly find tokens (words, phrases, etc) within your data.\n\nData Filter is a lightweight [data cleansing](https://en.wikipedia.org/wiki/Data_cleansing) framework that can be easily extended to support different data types, structures or processing requirements. It natively supports the following data types:\n\n* CSV files\n* Text files\n* Text strings\n\n# Table of Contents\n\n* [Requirements](#requirements)\n* [Installation](#installation)\n* [Basic Usage](#basic-usage)\n* [Features](#features)\n * [Base](#base)\n * [Filters](#filters)\n * [CSV](#csv)\n * [Text](#text)\n * [TextFile](#textfile)\n\n# Requirements\n\n* Python 3.6+\n\n# Installation\n\nTo install, simply use [pipenv](http://pipenv.org/) (or pip):\n\n```bash\n>>> pipenv install datafilter\n```\n\n# Basic Usage\n\n## CSV\n\n```python\nfrom datafilter import CSV\n\ntokens = [\"Lorem\", \"ipsum\", \"Volutpat est\", \"mi sit amet\"]\ndata = CSV(\"test.csv\", tokens=tokens)\ndata.save(\"filtered.csv\")\n```\n\nIn this example, we open a CSV file, search all rows for normalized tokens and flag them. The `save` method creates a new CSV file with all rows that weren't flagged.\n\n## Text\n\n```python\nfrom datafilter import Text\n\ntext = \"Lorem ipsum dolor sit amet, consectetur adipiscing elit\"\ndata = Text(text, tokens=[\"Lorem\"])\nprint(next(data.results()))\n```\n\nIn this example, we search a text string for normalized tokens. We can then iterator over the results using the `.results()` method, which returns a generator that yields [formatted results](#parse).\n\n## Text File\n\n```python\nfrom datafilter import TextFile\n\ndata = TextFile(\"test.txt\", tokens=[\"Lorem\", \"ipsum\"], re_split=r\"(?<=\\.)\")\nprint(next(data.results()))\n```\n\nIn this example, we open a text file and split the data based on a regular expression defined by `re_split`. \n\n# Features\n\nData Filter was designed to be highly extensible. Common or useful filters can be easily reused and shared. A few example use cases include:\n\n* Filters that can handle different data types such as Microsoft Word, Google Docs, etc.\n* Filters that can handle incoming data from external APIs.\n\n## Base\n\nAbstract base class that's subclassed by every filter.\n\n`Base` includes several methods to ensure data is properly normalized, formatted and returned. The `.results()` method is an `@abstractmethod` to enforce its use in subclasses.\n\n### Parameters\n\n#### `tokens`\n\n`type `\n\nA list of strings that will be searched for within a set of data.\n\n#### `translations`\n\n`type `\n\nA list of strings that will be removed during normalization.\n\n**Default**\n\n```python\n['0123456789', '(){}[]<>!?.:;,`\\'\"@#$%^&*+-|=~\u2013\u2014/\\\\_', '\\t\\n\\r\\x0c\\x0b']\n```\n\n#### `bidirectional`\n\n`type `\n\nWhen `True`, token matching will be bidirectional. \n\n**Default**\n\n```python\nTrue\n```\n\n> **Note:**\n>\n> A common obfuscation method is to reverse the offending string or phrase. This helps detect those instances.\n\n#### `caseinsensitive`\n\n`type `\n\nWhen `True`, tokens and data are converted to lowercase during normalization.\n\n**Default**\n\n```python\nTrue\n```\n\n### Methods\n\n#### `.results()`\n\nAbstract method used to return results within a filter. This is defined by a `Base` subclass\n\n#### `.maketrans()`\n\nReturns a translation table used during normalization.\n\n**Returns**\n\n`type `\n\n#### `.normalize(data)`\n\nReturns normalized data. Normalization includes converting data to [lowercase](#caseinsensitive) and [removing strings](#translations).\n\nAccepts parameter `data`.\n\n**Returns**\n\n`type `\n\n> **Note:**\n>\n> Normalized data is returned as a tuple. The first element is the original data. The second element is the normalized data.\n>\n\n#### `.parse(data)`\n\nReturns parsed and formatted data.\n\nAccepts parameter `data`.\n\n**Returns**\n\n`type `\n\n**Example**\n\nAssume we're searching for the token \"Lorem\" in a very short text string.\n\n```python\ndata = Text(\"Lorem ipsum dolor sit amet\", tokens=[\"Lorem\"])\nprint(next(data.results()))\n```\n\nThe returned result would be formatted as:\n\n```python\n{\n \"data\": \"Lorem ipsum dolor sit amet\",\n \"flagged\": True,\n \"describe\": {\n \"tokens\": {\n \"detected\": [\"Lorem\"],\n \"count\": 1,\n \"frequency\": {\n \"Lorem\": 1,\n },\n }\n },\n}\n```\n\n> **Note:**\n>\n> `.parse()` should never be called directly. Use `.results()` instead.\n\n## Filters\n\nFilters subclass and extend the `Base` class to support various data types and structure. This extensibility allows for the creation of powerful custom filters specifically tailored to a given task, data type or structure.\n\n## CSV\n\n### Parameters\n\n`CSV` is a subclass of `Base` and inherits all parameters.\n\n#### `path`\n\n`type `\n\nPath to a CSV file.\n\n### Methods\n\n`CSV` is a subclass of `Base` and inherits all methods.\n\n#### `.save(path)`\n\nSaves results to a file.\n\nAccepts parameter `path`. `path` is the absolute path and filename of the new file.\n\n## Text\n\n### Parameters\n\n`Text` is a subclass of `Base` and inherits all parameters.\n\n#### `text`\n\n`type `\n\nA text string.\n\n#### `re_split`\n\n`type `\n\nA regular expression pattern or string that will be applied to `text` with `re.split` before normalization.\n\n### Methods\n\n`Text` is a subclass of `Base` and inherits all methods.\n\n#### `.save(path, endofline=\" \")`\n\nSaves results to a file.\n\nAccepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.\n\n## TextFile\n\n### Parameters\n\n`TextFile` is a subclass of `Base` and inherits all parameters.\n\n#### `path`\n\n`type `\n\nPath to a text file.\n\n#### `re_split`\n\n`type `\n\nA regular expression pattern or string that will be applied to `text` with `re.split` before normalization.\n\n### Methods\n\n`TextFile` is a subclass of `Base` and inherits all methods.\n\n#### `.save(path, endofline=\" \")`\n\nSaves results to a file.\n\nAccepts parameter `path` and `endofline`. `path` is the absolute path and filename of the new file. `endofline` is a line delimiter that will be added to the end of every row.\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jcp/datafilter", "keywords": "", "license": "BSD 3-Clause", "maintainer": "", "maintainer_email": "", "name": "datafilter", "package_url": "https://pypi.org/project/datafilter/", "platform": "", "project_url": "https://pypi.org/project/datafilter/", "project_urls": { "Homepage": "https://github.com/jcp/datafilter" }, "release_url": "https://pypi.org/project/datafilter/0.4.2/", "requires_dist": null, "requires_python": ">=3.6", "summary": "Quickly find tokens (words, phrases, etc) within your data.", "version": "0.4.2" }, "last_serial": 5705954, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "8f59e4f118cecd0d3eb7e5dd2ca4559b", "sha256": "1a18aa407c17adff6902066fa6ee94225cbdce62673f9f451c8e211cd087159b" }, "downloads": -1, "filename": "datafilter-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "8f59e4f118cecd0d3eb7e5dd2ca4559b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 7915, "upload_time": "2019-08-11T17:03:23", "url": "https://files.pythonhosted.org/packages/52/a6/ec50058e3f63c50bd6131c02b5e599423e844d3781dbb83286324734f85f/datafilter-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "54a21bdd320a0948c51f9c685890eb1a", "sha256": "3fe053773908a907359db0ffc678138aaa89fc76ede229d9bcc048f170a759c1" }, "downloads": -1, "filename": "datafilter-0.1.2.tar.gz", "has_sig": false, "md5_digest": "54a21bdd320a0948c51f9c685890eb1a", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 7412, "upload_time": "2019-08-11T17:03:24", "url": "https://files.pythonhosted.org/packages/4b/c5/d5f613338259de7e52851f6d377c5eae58366626b3bd499d343c042a53ea/datafilter-0.1.2.tar.gz" } ], "0.2.0": [ { "comment_text": "", "digests": { "md5": "af3a97e27ce16be76b1b049c0e99bfe1", "sha256": "588eeceafb17ec49a8bb0f7c166b429e67914ff180bb40e46b6ca6d4910dbd45" }, "downloads": -1, "filename": "datafilter-0.2.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "af3a97e27ce16be76b1b049c0e99bfe1", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 6929, "upload_time": "2019-08-12T00:02:32", "url": "https://files.pythonhosted.org/packages/db/cf/4df710f72684c68d962c6c2697070e8d576433f5f7d87f4e8a8f53eaa5e2/datafilter-0.2.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "2e8575c302260fe1f4ebaa9e277fac99", "sha256": "20787de8c01ae1f5d862c41c06914dc9cb065e6cd9e9726ac8738edb06cfa76b" }, "downloads": -1, "filename": "datafilter-0.2.0.tar.gz", "has_sig": false, "md5_digest": "2e8575c302260fe1f4ebaa9e277fac99", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6772, "upload_time": "2019-08-12T00:02:33", "url": "https://files.pythonhosted.org/packages/7d/a0/5843b09c1a5b767d95de1513ab4011dedeef10db9b88e59482120bb6c81c/datafilter-0.2.0.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "892d44b41643e2a5952cc1ce3d8cd66f", "sha256": "f215e4c0bd8a6943ffa12b8e5e3ebd6ccf2f8e68e086194340025bca4041b0e6" }, "downloads": -1, "filename": "datafilter-0.3.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "892d44b41643e2a5952cc1ce3d8cd66f", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 6814, "upload_time": "2019-08-13T03:34:09", "url": "https://files.pythonhosted.org/packages/9a/6a/62e7237fd101d7fd6dd5e56655b7a2c9d260852cf9682e38965991f6ceac/datafilter-0.3.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e5e7ec692c3a5b0b55f42e72437e3c1b", "sha256": "89776351b2eae8b33aebe475d77a69ec15e566365e3590a0091386aeee82ac04" }, "downloads": -1, "filename": "datafilter-0.3.0.tar.gz", "has_sig": false, "md5_digest": "e5e7ec692c3a5b0b55f42e72437e3c1b", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6491, "upload_time": "2019-08-13T03:34:11", "url": "https://files.pythonhosted.org/packages/4f/33/acb6b16dc913e6dec441661c2f323fb14a90e0d27afbffcc5a8a74da146a/datafilter-0.3.0.tar.gz" } ], "0.4.1": [ { "comment_text": "", "digests": { "md5": "489496059866487273406802d1ae3226", "sha256": "9c2cfa4e7f688e2c4dffb98791ff6fd25ba62d40886e22fd2497b1290bc11831" }, "downloads": -1, "filename": "datafilter-0.4.1-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "489496059866487273406802d1ae3226", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 7574, "upload_time": "2019-08-14T16:21:14", "url": "https://files.pythonhosted.org/packages/02/3c/417036ca167c10067ce6a31f27a5966c5c4231465c5f6bd135ef7d09f9f0/datafilter-0.4.1-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1ede077867fae05408f29ca62fab4ab2", "sha256": "8eb943af91df0989ab00570a04d1a9149c3d2bd689b47ab94985666ce31cfbc0" }, "downloads": -1, "filename": "datafilter-0.4.1.tar.gz", "has_sig": false, "md5_digest": "1ede077867fae05408f29ca62fab4ab2", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6963, "upload_time": "2019-08-14T16:21:16", "url": "https://files.pythonhosted.org/packages/c3/52/9d453c4514f303c6c95e48c5e0c12498d61ebe07b3f30577751467663ce4/datafilter-0.4.1.tar.gz" } ], "0.4.2": [ { "comment_text": "", "digests": { "md5": "662da2bb309338c58d997955f98594bf", "sha256": "c44cda907badbdd3151e75968f487f8c581db26c2349820fe497517a23ef0a49" }, "downloads": -1, "filename": "datafilter-0.4.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "662da2bb309338c58d997955f98594bf", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 7574, "upload_time": "2019-08-20T22:24:22", "url": "https://files.pythonhosted.org/packages/11/6f/d74a3291e26831a4148108376f50d6df2dde9550448733d0cb33022d8bed/datafilter-0.4.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3d7cd39ca78ac6dc9076f6d9f43d036", "sha256": "32588f7a941653f27b5d0ed195a3bcc34f0e5d8465b5c5efb2fd5d2273085110" }, "downloads": -1, "filename": "datafilter-0.4.2.tar.gz", "has_sig": false, "md5_digest": "d3d7cd39ca78ac6dc9076f6d9f43d036", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6962, "upload_time": "2019-08-20T22:24:24", "url": "https://files.pythonhosted.org/packages/f8/41/c512c54c51acc2b6bb7cc6e5842136cf7a38fa819ead5d02bed74b538fe9/datafilter-0.4.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "662da2bb309338c58d997955f98594bf", "sha256": "c44cda907badbdd3151e75968f487f8c581db26c2349820fe497517a23ef0a49" }, "downloads": -1, "filename": "datafilter-0.4.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "662da2bb309338c58d997955f98594bf", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6", "size": 7574, "upload_time": "2019-08-20T22:24:22", "url": "https://files.pythonhosted.org/packages/11/6f/d74a3291e26831a4148108376f50d6df2dde9550448733d0cb33022d8bed/datafilter-0.4.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "d3d7cd39ca78ac6dc9076f6d9f43d036", "sha256": "32588f7a941653f27b5d0ed195a3bcc34f0e5d8465b5c5efb2fd5d2273085110" }, "downloads": -1, "filename": "datafilter-0.4.2.tar.gz", "has_sig": false, "md5_digest": "d3d7cd39ca78ac6dc9076f6d9f43d036", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6", "size": 6962, "upload_time": "2019-08-20T22:24:24", "url": "https://files.pythonhosted.org/packages/f8/41/c512c54c51acc2b6bb7cc6e5842136cf7a38fa819ead5d02bed74b538fe9/datafilter-0.4.2.tar.gz" } ] }