{ "info": { "author": "Igor Grillo Peternella", "author_email": "igor.feq@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "# regex4ocr\n\nA simple library to plug regular expression models (Document Regex Models) to parse your favorite OCR string output and extract important fields\nfrom the OCR picture.\n\n## Built with\n\n* Python 3.7.1 (but the library supports any Python 3.x);\n* PyYAML for yml parsing;\n* Tox and Pytest for testing.\n\n## Document Regexp Model (DRM)\n\nThe DRM are yml files which describes the desired documents with many regular expressions. Those are used to extract the document data in order to transform the OCR results into a structured Python dict.\n\nAn example is as follows:\n\n```yml\nidentifiers:\n - id_regexp_1\n - id_regexp_2\n - id_regexp_3\nfields:\n key1: regexp1\n key2: regexp2\noptions:\n lowercase: true\n remove_whitespace: false\n force_ascii: true\n replace:\n - ['replace_regex_1', 'substituion_str_1']\n - ['replace_regex_2', 'substituion_str_2']\ntable:\n header: table header regex\n line_start: table row beginning regex \n footer: table footer regex\n```\n\n### DRM (Document Regexp Model) fields\n\nIn order for the OCR string of a picture to be parsed properly, a regexp expressions that model the given document picture must be set in a YML file. The following fields are allowed to model the OCR document data string:\n\n```identifiers```: list of regexps that defines that this DRM should be used for the given OCR string. ALL regexp identifiers must match or the DRM will not be used for the given document. When using the ```regex4ocr.parse```function, a folder path to the folder which contains the yml DRMs is passed. Hence, if a DRM does not match the OCR document because not all the identifiers regexp were found in the OCR string, the next DRM in the folder will be tested until one is found.\n\n```fields```: defines key and regexp pairs. The regexps will be matched against the OCR string in order to extract the desired data. The key names will be used in the final Python dictionary.\n```options```: defines optional pre processing of the OCR string. Replaces in the OCR string can be performed, the OCR string can be lowercased, whitespace can be removed and non ascii characerts can be coerced to be closest ascii match, e.g, ```\u00e7``` is cast to ```c```.\n\n```table```: defines tabular data to be extract. Such data is very common on purchase receipts where the table header marks the beginning of the table such.\n```header```: regexp which marks the header of the table data in the OCR string.\n```line_start```: regexp that matches the beginning of EVERY new line of the table. The rows fields in the final dictionary requires this field.\n```footer```: regexp that matches the end of the table data. This is used to stop trying to parse the table rows.\n\n## Transform OCR images into structured data\n\nThis library allows one to convert the OCR result string to the following structured format:\n\n```python\n{\n \"fields\": {\n \"user_defined_field_1\": \"result_1\",\n \"user_defined_field_2\": \"result_2\",\n },\n \"table\": {\n \"header\": \"table header\",\n \"all_rows\": \"row 1 result row 2 result\",\n \"rows\": [\"row 1 result\", \"row 2 result\"],\n \"footer\": \"table footer\"\n }\n}\n```\n\n## Installing and using\n\nTo install the library, just run pip install:\n\n```bash\npip install regex4ocr\n```\n\nTo use it, just import the library:\n\n```python\nimport regex4ocr # use the regex4ocr.parse function\nfrom regex4ocr import parse # import the parse function directly\n```\n\n### Parse function\n\nThis OCR string parsing library is based on one function: ```parse```. Use it as follows:\n\n```python\nfrom regex4ocr import parse\n\n# ocr string is the result string generated by your OCR system (Google Vision, etc.)\n# drms_folder_path is the os folder path to the folder which contains the yml DRM models\nparse(ocr_string, drms_folder_path)\n```\n\n## Getting ready with local development\n\nIn a system with Python pip, install the dev requirements:\n\n```bash\npip install -r requirements_dev.txt\n```\n\nTo locally use the main parse function, create a Python module at the root of the project and import the parse function:\n\n```python\nimport regex4ocr\n\nregex4ocr.parse(ocr_result_string, folder_with_drms)\n```\n\n## Testing\n\nRegex4ocr has been tested under Python 3.x environments with tox and pytest. In order to run such tests:\n\n```bash\ntox # executes all tests for Python 3.0, 3.3, 3.5 and 3.7\n```\n\nTo execute the tests for a single environment (faster testing):\n\n```bash\ntox -e 3.7\n```\n\n## Code Linting\n\nThis project uses the following linters:\n\n* pylint;\n* flake8;\n* pep8.\n\nAlso, the code is formatted with the Black formatter (https://github.com/ambv/black). Some Black configuration can be found\nat the ```pyproject.toml``` file.\n\n## OCR and DRM example:\n\nGiven the following DRM which exists at a folder named ```drms```:\n\n```yml\nidentifiers:\n - cupom fiscal\nfields:\n cnpj: 'cnpj:\\s*(\\d{2}\\.\\d{3}\\.\\d{3}\\/\\d{4}-\\d{2})'\n coo: 'coo:\\s*(\\d{6})'\n date: '\\d{2}\\/\\d{2}\\/\\d{4}\\s*\\d{2}:\\d{2}:\\d{2}'\noptions:\n lowercase: true\n remove_whitespace: false\n force_ascii: true\n replace:\n - ['c00', 'coo']\n - ['c10', 'coo']\ntable:\n header: (item|iten)\\s+codigo.*vl.*(?=\\n)\n line_start: \\n\\d*.+\\d+\n footer: total\\s*r\\$\n```\n\nAnd given the OCR string result from a picture (such as Google Vision, Tesseract OCR, etc.), import the\nregex4ocr library and use its parse function in order to extract structured data (Python dict) from it:\n\n\n```python\nimport regex4ocr # import the library\n\n# stores the OCR string result generated by a computer vision package (Google Vision, etc.)\nocr_result = \"\"\"\ncnpj: 11.123.456/0001-99\nie:111.111.111. 111\nim: 123456-7\n2570972078 17:54: ttccf 3045759 **** c00:047621\ncupom fiscal\niten codigo descricao qid un vl unit r$ ) st vl item(r$)\n17273 breit grossa -7mts\" bunx373 ft 288 026\n2 $17 pedra 1 (ht) 2unx84 694 f1\n169 38g\n003 515 cimento votoran todas as obras 50 kg\ncred)\nboun x 26.489 f1\n794,676\ntotal r$\n1.247.07\ncheque\n1.247.09\ntroco r$\n\"\"\"\n\n# insert the path to a folder where the DRM (yml files) are stored\ndrms_folder_path = \"./my_drms\"\n\n# generate structured data (JSON-like) from the unstructured OCR string\nextracted_data = regex4ocr.parse(ocr_result, drms_folder_path)\n\n# prints the result\nprint(extracted_data)\n```\n\nHere's the final result:\n\n```python\n{\n \"fields\": {\n \"cnpj\": \"11.123.456/0001-99\",\n \"coo\": \"047621\",\n },\n \"table\": {\n \"header\": \"iten codigo descricao qid un vl unit r$ ) st vl item(r$)\",\n \"all_rows\": \"\\n17273 breit grossa -7mts bunx373 ft 288 026\\n2 $17 pedra 1 (ht) 2unx84 694 f1\\n169 38g\\n003 515 cimento votoran todas as obras 50 kg\\ncred)\\nboun x 26.489 f1\\n794,676\\n\",\n \"rows\": [\"17273 breit grossa -7mts bunx373 ft 288 026\", \"2 $17 pedra 1 (ht) 2unx84 694 f1\", \"169 38g\", \"003 515 cimento votoran todas as obras 50 kg cred) boun x 26.489 f1794,676\"],\n \"footer\": \"total r$\"\n }\n}\n```", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/juntossomosmais/regex4ocr", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "regex4ocr", "package_url": "https://pypi.org/project/regex4ocr/", "platform": "", "project_url": "https://pypi.org/project/regex4ocr/", "project_urls": { "Homepage": "https://github.com/juntossomosmais/regex4ocr" }, "release_url": "https://pypi.org/project/regex4ocr/1.4.1/", "requires_dist": null, "requires_python": "", "summary": "Extract data from OCR string results based on Document Regexp Models (DRMs).", "version": "1.4.1" }, "last_serial": 4717040, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "5dad7f655d58a01a6605aa1361a52ed8", "sha256": "bc6dfbd8bb733cbad9bf4c5598eb9003b2e008cdd133517bd32b32218cb71209" }, "downloads": -1, "filename": "regex4ocr-1.0.0.tar.gz", "has_sig": false, "md5_digest": "5dad7f655d58a01a6605aa1361a52ed8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7846, "upload_time": "2019-01-04T16:19:07", "url": "https://files.pythonhosted.org/packages/02/29/08df9ebe96a9bd16c41e8a65b9aefbcef84a693bc7707bedcb8ac1ae58be/regex4ocr-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "37b10e74cb02675de1fd629ac7e2a33f", "sha256": "2e7339a3526657e1c3562a927703ab36a4df364569866d80e82826cd3ebcf157" }, "downloads": -1, "filename": "regex4ocr-1.0.1.tar.gz", "has_sig": false, "md5_digest": "37b10e74cb02675de1fd629ac7e2a33f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10676, "upload_time": "2019-01-04T17:44:30", "url": "https://files.pythonhosted.org/packages/84/2a/fae27a88d0f92f3dfe991b902ea040e0c2d215bc3e2f10bb700324883353/regex4ocr-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "73d8ad4071c7710edc87e29ad633e12b", "sha256": "9020d337fc675897f45a5d913901dc3d8372722314be2fee2554c8c47143e78b" }, "downloads": -1, "filename": "regex4ocr-1.0.2.tar.gz", "has_sig": false, "md5_digest": "73d8ad4071c7710edc87e29ad633e12b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10614, "upload_time": "2019-01-04T23:46:23", "url": "https://files.pythonhosted.org/packages/b2/a1/ce31b141d144e1e8e1db2b08b9b8f390e12f08563ef3425590c6abeff4da/regex4ocr-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "b6acb545c57a365c825d2dd5f4594660", "sha256": "57fbca16cfc006ad18d1ad6a44756ab4abf90ebce71fa1cf727f3e8d93722bbe" }, "downloads": -1, "filename": "regex4ocr-1.0.3.tar.gz", "has_sig": false, "md5_digest": "b6acb545c57a365c825d2dd5f4594660", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10814, "upload_time": "2019-01-11T11:23:26", "url": "https://files.pythonhosted.org/packages/a2/6f/3efd3a098225decca4bb6463606a50d0f43a34af905b9f20f927f901762e/regex4ocr-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "ba8d87784783b9692da9b6d3b5978bb1", "sha256": "f11290260ea18975e49f89f7c3d3a9c6fbc473cde94d2bd9a332c50ed5e81a1e" }, "downloads": -1, "filename": "regex4ocr-1.0.4.tar.gz", "has_sig": false, "md5_digest": "ba8d87784783b9692da9b6d3b5978bb1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10809, "upload_time": "2019-01-11T11:28:37", "url": "https://files.pythonhosted.org/packages/2d/ef/c685d2cb1803818917093ec5e4d78ad05fa544d5d15bef51e8e9d4f06e3f/regex4ocr-1.0.4.tar.gz" } ], "1.1.0": [ { "comment_text": "", "digests": { "md5": "6e81275fb685817e310dd56310a9a753", "sha256": "a8b5d269b82c823d1378aa750e924049030400f3d11ff7bf70034538cea95b30" }, "downloads": -1, "filename": "regex4ocr-1.1.0.tar.gz", "has_sig": false, "md5_digest": "6e81275fb685817e310dd56310a9a753", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10841, "upload_time": "2019-01-17T12:30:29", "url": "https://files.pythonhosted.org/packages/4d/61/1b1eeb3c3674d33402cbfb3768a0847dcbb2ee485663b04560d242ff911f/regex4ocr-1.1.0.tar.gz" } ], "1.2.0": [ { "comment_text": "", "digests": { "md5": "b6f93dbfc967fdfa30b20b481226ca11", "sha256": "7aac429cd02b77da69c98d58121d27dbdbc8bc3a7e666dd06d3e50133b7770ac" }, "downloads": -1, "filename": "regex4ocr-1.2.0.tar.gz", "has_sig": false, "md5_digest": "b6f93dbfc967fdfa30b20b481226ca11", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11997, "upload_time": "2019-01-17T19:16:38", "url": "https://files.pythonhosted.org/packages/9f/da/568ee3d1e73b4f4bd0c60910b93825433f678f12896755bad9d39fda9cd0/regex4ocr-1.2.0.tar.gz" } ], "1.3.0": [ { "comment_text": "", "digests": { "md5": "b750a7f29b4c4e73d5e38113ae05b2f5", "sha256": "e653ccaaa0e65fb90894bb2eeb09da2154db1958999a14e2068cef455fb72435" }, "downloads": -1, "filename": "regex4ocr-1.3.0.tar.gz", "has_sig": false, "md5_digest": "b750a7f29b4c4e73d5e38113ae05b2f5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12401, "upload_time": "2019-01-19T15:54:48", "url": "https://files.pythonhosted.org/packages/82/1d/384e2eaebc02b39755d5ed356f9936f1421b516bb4bfa684998b2c36c139/regex4ocr-1.3.0.tar.gz" } ], "1.4.0": [ { "comment_text": "", "digests": { "md5": "8a1b78df11c152eebdba8f4e04d386ba", "sha256": "8c5b8f96f0b968f273503fc47474159cf51bddf75a161f1c45d2c03a95e351a4" }, "downloads": -1, "filename": "regex4ocr-1.4.0.tar.gz", "has_sig": false, "md5_digest": "8a1b78df11c152eebdba8f4e04d386ba", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12719, "upload_time": "2019-01-20T01:47:28", "url": "https://files.pythonhosted.org/packages/79/57/f97c64ffb94f6583b23feeaa9e09fd966a28b5d6072ab37c51d487bf16c4/regex4ocr-1.4.0.tar.gz" } ], "1.4.1": [ { "comment_text": "", "digests": { "md5": "e2070fb6ee17d51ff77d48e829d99d19", "sha256": "f36d351c7a600d432d32fd50f6d3a9615ecfe036b41a362a34c42511ba6b0aa5" }, "downloads": -1, "filename": "regex4ocr-1.4.1.tar.gz", "has_sig": false, "md5_digest": "e2070fb6ee17d51ff77d48e829d99d19", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12718, "upload_time": "2019-01-20T01:57:32", "url": "https://files.pythonhosted.org/packages/55/f3/2cd7d6d3a76aab8dd0b147983ead9fc16feff8776646ee7abd905e018960/regex4ocr-1.4.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e2070fb6ee17d51ff77d48e829d99d19", "sha256": "f36d351c7a600d432d32fd50f6d3a9615ecfe036b41a362a34c42511ba6b0aa5" }, "downloads": -1, "filename": "regex4ocr-1.4.1.tar.gz", "has_sig": false, "md5_digest": "e2070fb6ee17d51ff77d48e829d99d19", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12718, "upload_time": "2019-01-20T01:57:32", "url": "https://files.pythonhosted.org/packages/55/f3/2cd7d6d3a76aab8dd0b147983ead9fc16feff8776646ee7abd905e018960/regex4ocr-1.4.1.tar.gz" } ] }