{ "info": { "author": "Robertus Johansyah", "author_email": "me@kororo.co", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Programming Language :: Python :: Implementation :: PyPy" ], "description": "ExcelCy\n=======\n\n.. image:: https://badge.fury.io/py/excelcy.svg\n :target: https://badge.fury.io/py/excelcy\n\n.. image:: https://travis-ci.com/kororo/excelcy.svg?branch=master\n :target: https://travis-ci.com/kororo/excelcy\n\n.. image:: https://coveralls.io/repos/github/kororo/excelcy/badge.svg?branch=master\n :target: https://coveralls.io/github/kororo/excelcy?branch=master\n\n.. image:: https://badges.gitter.im/excelcy.png\n :target: https://gitter.im/excelcy\n :alt: Gitter\n\n------\n\nExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.\n\nExcelCy is Powerful\n-------------------\n\n`Simple Style Training `__, from spaCy documentation, demonstrates how to train NER using spaCy:\n\n.. code-block:: python\n\n TRAIN_DATA = [\n (\"Uber blew through $1 million a week\", {'entities': [(0, 4, 'ORG')]}),\n (\"Google rebrands its business apps\", {'entities': [(0, 6, \"ORG\")]})]\n\n nlp = spacy.blank('en')\n optimizer = nlp.begin_training()\n for i in range(20):\n random.shuffle(TRAIN_DATA)\n for text, annotations in TRAIN_DATA:\n nlp.update([text], [annotations], sgd=optimizer)\n nlp.to_disk('/model')\n\nThe **TRAIN_DATA**, describes sentences and annotated entities to be trained. It is cumbersome to always count the characters. With ExcelCy, (start,end) characters can be omitted.\n\n.. code-block:: python\n\n # download the en model from spacy\n # python -m spacy download en\"\n from excelcy import ExcelCy\n # collect sentences, annotate Entities and train NER using spaCy\n excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')\n # use the nlp object as per spaCy API\n doc = excelcy.nlp('Google rebrands its business apps')\n # or save_storage it for faster bootstrap for application\n excelcy.nlp.to_disk('/model')\n\n\nExcelCy is Friendly\n-------------------\n\nBy default, ExcelCy training is divided into phases, the example Excel file can be found in `tests/data/test_data_01.xlsx `__:\n\n1. Discovery\n^^^^^^^^^^^^\n\nThe first phase is to collect sentences from data source in sheet \"source\". The data source can be either:\n\n- Text: Direct sentence values.\n- Files: PDF, DOCX, PPT, PNG or JPG will be parsed using `textract `__.\n\nNote: See textract source examples in `tests/data/test_data_03.xlsx `__\n\n2. Preparation\n^^^^^^^^^^^^^^\n\nNext phase, the Gold annotation needs to be defined in sheet \"prepare\", based on:\n\n- Current Data Model: Using spaCy API of **nlp(sentence).ents**\n- Phrase pattern: Robertus Johansyah, Uber, Google, Amazon\n- Regex pattern: ^([0-1]?[0-9]|2[0-3]):[0-5][0-9]$\n\nAll annotations in here are considered as Gold annotations, which described in `here `__.\n\n3. Training\n^^^^^^^^^^^\n\nMain phase of NER training, which described in `Simple Style Training `__. The data is iterated from sheet \"train\", check sheet \"config\" to control the parameters.\n\n4. Consolidation\n^^^^^^^^^^^^^^^^\n\nThe last phase, is to test/save the results and repeat the phases if required.\n\nExcelCy is Flexible\n-------------------\n\nNeed more specific export and phases? It is possible to control it using phase API. This is the illustration of the real-world scenario:\n\n1. Train from `tests/data/test_data_05.xlsx `__\n\n .. code-block:: bash\n\n # download the dataset\n $ wget https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05.xlsx\n # this will create a directory and file \"export/train_05.xlsx\"\n $ excelcy execute test_data_05.xlsx\n\n2. Open the result in \"export/train_05.xlsx\", it shows all identified sentences from source given. However, there is error in the \"Himalayas\" as identified as \"PRODUCT\".\n3. To fix this, add phrase matcher for \"Himalayas = FAC\". It is illustrated in `tests/data/test_data_05a.xlsx `__\n4. Train again and check the result in \"export/train_05a.xlsx\"\n\n .. code-block:: bash\n\n # download the dataset\n $ wget https://github.com/kororo/excelcy/raw/master/tests/data/test_data_05a.xlsx\n # this will create a directory \"nlp/data\" and file \"export/train_05a.xlsx\"\n $ excelcy execute test_data_05a.xlsx\n\n5. Check the result that there is backed up nlp data model in \"nlp\" and the result is corrected in \"export/train_05a.xlsx\"\n6. Keep training the data model, if there is unexpected behaviour, there is backup data model in case needed.\n\nExcelCy is Comprehensive\n------------------------\n\nUnder the hood, ExcelCy has strong and well-defined data storage. At any given phase above, the data can be inspected.\n\n.. code-block:: python\n\n from excelcy import ExcelCy\n\n excelcy = ExcelCy()\n # load configuration from XLSX or YML or JSON\n # excelcy.load(file_path='test_data_01.xlsx')\n # or define manually\n excelcy.storage.config = Config(nlp_base='en_core_web_sm', train_iteration=2, train_drop=0.2)\n print(json.dumps(excelcy.storage.as_dict(), indent=2))\n\n # add sources\n excelcy.storage.source.add(kind='text', value='Robertus Johansyah is the maintainer ExcelCy')\n excelcy.storage.source.add(kind='textract', value='tests/data/source/test_source_01.txt')\n excelcy.discover()\n print(json.dumps(excelcy.storage.as_dict(), indent=2))\n\n # add phrase matcher Robertus Johansyah -> PERSON\n excelcy.storage.prepare.add(kind='phrase', value='Robertus Johansyah', entity='PERSON')\n excelcy.prepare()\n print(json.dumps(excelcy.storage.as_dict(), indent=2))\n\n # train it\n excelcy.train()\n print(json.dumps(excelcy.storage.as_dict(), indent=2))\n\n # test it\n doc = excelcy.nlp('Robertus Johansyah is maintainer ExcelCy')\n print(json.dumps(excelcy.storage.as_dict(), indent=2))\n\n\nFeatures\n--------\n\n- Load multiple data sources such as Word documents, PowerPoint presentations, PDF or images.\n- Import/Export configuration with JSON, YML or Excel.\n- Add custom Entity labels.\n- Rule based phrase matching using `PhraseMatcher `__\n- Rule based matching using `regex + Matcher `__\n- Train Named Entity Recogniser with ease\n\nInstall\n-------\n\nEither use the famous pip or clone this repository and execute the setup.py file.\n\n.. code-block:: bash\n\n $ pip install excelcy\n # ensure you have the language model installed before\n $ spacy download en\n\nTrain\n-----\n\nTo train the spaCy model:\n\n.. code-block:: python\n\n from excelcy import ExcelCy\n excelcy = ExcelCy.execute(file_path='test_data_01.xlsx')\n\nNote: `tests/data/test_data_01.xlsx `__\n\nCLI\n---\n\nExelCy has basic CLI command for execute:\n\n.. code-block:: bash\n\n $ excelcy execute https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx\n\n\nData Definition\n---------------\n\nExcelCy has data definition which expressed in `api.yml `__. As long as, data given in this specific format and structure, ExcelCy will able to support any type of data format. Check out, the Excel file format in `api.xlsx `__. Data classes are defined with `attrs `__, check in `storage.py `__ for more detail.\n\n\nTODO\n----\n\n- [X] Start get cracking into spaCy\n\n- [ ] More features and enhancements listed `here `__\n\n - [ ] [`link `__] JSONL integration with Prodigy\n - [ ] [`link `__] Add logging and the settings\n - [ ] Add special case for tokenisation described `here `__\n - [ ] Add custom tags.\n - [ ] Add classifier text training described `here `__\n - [ ] Add exception subtext when there is multiple occurrence in text. (Google Pay is awesome Google product)\n - [ ] Add tag annotation in sheet: train\n - [ ] Add ref in data storage\n - [ ] Improve speed and performance\n - [X] Add list of patterns easily (such as kitten breed.\n - [X] Add more data structure check in Excel and more warning messages\n - [X] Add plugin, otherwise just extends for now.\n - [X] [`link `__] Add enabled, notes columns\n - [X] [`link `__] Add export outputs such as identified Entities, Tags\n - [X] [`link `__] Add CLI support\n - [X] [`link `__] Improve experience\n - [X] [`link `__] Add more file format such as YML, JSON. Make standardise and well documented on data structure.\n - [X] Add support to accept sentences to Excel\n\n\n- [X] Submit to Prodigy Universe\n\nFAQ\n---\n\n**What is that idx columns in the Excel sheet?**\n\nThe idea is to give reference between two things. Imagine in sheet \"train\", like to know where the sentence generated from in sheet \"source\". And also, the nature of Excel, you can sort things, this is the safe guard to keep things in the correct order.\n\n**Can ExcelCy import/export to X, Y, Z data format?**\n\nExcelCy has strong and well-defined data storage, thanks to `attrs `__. It is possible to import/export data in any format.\n\n**Error: ModuleNotFoundError: No module named 'pip'**\n\nThere are lots of possibility on this. Try to lower pip version (it was buggy for version 19.0.3).\n\n**ExcelCy accepts suggestions/ideas?**\n\nYes! Please submit them into new issue with label \"enhancement\".\n\nAcknowledgement\n---------------\n\nThis project uses other awesome projects:\n\n- `attrs `__: Python Classes Without Boilerplate.\n- `pyexcel `__: Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files.\n- `pyyaml `__: The next generation YAML parser and emitter for Python.\n- `spacy `__: Industrial-strength Natural Language Processing (NLP) with Python and Cython.\n- `textract `__: extract text from any document. no muss. no fuss.\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/kororo/excelcy", "keywords": "spacy,spacy-pipeline,spacy-nlp,nlp,python,python3,entity,training,excel,xlsx,spacy-extensions", "license": "MIT", "maintainer": "Robertus Johansyah", "maintainer_email": "me@kororo.co", "name": "excelcy", "package_url": "https://pypi.org/project/excelcy/", "platform": "", "project_url": "https://pypi.org/project/excelcy/", "project_urls": { "Homepage": "https://github.com/kororo/excelcy" }, "release_url": "https://pypi.org/project/excelcy/0.3.3/", "requires_dist": [ "attrs (<19.0.0,>=18.1.0)", "pyexcel (<1.0,>=0.5.0)", "pyexcel-xlsx (<1.0,>=0.5.6)", "pyyaml (<5.0,>=4.2b1)", "spacy (<3.0,>=2.0.11)" ], "requires_python": "", "summary": "Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.", "version": "0.3.3" }, "last_serial": 4933353, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "608d6025c05c2e0f4c418f3709990131", "sha256": "b9d4185e0e5287d8f0bfccb25cfcc1b85c8789919049c78a0860f8d4ed9cda27" }, "downloads": -1, "filename": "excelcy-0.1.2-py3-none-any.whl", "has_sig": false, "md5_digest": "608d6025c05c2e0f4c418f3709990131", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 11192, "upload_time": "2018-07-19T12:13:40", "url": "https://files.pythonhosted.org/packages/97/61/737002f32f769d320d38900b084792aa4c6ae1ad7df9a707078894e2c75f/excelcy-0.1.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f0efb63f9ea501591cdba6c0ddece476", "sha256": "6723b9509f457005302dcf5bfb531c446a76ecbe4088c07aca346226d665ad90" }, "downloads": -1, "filename": "excelcy-0.1.2.tar.gz", "has_sig": false, "md5_digest": "f0efb63f9ea501591cdba6c0ddece476", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 12660, "upload_time": "2018-07-19T12:13:41", "url": "https://files.pythonhosted.org/packages/a9/0a/f5b0667e6b8d21c146de6b618cc1fe71c086aa616b15885879418b881a86/excelcy-0.1.2.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "eb97fe32017de1a91206ce2048e14851", "sha256": "2c0709641a8f7bc54f0ca35b9a4b85574c2cc65d46cf95eb58809c62ecc27804" }, "downloads": -1, "filename": "excelcy-0.2.4-py3-none-any.whl", "has_sig": false, "md5_digest": "eb97fe32017de1a91206ce2048e14851", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 16505, "upload_time": "2018-07-23T04:59:29", "url": "https://files.pythonhosted.org/packages/85/95/9d4520888caaa7be95057fc5db2be3de503630644531d5e3514fd516c0f6/excelcy-0.2.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "1560973dde478b0512e487c82ca1d9ce", "sha256": "51ba65380a607511f31ec11eb366acb11e064a852e0e28406d8dd99eb0d6d2e9" }, "downloads": -1, "filename": "excelcy-0.2.4.tar.gz", "has_sig": false, "md5_digest": "1560973dde478b0512e487c82ca1d9ce", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16189, "upload_time": "2018-07-23T04:59:31", "url": "https://files.pythonhosted.org/packages/76/3e/7e7b25662323e3c8ba3d41633300a3bdaace74d644186e5c47ee14415daf/excelcy-0.2.4.tar.gz" } ], "0.3.0": [ { "comment_text": "", "digests": { "md5": "50ff823a2d276ad831c3d89d8121dd4e", "sha256": "dfca0f65c1711ccfd3e1078578ff72467ae95b5bf082dd23fc48da1789564696" }, "downloads": -1, "filename": "excelcy-0.3.0-py3-none-any.whl", "has_sig": false, "md5_digest": "50ff823a2d276ad831c3d89d8121dd4e", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19324, "upload_time": "2018-07-29T11:14:15", "url": "https://files.pythonhosted.org/packages/aa/f3/ff79081abb63f66a377c60b335477017bd049383a42f1df0bca3929eb780/excelcy-0.3.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "760cdb53777ee9b940220deeba37baa6", "sha256": "825f8e9b1b701c3eb69e0dc0ca1b0ae73bd9b97c79c62a2536b3aeaee77dda35" }, "downloads": -1, "filename": "excelcy-0.3.0.tar.gz", "has_sig": false, "md5_digest": "760cdb53777ee9b940220deeba37baa6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19230, "upload_time": "2018-07-29T11:14:17", "url": "https://files.pythonhosted.org/packages/5a/07/83eb4ff8462778fea0822047b1ee4d82dc227b3742fa819cdd82f7a1ddac/excelcy-0.3.0.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "11a7fed76c62f7270b20720668615b50", "sha256": "33941b5be510d660f79ff313d7e2abefbc2f17fc794172ec791af125a0a3e3c4" }, "downloads": -1, "filename": "excelcy-0.3.1-py3-none-any.whl", "has_sig": false, "md5_digest": "11a7fed76c62f7270b20720668615b50", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19407, "upload_time": "2018-07-29T11:22:42", "url": "https://files.pythonhosted.org/packages/46/78/860db701f185698d366b55883a8c134385baf6787e55f8d4156a4953855e/excelcy-0.3.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "10844e6c78a0a9cbdefd414d56f107d7", "sha256": "c85deb1ec7fa288d3d7b308d0fb358fb06eeb68226f722be9f79af8c98df43ac" }, "downloads": -1, "filename": "excelcy-0.3.1.tar.gz", "has_sig": false, "md5_digest": "10844e6c78a0a9cbdefd414d56f107d7", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19394, "upload_time": "2018-07-29T11:22:44", "url": "https://files.pythonhosted.org/packages/78/33/b3cfa5b67b9e6a95bcb70fa57f00f02b9f15595f6e275567e1c0867365d9/excelcy-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "c35bd94661a3ad5c733d9ccc48b7060f", "sha256": "b3c60fbbd51aa49c84a2712d7e40e377fd031fcf587ff7fb10ee132c654a3ab6" }, "downloads": -1, "filename": "excelcy-0.3.2-py3-none-any.whl", "has_sig": false, "md5_digest": "c35bd94661a3ad5c733d9ccc48b7060f", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19597, "upload_time": "2018-08-12T09:40:14", "url": "https://files.pythonhosted.org/packages/c4/51/9851cc7970033583366582e4e37a9e4ce66a65feff769768aa6da13b8c6f/excelcy-0.3.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "ebaaf261e140c92155898c83eceeffb0", "sha256": "12679d3fef888b1e32efb2414ece4e6522a5a5d8be3e844e5df6ad10844a4add" }, "downloads": -1, "filename": "excelcy-0.3.2.tar.gz", "has_sig": false, "md5_digest": "ebaaf261e140c92155898c83eceeffb0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19599, "upload_time": "2018-08-12T09:40:17", "url": "https://files.pythonhosted.org/packages/fd/9a/f9d71688140e24c15f205f4d6ed798c59aaeae286eb210fa41693b72fe4f/excelcy-0.3.2.tar.gz" } ], "0.3.3": [ { "comment_text": "", "digests": { "md5": "001dd77f96ce1f254bffeff0f5ed46d0", "sha256": "6d891937f804888bd602d858b40e91eab4825f73b96e4c9c20295d77e6542065" }, "downloads": -1, "filename": "excelcy-0.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "001dd77f96ce1f254bffeff0f5ed46d0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19707, "upload_time": "2019-03-13T07:11:10", "url": "https://files.pythonhosted.org/packages/e6/c1/3acfc3c982a0e882309ee768c54f9e5e7412771e88954c777da0207ff196/excelcy-0.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f5a33051b689e2b029dbc8f63a871027", "sha256": "696c5592fbeac8054b793956a69f9545d3c87c02d5d6f59e426fefc68d1c27f0" }, "downloads": -1, "filename": "excelcy-0.3.3.tar.gz", "has_sig": false, "md5_digest": "f5a33051b689e2b029dbc8f63a871027", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19779, "upload_time": "2019-03-13T07:11:12", "url": "https://files.pythonhosted.org/packages/f4/dc/952aec90c82935f3af82dce3fc59864436ba50ba9e4b3472177605b10e0e/excelcy-0.3.3.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "001dd77f96ce1f254bffeff0f5ed46d0", "sha256": "6d891937f804888bd602d858b40e91eab4825f73b96e4c9c20295d77e6542065" }, "downloads": -1, "filename": "excelcy-0.3.3-py3-none-any.whl", "has_sig": false, "md5_digest": "001dd77f96ce1f254bffeff0f5ed46d0", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 19707, "upload_time": "2019-03-13T07:11:10", "url": "https://files.pythonhosted.org/packages/e6/c1/3acfc3c982a0e882309ee768c54f9e5e7412771e88954c777da0207ff196/excelcy-0.3.3-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f5a33051b689e2b029dbc8f63a871027", "sha256": "696c5592fbeac8054b793956a69f9545d3c87c02d5d6f59e426fefc68d1c27f0" }, "downloads": -1, "filename": "excelcy-0.3.3.tar.gz", "has_sig": false, "md5_digest": "f5a33051b689e2b029dbc8f63a871027", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19779, "upload_time": "2019-03-13T07:11:12", "url": "https://files.pythonhosted.org/packages/f4/dc/952aec90c82935f3af82dce3fc59864436ba50ba9e4b3472177605b10e0e/excelcy-0.3.3.tar.gz" } ] }