{ "info": { "author": "Maarten van Gompel", "author_email": "proycon@anaproy.nl", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Operating System :: POSIX", "Programming Language :: Python :: 3.2", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Topic :: Text Processing :: Linguistic" ], "description": "[![Language Machines Badge](http://applejack.science.ru.nl/lamabadge.php/gecco)](http://applejack.science.ru.nl/languagemachines/)\n[![Codacy Badge](https://api.codacy.com/project/badge/grade/56e381c80d6a48f2831dd00f76f3848c)](https://www.codacy.com/app/proycon/gecco)\n\n========================================================================\nGECCO - Generic Environment for Context-Aware Correction of Orthography\n=======================================================================\n\n by Maarten van Gompel\n Centre for Language and Speech Technology, Radboud University Nijmegen\n Sponsored by Revisely (http://revise.ly)\n Licensed under the GNU Public License v3\n\nGecco is a generic modular and distributed framework for spelling correction. Aimed to\nbuild a complete context-aware spelling correction system given your own data\nset. Most modules will be language-independent and trainable from a source\ncorpus. Training is explicitly included in the framework. The framework aims to\neasily extendible, modules can be written in Python 3. Moreover, the framework\nis scalable and can be distributed over multiple servers. \n\nGiven an input text, Gecco will add various suggestions for correction. \n\nThe system can be invoked from the command-line, as a Python binding, as a\nRESTful webservice, or through the web application (two interfaces).\n\n**Modules**:\n - Generic built-in modules:\n - **Confusible Module**\n - A confusible module is able to discern which version of often\n confused word is correct given the context. For example, the words\n \"then\" and \"than\" are commonly confused in English.\n - Your configuration should specify between which confusibles the module disambiguates.\n - The module is implemented using the IGTree classifier (a k-Nearest Neighbour\n approximation) in Timbl.\n - **Suffix Confusible Module**\n - A variant of the confusible module that checks commonly confused morphological\n suffixes, rather than words.\n - Your configuration should specify between which suffixes the module disambiguates\n - The module is implemented using the IGTree classifier (a k-Nearest Neighbour\n approximation) in Timbl.\n - **Language Model Module**\n - A language model predicts what words are likely to follow others,\n similar to predictive typing applications commonly found on\n smartphones.\n - The module is implemented using the IGTree classifier (a k-Nearest Neighbour\n approximation) in Timbl.\n - **Aspell Module**\n - Aspell is open-source lexicon-based software for spelling correction.\n This module enables aspell to be used from gecco. This is not a\n context-sensitive method.\n - **Hunspell Module**\n - Hunspell is open-source lexicon-based software for spelling correction.\n This module enables hunspell to be used from gecco. This is not a\n context-sensitive method.\n - **Lexicon Module**\n - The lexicon module enables you to automatically generate a lexicon\n from corpus data and use it. This is not a context-sensitive method.\n - Typed words are matched against the lexicon and the module will come\n with suggestions within a certain Levenshtein distance. \n - **Errorlist Module**\n - The errorlist module is a very simple module that checks whether a\n word is in a known error list, and if so, provides the suggestions\n from that list. This is not a context-sensitive method.\n - **Split Module**\n - The split module detects words that are split but should be written\n together.\n - Implemented using Colibri Core\n - **Runon Module**\n - The runon module detects words that are written as one but should be\n split.\n - Implemented using Colibri Core\n - **Punctuation & Recase Module**\n - The punctuation & recase module attempts to detect missing\n punctuation, superfluous punctuation, and missing capitals.\n - The module is implemented using the IGTree classifier (a k-Nearest Neighbour\n approximation) in Timbl.\n - Modules suggested but not implemented yet:\n - *Language Detection Module*\n - (Not written yet, option for later)\n - *Sound-alike Module*\n - (Not written yet, option for later)\n\n**Features**\n - Easily extendible by adding modules using the gecco module API\n - Language independent\n - Built-in training pipeline (given corpus input): Create models from sources\n - Built-in testing pipeline (given an error-annotated test corpus), returns report of evaluation metrics per module\n - **Distributed**, **Multithreaded** & **Scalable**:\n - Load balancing: backend servers can run on multiple hosts, master process distributes amongst these\n - Multithreaded, modules can be invoked in parallel, module servers themselves may be multithreaded too\n - Input and output is **FoLiA XML** (http://proycon.github.io/folia)\n - Automatic input conversion from plain text using ucto\n\nGecco is the successor of Valkuil.net and Fowlt.net.\n \n-----------------------\nInstallation\n-----------------------\n\nGecco relies on a large number of dependencies, including but not limited to:\n\nDependencies:\n - *Generic*:\n - python 3.3 or higher\n - [PyNLPl](https://github.com/proycon/pynlpl), needed for FoLiA support (https://proycon.github.io/folia)\n - [python-ucto](http://proycon.github.com/python-ucto) & [ucto](https://languagemachines.github.io/ucto) (in turn depending on libfolia, ticcutils)\n - *Module-specific*:\n - [Timbl](https://languagemachines.github.io/timbl) *(mandatory)*\n - [python-timbl](https://github.com/proycon/python-timbl)\n - [Colibri Core](https://github.com/proycon/colibri-core/) *(mandatory)*\n - For the Aspell Module: *(optional)*\n - [Aspell](http://aspell.net)\n - aspell-python-py3\n - For the Hunspell Module: *(optional)*\n - [Hunspell](http://hunspell.github.io)\n - [PyHunspell](https://github.com/smathot/pyhunspell) *(not supported out of the box on Mac OS X)*\n - *Webservice*: *(optional)*\n - [CLAM](https://proycon.github.io/clam)\n\nTo install Gecco, we *strongly* recommend you to use our LaMachine\ndistribution, which can be obtained from https://github.com/proycon/lamachine .\n\nLaMachine includes Gecco and can be run in multiple ways: as a virtual machine,\nas a docker app, or as a compilation script setting up a Python virtual\nenvironment.\n\nGecco uses memory-based technologies, and depending on the models you train,\nmay take up considerable memory. Therefore we recommend *at least* 16GB RAM,\ntraining may require even more. For various modules, model size may be reduced\nby increasing frequency thresholds, but this will come at the cost of reduced\naccuracy.\n\nGecco will only run on POSIX-complaint operating systems (i.e. Linux, BSD, Mac OS X), not on Windows.\n\n----------------\nConfiguration\n----------------\n\nTo build an actual spelling correction system, you need to have corpus sources\nand create a gecco configuration that enable the modules you desire with the\nparameters you want. \n\nA Gecco system consists of a configuration, either in the form of a simple Python\nscript or an external YAML configuration file.\n\nExample YAML configuration:\n\n name: fowlt\n path: /path/to/fowlt\n language: en\n modules:\n - module: gecco.modules.confusibles.TIMBLWordConfusibleModule\n id: confusibles\n source: \n - train.txt\n model: \n - confusible.model\n confusibles: [then,than]\n\nTo list all available modules and the parameters they may take, run ``gecco --helpmodules``.\n\nAlternatively, the configuration can be done in Python directly, in which case\nthe script will be the tool that exposes all functionality:\n\n from gecco import Corrector\n from gecco.modules.confusibles import TIMBLWordConfusibleModule\n\n\tcorrector = Corrector(id=\"fowlt\", root=\"/path/to/fowlt/\")\n\tcorrector.append( TIMBLWordConfusibleModule(\"thenthan\", source=\"train.txt\",test_crossvalidate=True,test=0.1,tune=0.1,model=\"confusibles.model\", confusible=('then','than')))\n\tcorrector.append( TIMBLWordConfusibleModule(\"its\", source=\"train.txt\",test_crossvalidate=True,test=0.1,tune=0.1,model=\"confusibles.model\", confusible=('its',\"it's\")))\n\tcorrector.append( TIMBLWordConfusibleModule(\"errorlist\", source=\"errorlist.txt\",model=\"errorlist.model\", servers=[(\"blah\",1234),(\"blah2\",1234)] )\n\tcorrector.append( TIMBLWordConfusibleModule(\"lexicon\", source=[\"lexicon.txt\",\"lexicon2.txt\"],model=[\"lexicon.model\",\"lexicon2.model\"], servers=[(\"blah\",1235)] )\n\tcorrector.main()\n\n\nIt is recommended to adopt a file/directory structure as described below. If you plan on using multiple hosts, you should store it on a shared network drive so all hosts can access the models:\n\n - yourconfiguration.yml\n - sources/\n - models/\n\nAn example system spelling correction system for English is provided with Gecco and resides in the ``example/`` directory.\n\n \n\n----------------\nServer setup\n----------------\n\n`gecco run ` is executed to process a given\nFoLiA document or plaintext document, it starts a master process that will\ninvoke all the modules, which may be distributed over multiple servers. If\nmultiple server instances of the same module are available, the load will be\ndistributed over them. Output will be delivered in the FoLiA XML format and\nwill contain suggestions for correction. \n\nTo start module servers on a host, issue `gecco startservers`.\nYou can optionally specify which servers you want to start, if you do not want\nto start all. You can start servers multiple times, either on the same or on\nmultiple hosts. The master process will distribute the load amongst all\nservers. \n\nTo stop the servers, run `gecco stopservers` on each host that\nhas servers running. A list of all running servers can be obtained by `gecco\n listservers`.\n\nModules can also run locally within the master process rather than as servers,\nthis is done by either by adding `local: true` in the configuration, or by\nadding the ``--local`` option when starting a run. But this will have a\nsignificant negative impact on performance and should therefore be avoided.\n\n-----------------\nArchitecture\n-----------------\n\n![Gecco Architecture](https://raw.github.com/proycon/gecco/master/gecco_architecture.png \"Gecco Architecture\")\n\n---------------------\nCommand line usage\n---------------------\n\nInvoke all gecco functionality through a single command line tool\n\n $ gecco myconfig.yml [subcommand] \n\nor \n\n $ myspellingcorrector.py [subcommand]\n\n\nSyntax:\n\n usage: gecco [-h]\n {run,startservers,stopservers,startserver,train,evaluate,reset}\n ...\n\n Gecco is a generic, scalable and modular spelling correction framework\n\n Commands:\n {run,startservers,stopservers,startserver,train,evaluate,reset}\n run Run the spelling corrector on the specified input file\n startservers Starts all the module servers that are configured to\n run on the current host. Issue once for each host.\n stopservers Stops all the module servers that are configured to\n run on the current host. Issue once for each host.\n listservers Lists all the module servers on all hosts\n startserver Start one module's server on the specified port, use\n 'startservers' instead\n train Train modules\n evaluate Runs the spelling corrector on input data and compares\n it to reference data, produces an evaluation report\n reset Reset modules, deletes all trained models that have\n sources. Issue prior to train if you want to start\n anew.\n\n\nVital documentation regarding all modules and the settings they take can be obtained through:\n\n $ gecco --helpmodules\n\n----------------------------------------\nGecco as a webservice\n----------------------------------------\n\nRESTUL webservice access will be available through CLAM. We are still working\non better integration of this in Gecco. FOr now, an example implementation of\nthis can be seen here:\nhttps://github.com/proycon/valkuil-gecco/tree/master/valkuilwebservice\n\n------------------------------\nGecco as a web-application\n------------------------------\n\nA web-application will eventually be available, modelled after Valkuil.net/Fowlt.net.\n\n\n\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/proycon/gecco", "keywords": "spelling corrector spell check nlp computational_linguistics rest", "license": "GPL", "maintainer": "", "maintainer_email": "", "name": "Gecco", "package_url": "https://pypi.org/project/Gecco/", "platform": "", "project_url": "https://pypi.org/project/Gecco/", "project_urls": { "Homepage": "https://github.com/proycon/gecco" }, "release_url": "https://pypi.org/project/Gecco/0.2.5/", "requires_dist": null, "requires_python": "", "summary": "Generic Environment for Context-Aware Correction of Orthography", "version": "0.2.5" }, "last_serial": 3892249, "releases": { "0.2.3": [ { "comment_text": "", "digests": { "md5": "43663d5d7af92a6d78d0ed18431f3263", "sha256": "5d44cd9cddbb6d079829d7620b7201cd3bc7fd2e6b8aea5de03673fcd18a0e8d" }, "downloads": -1, "filename": "Gecco-0.2.3.tar.gz", "has_sig": false, "md5_digest": "43663d5d7af92a6d78d0ed18431f3263", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 44425, "upload_time": "2017-09-07T15:34:51", "url": "https://files.pythonhosted.org/packages/11/98/2d607a4d0eabc2eb72e594e2854e3cf58a1edb2fa34ced15ee55e6c757d8/Gecco-0.2.3.tar.gz" } ], "0.2.4": [ { "comment_text": "", "digests": { "md5": "79b093f452b154c444c1e0902869e7af", "sha256": "0413a35bd19b91ee13f54055616134e054a189e89a0ef76d0966581bf84d0dbd" }, "downloads": -1, "filename": "Gecco-0.2.4.tar.gz", "has_sig": false, "md5_digest": "79b093f452b154c444c1e0902869e7af", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 60944, "upload_time": "2018-03-27T10:31:31", "url": "https://files.pythonhosted.org/packages/3a/9d/683229da3e0e91c2ffeb303681f1b5eb69b621637539030b74c176c117fb/Gecco-0.2.4.tar.gz" } ], "0.2.5": [ { "comment_text": "", "digests": { "md5": "2eac6b7c95c0aa3690b6259cea59c30f", "sha256": "00b964d53e65c1a15c688ad7f034b9caf2983ab47bb898aa8d329fc3cff3f2ac" }, "downloads": -1, "filename": "Gecco-0.2.5.tar.gz", "has_sig": false, "md5_digest": "2eac6b7c95c0aa3690b6259cea59c30f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 60770, "upload_time": "2018-05-23T17:59:12", "url": "https://files.pythonhosted.org/packages/6a/07/70944f325d9bed57c2462960e8c1cacf756e2e60de260987447aecf69637/Gecco-0.2.5.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "2eac6b7c95c0aa3690b6259cea59c30f", "sha256": "00b964d53e65c1a15c688ad7f034b9caf2983ab47bb898aa8d329fc3cff3f2ac" }, "downloads": -1, "filename": "Gecco-0.2.5.tar.gz", "has_sig": false, "md5_digest": "2eac6b7c95c0aa3690b6259cea59c30f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 60770, "upload_time": "2018-05-23T17:59:12", "url": "https://files.pythonhosted.org/packages/6a/07/70944f325d9bed57c2462960e8c1cacf756e2e60de260987447aecf69637/Gecco-0.2.5.tar.gz" } ] }