{ "info": { "author": "Tarek Amr", "author_email": "UNKNOWN", "bugtrack_url": null, "classifiers": [], "description": "# Do you speak London?\n\nCommand line tool for naturla language identification, also known as langID. Currently supporting 4 languages only, English, Spanish, Portuguese and Arabic:\n\n`$ python dysl.py Can you tell if this is in English?`\n\nYou can let dysl read from custom training data with your own set of languages. \n\n`$ python dysl.py --corpus /PATH/TO/TRAINING/DATA \u00bfPuede usted decir si esto es en espa\u00f1ol?`\n\nYou also can use the CLI to add more samples to your training data. \n\n`$ python dysl.py --corpus /PATH/TO/TRAINING/DATA --lang en Here you are some text in English to be add to your training data`\n\nTo show the currently installed version:\n\n`$ python dysl.py --version`\n\nFor listing of all CLI options:\n\n`$ python dysl.py --help`\n\n## Use in Code\n\nYou can also use dysl's LangID within your code\n\n```python\nfrom dysl.langid import LangID\n\nl = LangID()\nl.train()\n\ntext = u'hello, world'\n\nlang = l.classify(text)\nprint text, 'Language:', lang\n```\n## Use your own Training Data\n\nAs stated earlier, we currently support English, Spanish, Portuguese and Arabic out of the box, however, this doesn't stop you from using your own data to train dysl. Thus, you can even use it to classify your choice of fictional languages such as George Orwell's Newspeak, Valyrian Languages, and The Klingon language if you have the needed data.\n\nFor compatibility with [langid.py](https://github.com/saffsd/langid.py), we require the training data to be in the following format:\n\n> To train a model, we require multiple corpora of monolingual documents. Each document should be a single file, and each file should be in a 2-deep folder hierarchy, with language nested within domain. For example, we may have a number of English files:\n\n`./corpus/domain1/en/File1.txt ./corpus/domainX/en/001-file.xml`\n\nWe, however do not rely on domains as they do in langid.py, so you can use a single domain if you want to.\n\nIn brief, if you want to classify Valyrian and Klingon for example, your corpus should look somehow as follows:\n\n`/user/me/corpus/domain/klingon/file01.txt`\n`/user/me/corpus/domain/klingon/file02.txt` \n\n`/user/me/corpus/domain/valyrian/ex12.xml`\n`/user/me/corpus/domain/valyrian/ex13.xml`\n\nAs you can see, domain and filename can be anything, just the folders containing the example files should be named after the languages you want to classify. \n\nFinally, you need to specify the path to the training example in your code.\n\n```python\nfrom dysl.langid import LangID\n\nl = LangID()\nl.train(root='/user/me/corpus')\n\ntext = u'hello, world'\n\nlang = l.classify(text)\nprint text, 'Language:', lang\n```\n## Adding New Samples to Your Training Data\n\nYou can add new training samples to your custom training-set. \nYou do that on two stages.\n\nTo add a new samples:\n\n`l.add_training_sample(text=u'tlhIngan Hol Dajatlha?', lang='Klingon')`\n\n`l.add_training_sample(text=u'jIyajbe', lang='Klingon')`\n\nThen to save changes to disk:\n\n`l.save_training_samples()`\n\n_Note_: \n\nBy default, a folder name generated from the current timestamp, `/batchTSYYMMDDHHMMSS` is used for the domain, and `file.txt` is used for sample files in language folders. To specify the domain folder name and file name, you can use the following command instead:\n\n`l.save_training_samples(domain='MyDomain', filename='MyFile.txt')`\n\nNotice that saving new sample data only works when using custom training-set, \nadding samples to builtin training-set is not permitted.\n\n# Retrain When Modified\n\nLet's say you are going to build a daemon that calls uses dysl. Normally, you would train dysl when starting then use that data for any further language classifications. \n\nWhat if some other process, or human, updates the training dataset during that? How will your daemon know that it neads to retrain again?\n\nWell, in LangID, there is a method, `is_training_modified()`, that can tell you if the training data was modified after it was first trained on it or not:\n\n`l.is_training_modified()`\n\nIt returns `True` if the data was modified, and `False` otherwise. Of course, when using the builtin dataset, it will always return `False` as it not allowed to be modified.\n\n## Unknown Languages (_Experimantal_)\n\nBy default, we try to classify a given text as one of the languages dysl is trained on. However, by setting `unk=True` we allow dysl to return 'unk' when the given text is not in any of the languages it has seen before. \n\n`l = LangID(unk=True)` \n\n***\n\n# Contacts\n \n+ Project Owner: [Tarek Amr](http://tarekamr.appspot.com/)", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gr33ndata/dysl", "keywords": null, "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "dysl", "package_url": "https://pypi.org/project/dysl/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/dysl/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/gr33ndata/dysl" }, "release_url": "https://pypi.org/project/dysl/0.13/", "requires_dist": null, "requires_python": null, "summary": "Do you speak London?", "version": "0.13" }, "last_serial": 1288319, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "4ed134a9c49587abf0aa4fc0250a1304", "sha256": "424ed3f68e323c5af4da08df1dd9d730a0f876a33a77947dca33098459d52e67" }, "downloads": -1, "filename": "dysl-0.1.tar.gz", "has_sig": false, "md5_digest": "4ed134a9c49587abf0aa4fc0250a1304", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 58843, "upload_time": "2014-03-16T14:24:21", "url": "https://files.pythonhosted.org/packages/77/53/e9899048ee213087c0f49e13a4d4bb04271645a80811cb12f4a2087dbb47/dysl-0.1.tar.gz" } ], "0.11": [ { "comment_text": "", "digests": { "md5": "61f35770090ee43a98ae46ebeca9a542", "sha256": "0591c352218bfcda28845204035f66627e87ca4c95d9633c178725da750666d1" }, "downloads": -1, "filename": "dysl-0.11.tar.gz", "has_sig": false, "md5_digest": "61f35770090ee43a98ae46ebeca9a542", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 240126, "upload_time": "2014-03-30T13:10:30", "url": "https://files.pythonhosted.org/packages/0b/22/09dbb868c97e5e847cd579abc81212bc7ad55ab5b008e428b747576fb22b/dysl-0.11.tar.gz" } ], "0.12": [ { "comment_text": "", "digests": { "md5": "6be9bb9ac099175556530985ac016f6b", "sha256": "b8645374d3bb61a61703ab4518f6e008dd22c76d7d1ecf21a189d75f360a6ca3" }, "downloads": -1, "filename": "dysl-0.12.tar.gz", "has_sig": false, "md5_digest": "6be9bb9ac099175556530985ac016f6b", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1764625, "upload_time": "2014-04-09T16:47:16", "url": "https://files.pythonhosted.org/packages/f1/cc/fe6fb1b56b571271f8d4fd5dca9444748e04a61fff88a422bf58a523220b/dysl-0.12.tar.gz" } ], "0.13": [ { "comment_text": "", "digests": { "md5": "f5f8034284df7efc4bdc502f9be0c610", "sha256": "b65d5f7aa1933d14a761daf4ee53c9752eff78c86e202cba92012778331da7c4" }, "downloads": -1, "filename": "dysl-0.13.tar.gz", "has_sig": false, "md5_digest": "f5f8034284df7efc4bdc502f9be0c610", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1768472, "upload_time": "2014-10-30T09:57:38", "url": "https://files.pythonhosted.org/packages/42/0b/e4427b6d36c30813fedede29970b8e63fa4d891c3d6eea5bba922bc2ef1e/dysl-0.13.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f5f8034284df7efc4bdc502f9be0c610", "sha256": "b65d5f7aa1933d14a761daf4ee53c9752eff78c86e202cba92012778331da7c4" }, "downloads": -1, "filename": "dysl-0.13.tar.gz", "has_sig": false, "md5_digest": "f5f8034284df7efc4bdc502f9be0c610", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1768472, "upload_time": "2014-10-30T09:57:38", "url": "https://files.pythonhosted.org/packages/42/0b/e4427b6d36c30813fedede29970b8e63fa4d891c3d6eea5bba922bc2ef1e/dysl-0.13.tar.gz" } ] }