{ "info": { "author": "Robert Lujo", "author_email": "trebor74hr@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Natural Language :: Croatian", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python", "Topic :: Internet :: WWW/HTTP :: Indexing/Search", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing", "Topic :: Text Processing :: Indexing", "Topic :: Text Processing :: Linguistic" ], "description": "Morphological/Inflection/Lemmatization Engine for Croatian language\n===================================================================\n\"text-hr\" is Morphological/Inflectional/Lemmatization Engine for Croatian\nlanguage written in Python programming language. Includes stopwords and\nPart-Of-Speech tagging engine (POS tagging) based on inverse inflection\nalgorithm for detection.\n\nSince API is not freezed, this project is still in alpha.\n\nTAGS \n----\n Croatian language, lemmatization, stemming, inflection, python, natural\n language processing (NLP), Part-of-speech (POS) tagging, stopwords, inverse\n inflection, morphological lexicon\n\n\nOZNAKE\n------\n Hrvatski jezik, lematizacija, Python biblioteka, morfologija, infleksija,\n obrnuta infleksija, prepoznavanje vrsta rije\u010di, ra\u010dunalna obrada govornog\n jezika, zaustavne rije\u010di, morfolo\u0161ki leksikon\n\nAUTHOR\n======\nRobert Lujo, Zagreb, Croatia, find mail address in LICENCE\n\n\nFEATURES\n========\nTo name the most important:\n - inflection system - for producing all forms of one word\n - detection of word types (POS tagging) - from existing list of word forms\n - list of stopwords\n\nSystem is based on unicode strings, default codepage to convert from and to \nstring is cp-1250.\n\nCheck `Getting started`_.\n\nINSTALLATION\n============\nInstallation instructions - if you have installed pip package \nhttp://pypi.python.org/pypi/pip::\n\n pip install text-hr\n\nIf not, then do it old-fashioned way:\n - download zip from http://pypi.python.org/pypi/text-hr/\n - unzip\n - open shell\n - go to distribution directory\n - python setup.py install\n\n\nGETTING STARTED\n===============\nThere are three important parts that this project provides:\n - `Inflection system`_ - for producing all forms of one word\n - `Detection of word types (POS tagging)`_ - from existing list of word forms\n - `List of stopwords`_\n\nInflection system\n-----------------\nUsage example - start python shell::\n\n >>> from text_hr import Verb\n >>> v = Verb(\"platiti\")\n >>> for k in sorted(v.forms.keys()):\n ... print k, v.forms[k]\n ...\n AOR/P/1 [u'platismo']\n AOR/P/2 [u'platiste']\n AOR/P/3 [u'plati\\u0161e']\n AOR/S/1 [u'platih']\n AOR/S/2 [u'plati']\n AOR/S/3 [u'plati']\n IMP/P/1 [u'platasmo', u'pla\\u0107asmo', u'platijasmo']\n IMP/P/2 [u'plataste', u'pla\\u0107aste', u'platijaste']\n IMP/P/3 [u'platahu', u'pla\\u0107ahu', u'platijahu']\n ...\n VA_PA//P_O+S+V+N [u'pla\\u0107eno']\n X_INF// [u'platiti']\n X_VAD_PAS// [u'plativ\\u0161i']\n X_VAD_PRE// [u'plate\\u0107i']\n X_VAD_PRE// [u'plate\\u0107i']\n\nDetection of word types (POS tagging)\n-------------------------------------\nTODO: to be done - check test_detect.txt for samples, and detect.py for the logic:\n\nFirst example in test_detect.txt::\n\n >>> from text_hr.detect import WordTypeRecognizerExample\n >>> def test_it(word_list, wt_filter=None, level=2):\n ... wdh = WordTypeRecognizerExample(word_list, silent=True)\n ... if not wt_filter is None:\n ... wdh.detect(wt_filter=wt_filter, level=level) # e.g. wt_filter=[\"N\"]\n ... else:\n ... wdh.detect(level=level) # all word types\n ... lines_file = LinesFile()\n ... wdh.dump_result(lines_file) # doctest: +NORMALIZE_WHITESPACE +ELLIPSIS\n ... print \"\\n\".join(lines_file.lines)\n ... return wdh\n\n >>> class LinesFile(object):\n ... def __init__(self):\n ... self.lines = []\n ... def write(self, s):\n ... self.lines.append(repr(s.rstrip()))\n\n >>> word_list = [\n ... \"Broj 84\"\n ... , \"broji 34\"\n ... , \"Brojila 28\"\n ... , \"broje 23\"\n ... , \"broje\u0107i 22\"\n ... , \"brojim 7\"\n ... , \"brojimo 5\"\n ... , \"broji\u0161 4\"\n ... , \"brojahu 2\"\n ... , \"broja\u0161e 1\"\n ... , \"brojite 1\"\n ... , \"-brijestovu 1\"\n ... , \"brijestovi 1\" #the only one checked with endswith, but all other will be checked with get_freq\n ... , \"-brijestove 1\"\n ... , \"-brijestova 1\"\n ... ]\n\n Lowest quality, but fastest\n >>> wdh = test_it(word_list, level=4) # doctest: +ELLIPSIS\n \" 10/ 183 -> brojati (u'V-XX_-_JATI-je\\\\u0107i-0') 84/broj,34/broji,23/broje,22/broje\\xe6i,7/brojim,5/brojimo,4/broji\\x9a,2/brojahu,1/brojite,1/broja\\x9ae\"\n\nList of stopwords\n-----------------\nIs located in std_words.txt, and you can read it directly from here\n\n http://bitbucket.org/trebor74hr/text-hr/src/tip/text_hr/std_words.txt\n\nThe list can be updated like this::\n \n >>> import text_hr\n >>> text_hr.dump_all_std_words()\n Totaly 2904 word forms dumped to r:\\hg-clones\\python\\text-hr\\text_hr\\std_words.txt in codepage utf8\n\nIteration over all words goes like this::\n\n from text_hr import get_all_std_words\n\n for word_base, l_key, cnt, _suff_id, wform_key, wform in get_all_std_words():\n print word_base, l_key, cnt, _suff_id, wform_key, wform\n\n\nFurther\n-------\nSince there is currently no good documentation, the best source of \nfurther information is by reading tests inside of modules and\ntests in tests directory (dev version). More information in `Running tests`_.\nYou can allways read a source.\n\n\nDOCUMENTATION\n=============\nCurrently there is no documentation. In progress ...\n\n\nSUPPORT\n=======\nSince this project is limited by my free time, support is limited. \n\n\nREPORT BUG OR REQUEST FEATURE\n-----------------------------\nIf you encounter bug, the best is to report it to the bitbucket web page\nhttp://bitbucket.org/trebor74hr/text-hr.\n\nIf there will be an interest for development for other inflection rich\nlanguages, I'd be glad to decouple language specific code and create new\nproject that will be capable to deal with multiple languages.\n\nThe best way to contact me is by mail (find in LICENCE).\n\nTODO list is in readme.txt (dev version).\n\n\nCONTRIBUTION\n============\nSince this project is not currently in the stable API phase, contribution\nshould wait for a while.\n\n\nRUNNING TESTS\n=============\nAll tests are doctests (not unittests). There are three type of tests in the\npackage: \n\n 1. doctests in each module - e.g. in verbs.py\n 2. doctests in tests/test_*.txt - only development version\n 3. tests which are not automatically compared - i.e. in special call mode\n detect.py can produce output file which needs to be compared \n manually with some existing file. Such test(s) are very slow. This needs\n to be changed to be automatic.\n\nRunning each module directly will run 1. and 2. if running from development\nversion. To get development version\nTo use development version (http://bitbucket.org/trebor74hr/text-hr)::\n\n hg clone https://bitbucket.org/trebor74hr/text-hr\n\n\ncreate text_hr.pth in python site-packages directory with path to text-hr e.g.::\n\n r:\\hg-clones\\python\\text-hr\n\nTo run all tests:\n - go to tests directory\n - run tests.py like (with sample output)::\n\n > python tests.py\n testing module __init__\n testing module adjectives\n ...\n testing textfile R:\\hg-clones\\python\\text-hr\\tests\\test_adj.txt\n ...\n testing textfile R:\\hg-clones\\python\\text-hr\\tests\\test_verbs_type.txt\n \nTo run tests for just one module:\n - goto text_hr directory\n - run tests by running module, e.g.::\n\n > py pronouns.py\n __main__: running doctests\n ..\\tests\\test_pronouns.txt: running doctests\n\n - in the case you're not running from dev version, you'll get output like\n this::\n\n > py pronouns.py\n __main__: running doctests\n ..\\tests\\test_pronouns.txt: Not found, skipping\n\nADDITIONAL\n==========\nMaster thesis pdf in Croatian (134 pages) with title::\n\n Lociranje sli\u010dnih logi\u010dkih cjelina u tekstualnim \n dokumentima na hrvatskome jeziku\n\ncan be found at:\n\n http://bitbucket.org/trebor74hr/text-hr/downloads/magistarski-konacni.pdf\n\nTODO\n====\nvarious things, see readme.txt for details.\n\nCHANGES\n=======\n0.18\n----\nulr1 121210 \n - fixed wrong readme on bitbucket homepage\n\n0.17 \n----\nulr1 100617 \n - utf-8 setup \n\n0.16 \n----\nulr1 100617 \n - master thesis pdf added to repository (in Croatian, 134 pages)\n\n0.15 \n----\nulr1 100617 \n - minor changes\n\n0.14 \n----\nulr1 100617 \n - beta release\n - tags: lemmatization, stemming\n\n0.13 \n----\nulr1 100610:\n - text_hr package reorganized (__init__.py with __all__ and imports ...)\n - word_types.py removed\n - std_words.txt\n\n0.12 \n----\nulr1 100608 :\n - README\n - enabled tests from tests.py for all \n - enabled tests from directly from each modules\n\n0.11 \n----\nulr1 100607:\n - recreated repo at bitbucket\n - no .suff_registry.pickle and testing_*.out put in zip\n\n0.10\n----\nulr1 100605:\n - first installable release", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://bitbucket.org/trebor74hr/text-hr/", "keywords": null, "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "text-hr", "package_url": "https://pypi.org/project/text-hr/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/text-hr/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://bitbucket.org/trebor74hr/text-hr/" }, "release_url": "https://pypi.org/project/text-hr/0.18/", "requires_dist": null, "requires_python": null, "summary": "Morphological/Inflection/Lemmatization Engine for Croatian language, POS tagger, stopwords", "version": "0.18" }, "last_serial": 800502, "releases": { "0.10": [ { "comment_text": "", "digests": { "md5": "4454894d19198511a825cac00f42d333", "sha256": "60a2ec92975c0804493fd2b1919d0e5d74a54bf9d2ee7dbb1f1c19ced6fcf9e5" }, "downloads": -1, "filename": "text-hr-0.10.zip", "has_sig": false, "md5_digest": "4454894d19198511a825cac00f42d333", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1009823, "upload_time": "2010-06-06T00:45:44", "url": "https://files.pythonhosted.org/packages/45/08/b7ae73c261c29c4a9c95d76a648e1fb3473e4b6b7928010253f89df757d6/text-hr-0.10.zip" } ], "0.11": [ { "comment_text": "", "digests": { "md5": "cbba1a151713ee6205605f767abe6a99", "sha256": "d47d34af7b1238dd40124ce464f0e082ece8e07a3ef852d11f72917157b5b24f" }, "downloads": -1, "filename": "text-hr-0.11.zip", "has_sig": false, "md5_digest": "cbba1a151713ee6205605f767abe6a99", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 95695, "upload_time": "2010-06-07T21:30:55", "url": "https://files.pythonhosted.org/packages/85/1e/cd88574ee78d7c99fbc6a799157fb1a50c020bc10318295de0c530116fe0/text-hr-0.11.zip" } ], "0.12": [ { "comment_text": "", "digests": { "md5": "8e1503a42ec900b48bd0534aae0434c8", "sha256": "845a15e8624f2dd50f224b810adfb8cf35489a56b712d0108b0e7bca447bd95e" }, "downloads": -1, "filename": "text-hr-0.12.zip", "has_sig": false, "md5_digest": "8e1503a42ec900b48bd0534aae0434c8", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 101489, "upload_time": "2010-06-08T02:24:20", "url": "https://files.pythonhosted.org/packages/0e/8a/b401d758922aecce2d863873ae49ebbd49cdd77a45743dc7d74c5476efcb/text-hr-0.12.zip" } ], "0.13": [ { "comment_text": "", "digests": { "md5": "4c959a7e955a7dfb77a62e9dc57d42be", "sha256": "e6e99f5744f3408ef0d191d2a74a9508e6d9dc41784322928f9b398c0649bb95" }, "downloads": -1, "filename": "text-hr-0.13.zip", "has_sig": false, "md5_digest": "4c959a7e955a7dfb77a62e9dc57d42be", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 119060, "upload_time": "2010-06-10T01:13:40", "url": "https://files.pythonhosted.org/packages/c6/ae/5d70a2b0af32f353d78c8245774a159ea1914aa5399c02e4dd8620cb43d5/text-hr-0.13.zip" } ], "0.14": [ { "comment_text": "", "digests": { "md5": "6ab862f7d0c0be33e3b657f6f93e1d30", "sha256": "4ed3f131037b67afa4ef6fc2fa8bc90f6e6b1320467a185e455b2f593d71738a" }, "downloads": -1, "filename": "text-hr-0.14.zip", "has_sig": false, "md5_digest": "6ab862f7d0c0be33e3b657f6f93e1d30", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 119182, "upload_time": "2010-06-17T19:03:01", "url": "https://files.pythonhosted.org/packages/5a/20/d9a694c7df5919d68cc7f12ee1c85ee7c2048595b4882b9ec036b1af3542/text-hr-0.14.zip" } ], "0.16": [ { "comment_text": "", "digests": { "md5": "a81b76b04de73f71353d151dbc14bdb3", "sha256": "9daed73cba77d21db51fe792f4c8595465dd49678d923943ec9824cd56ef517b" }, "downloads": -1, "filename": "text-hr-0.16.zip", "has_sig": false, "md5_digest": "a81b76b04de73f71353d151dbc14bdb3", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 119684, "upload_time": "2010-07-09T01:35:42", "url": "https://files.pythonhosted.org/packages/3b/9d/a111f4c289ba785f01e30be55f69e23c527c2a038ca58059628c6a084396/text-hr-0.16.zip" } ], "0.17": [ { "comment_text": "", "digests": { "md5": "c5e00de08d0b465a1624028c17cc29d0", "sha256": "557b65b1c359aa468088e9b682ae133153a847aec39957025a941649e26fb51e" }, "downloads": -1, "filename": "text-hr-0.17.zip", "has_sig": false, "md5_digest": "c5e00de08d0b465a1624028c17cc29d0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 119745, "upload_time": "2010-07-09T01:39:41", "url": "https://files.pythonhosted.org/packages/b0/03/89535ded41e1e2b832e37dd949398f2e00d0befc9017e56c7dd492de89fd/text-hr-0.17.zip" } ], "0.18": [ { "comment_text": "", "digests": { "md5": "e14d9388a18ed36c4949257b4ccbcbf5", "sha256": "3fa7ecf6ec8b45ff2070e4e91c3d19f7dcbd68315a336e4d1b4f130a8c404475" }, "downloads": -1, "filename": "text-hr-0.18.tar.gz", "has_sig": false, "md5_digest": "e14d9388a18ed36c4949257b4ccbcbf5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 110116, "upload_time": "2012-12-10T21:46:15", "url": "https://files.pythonhosted.org/packages/c1/bd/0a079a34e0ccf5dd4650ad5366a67f7530e379b84f667cc13569a91d4c65/text-hr-0.18.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "e14d9388a18ed36c4949257b4ccbcbf5", "sha256": "3fa7ecf6ec8b45ff2070e4e91c3d19f7dcbd68315a336e4d1b4f130a8c404475" }, "downloads": -1, "filename": "text-hr-0.18.tar.gz", "has_sig": false, "md5_digest": "e14d9388a18ed36c4949257b4ccbcbf5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 110116, "upload_time": "2012-12-10T21:46:15", "url": "https://files.pythonhosted.org/packages/c1/bd/0a079a34e0ccf5dd4650ad5366a67f7530e379b84f667cc13569a91d4c65/text-hr-0.18.tar.gz" } ] }