{ "info": { "author": "Metglobal", "author_email": "kadir.pekel@metglobal.com", "bugtrack_url": null, "classifiers": [], "description": "======\nkorpus\n======\n\nKorpus is a tf-idf algorithm implementation , simply helps you create a corpus\nof documents which you can query it to find similiar documents for a given\ninput. So, what is tf-idf?\n\nWiki definition (http://en.wikipedia.org/wiki/Tf-idf):\n\n Tf\u2013idf, term frequency\u2013inverse document frequency, is a numerical statistic\n which reflects how important a word is to a document in a collection or\n corpus.\n \n It is often used as a weighting factor in information retrieval and text\n mining. The tf-idf value increases proportionally to the number of times a\n word appears in the document, but is offset by the frequency of the word in\n the corpus, which helps to control for the fact that some words are\n generally more common than others.\n\nBasically, korpus is your best friend, while you are trying to approach what\nthe actual input meant to be using your pre-indexed document base. This is the\napproach what Lucene (the most popular java full text search engine) uses in\nbackyard (http://lucene.apache.org/core/old_versioned_docs/versions/2_9_0/api/all/org/apache/lucene/search/Similarity.html)\n\nQuickstart\n----------\n\nLet's take a look the example below. Currently a document defined as a \n``(id, content)`` tuple and we are going to create a corpus with a bunch of\nidioms using these document tuples. Once the corpus created, under the hood,\nour idioms are automatically indexed and weighted which is meant to be ready\nfor querying::\n\n >>> from korpus import Corpus\n\n >>> common_idioms = [\n (1, 'Piece of cake'),\n (2, 'Costs an arm and a leg'),\n (3, 'Break a leg'),\n (4, 'Hit the books'),\n (5, 'Let the cat out of the bag'),\n (6, 'Hit the nail on the head'),\n (7, 'When pigs fly'),\n (8, 'You can\u2019t judge a book by its cover'),\n (9, 'Bite off more than you can chew '),\n (10, 'Scratch someone\u2019s back'),\n ]\n\n >>> corpus = Corpus(common_idioms)\n >>> resutls = corpus.query('Hit the nail', min_score=0.2)\n [(6, 0.6134307406647964, 4), (4, 0.2928327297980855, 4)]\n\nWe tried to find similiar idioms by our input ``Hit the nail`` with a minimum\nsimilarity score of ``0.2``. The returned list of objects contains the\ninformation about matched items in corresponding corpus. In this case there two\nmatched items of ids ``6`` and ``4`` with similarity scores\n``0.6134307406647964`` and ``0.2928327297980855`` beside the total match count\nof ``4``\n\nThis means there are 4 matched results. Two of them are above the ``min_score``\nthreshold those are::\n\n * Hit the nail on the head (0.613)\n * Hit the books (0.292)\n\nDocumentation\n-------------\n\nComing soon...\n\n\nContributors\n------------\n\n* Mumin Ozturk (`@mumino `_)\n* Kadir Pekel (`@kadirpekel `_)\n\nLicense\n-------\nCopyright (c) 2013 Metglobal LLC.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the 'Software'), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies\nof the Software, and to permit persons to whom the Software is furnished to do\nso, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/metglobal/korpus", "keywords": null, "license": "MIT", "maintainer": null, "maintainer_email": null, "name": "korpus", "package_url": "https://pypi.org/project/korpus/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/korpus/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/metglobal/korpus" }, "release_url": "https://pypi.org/project/korpus/0.0.1/", "requires_dist": null, "requires_python": null, "summary": "similarity made easy!", "version": "0.0.1" }, "last_serial": 883162, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "7cfa8cf4430969fbd3285a93ba19f372", "sha256": "23e9d183a05a4d1be17d518ac6579a9037889ca5329a94c4ca31f2a483d28387" }, "downloads": -1, "filename": "korpus-0.0.1.tar.gz", "has_sig": false, "md5_digest": "7cfa8cf4430969fbd3285a93ba19f372", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6060, "upload_time": "2013-10-07T14:28:51", "url": "https://files.pythonhosted.org/packages/b1/b9/59c6fe533bce4b4c7914717c3f2c185e806a685a1af3ad979d666d09bc90/korpus-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7cfa8cf4430969fbd3285a93ba19f372", "sha256": "23e9d183a05a4d1be17d518ac6579a9037889ca5329a94c4ca31f2a483d28387" }, "downloads": -1, "filename": "korpus-0.0.1.tar.gz", "has_sig": false, "md5_digest": "7cfa8cf4430969fbd3285a93ba19f372", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 6060, "upload_time": "2013-10-07T14:28:51", "url": "https://files.pythonhosted.org/packages/b1/b9/59c6fe533bce4b4c7914717c3f2c185e806a685a1af3ad979d666d09bc90/korpus-0.0.1.tar.gz" } ] }