{ "info": { "author": "Tarek Amr", "author_email": "gr33ndata@yahoo.com", "bugtrack_url": null, "classifiers": [], "description": "Information Retrieval Library \n=============================\n\nI started writing this library as part of my `Information Retrieval and Natural Language Processing (IR and NLP) `_ module in the `University of East Anglia `_. It was mainly meant to detect Review Spam (Machine Learning - Classification). However, It has more functionalities such as `Vector Space Model (VSM) `_, as well as some other IR functions such as tokenizing, n-grams, stemming and PoS (Part of Speech) tagging.\n\nInstallation\n-------------\n\nYou sure need to have Python installed on your computer.\n\nAnother **optional** module might be needed, `NLTK `_ \nThis is only needed in case of stemming and PoS (part of speech) tagging \n\nTo install the package, write::\n\n python setup.py install\n\n\nCode Organization\n-----------------\n\nFirst of all, the code is divided into the following main components:\n\n#. matrix.py\n#. metrics.py\n#. classifier.py\n#. preprocessor.py\n#. configuration.py\n#. superlist.py\n\nmatrix.py: This is where documents (vecotor space) index is implemented.\n\nmetrics.py: Distance measures (cosine and euclidean distance).\n\nclassifier.py: The following classifiers are implemented here:\n\n* Rocchio: Rocchio Classifier \n* KNN: k-NN Classifier\n* Bayes: Naive Bayes Classifier\n \nThe above 3 classes (class as in object oriented programming not machine learning), are inherited from 'Index'\nOne more class here is 'Evaluation', which is used to calculate accuracy during testing.\n\npreprocessor.py is the module where files parsing tokenizing, stemming and PoS tagging are implemented\nMore feature selection algorithms (such as Mutual Information Gain), should be implemented here.\n \nHow to use for Vector Space IR\n-------------------------------\n\nThe main modules to use here are matrix.py and metrics.py,\nhowever you might find the preprocessor.py useful too::\n\n # Load the three modules:\n from irlib.preprocessor import Preprocessor\n from irlib.matrix import Matrix\n from irlib.metrics import Metrics\n \n # Create instances for their classes:\n prep = Preprocessor()\n mx = Matrix()\n metric = Metrics()\n\nFor simplicity, let's assume you have a single text file tweets.txt,\nAnd each line represent a seperate tweet (document).\n\nWe will read the file, line after the other, \nand then use the preprocessor to tokenize the line into words.\nThere are more options for the ngram_tokenizer, such as producing bigrams,\nconverting tokens into lower space, stemming and converting to PoS.\nHowever, we will stick to the default options for now.\nThen we will load the document into our VSM, aka Matrix. \nMore on the frequency and do_padding options later on:: \n\n fd = open('tweets.txt','r')\n for line in fd.readlines(): \n terms = prep.ngram_tokenizer(text=line)\n mx.add_doc( doc_id='some-unique-identifier', \n doc_terms=terms, \n frequency=True, \n do_padding=True)\n \nNow, we are done with loading our documents, let someone search for a tweet::\n\n q = raw_input(\"Enter query to search for a tweet: \")\n\nAgain, we can use the preprocessor to tokenize the query.\nWe also need to align the terms used in the query with ones read from ducuments,\ni.e. we need them both to be put in equal length lists and ignore terms in query\nnot seen before when reading the documents::\n\n terms = prep.ngram_tokenizer(text=q)\n q_vector = query_to_vector(terms, frequency=False)\n\nFinally, to get the best matching document to our query, \nwe can loop on all documents in the Matrix and find the one with least distance.\nWe will just show the looping here, you can easily compare distances, \nand sort documents according to their relevance if you want to:: \n\n for doc in self.mx.docs:\n distance = metric.euclid_vectors(doc['terms'], q_vector)\n\nWe are done for now, but before moving to the next section let me discuss \nthe following:\n\nThe add_doc() method in Matrix has two more options we skipped earlier:\n \n* frequency:: \n\n If True, then we are using a multinomial mode (You usually will need this) \n i.e. if terms occurs n times in document, then it frequency is n.\n If False, then we are using a multivariate (Bernoulli) mode, \n i.e. if terms occurs in document at least one time, then it frequency is 1 \n otherwise its frequency is zero.\n As mentioned above, you will normally need the multinomial mode,\n We just put the Bernoulli mode for the sake of Naive Bayesian Classifier\n \n* do_padding::\n\n Each time we add new document, we also see new terms, \n hence if terms represent columns in our matrix and documents represent rows,\n we might end with new rows of bigger length than order rows.\n So, padding here is to align the length of older rows with newer ones.\n So you either set this to True with each new document, \n or call mx.do_padding() when done.\n\nWait a minute, two more notes: \n\n* We haven't converted our VSM into tf.idf in the previous example, \nhowever, you normally need to do so. So you have to call the follwing method, \nright after loading your documents and before doing searches::\n\n mx.tf_idf()\n\nWe have used the Euclidean distance in our example, yet you may need to use \ncosine distance instead, so, here is the method for that::\n\n metric.cos_vectors() \n\n\nHow to use for classification\n------------------------------\n\nYour code should does the following\n\n* Reading and parseing the configuration file::\n\n\t# You first need to import the configuration class\n\tfrom irlib.configuration import Configuration \n\n\t# Then you load configuration\n\t# Sample configuration file in the root directory: sample.conf\n\tconfig = Configuration(config_file='your_file.conf')\n\tconfig.load_configuration()\n \n* Initiate the preprocessors::\n\n\t# You first need to import the configuration class\n\tfrom irlib.preprocessor import Preprocessor\n\n\t# Then you create a new preprocessor object\n\t# prep = Preprocessor(pattern='\\W+', lower=True, stem=False, pos=False, ngram=2)\n\n\t# However, you normally get the values from the confiuration\n\tconfig_data = config.get_configuration()\n\tprep = Preprocessor(pattern='\\W+', lower=config_data['lower'], stem=config_data['stem'], \n\t\t\t\t\t\tpos=config_data['pos'], config_data['ngram'])\n\n* Initiate your evaluation module::\n\t\n\t# You first need to import this\n\tfrom irlib.classifier import Evaluation \n\n\t# You give it the configuration class object, see above\n\t# For now, you can safely set fold = 0\n\n\t# This is used for cross-testing, \n\t# to keep track which part of dataset is being used for testing at the moment\n\tev = Evaluation(config=config, fold=myfold)\n\n* Select and initiate the desired Classifier [Rocchio, KNN or Naive Bayes]::\n\n\t# Once more don't forget to import your desired classe(s)\n\t#from irlib.classifier import Rocchio \n\t#from irlib.classifier import KNN \n\tfrom irlib.classifier import NaiveBayes \n\t\n\t# VERBOSE = True or False\n\t# Fold, see above for more details\n\t# Objects for both configuration and evaluation classes\n\tml = NaiveBayes(verbose=VERBOSE, fold=myfold, config=config, ev=ev)\n\n* Training::\n\n\t# Read your files one after the other (example fd = open('file1.txt', 'r'))\n\t# You may use our preprocessor for tokenizing data got from fd.read()\n\tterms = prep.ngram_tokenizer(text=file_data)\n\n\t# Then add document to your classifier\n\t# doc_id: you can call it anything you want\n\t# You should tell the classifier which class the doc belongs to\n\t# This should be the same as ones mentioned in configuration file\n\tml.add_doc(doc_id = doc_id, doc_class=class_name, doc_terms=terms)\n\n* Some house keeping::\n\n\t# You shall first call do padding to align and tide read data\n\tml.do_padding()\n\n\t# Then you should do the actual learning from training data\n\tml.calculate_training_data()\n\n\t# This one is optional, in case you need more details to be printed \n\tml.diagnose()\n\n* Testing::\n\n\t# Just as in training, you can use the preprocessor\n\tterms = prep.ngram_tokenizer(text=file_data)\n\n\t# Then add the document, we call them queries this time, notice function name\n\tml.add_query(query_id = doc_id, query_class=class_name, query_terms=terms)\t\n\n* Get Evaluation results::\n\n\t# Remember the evaluation class we created earlier\n\t# Now we can call it to tell us some nice results\n\tresults = ev.calculate(review_spam=True, k=k)\n\n* If we are doing cross checking here, the previous 4 steps are repeated for all folds \n\nContacts\n--------\n \n+ Name: Tarek Amr \n+ Twitter: @gr33ndata", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gr33ndata/irlib", "keywords": null, "license": "LICENSE.txt", "maintainer": null, "maintainer_email": null, "name": "irlib", "package_url": "https://pypi.org/project/irlib/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/irlib/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/gr33ndata/irlib" }, "release_url": "https://pypi.org/project/irlib/0.1.1/", "requires_dist": null, "requires_python": null, "summary": "Inforamtion Retrieval Library", "version": "0.1.1" }, "last_serial": 793437, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "165540a86de8eb80adbbf5bf13bf3f55", "sha256": "14d9033a99a2bde3eba80261843444504db74a8edb6227f75c5eac893526604e" }, "downloads": -1, "filename": "irlib-0.1.1.tar.gz", "has_sig": false, "md5_digest": "165540a86de8eb80adbbf5bf13bf3f55", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20677, "upload_time": "2013-01-29T01:06:22", "url": "https://files.pythonhosted.org/packages/d1/7d/2eea94e208ec00fc4d4f60a8f6264a8c93f80644e372ce865af9d2d3a7a2/irlib-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "165540a86de8eb80adbbf5bf13bf3f55", "sha256": "14d9033a99a2bde3eba80261843444504db74a8edb6227f75c5eac893526604e" }, "downloads": -1, "filename": "irlib-0.1.1.tar.gz", "has_sig": false, "md5_digest": "165540a86de8eb80adbbf5bf13bf3f55", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 20677, "upload_time": "2013-01-29T01:06:22", "url": "https://files.pythonhosted.org/packages/d1/7d/2eea94e208ec00fc4d4f60a8f6264a8c93f80644e372ce865af9d2d3a7a2/irlib-0.1.1.tar.gz" } ] }