{ "info": { "author": "Bulent Ozel", "author_email": "bulent.ozel@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Topic :: Scientific/Engineering :: Information Analysis", "Topic :: Text Processing :: General" ], "description": "\n.. role:: math(raw)\n :format: html latex\n..\n\nDiscovery and Representation of Open Making Related Terms\n=========================================================\n\n+-------------+\n| Bulent |\n| Ozel, UZH |\n+-------------+\n| ``bulent.oz |\n| el@gmail.co |\n| m`` |\n+-------------+\n\nSupport for this work is partly covered by the OpenMaker Project:\nhttp://openmaker.eu/\n\nCollaborator(s): \\* Hamza Zeytinoglu\n\n--------------\n\nThe first objective of this module is to provide a customizable and\nstandardized text preprocessing prior to further analyses where more\nadvanced machine learning and or statistical techniques can be applied\nand compared with each other. In that sense, it provides a pipelined set\nof functionalities (i) to be able to inspect, organize, prune and merge\ntexts around one or very few specific theme(s) or topic(s), (ii) remove\nunwanted terms or literals from the texts, (iii) tokenize the texts,\n(iv) count the terms in texts, and (v) when desired stem the tokenized\nterms.\n\nThe second objective of this module is to be able compare or score a\nforeground corpus or a specific corpus against a background corpus or\nreference corpus. Example use cases could be, for instance, exploring\nthe language of a sub-culture, a community, or a movement looking at to\nwhat extend the specific use of the language of the group differentiates\nitself from the common language.\n\nIn cases when there are more than a few number of themes or topics, and\nwhere each topic is represented with a large set of documents that\nvalidates the employment of standardized matrix decomposition based\nmethodologies, then the scoring option of this module can be skipped\nentirely. More specifically, in use cases where the objective is being\nable to classify and differentiate a number of topics or issues from\neach other and where there are sufficient data that fulfills the\nunderlining assumptions of NMF, LDA or LSI based approaches, then tools\nfrom, for instance, Python\u2019s\n`sklearn.decomposition `__\npackage are suggested.\n\nNevertheless, the outputs of this module, such as its normalized term\nfrequencies or the specificity scores it associates to them regarding to\na reference background corpus, can be used as input to other matrix\ndecomposition techniques.\n\nInstall\n-------\n\nA. Via Python's standard distribution channel PyPI\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n pip install omterms\n\nB. Via from its GitHub source\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n.. code:: bash\n\n git clone https://github.com/bulentozel/omterms.git\n\n.. code:: bash\n\n cd omterms\n\n.. code:: bash\n\n pip install .\n\nA quick use\n-----------\n\n.. code:: python\n\n >>> from omterms.interface import *\n >>> extract_terms(\"Some input X text to process less then 3 seconds.\").head()\n Configuring the text cleaner ...\n A single text is provided.\n Extracting the terms ...\n Tokenizing the input text ..\n Done. Number of terms: 10\n Cleaning process: Initial size of tokens = 10\n Reduction due to punctuations and stopwords = 3.\n Reduction due to all numeral terms = 1\n Reduction due to short terms = 1\n Reduction due to rare terms = 0\n Reduction due to partially numeral terms = 0\n Reduction due to terms with not allowed symbols = 0\n The total term count reduction during this cleaning process = 5\n Percentage = 50%\n COMPLETED.\n TF Term wTF\n 0 1 input 0.2\n 1 1 text 0.2\n 2 1 process 0.2\n 3 1 less 0.2\n 4 1 seconds 0.2\n >>> \n\nMore on usage\n-------------\n\n`Please see the\ntutorial. `__\n\n--------------\n\n--------------\n\nRoadmap on Keyword and Keyphrase Extraction\n===========================================\n\nThe method outlined here aims to set-up a base line for future\nimprovements.\n\n- It uses a statistical approach combined with standardized procedures\n that are widely applied in standard NLP workflows.\n- In this base line, it aims to present a work flow that can be ablied\n to\n\n - different languages\n - differrent problem domains\n - analysis on a single theme with limited training set\n\n1. Overall work flow\n--------------------\n\nIn short, the workflow presented on this notebook is the second stage on\na workflow objective of which is being able to measure relevance of a\ngiven external input to a specific theme, issue or topic. The steps of\nthe work flow is as follows.\n\n1. Forming a specific corpus where the corpus consists of set of\n documents around a topic. The corpus could be\n\n - a set of blog articles around an issue let say green finance\n - or a set of Wikipedia articles around the same subject\n - or collection of news articles around the green finance\n - or collection of tweets around the same issue.\n\n At the moment we have another module that given a set of seed\n Wikipedia articles around an issue the crawler scrapes textual data\n from articles. For the details of the module please `see the scraper\n module. `__.\n The output of that module is a set of input texts stored in a\n collection in JSON format.\n\n2. Given an input set of texts on a theme a concept or a topic identify\n set of terms that more likely or less likely can occur within a\n discussion on the topic. This module hereby presents one of the\n simple methods for this purpose.\n\n3. Given a list of weighted terms which are more likely to occur or\n reprsent a theme, concept or topic and input query text measure the\n relevance of the input text to the topic/theme/concept. `The notebook\n in this\n link `__\n demonstrates one way doing such scoring of a given text against the\n curated set of terms of this particular module.\n\n2. Suggested future work\n------------------------\n\n- Comparing and combining this comparison based scoring with matrix\n decompostion based topic modelling approaches such as NMF, LDA, LSI.\n\n- Using language specicif term frequency counts of Wikipedia itself for\n comparisons. In NLP terminology, the *foreground* corpus around a\n topic needs to be compared and contrasted to a *background* corpus.\n\n- Improving the semantic crawler of the previous stage to be able to\n increase quality of the specific corpuses\n\nMethodological Improvements\n~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n- Instead of tokenizing all terms, examine possibilities of key-phrase\n extrcation combining with *tf-idf* and\n\n - experiment with extracting noun phrases and words, for this use\n NLTK's regular expression module for POS (part of speeach)\n analysis.\n - extract n-grams where n=1,2,3\n\n3. Definitions and assumptions\n------------------------------\n\nAssumptions\n~~~~~~~~~~~\n\n- At the comparison stage, it is assumed that a document's terms tend\n to be relatively frequent within the document as compared to an\n external reference corpus. However, it should be noted this\n assumption is contested in the field. See the paper by Chuang et el.\n\n- Condidering the fact that the crawler is used to aggregate\n semantically related set of documents into a single document, *tf x\n idf* is equivalent to *tf*. As can be seen below, we use a normalized\n version of *tf*: *ntS / NS*.\n\n- Fewer number of but relatively more relevant training (input corpus)\n is prefered in order to reduce term extraction problems due to length\n of documents. However, it should be noted that the crawling depth of\n an identiefied wiki article from stage 1 of this document can be used\n as an additional weight on relevance/reprsesntation of keywords.\n\n- We have limited ourselves to terms instead of n-grams and phrases or\n use of POS to be able to develop a base model that can work on\n different languages.\n\nTerm\n~~~~\n\nGiven for instance a set of texts around open source software movement a\nterm that is identified can be a word such as *openness*, a person such\nas *Stallman* a license type such as *GNU*, an acronym for an\norganization such as *FSF* the Free Software Foundation, or a technology\nsuch as *Emacs*.\n\nLikelihood ratio\n~~~~~~~~~~~~~~~~\n\nIt is a simple measure computed comparing frequency count of a term in a\nspecific corpus versus its frequency count in the reference reference\ncorpus. Here assumption is that the reference corpus is a large enough\nsample of the language at observing the occurance of a term. Then having\na higher/lower observation frequency of a term in the specific corpus is\na proxy indicator for the term choice while having a debate on the\ntopic.\n\nThe likelihood ratio for a term :math:`P_t` is calculated as:\n\n:math:`P_t = log ( (ntS/NS) / (ntR/NR) )`\n\nwhere\n\n- *ntS* is the raw frequency count of the term in the entire specific\n corpus\n- *ntR* is the raw frequenccy count of the term in the reference corpus\n- *NS* is the total number of terms in the specific corpus\n- *NR* is the total number of terms in the reference corpus\n\nIt should be noted that frequency counts are calculated after having\napplied the same tokenization and post processing such as excluding\nstop-words, pancuations, rare terms, etc both on the reference corpus\nand the specific corpus.\n\n4. Some thoughts on a conceptual approach at using the extracted keywords or phrases to predict topical relevance of a new text\n-------------------------------------------------------------------------------------------------------------------------------\n\nUsing the outcome of this technique to score arbitrary input texts\nagainst a single issue such as financial sustainability or against a set\nof issues such as the 10 basic human values requires a set of\nnormalization of the raw scores and their rescaling/transformation.\n\nThe factors that need to be considered are:\n\n- **Differing document lengths:** The likelihood of repetion of a key\n phrase increases as the size of the input text gets larger. In more\n concrete terms, when a scoring that simply sums up detection of\n weighted keyphrases or words within a given input text would be very\n sensitive to the document length. For isntance, the an executive\n summary of an article would very likely get quite lower score than\n the full article on any issue.\n\n *Among other methods, this can simply be resolved by computing per\n word scores, where the word set to be conidered is the tokenized and\n cleaned set of words that represent the input text.*\n\n- **Topical relevance:** This factor would be important when the\n subject matter of the inputs texts vary among each other. In other\n words, this factor would matter to a very high significance, let's\n say when one wants to compare perceptions of indivuduals on the role\n of privacy in democracies and when this question is not asked them in\n a uniform manner, that is under the same social, cultural,\n environmental and physical conditions.\n\n Let\u2019s assume that issue under investigation is again pricacy in\n democracies. It is possible that the same individual as a blogger who\n has a strong pro-privacy opinion (i) may not touch the issue while\n talking on data science, (ii) he would slightly touch the issue while\n he talks about her preferences in mobile devices (iii) He dives into\n subject using all keywords and phrases when he talks about impact of\n privacy in a democratic life. In brief, it is necessary to offset the\n variability of the topical relavance of an input text to the issue\n under investigation when arbitrary text samples are used for scoring.\n\n *An offsetting scheme can be devised when opinion or perception of an\n actor is to be measured with respect to more than one factor that\n define the issue under investigation. For instance, when we want to\n measure the position of a political leader on individual liberties vs\n social security or when we want to profile discourse of the political\n leader as of a number of basic human values we could employ some\n simple statistical methods in order to offset the topical relevance\n of the discourses or the speeches of the political figure to what we\n would like to measure.*\n\n *A simple method could be rescaling the scores on each sub factor\n such as the scores of liberty and security that are measured from the\n same speech into a range of -1 to 1. This can simply be done by\n taking the mean of the two and then deducting the mean from each\n score and scaling them into a scala of -1 to 1. This way it may be\n possible to use multiple speeches of the same political figure on\n different topics to evalaute his or her postion on liberty vs\n security matter.*\n\nIn statistical terms this problem corresponds to adjusting or\nnormalizing ratings or scores measured on different scales to a\nnotionally common scale. Given the fact that in most cases a normal\ndistribution for underlying factors may not be assumed the\nquantile-normalization technique is suggested. The quantile\nnormalization sorts and ranks the variables with a non-negative\namplitudes. Then these rankings can be scaled to for instance to a 0-1\ninterval.\n\n- **Level of subjectivity**. This is variability in terms of relevant\n importance attributed to each issue out of a given set of issues. For\n instance, it is possible that a great many individuals or political\n leaders would attach a higher importance to individual liberties than\n secuirty or otherway around. But the question might be rather to\n understand to what extend one attaches more importance to an issue\n more than the others. So when the objective of the scoring is not\n simply to make an order of importance, then a comparative importance\n with respect to overall observations needs to be tackled.\n\n *Observed variances in each query texts can be considered. That is, a\n simple statistical methods can be used for instance to be able to\n compare two or more query texts with respect to each other. A\n suggested method would be (1) estimate coefficient of variation for\n each input text using per-word scores (2) the rescale\n quantile-normalized scores that is suggested above using the\n estimated coefficient of variation in each case.*\n\n *When this rescaling is applied, for instance, liberty vs security\n the coeffcient of variation would act as a polarization measure.*\n\nScoring a group of variables\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nWhen one attempts to use scores generated by this package, using\nspecific vs reference corpus comparisons, on a group of variable then\nboth ranking of the scores as well as the relevant importance of each\nscore from a number of texts from the same source should be taken into\nconsideration.\n\n5. State of the art\n-------------------\n\n- Survey Paper: Kazi Saidul Hasan and Vincent Ng, 2014. \u201cAutomatic\n Keyphrase Extraction: A Survey of the State of the Art\u201d Proceedings\n of the 52nd Annual Meeting of the Association for Computational\n Linguistics, pages 1262\u20131273.\n\n- Survey Paper: Sifatullah Siddiqi and Aditi Sharan. Article: Keyword\n and Keyphrase Extraction Techniques: A Literature Review.\n International Journal of Computer Applications 109(2):18-23, January\n 2015\n\n- Survey Paper: Z. A. Merrouni, B. Frikh, and B. Ouhbi. Automatic\n keyphrase extraction: An overview of the state of the art. In 2016\n 4th IEEE Colloquium on Information Science and Technology (CiSt),\n pages 306\u2013313, Oct 2016\n\n- PageRank - Topical: Zhiyuan Liu, Wenyi Huang, Yabin Zheng and Maosong\n Sun, 2010. \u201cAutomatic Keyphrase Extraction via Topic Decomposition\u201d.\n Proceeding EMNLP '10 Proceedings of the 2010 Conference on Empirical\n Methods in Natural Language Processing Pages 366-376\n\n- RAKE (Rapid Automatic Keyword Extraction ): Stuart Rose, Dave Engel,\n Nick Cramer, and Wendy Cowley. Automatic keyword extraction from\n individual documents. Text Mining, pages 1\u201320, 2010.\n\n- TextRank - Graph Based : Rada Mihalcea and Paul Tarau. Textrank:\n Bringing order into texts. Association for Computational Linguistics,\n 2004.\n\n- STOPWORDS: S. Popova, L. Kovriguina, D. Mouromtsev, and I. Khodyrev.\n Stopwords in keyphrase extraction problem. In 14th Conference\n\n- Corpus Similarity - Keyword frequency based: Adam Kilgarriff. Using\n word frequency lists to measure corpus homogeneity and similarity\n between corpora. In Proceedings of ACLSIGDAT Workshop on very large\n corpora, pages 231\u2013245, 1997.\n\n- Recommendation - Keyphrase Based: F. Ferrara, N. Pudota and C. Tasso.\n A keyphrase-based paper recommender system. In: Digital Libraries and\n Archives. Springer Berlin Heidelberg, 2011. p. 14-25.\n\n- Jason Chuang, Christopher D. Manning, Jeffrey Heer, 2012. \"Without\n the Clutter of Unimportant Words\": Descriptive Keyphrases for Text\n Visualization\" ACM Trans. on Computer-Human Interaction, 19(3), 1\u201329.\n\n+--------------------------------------------------------------+\n| Learn more about the OpenMaker project: http://openmaker.eu/ |\n+--------------------------------------------------------------+\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://github.com/bulentozel/omterms", "keywords": "", "license": "Apache2", "maintainer": "", "maintainer_email": "", "name": "omterms", "package_url": "https://pypi.org/project/omterms/", "platform": "", "project_url": "https://pypi.org/project/omterms/", "project_urls": { "Homepage": "http://github.com/bulentozel/omterms" }, "release_url": "https://pypi.org/project/omterms/0.1.4/", "requires_dist": [ "scipy", "matplotlib", "nltk", "numpy", "pandas", "scikit-learn" ], "requires_python": ">=3.6.0", "summary": "A customizable keyword extraction package.", "version": "0.1.4" }, "last_serial": 3857534, "releases": { "0.1.2": [ { "comment_text": "", "digests": { "md5": "5ba0d3a9e83754d9a198de4d5c7a95fd", "sha256": "df21b345b83e8275b8ed1d50267f91809366f2f9719015faf246e13908d94c1e" }, "downloads": -1, "filename": "omterms-0.1.2-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "5ba0d3a9e83754d9a198de4d5c7a95fd", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 36472, "upload_time": "2018-05-08T11:43:17", "url": "https://files.pythonhosted.org/packages/3b/57/ce0f524219d772bd8c2f9f9b0c8c51ae22a82e86b39fede3f92629d5cbfd/omterms-0.1.2-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "fb6ba967cc887a450b012fb5fd68ef59", "sha256": "081e5b556f671aec8c4b49cdd4821be10b12b460f28e1e5265b1e3a650bab5a0" }, "downloads": -1, "filename": "omterms-0.1.2.tar.gz", "has_sig": false, "md5_digest": "fb6ba967cc887a450b012fb5fd68ef59", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 3614061, "upload_time": "2018-05-08T11:43:24", "url": "https://files.pythonhosted.org/packages/a4/e3/2d4d0bb421c791a16465b9889de09ea9bf816b46e4fc01a5a6ac4d0d98eb/omterms-0.1.2.tar.gz" } ], "0.1.3": [ { "comment_text": "", "digests": { "md5": "332d9df893881c01cb73cbbf5b2b7ba6", "sha256": "191b34981d91828c239468d8656efcf758729357c5fec292423d52a0f5ba3d6a" }, "downloads": -1, "filename": "omterms-0.1.3-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "332d9df893881c01cb73cbbf5b2b7ba6", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 36801, "upload_time": "2018-05-08T17:56:52", "url": "https://files.pythonhosted.org/packages/b3/e4/cffc46b9c6a9dd782e25c598c7ed1b4269fe070a172c440e83411f865714/omterms-0.1.3-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "b56c1bac83f1c9511882df15e44be12c", "sha256": "db6e8fcf3954b32bfc251edbb4a08766a1bf285f595095733fe790cd71f515f6" }, "downloads": -1, "filename": "omterms-0.1.3.tar.gz", "has_sig": false, "md5_digest": "b56c1bac83f1c9511882df15e44be12c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 3614358, "upload_time": "2018-05-08T17:56:58", "url": "https://files.pythonhosted.org/packages/43/c1/716d47aa24fb47fc27942bcb68deabc8356529d57f10275a33aed98b41fb/omterms-0.1.3.tar.gz" } ], "0.1.4": [ { "comment_text": "", "digests": { "md5": "f1936d64f27df20f2a43ce55a80cadbf", "sha256": "471c3a33bac5a2cbfac84e301262c320fc10f2c8b5968670aa8e968bd63bf7a4" }, "downloads": -1, "filename": "omterms-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "f1936d64f27df20f2a43ce55a80cadbf", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 36869, "upload_time": "2018-05-12T21:59:06", "url": "https://files.pythonhosted.org/packages/8e/b6/fefc4617e1785c504bd1449e535b119de715b6f3fddc8be7f3d712b07c65/omterms-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "19d1f8f22f932e93e099581ad93d0029", "sha256": "0f894e8a2cee527eeff21bdde62bfca9f8b3c7e6cc429df1a5d91e281ea9e760" }, "downloads": -1, "filename": "omterms-0.1.4.tar.gz", "has_sig": false, "md5_digest": "19d1f8f22f932e93e099581ad93d0029", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 936854, "upload_time": "2018-05-12T21:59:13", "url": "https://files.pythonhosted.org/packages/03/e0/d857e5919b783fe775e5268823fc7f14582eef02a7715d7857727ba64c1a/omterms-0.1.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "f1936d64f27df20f2a43ce55a80cadbf", "sha256": "471c3a33bac5a2cbfac84e301262c320fc10f2c8b5968670aa8e968bd63bf7a4" }, "downloads": -1, "filename": "omterms-0.1.4-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "f1936d64f27df20f2a43ce55a80cadbf", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.6.0", "size": 36869, "upload_time": "2018-05-12T21:59:06", "url": "https://files.pythonhosted.org/packages/8e/b6/fefc4617e1785c504bd1449e535b119de715b6f3fddc8be7f3d712b07c65/omterms-0.1.4-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "19d1f8f22f932e93e099581ad93d0029", "sha256": "0f894e8a2cee527eeff21bdde62bfca9f8b3c7e6cc429df1a5d91e281ea9e760" }, "downloads": -1, "filename": "omterms-0.1.4.tar.gz", "has_sig": false, "md5_digest": "19d1f8f22f932e93e099581ad93d0029", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.6.0", "size": 936854, "upload_time": "2018-05-12T21:59:13", "url": "https://files.pythonhosted.org/packages/03/e0/d857e5919b783fe775e5268823fc7f14582eef02a7715d7857727ba64c1a/omterms-0.1.4.tar.gz" } ] }