{ "info": { "author": "Ganggao Zhu", "author_email": "gzhu@dit.upm.es", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Science/Research", "Operating System :: OS Independent", "Programming Language :: Python :: 2.7", "Topic :: Software Development :: Libraries" ], "description": "![logo](docs/sources/img/logo.png)\n\n------------------\n\n# Introduction\n\nSematch is an integrated framework for the development, evaluation, and application of semantic similarity for Knowledge Graphs (KGs). It is easy to use Sematch to compute semantic similarity scores of concepts, words and entities. Sematch focuses on specific knowledge-based semantic similarity metrics that rely on structural knowledge in taxonomy (e.g. depth, path length, least common subsumer), and statistical information contents (corpus-IC and graph-IC). Knowledge-based approaches differ from their counterpart corpus-based approaches relying on co-occurrence (e.g. Pointwise Mutual Information) or distributional similarity (Latent Semantic Analysis, Word2Vec, GLOVE and etc). Knowledge-based approaches are usually used for structural KGs, while corpus-based approaches are normally applied in textual corpora.\n\nIn text analysis applications, a common pipeline is adopted in using semantic similarity from concept level, to word and sentence level. For example, word similarity is first computed based on similarity scores of WordNet concepts, and sentence similarity is computed by composing word similarity scores. Finally, document similarity could be computed by identifying important sentences, e.g. TextRank.\n\n![logo](docs/sources/img/sematch-motivation.jpg)\n\n\n\nKG based applications also meet similar pipeline in using semantic similarity, from concept similarity (e.g. `http://dbpedia.org/class/yago/Actor109765278`) to entity similarity (e.g. `http://dbpedia.org/resource/Madrid`). Furthermore, in computing document similarity, entities are extracted and document similarity is computed by composing entity similarity scores. \n\n![kg](docs/sources/img/kg.png)\n\nIn KGs, concepts usually denote ontology classes while entities refer to ontology instances. Moreover, those concepts are usually constructed into hierarchical taxonomies, such as DBpedia ontology class, thus quantifying concept similarity in KG relies on similar semantic information (e.g. path length, depth, least common subsumer, information content) and semantic similarity metrics (e.g. Path, Wu & Palmer,Li, Resnik, Lin, Jiang & Conrad and WPath). In consequence, Sematch provides an integrated framework to develop and evaluate semantic similarity metrics for concepts, words, entities and their applications. \n\n------------------\n\n\n\n## Getting started: 20 minutes to Sematch\n\n### Install Sematch\n\nYou need to install scientific computing libraries **numpy** and **scipy** first. An example of installing them with pip is shown below.\n\n```\npip install numpy scipy\n```\n\nDepending on different OS, you can use different ways to install them. After sucessful installation of **numpy** and **scipy**, you can install sematch with following commands.\n\n```\npip install sematch\npython -m sematch.download\n```\n\nAlternatively, you can clone and install Sematch with setuptools. We recommend you to update your pip and setuptools.\n\n```\ngit clone https://github.com/gsi-upm/sematch.git\ncd sematch\npython setup.py install\n```\n\nWe also provide a [Sematch-Demo Server](https://github.com/gsi-upm/sematch-demo). You can use it for experimenting with main functionalities or take it as an example for using Sematch to develop applications. Please check our [Documentation](http://gsi-upm.github.io/sematch/) for more details.\n\n### Computing Word Similarity\n\nThe core module of Sematch is measuring semantic similarity between concepts that are represented as concept taxonomies. Word similarity is computed based on the maximum semantic similarity of WordNet concepts. You can use Sematch to compute multi-lingual word similarity based on WordNet with various of semantic similarity metrics.\n\n```python\nfrom sematch.semantic.similarity import WordNetSimilarity\nwns = WordNetSimilarity()\n\n# Computing English word similarity using Li method\nwns.word_similarity('dog', 'cat', 'li') # 0.449327301063\n# Computing Spanish word similarity using Lin method\nwns.monol_word_similarity('perro', 'gato', 'spa', 'lin') #0.876800984373\n# Computing Chinese word similarity using Wu & Palmer method\nwns.monol_word_similarity('\u72d7', '\u732b', 'cmn', 'wup') # 0.857142857143\n# Computing Spanish and English word similarity using Resnik method\nwns.crossl_word_similarity('perro', 'cat', 'spa', 'eng', 'res') #7.91166650904\n# Computing Spanish and Chinese word similarity using Jiang & Conrad method\nwns.crossl_word_similarity('perro', '\u732b', 'spa', 'cmn', 'jcn') #0.31023804699\n# Computing Chinese and English word similarity using WPath method\nwns.crossl_word_similarity('\u72d7', 'cat', 'cmn', 'eng', 'wpath')#0.593666388463\n```\n\n### Computing semantic similarity of YAGO concepts.\n\n```python\nfrom sematch.semantic.similarity import YagoTypeSimilarity\nsim = YagoTypeSimilarity()\n\n#Measuring YAGO concept similarity through WordNet taxonomy and corpus based information content\nsim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Actor109765278', 'wpath') #0.642\nsim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Singer110599806', 'wpath') #0.544\n#Measuring YAGO concept similarity based on graph-based IC\nsim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Actor109765278', 'wpath_graph') #0.423\nsim.yago_similarity('http://dbpedia.org/class/yago/Dancer109989502','http://dbpedia.org/class/yago/Singer110599806', 'wpath_graph') #0.328\n```\n\n### Computing semantic similarity of DBpedia concepts.\n\n```python\nfrom sematch.semantic.graph import DBpediaDataTransform, Taxonomy\nfrom sematch.semantic.similarity import ConceptSimilarity\nconcept = ConceptSimilarity(Taxonomy(DBpediaDataTransform()),'models/dbpedia_type_ic.txt')\nconcept.name2concept('actor')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'path')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'wup')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'li')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'res')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'lin')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'jcn')\nconcept.similarity('http://dbpedia.org/ontology/Actor','http://dbpedia.org/ontology/Film', 'wpath')\n```\n\n### Computing semantic similarity of DBpedia entities.\n\n```python\nfrom sematch.semantic.similarity import EntitySimilarity\nsim = EntitySimilarity()\nsim.similarity('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona') #0.409923677282\nsim.similarity('http://dbpedia.org/resource/Apple_Inc.','http://dbpedia.org/resource/Steve_Jobs')#0.0904545454545\nsim.relatedness('http://dbpedia.org/resource/Madrid','http://dbpedia.org/resource/Barcelona')#0.457984139871\nsim.relatedness('http://dbpedia.org/resource/Apple_Inc.','http://dbpedia.org/resource/Steve_Jobs')#0.465991132787\n```\n\n### Evaluate semantic similarity metrics with word similarity datasets\n\n```python\nfrom sematch.evaluation import WordSimEvaluation\nfrom sematch.semantic.similarity import WordNetSimilarity\nevaluation = WordSimEvaluation()\nevaluation.dataset_names()\nwns = WordNetSimilarity()\n# define similarity metrics\nwpath = lambda x, y: wns.word_similarity_wpath(x, y, 0.8)\n# evaluate similarity metrics with SimLex dataset\nevaluation.evaluate_metric('wpath', wpath, 'noun_simlex')\n# performa Steiger's Z significance Test\nevaluation.statistical_test('wpath', 'path', 'noun_simlex')\n# define similarity metrics for Spanish words\nwpath_es = lambda x, y: wns.monol_word_similarity(x, y, 'spa', 'path')\n# define cross-lingual similarity metrics for English-Spanish\nwpath_en_es = lambda x, y: wns.crossl_word_similarity(x, y, 'eng', 'spa', 'wpath')\n# evaluate metrics in multilingual word similarity datasets\nevaluation.evaluate_metric('wpath_es', wpath_es, 'rg65_spanish')\nevaluation.evaluate_metric('wpath_en_es', wpath_en_es, 'rg65_EN-ES')\n```\n\n### Evaluate semantic similarity metrics with category classification\n\nAlthough the word similarity correlation measure is the standard way to evaluate the semantic similarity metrics, it relies on human judgements over word pairs which may not have same performance in real applications. Therefore, apart from word similarity evaluation, the Sematch evaluation framework also includes a simple aspect category classification. The task classifies noun concepts such as pasta, noodle, steak, tea into their ontological parent concept FOOD, DRINKS.\n\n```python\nfrom sematch.evaluation import AspectEvaluation\nfrom sematch.application import SimClassifier, SimSVMClassifier\nfrom sematch.semantic.similarity import WordNetSimilarity\n\n# create aspect classification evaluation\nevaluation = AspectEvaluation()\n# load the dataset\nX, y = evaluation.load_dataset()\n# define word similarity function\nwns = WordNetSimilarity()\nword_sim = lambda x, y: wns.word_similarity(x, y)\n# Train and evaluate metrics with unsupervised classification model\nsimclassifier = SimClassifier.train(zip(X,y), word_sim)\nevaluation.evaluate(X,y, simclassifier)\n\nmacro averge: (0.65319812882333839, 0.7101245049198579, 0.66317566364913016, None)\nmicro average: (0.79210167952791644, 0.79210167952791644, 0.79210167952791644, None)\nweighted average: (0.80842645056024054, 0.79210167952791644, 0.79639496616636352, None)\naccuracy: 0.792101679528\n precision recall f1-score support\n\n SERVICE 0.50 0.43 0.46 519\n RESTAURANT 0.81 0.66 0.73 228\n FOOD 0.95 0.87 0.91 2256\n LOCATION 0.26 0.67 0.37 54\n AMBIENCE 0.60 0.70 0.65 597\n DRINKS 0.81 0.93 0.87 752\n\navg / total 0.81 0.79 0.80 4406\n```\n\n### Matching Entities with type using SPARQL queries\n\nYou can use Sematch to download a list of entities having a specific type using different languages. Sematch will generate SPARQL queries and execute them in [DBpedia Sparql Endpoint](http://dbpedia.org/sparql).\n\n```python\nfrom sematch.application import Matcher\nmatcher = Matcher()\n# matching scientist entities from DBpedia\nmatcher.match_type('scientist')\nmatcher.match_type('cient\u00edfico', 'spa')\nmatcher.match_type('\u79d1\u5b66\u5bb6', 'cmn')\nmatcher.match_entity_type('movies with Tom Cruise')\n```\n\nExample of automatically generated SPARQL query.\n\n```sql\nSELECT DISTINCT ?s, ?label, ?abstract WHERE {\n { \n ?s . }\n UNION { \n ?s . }\n UNION { \n ?s . }\n UNION { \n ?s . }\n UNION { \n ?s . } \n ?s . \n ?s ?label . \n FILTER( lang(?label) = \"en\") . \n ?s ?abstract . \n FILTER( lang(?abstract) = \"en\") .\n} LIMIT 5000\n```\n\n### Entity feature extraction with Similarity Graph\n\nApart from semantic matching of entities from DBpedia, you can also use Sematch to extract features of entities and apply semantic similarity analysis using graph-based ranking algorithms. Given a list of objects (concepts, words, entities), Sematch compute their pairwise semantic similarity and generate similarity graph where nodes denote objects and edges denote similarity scores. An example of using similarity graph for extracting important words from an entity description.\n\n```python\nfrom sematch.semantic.graph import SimGraph\nfrom sematch.semantic.similarity import WordNetSimilarity\nfrom sematch.nlp import Extraction, word_process\nfrom sematch.semantic.sparql import EntityFeatures\nfrom collections import Counter\ntom = EntityFeatures().features('http://dbpedia.org/resource/Tom_Cruise')\nwords = Extraction().extract_nouns(tom['abstract'])\nwords = word_process(words)\nwns = WordNetSimilarity()\nword_graph = SimGraph(words, wns.word_similarity)\nword_scores = word_graph.page_rank()\nwords, scores =zip(*Counter(word_scores).most_common(10))\nprint words\n(u'picture', u'action', u'number', u'film', u'post', u'sport', \nu'program', u'men', u'performance', u'motion')\n```\n\n------------------\n\n## Publications\n\n- Ganggao Zhu, and Carlos A. Iglesias. [\"Computing Semantic Similarity of Concepts in Knowledge Graphs.\"](http://ieeexplore.ieee.org/document/7572993/) IEEE Transactions on Knowledge and Data Engineering 29.1 (2017): 72-85.\n\n- Oscar Araque, Ganggao Zhu, Manuel Garcia-Amado and Carlos A. Iglesias [Mining the Opinionated Web: Classification and Detection of Aspect Contexts for Aspect Based Sentiment Analysis](http://sentic.net/sentire2016araque.pdf), ICDM sentire, 2016.\n\n- Ganggao Zhu, and Carlos Angel Iglesias. \"Sematch: Semantic Entity Search from Knowledge Graph.\" SumPre-HSWI@ ESWC. 2015.\n\n\n------------------\n\n## Support\n\nYou can post bug reports and feature requests in [Github issues](https://github.com/gsi-upm/sematch/issues).\nMake sure to read our guidelines first.\nThis project is still under active development approaching to its goals. The project is mainly maintained by Ganggao Zhu. You can contact him via gzhu [at] dit.upm.es\n\n------------------\n\n## Why this name, Sematch and Logo?\n\nThe name of Sematch is composed based on Spanish \"se\" and English \"match\". It is also the abbreviation of semantic matching because semantic similarity metrics helps to determine semantic distance of concepts, words, entities, instead of exact matching.\n\nThe logo of Sematch is based on Chinese [Yin and Yang](http://en.wikipedia.org/wiki/Yin_and_yang) which is written in [I Ching](http://en.wikipedia.org/wiki/I_Ching). Somehow, it correlates to 0 and 1 in computer science.\n\n\n\n\n![GSI Logo](http://vps161.cesvima.upm.es/images/stories/logos/gsi.png)", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/gsi-upm/sematch", "keywords": "semantic similarity,taxonomy,knowledge graph,semantic analysis,knowledge base,WordNet,DBpedia,YAGO,ontology", "license": "Apache 2.0", "maintainer": "", "maintainer_email": "", "name": "sematch", "package_url": "https://pypi.org/project/sematch/", "platform": "", "project_url": "https://pypi.org/project/sematch/", "project_urls": { "Homepage": "https://github.com/gsi-upm/sematch" }, "release_url": "https://pypi.org/project/sematch/1.0.4/", "requires_dist": null, "requires_python": "", "summary": "Semantic similarity framework for knowledge graphs", "version": "1.0.4" }, "last_serial": 5296420, "releases": { "1.0.0": [ { "comment_text": "", "digests": { "md5": "cdbf29d6a8327e5472e6856cacccdf2a", "sha256": "d08942557836c3f8485a956015ddbd2874b198ff8d2ceeb0970badb858e5535d" }, "downloads": -1, "filename": "sematch-1.0.0.tar.gz", "has_sig": false, "md5_digest": "cdbf29d6a8327e5472e6856cacccdf2a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1747367, "upload_time": "2017-02-03T19:02:03", "url": "https://files.pythonhosted.org/packages/8a/7c/488331f463b28eff351efbbd115d1590ac2e859c1478766c9964d4c10415/sematch-1.0.0.tar.gz" } ], "1.0.1": [ { "comment_text": "", "digests": { "md5": "881f728b4bfcd368d5895d1679be4cb5", "sha256": "30e0327046b9ffcd628a4f1764fc92a2788b69290a315f78e8947f0996746300" }, "downloads": -1, "filename": "sematch-1.0.1.tar.gz", "has_sig": false, "md5_digest": "881f728b4bfcd368d5895d1679be4cb5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1748560, "upload_time": "2017-02-03T20:03:28", "url": "https://files.pythonhosted.org/packages/b6/b7/50a2696a62ae9d723463b048d2a02da2e801f7479b9b9f25f5defe58f23e/sematch-1.0.1.tar.gz" } ], "1.0.2": [ { "comment_text": "", "digests": { "md5": "afe7046ae367d1a5d9a1371535fd9ebb", "sha256": "3f77b4679f212513e84a185b747128bfb19e2146db6ae697a3b4b7e29437719c" }, "downloads": -1, "filename": "sematch-1.0.2.tar.gz", "has_sig": false, "md5_digest": "afe7046ae367d1a5d9a1371535fd9ebb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1748534, "upload_time": "2017-02-07T10:50:24", "url": "https://files.pythonhosted.org/packages/a4/07/636c91916f65b04a9aba9e4437855f85abe60432868049ce3f6b60328d88/sematch-1.0.2.tar.gz" } ], "1.0.3": [ { "comment_text": "", "digests": { "md5": "acadaaf1022e4f2f974931b3e8de0989", "sha256": "c41b3c239282df78878b7a17514f345b838f7cba0c6eef83312eb96c7b4cee08" }, "downloads": -1, "filename": "sematch-1.0.3.tar.gz", "has_sig": false, "md5_digest": "acadaaf1022e4f2f974931b3e8de0989", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 1756735, "upload_time": "2017-02-07T16:35:07", "url": "https://files.pythonhosted.org/packages/88/78/e57179aa93a348b76254dc85e739bde1e5ec863d09351162d70794795444/sematch-1.0.3.tar.gz" } ], "1.0.4": [ { "comment_text": "", "digests": { "md5": "31f3756eb40399de72b41136aa7d899d", "sha256": "5bb69c77d91f2fb960245ce5b8df26aabf32f31d0b25384881906f69a44ff44d" }, "downloads": -1, "filename": "sematch-1.0.4.tar.gz", "has_sig": false, "md5_digest": "31f3756eb40399de72b41136aa7d899d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9017983, "upload_time": "2017-04-17T10:56:52", "url": "https://files.pythonhosted.org/packages/f4/1a/09377bdde1fcf4ede770c631e50199511a07921cf11dc66d3a83f2514277/sematch-1.0.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "31f3756eb40399de72b41136aa7d899d", "sha256": "5bb69c77d91f2fb960245ce5b8df26aabf32f31d0b25384881906f69a44ff44d" }, "downloads": -1, "filename": "sematch-1.0.4.tar.gz", "has_sig": false, "md5_digest": "31f3756eb40399de72b41136aa7d899d", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 9017983, "upload_time": "2017-04-17T10:56:52", "url": "https://files.pythonhosted.org/packages/f4/1a/09377bdde1fcf4ede770c631e50199511a07921cf11dc66d3a83f2514277/sematch-1.0.4.tar.gz" } ] }