{ "info": { "author": "Tahir Sousa", "author_email": "tahirsousa@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: Public Domain", "Natural Language :: English", "Programming Language :: Python :: 2.7", "Topic :: Scientific/Engineering" ], "description": "# SimilarityCalculator\n\n*Uses parallel processing and efficient C implementations to quickly compute similarities between a list of items; and \nprints the top neighbors (most similar items) of each item to file.* \n\n##### Travis Continuous Integration Testing status \n\n[![Build Status](https://travis-ci.org/SMLtahir/SimilarityCalculator.svg?branch=master)](https://travis-ci.org/SMLtahir/SimilarityCalculator)\n\nFor maximum assurance against bugs, Travis CI testing is linked to the source Github repository. Travis runs several \nunit tests on the latest code pushed to Github and displays a badge indicating **Build_Passing** or \n**Build_Failing**. Since this project is currently under continuous development, it is suggested however to use only \nstable builds released as code versions.\n\n##### Compatible with Python versions\n\nSimilarityCalculator currently runs on Python versions: *2.6*, *2.7*, *pypy*\n\n#### How to run:\n\nFirst, set **PYTHONPATH** environment variable to your Python bin directory\n\n**Runnable modules**:\n\n- load_neighbors.py\n- test/unit_tests_load_neighbors.py\n\n###### Run from the command line\n\nLOAD_NEIGHBORS:\n\n```\n\tpython load_neighbors.py\n```\n\nUNIT_TESTS:\n\n```\n python test/unit_tests_load_neighbors.py \n```\n\nTo get information about any runnable module:\n\n* python -h\n* python --help\n\nThis will tell you how to run the program along with its available command line parameters.\n\n###### Run from external module (as a function)\n**SimilarityCalculator** can also be run as a function from an external module\nThe sample file **runAsFunction.py** located in SimilarityCalculator/data/samples/ should be followed as a model for \nrunning the tool from an external module. \n\n##### Parallel execution\n\nThe SimilarityCalculator has been designed to run on maximum available CPUs and using several C implementations \nover native Python. Due to this, it is extremely fast and efficient. In case the user wants less CPUs to be used during \nsimilarity and nearest neighbor computation, the **NUMBER_OF_CPU** configurational parameter should be set appropriately.\nIts default value is **MAX**.\n\n \n##### Similarity measures currently supported\n\nCurrently, Similarity between items is calculated using the Cosine Similarity Measure. We are currently working to \nextend the code to include more measures. \n\n##### Configuration Parameters\n\nThere are 3 json files in the SimilarityCalculator/config/ directory: \n1. **config.json** - This is the most general config file. Store parameters whose values you do not plan to change \nvery often, here. These usually include input/output file paths, etc.\n2. **config.test.json** - This file should be used for test parameters or parameters whose values you plan to change \nvery frequently. This can include number of CPUs, neighborhood size, etc. \n3. **config.local.json** - This file is not written to the Github repository and will have to be created when \nthe code is first used. This is used to store local and private parameters and their values. **DO NOT** store these \nparameters in any of the other config files.\n\nIn case the same configuration parameter is set in two or more files than the following is the order of assignment.\nconfig.local.json > config.test.json > config.json\nThis means that if a parameter is set in both, local as well as test config files, then the value of the \nparameter in config.test will be overwritten by config.local and so on. \n\nThe following parameters are to be modified as desired in these files.\n\n1. **NUMBER_OF_CPU**\n\n *Example entry*:\n \n \"NUMBER_OF_CPU\": 2\n \n *Default entry*: \n \n \"NUMBER_OF_CPU\": \"MAX\"\n \n \n This is used to indicate the number of CPUs to be used in parallel by the program. If not changed by the user, \n the default is \"MAX\" which means that maximum available CPUs will be used for computation. Use this option for \n large datasets and when speedy computation is required. Minimum value for this parameter is 1.\n \n2. **FILE_RELEVANCE_PREDICTIONS**\n\n *Example entry*:\n \n \"FILE_RELEVANCE_PREDICTIONS\": \"correct/path/inserted/here/sample_file.csv\"\n \n *Default entry*: \n \n \"FILE_RELEVANCE_PREDICTIONS\": \"data/relevance_predictions.txt\"\n\n This is the path to the main input file. It is suggested to store all data files in the SimilarityCalculator/data/ \n directory. The input file can be a .txt, .csv (comma separated) or .tsv (tab separated) file. \n It contains 3 values: \n \n item1_id,item2_id,relevance_score\n \n Please reserve the first line of the file as the Header line for better readability.\n The comma (,) can be replaced by any **INPUT_FIELD_SEPARATOR** as long as it is declared in the \n **INPUT_FIELD_SEPARATOR** configuration parameter (see below). \n \n Item1 is an entity of a set on which you need to perform the nearest-neighbor analysis. Item2 (henceforth called \n **tag**) is an entity that is used as a metric to compare two entities of the item1 set. \n \n * Example 1:\n \n MovieId,Genre,RelevanceScore\n 1,\"Action\",0.50\n 1,\"Comedy\",0.75\n 2,\"Action\",0.67\n 2,\"Comedy\",0.30\n 3,\"Action\",0.98\n 3,\"Comedy\",0.10\n \n The example above can be used to check how similar two movies are based on how highly they can be classified to the \n different genres. Normalization of scores is done within the program and so this should not be a worry at the time \n of preparing the needed input file.\n \n * Example 2:\n \n UserId,MovieId,Rating\n 1,96,4.5\n 1,54,2.0\n 2,96,5.0\n 2,54,1.0\n 3,96,3.0\n 3,54,4.0\n \n The example above can be used to find how closely matched the tastes of different users are, on the basis of the \n ratings they assign to movies.\n \n Fully working examples (input + output files) can be found in the SimilarityCalculator/data/samples/ directory.\n \n This code has been tested successfully on data files containing up to 2.5 million lines. \n\n3. **INPUT_FIELD_SEPARATOR**\n\n *Example entry* (tab-space):\n \n \"INPUT_FIELD_SEPARATOR\": \"\\t\"\n \n *Default entry* (comma): \n \n \"INPUT_FIELD_SEPARATOR\": \",\"\n \n This is the token (space, tab-space, comma, colon, semi-colon, etc.) that is used to separate two fields of the \n input files. Examples can be seen above. \n \n4. **TAG_WEIGHTED**\n \n *Valid entries*:\n \n \"TAG_WEIGHTED\": \"T\"\n \"TAG_WEIGHTED\": \"F\"\n \n *Default entry*: \n \n \"TAG_WEIGHTED\": \"F\"\n \n This configuration parameter can take only two values - \"T\" (True) and \"F\" (False). By default its value is F which \n means that tags will be all equally weighted (= 1.0). Assign a value of \"T\" when some tags (as described above) need \n to be weighted differently than others. This could happen if you want some tags to play a bigger role in determining \n the nearest neighbors than others. \n \n5. **FILE_TAG_WEIGHTS**\n \n *Example entry*:\n \n \"FILE_TAG_WEIGHTS\": \"correct/path/inserted/here/sample_tag_file.csv\"\n \n *Default entry*: \n \n \"FILE_TAG_WEIGHTS\": \"data/tagWeights.txt\" \n\n This is an optional input file. The value of this parameter will be considered only when the parameter \n **TAG_WEIGHTED** is set to \"T\" (True). The format of this file is as follows:\n \n tag,tag_weight\n \n Please reserve the first line of the file as the Header line for better readability.\n The comma (,) can be replaced by any **INPUT_FIELD_SEPARATOR** as long as it is declared in the \n **INPUT_FIELD_SEPARATOR** configuration parameter (see above). The below is a tab-delimited example. \n\n *Example*:\n \n tag weight\n \"tagA\" 2.3\n \"tagB\" 1.0\n \"tagC\" -2.0\n \n **Please note**: If this file is included, **ALL** tags must be assigned weights.\n\n6. **FILE_NEIGHBORS**\n\n *Example entry*:\n \n \"FILE_NEIGHBORS\": \"correct/path/inserted/here/sample_file.txt\"\n \n *Default entry*: \n \n \"FILE_NEIGHBORS\": \"data/neighbors.txt\"\n \n This configuration parameter tells the program where to store the final output file. It is advised to store this in\n the SimilarityCalculator/data/ directory as is done by default. The format of the file produced will be as below:\n \n itemId,neighbor1,Similarity_Score\n \n *Example*:\n \n 1 1 4.7657\n 1 2 4.6790\n 1 4 4.4423\n 1 5 4.4208\n 1 3 2.8345\n 2 2 4.7657\n 2 1 4.6790\n 2 4 4.5840\n 2 5 4.5061\n 2 3 2.9758\n \n As can be seen in the example above, the neighbors of a particular itemId are printed in decreasing order of \n similarity score. This means that the nearest neighbors will be displayed on top. \n \n7. **NEIGHBORHOOD_SIZE** \n\n *Example entry*:\n \n \"NEIGHBORHOOD_SIZE\": 50\n \n *Default entry*: \n \n \"NEIGHBORHOOD_SIZE\": 250\n \n This determines the number of top neighbors that you would want the program to calculate per item. If this value is \n set higher than the total number of items, all items will be printed along with their similarity scores as \n neighbors for every item. \n \n8. **LOG_NAME**\n\n *Example entry*:\n \n \"LOG_NAME\": \"correct/path/inserted/here/sample_file.txt\"\n \n *Default entry*: \n \n \"LOG_NAME\": \"logs/load_neighbors.txt\" \n \n This tells the program where to store the runtime logs of the program. It is advised to store this in the \n SimilarityCalculator/logs/ directory as is done by default.\n \n9. **ITEM1_COLUMN_NO**\n\n *Example entry*:\n \n \"ITEM1_COLUMN_NO\": 4\n \n *Default entry*: \n \n \"ITEM1_COLUMN_NO\": 0\n \n This parameter tells the program in which column of the .csv/ .tsv *RELEVANCE_PREDICTIONS* file it can find \n entities belonging to the itemId1 set (as defined earlier). It is particularly useful when the input file is a \n multi-column one not prepared specially as input for the SimilarityCalculator program.\n \n *Example*:\n \n MovieId,Budget,BoxOfficeSales,Genre,relevanceScore(GenreToMovieId)\n 1,87,76,\"Action\",0.50\n 1,87,76,\"Comedy\",0.75\n 2,45,30,\"Action\",0.67\n 2,45,30,\"Comedy\",0.30\n 3,95,110,\"Action\",0.98\n 3,95,110,\"Comedy\",0.10\n \n In this case, if we want to find the top-nearest neighbors of the different movies in our list, we would set\n \n \"ITEM1_COLUMN_NO\": 0\n \"ITEM2_COLUMN_NO\": 3\n \"RELEVANCE_SCORE_COLUMN_NO\": 4\n \n **Note**: Column numbering here starts from 0, and not from 1.\n For the description of the two other parameters shown in the example above, see their respective sections below.\n \n10. **ITEM2_COLUMN_NO**\n\n *Example entry*:\n \n \"ITEM2_COLUMN_NO\": 4\n \n *Default entry*: \n \n \"ITEM2_COLUMN_NO\": 1\n \n This parameter tells the program in which column of the .csv/ .tsv *RELEVANCE_PREDICTIONS* file it can find \n entities belonging to the itemId2 set (as defined earlier). It is particularly useful when the input file is a \n multi-column one not prepared specially as input for the SimilarityCalculator program.\n \n Please refer to the example above to see the usage of this parameter.\n \n11. **RELEVANCE_SCORE_COLUMN_NO**\n\n *Example entry*:\n \n \"RELEVANCE_SCORE_COLUMN_NO\": 1\n \n *Default entry*: \n \n \"RELEVANCE_SCORE_COLUMN_NO\": 2\n \n This parameter tells the program in which column of the .csv/ .tsv *RELEVANCE_PREDICTIONS* file it can find \n entities belonging to the relevance_score set (as defined earlier). It is particularly useful when the input file \n is a multi-column one not prepared specially as input for the SimilarityCalculator program.\n \n Please refer to the example above to see the usage of this parameter.", "description_content_type": null, "docs_url": null, "download_url": "https://github.com/SMLtahir/SimilarityCalculator/tarball/0.1", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/SMLtahir/SimilarityCalculator", "keywords": "similarity,neighbor,neighborhood", "license": "GNU General Public License", "maintainer": null, "maintainer_email": null, "name": "SimilarityCalculator", "package_url": "https://pypi.org/project/SimilarityCalculator/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/SimilarityCalculator/", "project_urls": { "Download": "https://github.com/SMLtahir/SimilarityCalculator/tarball/0.1", "Homepage": "https://github.com/SMLtahir/SimilarityCalculator" }, "release_url": "https://pypi.org/project/SimilarityCalculator/0.1/", "requires_dist": null, "requires_python": null, "summary": "Python package that computes similarity between two items based on their tag relevance scores and writes to file the list of items along with their corresponding top item neighbors", "version": "0.1" }, "last_serial": 1739209, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "5f6979e7ba1b88e0639f186cde57ff52", "sha256": "e6747633f332bc68b4b1720078155909a19e204d9877d18d6805cd1feb62d785" }, "downloads": -1, "filename": "SimilarityCalculator-0.1.tar.gz", "has_sig": false, "md5_digest": "5f6979e7ba1b88e0639f186cde57ff52", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7434, "upload_time": "2015-09-25T22:10:08", "url": "https://files.pythonhosted.org/packages/ef/d4/843c79b15ff6c9b504a7ef23f38ea0dd5f5e871854ee0f18ab4b35c0467a/SimilarityCalculator-0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "5f6979e7ba1b88e0639f186cde57ff52", "sha256": "e6747633f332bc68b4b1720078155909a19e204d9877d18d6805cd1feb62d785" }, "downloads": -1, "filename": "SimilarityCalculator-0.1.tar.gz", "has_sig": false, "md5_digest": "5f6979e7ba1b88e0639f186cde57ff52", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7434, "upload_time": "2015-09-25T22:10:08", "url": "https://files.pythonhosted.org/packages/ef/d4/843c79b15ff6c9b504a7ef23f38ea0dd5f5e871854ee0f18ab4b35c0467a/SimilarityCalculator-0.1.tar.gz" } ] }