{ "info": { "author": "Chaitanya Sharma", "author_email": "chaitanya@3bandar.org", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "Intended Audience :: Developers", "Programming Language :: Python", "Programming Language :: Python :: 2", "Topic :: Software Development :: Libraries" ], "description": "Welcome to the Google all pairs similarity search package!\n\nThis package provides a bare-bones implementation of the\n\"All-Pairs-Binary\" algorithm described in the following paper:\n\nR. J. Bayardo, Yiming Ma, Ramakrishnan Srikant. Scaling Up All-Pairs\nSimilarity Search. In Proc. of the 16th Int'l Conf. on World Wide Web,\n131-140, 2007. (download from: http://www.bayardo.org/ps/www2007.pdf)\n\n===============================================================================\nBUILDING\n\n*** For Linux & similar platforms ***\n\nSimply typing \"make\" in the directory containing this file will build\nthe executable. This code depends on the google-sparsehash package\n(http://code.google.com/p/google-sparsehash/). If the build fails, you\nmay have to update the CFLAGS within the Makefile to specify the\nlocation of the sparsehash header files.\n\nIf you'd rather not use the google sparsehash library, it is\nstraightforward to modify the algorithm to use STL hash_map instead.\n\n*** For Win32 / Microsoft VC++ v9 & a command shell ***\n\nNOTE: The build under windows does not exploit (and hence it does not\nrequire) the Google sparsehash library. Instead it uses the regular\nSTL hash_map, which isn't quite as fast but seems to work well\nenough. To build:\n\nC:\\google-all-pairs-similarity-search>\"%VS90COMNTOOLS%vsvars32.bat\"\n\nC:\\google-all-pairs-similarity-search>nmake -f Makefile.w32\n\n===============================================================================\nDATASET FORMAT\n\nThe dataset format expected by the algorithm is \"apriori binary.\" In\nan apriori binary encoded dataset, each vector has the following\nformat where each component is encoded as a raw 4-byte integer:\n\n ... \n\n(Endianness of the integers should match that of your platform,\ne.g. little-endian for Intel x86 architectures.)\n\nRecord ids can be arbitrary integers. Feature ids should be assigned\nsuch that feature id \"i\" corresponds to the \"ith\" least frequently\noccuring feature in the dataset. For example, the feature with id\nwhose integer value is \"1\" should be the least frequently occurring\nvalue in the dataset. Feature ids within a vector should then appear\nin increasing order of their id.\n\nRecords in the dataset should appear in order of increasing vector\nsize (a vector's size is its number of features.)\n\nIt is easy to extend the algorithm to read CSV formatted data if\ndesired.\n\nYou can download the apriori-binary little-endian encoded dblp dataset\nfrom here: http://www.bayardo.org/bin/dblp_le.bin.gz\n\n===============================================================================\nRUNNING THE ALGORITHM\n\nTo run the algorithm under Linux/Unix:\n\n./ap \n\nUnder Windows:\n\nap \n\nFor example, to mine all pairs of vectors with .9 or higher cosine\nsimilarity from the dblp dataset on a Linux-type system, you might\ntype:\n\n[~/google-all-pairs-similarity-search]: /ap .9 dblp_le.bin > some_file.txt\n; User specified similarity threshold: 0.9\n; Found 35022 similar pairs.\n; Candidates considered: 9274991\n; Vector intersections performed: 2063790\n; Total running time: 5 seconds\n[~/google-all-pairs-similarity-search]:\n\nThe output of the algorithm is text format. Each line will contain a\npair of vector ids that were found to be similar, followed by the\nactual cosine similarity score.\n\nNOTE: The algorithm is currently configured to use no more than ~1GB\nof main memory. The algorithm will enter an \"out of core\" mode should\nthe dataset require more than this amount of RAM to process in a\nsingle pass. The constant bounding RAM usage can be changed in main.cc\nas desired.\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/phoenix24/google-all-pairs-similarity-search", "keywords": null, "license": "Apache License 2.0", "maintainer": null, "maintainer_email": null, "name": "pyallpairs", "package_url": "https://pypi.org/project/pyallpairs/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/pyallpairs/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/phoenix24/google-all-pairs-similarity-search" }, "release_url": "https://pypi.org/project/pyallpairs/0.1.0/", "requires_dist": null, "requires_python": null, "summary": "Python bindings for the Google All Pairs Similarity Search", "version": "0.1.0" }, "last_serial": 1439094, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "8038fe01ae92168954100e87ddbc04eb", "sha256": "c13599f2a115cee2d5b43ad14eee2bf17bebe2daf5f10a5f936c0f1b4d47169d" }, "downloads": -1, "filename": "pyallpairs-0.1.0.tar.gz", "has_sig": false, "md5_digest": "8038fe01ae92168954100e87ddbc04eb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11166, "upload_time": "2015-02-26T12:48:24", "url": "https://files.pythonhosted.org/packages/ce/4e/65925066021747aa13846629c8e6053e7dbb0d608786864686a9d8a35d14/pyallpairs-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8038fe01ae92168954100e87ddbc04eb", "sha256": "c13599f2a115cee2d5b43ad14eee2bf17bebe2daf5f10a5f936c0f1b4d47169d" }, "downloads": -1, "filename": "pyallpairs-0.1.0.tar.gz", "has_sig": false, "md5_digest": "8038fe01ae92168954100e87ddbc04eb", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 11166, "upload_time": "2015-02-26T12:48:24", "url": "https://files.pythonhosted.org/packages/ce/4e/65925066021747aa13846629c8e6053e7dbb0d608786864686a9d8a35d14/pyallpairs-0.1.0.tar.gz" } ] }