{ "info": { "author": "Andrew Dalke", "author_email": "dalke@dalkescientific.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: C", "Programming Language :: Python :: 2 :: Only", "Topic :: Scientific/Engineering :: Chemistry", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "SMILEZ - compression for SMILES strings\n=======================================\n\nSMILEZ home page: https://bitbucket.org/dalke/smilez\n\nSmilez is a simple compression library for SMILES strings. It's\nclosely based on the SMAZ compression library for short strings,\navailable from https://github.com/antirez/smaz .\n\nSmilez is not a general purpose compression algorithm. It can compress\nSMILES strings by about 50-60%, including many SMILES strings which\nare only a few bytes long. In the best case it can compress up to 75%,\nbut if you give it non-SMILES data the result might be 50% larger.\n\nYou can use SMILEZ through the C API or the Python API.\n\n\nCOMPRESSION EXAMPLES\n====================\n\nThe SMILEZ code dictionaries were trained on a mixture of PubChem and\nChEMBL data. Compression performance will depend on how closely the\nSMILES strings match that training set. In general it does best with\n\"normal\" SMILES, and poorly with uncommon features like reaction maps\nand isotopes.\n\n\n from to \n size size SMILES\n ---- ---- ------------------------------------------\n 1 1 C\n 2 1 CC\n 3 2 CCC\n 4 1 CCCC\n 7 4 F/C=C/F\n 6 8 [NH4+]\n 9 4 c1ccccc1O\n 26 13 Cn1cnc2c1c(=O)n(c(=O)n2C)C\n 28 10 CN1C=NC2=C1C(=O)N(C(=O)N2C)C\n 42 17 COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C\n 43 46 [CH2:1]=[CH:2][CH2:1][CH2:3][C:4](C)[CH2:3]\n 60 30 C[N+]1(C2CC(CC1C3C2O3)OC(=O)C(C4=CC=CS4)(C5=CC=CS5)O)C.[Br-]\n\n\nChEMBL Example\n==============\n\nI have a SMILES data set derived from ChEMBL 16 with 1,292,344\nrecords. \n\n uncompressed: 66,896,248 bytes\n compressed: 25,455,759 bytes\n\n => 60% smaller\n\nI timed the process using the Python API. It took about 2.6 seconds to\ncompress the strings and 1.4 seconds to decompress them.\n\nBy comparison, it takes zlib about 51 seconds to compress the same\ndata set. The result takes up 61,406,214 bytes, which is only 8%\nsmaller than the original. (This is why you don't use zlib on small\nstrings.)\n\n\nI double-checked using the 24,323 SMILES from the NCI data set. The\ncompressed SMILES are 58% smaller than the original SMILES (1.21 MB to\n0.51 MB).\n\nC API\n=====\n\n int smilez_compress(const char *in, int inlen, char *out, int outlen,\n int dictionary)\n\nCompress the SMILES string in 'in' of length 'inlen' and put the\ncompressed data into 'out' of maximum length 'outlen' bytes. If the\noutput buffer is too short to hold the whole compressed string,\noutlen+1 is returned. Otherwise, the length of the compressed string\n(less then or equal to outlen) is returned.\n\nThe 'dictionary' option specifies the encoding dictionary. If\nSMILEZ_BYTE_DICTIONARY (=0) then the compressed string may use the\nbytes 0-255. If SMILEZ_WHITESPACE_DICTIONARY (=1) then the compressed\nstrings will not use the bytes \"\\n\", \"\\r\", \"\\t\", or \" \". This lets you\nuse a SMILEZ string as a field in a tab or space separated file, or in\nan SD tag, for a very small size penalty.\n\nThe dictionary only affects the encoder. The decoder handles both\ndictionaries automatically. If the decoder value is out of range then\nsmilez_compress returns 0.\n\n\n int smaz_decompress(char *in, int inlen, char *out, int outlen);\n\nDecompress the buffer 'in' of length 'inlen' and put the decompressed data into\n'out' of max length 'outlen' bytes. If the output buffer is too short to hold\nthe whole decompressed string, outlen+1 is returned. Otherwise the length of the\ncompressed string (less then or equal to outlen) is returned. This function will\nnot automatically put a nul-term at the end of the string if the original\ncompressed string didn't included a nulterm.\n\n\nThe following macros are nearly self-explanatory:\n\n #define SMILEZ_VERSION \"1.0\"\n #define SMILEZ_MAJOR_VERSION 1\n #define SMILEZ_MINOR_VERSION 0\n\n #define SMILEZ_COMPRESSION_VERSION 1\n\n #define SMILEZ_BYTE_DICTIONARY 0\n #define SMILEZ_WHITESPACE_DICTIONARY 1\n #define SMILEZ_NUM_DICTIONARIES 2\n\nThe SMILEZ_COMPRESSION_VERSION will change only when the compression\nformat changes.\n\nThe SMILEZ_VERSION and SMILEZ_COMPRESSION_VERSION values are also\navailable through function calls:\n\n const char *smilez_get_version(void);\n int smilez_get_compression_version(void);\n\n\nPython API\n==========\n\nUse \"python setup.py install\" to build and install the Python module\n\"smilez\".\n\nThe module defines a few constants, based on smilez_get_version() and\nsmilez_get_compression_version():\n\n >>> import smilez\n >>> smilez.__version__\n '1.0'\n >>> smilez.compression_version\n 1\n\nas well as constants for the two dictionaries:\n\n >>> smilez.BYTE_DICTIONARY\n 0\n >>> smilez.WHITESPACE_DICTIONARY\n 1\n\nHere's an example of how to compress and decompress phenol:\n\n >>> smilez.compress(\"c1ccccc1O\")\n 'I\\xda\\xc9\\xb9'\n\n >>> smilez.decompress('I\\xda\\xc9\\xb9')\n 'c1ccccc1O'\n\nand an example of compressing with the whitespace dictionary:\n\n >>> smilez.compress(\"P\", smilez.WHITESPACE_DICTIONARY)\n '\\xfeP'\n >>> smilez.compress(\"P\")\n ' '\n\nCommand-line programs\n=====================\n\nRunning \"make\" will make three demonstration or test programs:\n\nsmilezip\n--------\n\nConvert a SMILES file into a SMILEZ file.\n\nA SMILES file contains one record per line. Each line contains space\nor tab delimited fields where the SMILES is the first field. A SMILEZ\nfile is the same format, after compressing the first field.\n\n % ./smilezip nci_09425001_09450000.smi > nci.smiz\n % ls -l nci_09425001_09450000.smi nci.smiz\n -rw-r--r-- 1 dalke staff 1433190 Jun 8 2008 nci_09425001_09450000.smi\n -rw-r--r-- 1 dalke staff 724140 Jul 2 02:52 nci.smiz\n\nNOTE! There is no real reason to have a SMILEZ file. The gzip'ed\nversion of the same file is only 187483 bytes, or almost 1/4th the\nsize of the SMILEZ file. This is a demonstration of how one might use\nthe whitespace dictionary.\n\n\nsmilezcat\n---------\n\nConvert a SMILEZ file into a SMILES file:\n\n % ./smilezcat nci.smiz > nci.smi\n % cmp nci_09425001_09450000.smi nci.smi\n\n\n\nsmilez_test\n-----------\n\nThis runs a set of self-tests on the C API.\n\n\n\n\nCREDITS\n=======\n\nSmaz was written by Salvatore Sanfilippo and is released under the BSD\nlicense. Check the COPYING file for more information.\n\nSmilez was written by Andrew Dalke and is released under the same BSD\nlicense.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://bitbucket.org/dalke/smilez", "keywords": "cheminformatics SMILES compression", "license": "BSD", "maintainer": null, "maintainer_email": null, "name": "smilez", "package_url": "https://pypi.org/project/smilez/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/smilez/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://bitbucket.org/dalke/smilez" }, "release_url": "https://pypi.org/project/smilez/1.0/", "requires_dist": null, "requires_python": null, "summary": "A compressor and decompressor library for cheminformatics SMILES strings", "version": "1.0" }, "last_serial": 1144003, "releases": { "1.0": [] }, "urls": [] }