{
    "info": {
        "author": "Chitta Ranjan",
        "author_email": "cran2367@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "# Sequence Graph Transform (SGT)\n\n#### Maintained by: Chitta Ranjan, PhD (cran2367@gmail.com).\n\n\nThis is open source code repository for SGT. Sequence Graph Transform extracts the short- and long-term sequence features and embeds them in a finite-dimensional feature space. Importantly, SGT has low computation and can extract any amount of short- to long-term patterns without any increase in the computation. These properties are proved theoretically and demonstrated on real data in this paper: https://arxiv.org/abs/1608.03533.\n\nIf using this code or dataset, please cite the following:\n\n[1] Ranjan, Chitta, Samaneh Ebrahimi, and Kamran Paynabar. \"Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining.\" arXiv preprint arXiv:1608.03533 (2016).\n\n@article{ranjan2016sequence,\n  title={Sequence Graph Transform (SGT): A Feature Extraction Function for Sequence Data Mining},\n  author={Ranjan, Chitta and Ebrahimi, Samaneh and Paynabar, Kamran},\n  journal={arXiv preprint arXiv:1608.03533},\n  year={2016}\n}\n\n## Quick validation of your code\nApply the algorithm on a sequence `BBACACAABA`. The parts of SGT, W<sup>(0)</sup> and W<sup>(\\kappa)</sup>, in Algorithm 1 & 2 in [1], and the resulting SGT estimate will be (line-by-line execution of `main.R`):\n\n```\nalphabet_set <- c(\"A\", \"B\", \"C\")  # Denoted by variable V in [1]\nseq          <- \"BBACACAABA\"\n\nkappa <- 5\n###### Algorithm 1 ######\nsgt_parts_alg1 <- f_sgt_parts(sequence = seq, kappa = kappa, alphabet_set_size = length(alphabet_set))\nprint(sgt_parts_alg1)\n```\n\n*Result*\n```\n$W0\n   A B C\nA 10 4 3\nB 11 3 4\nC  7 2 1\n\n$W_kappa\n            A            B            C\nA 0.006874761 6.783349e-03 1.347620e-02\nB 0.013521602 6.737947e-03 4.570791e-05\nC 0.013521604 3.059162e-07 4.539993e-05\n```\n\n```\nsgt <- f_SGT(W_kappa = sgt_parts_alg1$W_kappa, W0 = sgt_parts_alg1$W0, \n             Len = sgt_parts_alg1$Len, kappa = kappa)  # Set Len = NULL for length-sensitive SGT.\nprint(sgt)\n```\n\n*Result*\n```\n          A          B         C\nA 0.3693614 0.44246287 0.5376371\nB 0.4148844 0.46803816 0.1627745\nC 0.4541361 0.06869332 0.2144920\n\n```\n\nSimilarly, the execution for Algorithm-2 is shown in `main.R`.\n\n## Illustration and use of the code\nOpen file `main.R` and execute line-by-line to understand the process. In this sample execution, we present SGT estimation from either of the two algorithms presented in [1]. The first part is for understanding the SGT computation process.\n\nIn the next part we demonstrate sequence clustering using SGT on a synthesized sample dataset. The sequence lengths in the dataset ranges between (45, 711) with a uniform distribution (hence, average length is ~365). Similar sequences in the dataset has some similar patterns, in turn common substrings. These common substrings can be of any length. Also, the order of the instances of these substrings is arbitrary and random in different sequences. For example, the following two sequences have common patterns. One common subtring in both is `QZTA` which is present arbitrarily in both sequences. The two sequences have other common substrings as well. Other than these commonlities there are significant amount of noise present in the sequences. On average, about 40% of the letters in all sequences in the dataset are noise.\n\n```\nAKQZTAEEYTDZUXXIRZSTAYFUIXCPDZUXMCSMEMVDVGMTDRDDEJWNDGDPSVPKJHKQBRKMXHHNLUBXBMHISQ\nWEHGXGDDCADPVKESYQXGRLRZSTAYFUOQZTAWTBRKMXHHNWYRYBRKMXHHNPRNRBRKMXHHNPBMHIUSVXBMHI\nWXQRZSTAYFUCWRZSTAYFUJEJDZUXPUEMVDVGMTOHUDZUXLOQSKESYQXGRCTLBRKMXHHNNJZDZUXTFWZKES\nYQXGRUATSNDGDPWEBNIQZMBNIQKESYQXGRSZTTPTZWRMEMVDVGMTAPBNIRPSADZUXJTEDESOKPTLJEMZTD\nLUIPSMZTDLUIWYDELISBRKMXHHNMADEDXKESYQXGRWEFRZSTAYFUDNDGDPKYEKPTSXMKNDGDPUTIQJHKSD\nZUXVMZTDLUINFNDGDPMQZTAPPKBMHIUQIUBMHIEKKJHK\n```\n\n```\nSDBRKMXHHNRATBMHIYDZUXMTRMZTDLUIEKDEIBQZTAZOAMZTDLUILHGXGDDCAZEXJHKTDOOHGXGDDCAKZH\nNEMVDVGMTIHZXDEROEQDEGZPPTDBCLBMHIJMMKESYQXGRGDPTNBRKMXHHNGCBYNDGDPKMWKBMHIDQDZUXI\nHKVBMHINQZTAHBRKMXHHNIRBRKMXHHNDISDZUXWBOYEMVDVGMTNTAQZTA\n```\n\nIdentifying similar sequences with good accuracy, and also low false positives (calling sequences similar when they are not) is difficult in such situations due to, \n\n1. _Different lengths of the sequences_: due to different lengths figuring out that two sequences have same inherent pattern is not straightforward. Normalizing the pattern features by the sequence length is a non-trivial problem.\n\n2. _Commonalities are not in order_: as shown in the above example sequences, the common substrings are anywhere. This makes methods such as alignment-based approaches infeasible.\n\n3. _Significant amount of noise_: a good amount of noise is a nemesis to most sequence similarity algorithms. It often results into high false positives.\n\n### SGT Clustering\n\nThe dataset here is a good example for the above challenges. We run clustering on the dataset in `main.R`. The sequences in the dataset are from 5 (=K) clusters. We use this ground truth about the number of clusters as input to our execution below. Although, in reality, the true number of clusters is unknown for a data, here we are demonstrating the SGT implementation. Regardless, using the _random search procedure_ discussed in Sec.SGT-ALGORITHM in [1], we could find the number of clusters as equal to 5. For simplicity it has been kept out of this demonstration.\n\n> Other state-of-the-art sequence clustering methods had significantly poorer performance even with the number of true clusters (K=5). HMM had good performance but significantly higher computation time.\n\n\n```\n## The dataset contains all roman letters, A-Z.\ndataset <- read.csv(\"dataset.csv\", header = T, stringsAsFactors = F)\n\nsgt_parts_sequences_in_dataset <- f_SGT_for_each_sequence_in_dataset(sequence_dataset = dataset, \n                                                                     kappa = 5, alphabet_set = LETTERS, \n                                                                     spp = NULL, sgt_using_alphabet_positions = T)\n\n\ninput_data <- f_create_input_kmeans(all_seq_sgt_parts = sgt_parts_sequences_in_dataset, \n                                    length_normalize = T, \n                                    alphabet_set_size = 26, \n                                    kappa = 5, trace = TRUE, \n                                    inv.powered = T)\nK = 5\nclustering_output <- f_kmeans(input_data = input_data, K = K, alphabet_set_size = 26, trace = T)\n\ncc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output$class), K = K, type = \"f1\")\nprint(cc)\n```\n*Result*\n```\n$cc\nConfusion Matrix and Statistics\n\n          Reference\nPrediction  a  b  c  d  e\n         a 50  0  0  0  0\n         b  0 66  0  0  0\n         c  0  0 60  0  0\n         d  0  0  0 55  0\n         e  0  0  0  0 68\n\nOverall Statistics\n\n               Accuracy : 1          \n                 95% CI : (0.9877, 1)\n    No Information Rate : 0.2274     \n    P-Value [Acc > NIR] : < 2.2e-16  \n\n                  Kappa : 1          \n Mcnemar's Test P-Value : NA         \n\nStatistics by Class:\n\n                     Class: a Class: b Class: c Class: d Class: e\nSensitivity            1.0000   1.0000   1.0000   1.0000   1.0000\nSpecificity            1.0000   1.0000   1.0000   1.0000   1.0000\nPos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000\nNeg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000\nPrevalence             0.1672   0.2207   0.2007   0.1839   0.2274\nDetection Rate         0.1672   0.2207   0.2007   0.1839   0.2274\nDetection Prevalence   0.1672   0.2207   0.2007   0.1839   0.2274\nBalanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000\n\n$F1\nF1 \n 1 \n```\n\nAs we can see the clustering result is accurate with no false-positives. The f1-score is 1.0.\n\n> Note: Do not run function `f_clustering_accuracy` when `K` is larger (> 7), because it does a permutation operation which will become expensive.\n\n### PCA on SGT & Clustering\n\nFor demonstrating PCA on SGT for dimension reduction and then performing clustering, we added another code snippet. PCA becomes more important on datasets where SGT's are sparse. A sparse SGT is present when the alphabet set is large but the observed sequences contain only a few of those alphabets. For example, the alphabet set for sequence dataset of music listening history will have thousands to millions of songs, but a single sequence will have only a few of them\n\n```\n######## Clustering on Principal Components of SGT features ########\nnum_pcs <- 5  # Number of principal components we want\ninput_data_pcs <- f_pcs(input_data = input_data, PCs = num_pcs)$input_data_pcs\n\nclustering_output_pcs <- f_kmeans(input_data = input_data_pcs, K = K, alphabet_set_size = sqrt(num_pcs), trace = F)\n\ncc <- f_clustering_accuracy(actual = c(strtoi(dataset[,1])), pred = c(clustering_output_pcs$class), K = K, type = \"f1\")  \nprint(cc)\n```\n\n*Result*\n```\n$cc\nConfusion Matrix and Statistics\n\n          Reference\nPrediction  a  b  c  d  e\n         a 50  0  0  0  0\n         b  0 66  0  0  0\n         c  0  0 60  0  0\n         d  0  0  0 55  0\n         e  0  0  0  0 68\n\nOverall Statistics\n\n               Accuracy : 1          \n                 95% CI : (0.9877, 1)\n    No Information Rate : 0.2274     \n    P-Value [Acc > NIR] : < 2.2e-16  \n\n                  Kappa : 1          \n Mcnemar's Test P-Value : NA         \n\nStatistics by Class:\n\n                     Class: a Class: b Class: c Class: d Class: e\nSensitivity            1.0000   1.0000   1.0000   1.0000   1.0000\nSpecificity            1.0000   1.0000   1.0000   1.0000   1.0000\nPos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000\nNeg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000\nPrevalence             0.1672   0.2207   0.2007   0.1839   0.2274\nDetection Rate         0.1672   0.2207   0.2007   0.1839   0.2274\nDetection Prevalence   0.1672   0.2207   0.2007   0.1839   0.2274\nBalanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000\n\n$F1\nF1 \n 1 \n ```\n\nThe clustering result remains accurate upon clustering the PCs on the SGT of sequences.\n\n\n-----------------------\n#### Comments:\n1. Simplicity: SGT's is simple to implement. There is no numerical optimization or other solution search algorithm required to estimate SGT. This makes it deterministic and powerful.\n2. Length sensitive: The length sensitive version of SGT can be easily tried by changing the marked arguments in `main.R`.\n\n#### Note:\n1. Small alphabet set: If the alphabet set is small (< 4), SGT's performance may not be good. This is because the feature space becomes too small.\n2. Faster implementation: The provided code is a research level code, not optimized for the best of speed. Significant speed improvements can be made, e.g. multithreading the SGT estimation for sequences in a dataset.\n\n#### Additional resource:\nPython implementation: Please refer to \n\nhttps://github.com/datashinobi/Sequence-Graph-transform\n\nThanks to Yassine for providing the Python implementation.\n\n",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/pypa/sampleproject",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "sgt",
        "package_url": "https://pypi.org/project/sgt/",
        "platform": "",
        "project_url": "https://pypi.org/project/sgt/",
        "project_urls": {
            "Homepage": "https://github.com/pypa/sampleproject"
        },
        "release_url": "https://pypi.org/project/sgt/0.0.1/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Sequence Graph Transform (SGT) is a sequence embedding function.                          SGT extracts the short- and long-term sequence features and embeds them                   in a finite-dimensional feature space. With SGT you can tune the amount                   of short- to long-term patterns extracted in the embeddings without any                   increase in the computation.",
        "version": "0.0.1"
    },
    "last_serial": 5121398,
    "releases": {
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "f53777eee0d460b052ce54d83ea06494",
                    "sha256": "6e1ea8b73f1411fa5c539b470ca6ad21c6f08fa5e166a606c6a9995eda53e312"
                },
                "downloads": -1,
                "filename": "sgt-0.0.1-py3-none-any.whl",
                "has_sig": false,
                "md5_digest": "f53777eee0d460b052ce54d83ea06494",
                "packagetype": "bdist_wheel",
                "python_version": "py3",
                "requires_python": null,
                "size": 8133,
                "upload_time": "2019-04-10T00:04:19",
                "url": "https://files.pythonhosted.org/packages/98/d2/767366e06578190c708067985adee9ed2243d96fe235dfb32c5a1d55ef10/sgt-0.0.1-py3-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "62188b0e6381c972ddb2b7c55deb1de6",
                    "sha256": "784f51bbeecec5b14c0b3183052f611a2fc85297829eca5da65d51277057c8da"
                },
                "downloads": -1,
                "filename": "sgt-0.0.1.tar.gz",
                "has_sig": false,
                "md5_digest": "62188b0e6381c972ddb2b7c55deb1de6",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 7657,
                "upload_time": "2019-04-10T00:04:22",
                "url": "https://files.pythonhosted.org/packages/f2/10/3fa724e1334eab24be451ea3f6471406ff90f1470a380724e3001d7d8ade/sgt-0.0.1.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "f53777eee0d460b052ce54d83ea06494",
                "sha256": "6e1ea8b73f1411fa5c539b470ca6ad21c6f08fa5e166a606c6a9995eda53e312"
            },
            "downloads": -1,
            "filename": "sgt-0.0.1-py3-none-any.whl",
            "has_sig": false,
            "md5_digest": "f53777eee0d460b052ce54d83ea06494",
            "packagetype": "bdist_wheel",
            "python_version": "py3",
            "requires_python": null,
            "size": 8133,
            "upload_time": "2019-04-10T00:04:19",
            "url": "https://files.pythonhosted.org/packages/98/d2/767366e06578190c708067985adee9ed2243d96fe235dfb32c5a1d55ef10/sgt-0.0.1-py3-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "62188b0e6381c972ddb2b7c55deb1de6",
                "sha256": "784f51bbeecec5b14c0b3183052f611a2fc85297829eca5da65d51277057c8da"
            },
            "downloads": -1,
            "filename": "sgt-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "62188b0e6381c972ddb2b7c55deb1de6",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 7657,
            "upload_time": "2019-04-10T00:04:22",
            "url": "https://files.pythonhosted.org/packages/f2/10/3fa724e1334eab24be451ea3f6471406ff90f1470a380724e3001d7d8ade/sgt-0.0.1.tar.gz"
        }
    ]
}