{ "info": { "author": "Thomas Christie, James Ryan, Kyle Marek-Spartz and Serguei Pakhomov", "author_email": "tchristie@umn.edu", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "Intended Audience :: Healthcare Industry", "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Programming Language :: C", "Programming Language :: Python :: 2.7", "Topic :: Multimedia :: Sound/Audio :: Speech", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Scientific/Engineering :: Medical Science Apps.", "Topic :: Software Development :: Libraries", "Topic :: Text Processing :: Linguistic" ], "description": "VFClust\r\n=======\r\n\r\nThis package is designed to generate clustering analyses for\r\ntranscriptions of semantic and phonemic verbal fluency test responses.\r\nIn a verbal fluency test, the subject is given a set amount of time\r\n(usually 60 seconds) to name as many words as he or she can that\r\ncorrespond to a given specification. For a phonemic test, subjects are\r\nasked to name words that begin with a specific letter. For a semantic\r\nfluency test, subjects are asked to provide words of a certain category,\r\ne.g. animals. VFClust groups words in responses based on phonemic or\r\nsemantic similarity, as described below. It then calculates metrics\r\nderived from the discovered groups and returns them as a CSV file or\r\nPython dict object. For a detailed explanation of the reasoning underlying\r\nthe computation of these measures, please see:\r\n\r\nRyan et al., Computerized Analysis of a Verbal Fluency Test\r\nhttp://rxinformatics.umn.edu/downloads/VFCLUST/ryan\\_acl2013.pdf\r\n\r\nVerbal fluency tests are often used in test batteries used to study\r\ncognitive impairment arising from e.g. Alzheimer's disease, Parkinson's\r\ndisease, and certain medications. The following reference provides\r\nan introduction to the use of clustering in cognitive evaluation.\r\n\r\nMayr, U. (2002). On the dissociation between clustering and switching in\r\nverbal fluency: Comment on Troyer, Moscovitch, Winocur, Alexander and\r\nStuss. Neuropsychologia, 40(5), 562-566.\r\n\r\n\r\nClustering in VFClust\r\n---------------------\r\n\r\nVFClust finds adjacent subsets of words of the following types:\r\n\r\n- **clusters**: every entry in a cluster is sufficiently similar to every\r\n other entry\r\n\r\n- **chains**: every entry in a chain is sufficiently similar to adjacent\r\n entries\r\n\r\nwhere \"entry\" corresponds to a word, compound word, or multiple adjacent\r\nwords with the same stem.\r\n\r\nSimilarity scores between words are thresholded and binarized using\r\nempirically-derived thresholds. Overlap of clusters is allowed (a word\r\ncan be part of multiple clusters), but overlapping chains are not\r\npossible, as any two adjacent words with a lower-than-threshold\r\nsimilarity breaks the chain. Clusters subsumed by other clusters are not\r\ncounted.\r\n\r\nThe similarity measures used are the following:\r\n\r\n- **PHONETIC/\"phone\"**: the phonetic similarity score (PSS) is calculated\r\n between the phonetic representations of the input units. It is equal\r\n to 1 minus the Levenshtein distance between two strings, normalized\r\n to the length of the longer string. The strings should be compact\r\n phonetic representations of the two words. (This method is a\r\n modification of a Levenshtein distance function available at\r\n http://hetland.org/coding/python/levenshtein.py.)\r\n\r\n- **PHONETIC/\"biphone\"**: the binary common-biphone score (CBS) depends on\r\n whether two words share their initial and/or final biphone (i.e., set\r\n of two phonemes). A score of 1 indicates that two words have the same\r\n intial and/or final biphone; a score of 0 indicates that two words\r\n have neither the same initial nor final biphone This is also\r\n calculated using the phonetic representation of the two words.\r\n\r\n- **SEMANTIC/\"lsa\"**: a semantic relatedness score (SRS) is calculated as\r\n the COSINE of the respective term vectors for the first and second\r\n word in an LSA space of the specified clustering\\_parameter. Unlike\r\n the PHONETIC methods, this method uses the .text property of the\r\n input Unit objects.\r\n\r\n- **SEMANTIC/\"custom\"**: the user can specify a custom file of word similarities,\r\n in which each pair of words is given a custom similarity score.\r\n\r\nOutput\r\n~~~~~~\r\n\r\nAfter chains/clusters are discovered using the methods relevant for the\r\ntype of fluency test performed, metrics are derived from the clusters\r\nand output to screen and a .csv file (if run as a script) or to a python\r\ndict object (if run as a package). The following metrics are calculated:\r\n\r\nCounts of different token types in the raw input. Each of these is\r\nprefaced by ''COUNT\\_'' in the output.\r\n\r\n- **total\\_words**: count of words (i.e. utterances with semantic content)\r\n spoken by the subject. Filled pauses, silences, coughs, breaths,\r\n words by the interviewer, etc. are all excluded from this count.\r\n\r\n- **permissible\\_words**: Number of words spoken by the subject that\r\n qualify as a valid response according to the clustering criteria.\r\n Compound words are counted as a single word in SEMANTIC clustering,\r\n but as two words in PHONETIC clustering.\r\n\r\n- **exact\\_repetitions**: Number of words which repeat words spoken earlier\r\n in the response. Responses in SEMANTIC clustering are lemmatized\r\n before this function is called, so slight variations (dog, dogs) may\r\n be counted as exact responses.\r\n\r\n- **stem\\_repetitions**: Number of words stems identical to words uttered\r\n earlier in the response, according to the Porter Stemmer. For\r\n example, 'sled' and 'sledding' have the same stem ('sled'), and\r\n 'sledding' would be counted as a stem repetition.\r\n\r\n- **examiner\\_words**: Number of words uttered by the examiner. These start\r\n with \"E\\_\" in .TextGrid files.\r\n\r\n- **filled\\_pauses**: Number of filled pauses uttered by the subject. These\r\n begin with \"FILLEDPAUSE\\_\" in the .TextGrid file.\r\n\r\n- **word\\_fragments**: Number of word fragments uttered by the subject.\r\n These end with \"-\" in the .TextGrid file.\r\n\r\n- **asides**: Words spoken by the subject that do not adhere to the test\r\n criteria are counted as asides, i.e. words that do not start with the\r\n appropriate letter or that do not represent an animal.\r\n\r\n- **unique\\_permissible\\_words**: Number of works spoken by the subject,\r\n less asides, stem repetitions and exact repetitions.\r\n\r\nMeasures derived from clusters/chains in the response. Each of these is\r\nprefaced by ''COLLECTION\\_'', along with the similarity measure used and\r\nthe collection type the measure was calculated over.\r\n\r\n- **pairwise\\_similarity\\_score\\_mean**: mean of pairwise similarity\r\n scores. The pairwise similarity is calculated as the sum of\r\n similarity scores for all pairwise word pairs in a response -- except\r\n any pair composed of a word and itself -- divided by the total number\r\n of words in an attempt. I.e., the mean similarity for all pairwise\r\n word pairs.\r\n\r\n- **count**: number of collections\r\n\r\n- **size\\_mean**: mean size of collections\r\n\r\n- **size\\_max**: size of largest collection\r\n\r\n- **switch\\_count**: number of changes between clusters\r\n\r\nMeasures derived from timing information in the response, along with\r\nclusters/chains. Each of these is prefaced by ''TIMING\\_'' along with\r\nthe along with the similarity measure used and the collection type the\r\nmeasure was calculated over.\r\n\r\n- **response\\_vowel\\_duration\\_mean**: average vowel duration of all vowels\r\n in the response.\r\n\r\n- **response\\_continuant\\_duration\\_mean**: average vowel duration of all\r\n vowels in the response.\r\n\r\n- **between\\_collection\\_interval\\_duration\\_mean**: average interval\r\n duration separating clusters. Negative intervals (for overlapping\r\n clusters) are counted as 0 seconds. Intervals are calculated as being\r\n the difference between the ending time of the last word in a\r\n collection and the start time of the first word in the subsequent\r\n collection. Note that these intervals are not necessarily silences,\r\n and may include asides, filled pauses, words from the examiner, etc.\r\n\r\n- **within\\_collection\\_interval\\_duration\\_mean**: the mean time between\r\n the end of each word in the collection and the beginning of the next\r\n word. Note that these times do not necessarily reflect pauses, as\r\n collection members could be separated by asides or other noises.\r\n\r\n- **within\\_collection\\_vowel\\_duration\\_mean**: average duration of vowels\r\n that occur within a collection\r\n\r\n- **within\\_collection\\_continuant\\_duration\\_mean**: average duration of\r\n continuants that occur within a collection.\r\n\r\n\r\n\r\nDependencies\r\n~~~~~~~~~~~~\r\n\r\nThis package has been tested on Mac OS X (Mavericks). In order to run\r\nthe package you must have the following installed on your machine:\r\n\r\n0. Python 2.7\r\n\r\n1. **pip**: pip should install with Python 2.7. If for some reason pip\r\n is not installed, go to your terminal or commandline of choice and\r\n enter the command below:\r\n\r\n::\r\n\r\n easy_install pip\r\n\r\n2. **NLTK**: VFClust requires the Natural Language Toolkit (NLTK), as it\r\n uses the NLTK lemmatizer and stemmer in parsing subject responses.\r\n Check http://www.nltk.org for more information on how to install\r\n NLTK.\r\n\r\n::\r\n\r\n pip install nltk\r\n\r\n3. **numpy**: Some of the data files are stored as numpy arrays. This will\r\n change in future releases, but for now numpy is required to\r\n\r\n::\r\n\r\n pip install numpy\r\n\r\n4. **gcc**: On Mac OS X, you will need to install the latest version of\r\n Xcode compatible with your version of OS X with Command-line tools\r\n package (https://developer.apple.com/xcode/). Keep in mind that you\r\n may need to enable command-line tools in Xcode in order to be able to\r\n use the gcc compiler. If you can't run gcc from command-line after\r\n installing Xcode, go to the Xcode Preferences/Downloads tab and\r\n select the \"Install\" button, next to \"Command Line Tools.\"\r\n\r\n\r\nInstallation\r\n-------------\r\n\r\nThere are two ways to install the package. VFClust is registered at\r\nhttp://pypi.python.org/, so you can install it using:\r\n\r\n::\r\n\r\n $ sudo pip install vfclust\r\n\r\nThe ``sudo`` is included because the setup process\r\nincludes compiling a file (t2p.c) and placing it in the install directory.\r\n\r\nTo install the package manually, download the .zip file from github.com or the\r\n.tar.gz file from pypi.python.org. Extract the file, navigate to the new\r\ndirectory in the terminal, and type\r\n\r\n::\r\n\r\n $ sudo python setup.py install\r\n\r\nYou will need to have the gcc compiler installed on your system.\r\nInstalling also includes compiling a C executable for the\r\ngrapheme-to-phoneme conversion (t2p) that the phonetic clustering\r\npackage uses. If everything went okay, you should see the following\r\noutput in the console:\r\n\r\n::\r\n\r\n success S AH0 K S EH1 S\r\n\r\nalong with other output from the install process.\r\n\r\nThere are three ways to run VFClust, and therefore three tests to make sure\r\nit's running properly. If you installed using pip, you can test the program\r\nusing some of the included example files. You should be able to type:\r\n\r\n::\r\n\r\n $ vfclust test\r\n\r\nIf you simply downloaded the package, you can navigate to the \"vfclust\" directory\r\nand type\r\n\r\n::\r\n\r\n $ python vfclust.py test\r\n\r\nIf you are using vfclust within Python, type:\r\n\r\n::\r\n\r\n >> import vfclust\r\n >> vfclust.test_script()\r\n\r\nAll results are the same in each case.\r\n\r\n\r\nDeploying\r\n---------\r\n\r\n*Input*\r\n~~~~~~~\r\n\r\nVFClust operations are performed on transcriptions of verbal fluency tests. These can be recorded as either CSV files or TextGrid files. For a CSV file, the first field should be the subject ID number, and each remaining field should contain a response. For example:\r\n\r\n::\r\n\r\n 12345,fort,friend,fry,fetch,follow,um,i,don't,know,fall,felt\r\n\r\nFor a .TextGrid file, at this point the program expects two tiers, where the first includes the word strings and the second includes the phone strings. Here are the first few lines of an example file:\r\n\r\n::\r\n\r\n File type = \"ooTextFile\"\r\n Object class = \"TextGrid\"\r\n\r\n xmin = 0\r\n xmax = 59.72\r\n tiers? \r\n size = 2\r\n item []:\r\n item [1]:\r\n class = \"IntervalTier\"\r\n name = \"word\"\r\n xmin = 0\r\n xmax = 59.72\r\n intervals: size = 65\r\n intervals [1]:\r\n xmin = 0.00\r\n xmax = 1.31\r\n text = \"!SIL\"\r\n intervals [2]:\r\n xmin = 1.31\r\n xmax = 1.83\r\n text = \"CAT\"\r\n intervals [3]:\r\n xmin = 1.83\r\n xmax = 2.22\r\n text = \"!SIL\"\r\n intervals [4]:\r\n xmin = 2.22\r\n xmax = 2.72\r\n\r\nIn both .TextGrid and .csv files, non-word noises and responses can be annotated using the following:\r\n\r\n- !SIL = silence\r\n- starts with E\\_ = examiner word\r\n- FILLEDPAUSE\\_um or FILLEDPAUSE\\_ah = filled pause\r\n- T\\_NOISE = noise\r\n- T\\_COUGH = cough\r\n- T\\_LIPSMACK = lipsmack\r\n- T\\_BREATH = breath\r\n\r\nThese special tags will be used to generate a list of counts for\r\nAny entry that is not one of these and does not fit into the specified clustering category will be labeled as an aside.\r\n\r\n\r\n*As a script*\r\n~~~~~~~~~~~~~\r\n\r\nAfter installation, you should be able to use vfclust from the command line simply by typing:\r\n\r\n::\r\n\r\n vfclust [-h] [-s SEMANTIC] [-p PHONEMIC] [-o OUTPUT_PATH] [-q]\r\n [--similarity-file SIMILARITY_FILE] [--threshold THRESHOLD]\r\n source_file_path\r\n\r\nwith the relevant parameters.\r\n\r\nIf for some reason this doesn't work, you can navigate to the directory containing the vfclust.py file (it should be\r\nin the vfclust/ subdirectory of the installed package) and type:\r\n\r\n::\r\n\r\n python vfclust.py [-h] [-s SEMANTIC] [-p PHONEMIC] [-o OUTPUT_PATH] [-q]\r\n [--similarity-file SIMILARITY_FILE] [--threshold THRESHOLD]\r\n source_file_path\r\n\r\nBracketed arguments are optional, but either -s (semantic) or -p\r\n(phonemic) must be selected. The arguments are as follows:\r\n\r\n::\r\n\r\n positional arguments:\r\n source_file_path Full path of textgrid or csv file to parse\r\n\r\n optional arguments:\r\n -h, --help show this help message and exit\r\n -s SEMANTIC Usage: -s animals If included, calculates measures for\r\n the given category for the semantic fluency test, i.e.\r\n animals, fruits, etc.\r\n -p PHONEMIC Usage: -p f If included, calculates measures for the\r\n given category for the phonemic fluency test, i.e. a,\r\n f, s, etc.\r\n -o OUTPUT_PATH Where to put output - default is the same directory as\r\n the input file working directory.\r\n -q Use to eliminate output (default is print everything\r\n to stdout).\r\n --similarity-file SIMILARITY_FILE\r\n Usage: --similarity-file /path/to/similarity/file\r\n Location of custom word similarity file. Each line\r\n must contain two words separated by a space, followed\r\n by a comma and the similarity number. For example,\r\n \"horse dog,1344.3969\" is a valid line. If used, the\r\n default \"LSA\" option is overridden. You must also\r\n include a threshold number with --threshold X.\r\n --threshold THRESHOLD\r\n Usage: --threshold X, where X is a number. A custom\r\n threshold is required when including a custom\r\n similarity file. A custom threshold can also be set\r\n when using semantic or phonemic clustering. In this\r\n case, it would override the default threshold\r\n implemented in the program.\r\n\r\nFor example, to run clustering on a phonetic verbal fluency test using the letter \"f\",\r\nwhere the response was saved as a .csv file, type:\r\n\r\n::\r\n\r\n vfclust -p f /path/to/response/response.csv\r\n\r\nSimilarly, to run clustering on a semantic verbal fluency test using the category\r\n\"animals\", where the response is recorded as a .TextGrid file, type\r\n\r\n::\r\n\r\n vfclust -s animals /path/to/response/response.TextGrid\r\n\r\n\r\nTo use a custom similarity file, type something like the following:\r\n\r\n::\r\n\r\n python vfclust.py --similarity-file path/to/similarity/file.txt --threshold 0.5 /path/to/response/response.TextGrid\r\n\r\nBy default, the results are printed to screen and a .csv file is created in the same\r\ndirectory as the response.csv file. You can output the results to a different directory\r\nby using the -o flag.\r\n\r\n\r\n\r\n*As a Python package*\r\n~~~~~~~~~~~~~~~~~~~~~\r\n\r\nThe functionality in the ``vfclust`` script is accessed using the ``vfclust.get_duration_measures``\r\nmethod. The method inputs are as follows:\r\n\r\n::\r\n\r\n :param source_file_path: Required. Location of the .csv or .TextGrid file to be\r\n analyzed.\r\n :param output_path: Path to which to write the resultant csv file. If left None,\r\n path will be set to the source_file_path. If set to False, no file will be\r\n written.\r\n :param phonemic: The letter used for phonetic clustering. Note: should be False if\r\n semantic clustering is being used.\r\n :param semantic: The word category used for semantic clustering. Note: should be\r\n False if phonetic clustering is being used.\r\n :param quiet: Set to True if you want to suppress output to the screen during processing.\r\n :param similarity_file (optional): When doing semantic processing, this is the path of\r\n a file containing custom term similarity scores that will be used for clustering.\r\n If a custom file is used, the default LSA-based clustering will not be performed.\r\n :param threshold (optional): When doing semantic processing, this threshold is used\r\n in conjunction with a custom similarity file. The value is used as a semantic\r\n similarity cutoff in clustering. This argument is required if a custom similarity\r\n file is specified. This argument can also be used to override the built-in\r\n cluster/chain thresholds.\r\n\r\n :return data: A dictionary of measures derived by clustering the input response.\r\n\r\nand can be called by typing\r\n\r\n::\r\n\r\n >> import vfclust\r\n >> results = vfclust.get_duration_measures(source_file_path = '/path/to/response/response.TextGrid',\r\n output_path = '/output/directory/'\r\n phonemic = 'a')\r\n\r\nIf you enter invalid arguments or both the \"phonemic\" and \"semantic\" arguments, an exception\r\nwill be raised.\r\n\r\n*Using a custom similarity file*\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\nYou can also specify word similarities using a separate file. If this is done, words in the\r\nresponse will be counted as \"permissible\" and as legitimate members of clusters only if they\r\nappear somewhere in this file. VFClust will also assume all words in the file are already\r\ntokenized, i.e. \"polar bear\" should be written as \"polar_bear\".\r\n\r\nEach line of the file must be formatted with two words separated by a space, followed by a comma and\r\na number:\r\n\r\n::\r\n\r\n elk bison,114.9277\r\n guinea_pig mouse,113.2803\r\n panther puma,112.4150\r\n cat skunk,112.2775\r\n cardinal finch,111.5717\r\n squirrel elephant,111.2780\r\n\r\nWhen using a custom similarity file, you must also explicitly specify a custom threshold using the\r\n--threshold argument.\r\n\r\n\r\n\r\nACKNOWLEDGEMENTS\r\n----------------\r\n\r\nThis package uses a grapheme-to-phoneme conversion (t2p) implementation\r\nby the MBRDICO Project (http://tcts.fpms.ac.be/synthesis/mbrdico/).\r\n\r\nThe English Open Word List is used as a basic dictionary of English\r\nwords. http://dreamsteep.com/projects/the-english-open-word-list.html\r\n\r\nThe NLTK (http://www.nltk.org) WordNet 3.0 Corpus is used for lemmatizing words.\r\n\r\nLicense\r\n-------\r\n\r\nAll files which are included as a part of the VFClust Phonetic\r\nClustering Module are provided under an Apache license, excluding:\r\n\r\n- t2p.c in the data/t2p directory, which is provided under a GPL license\r\n\r\n- the NLTK WordNet 3.0 corpus, which is Copyright 2006 by Princeton University.\r\n The full text of the license is available in the corpus files.\r\n\r\n- english\\_words.txt in the data/EOWL directory, which is a\r\n modification of the UK Advanced Cryptics Dictionary and is released\r\n with the following licensing:\r\n\r\nCopyright J Ross Beresford 1993-1999. All Rights Reserved. The\r\nfollowing restriction is placed on the use of this publication: if the\r\nUK Advanced Cryptics Dictionary is used in a software package or\r\nredistributed in any form, the copyright notice must be prominently\r\ndisplayed and the text of this document must be included verbatim.", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/speechinformaticslab/vfclust", "keywords": "bioinformatics speech linguistics", "license": "Apache License, Version 2.0", "maintainer": "", "maintainer_email": "", "name": "VFClust", "package_url": "https://pypi.org/project/VFClust/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/VFClust/", "project_urls": { "Homepage": "https://github.com/speechinformaticslab/vfclust" }, "release_url": "https://pypi.org/project/VFClust/0.1.1/", "requires_dist": null, "requires_python": null, "summary": "Clustering of Verbal Fluency responses.", "version": "0.1.1" }, "last_serial": 3139322, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "a3b73f5a144fec461ef926dc047a4fdd", "sha256": "e4a0037c16dce41df9874fa64a87f75c1fac9e41f49d79c25ce93afbaec46687" }, "downloads": -1, "filename": "VFClust-0.1.0.tar.gz", "has_sig": false, "md5_digest": "a3b73f5a144fec461ef926dc047a4fdd", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 18868365, "upload_time": "2014-06-26T02:33:31", "url": "https://files.pythonhosted.org/packages/c6/c2/123887383ccf294442ae3e0007045858827e3d24fa0f6fca617a74407f48/VFClust-0.1.0.tar.gz" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "cbfec10f183e493dc1687b5573adc656", "sha256": "aa74a32946852917830e8ec48677e8e3978d31a795d7105db4b5bbf7705d8bf4" }, "downloads": -1, "filename": "VFClust-0.1.1.tar.gz", "has_sig": false, "md5_digest": "cbfec10f183e493dc1687b5573adc656", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41049684, "upload_time": "2014-07-08T19:48:10", "url": "https://files.pythonhosted.org/packages/01/7e/c4adb93a8f769c5b1ad013b9389d0f3ffebf6f1b33665bbd922f65ed7f9c/VFClust-0.1.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "cbfec10f183e493dc1687b5573adc656", "sha256": "aa74a32946852917830e8ec48677e8e3978d31a795d7105db4b5bbf7705d8bf4" }, "downloads": -1, "filename": "VFClust-0.1.1.tar.gz", "has_sig": false, "md5_digest": "cbfec10f183e493dc1687b5573adc656", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 41049684, "upload_time": "2014-07-08T19:48:10", "url": "https://files.pythonhosted.org/packages/01/7e/c4adb93a8f769c5b1ad013b9389d0f3ffebf6f1b33665bbd922f65ed7f9c/VFClust-0.1.1.tar.gz" } ] }