{ "info": { "author": "Nick Seidenman", "author_email": "seidenman@wehi.edu.au", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "License :: OSI Approved :: BSD License", "Topic :: Scientific/Engineering :: Bio-Informatics", "Topic :: Scientific/Engineering :: Information Analysis" ], "description": "===================================================================\nGSE -- Python module for processing Geo Series Expression Datasets\n===================================================================\n\nGEO Datasets\n============\n\nThe National Center for Biotechnology Information makes microarray datasets available for free download\nthat are used by researchers world-wide. This module was written to facilite processing of these\ndatasets from within applications written in python.\n\n1. File Structure\n=================\n\nFiles containing data for this dataset are organized into TAB-separated\ncolumns of data. All files contain a certain amount of metadata encoded in the \nbeginning lines of the file. Metadata records begin with a descriptive record label \nthat begins with \"!\", ``!Series_title``, for example.\n\nThe actual expression data may be found in this same file, or in separeate files, one per sample, \nthe names of which can be found in the associated metadata. For purposes of simplicity,\nit is assumed that the expression data follows the metadata in this same file, between\nthe descriptor labels::\n\n !series_matrix_table_begin \n\nand::\n\n !series_matrix_table_end\n\n1.1 gse (script)\n----------------\n\nA command-line script, called *gse* is provided that uses the classes defined in this module\nto render data, both to the console and into files, depending on the switches used in the \ncommand line. The output tacitly includes a file with the same name as the input, but with the\nextension changed to a '.P' to denote python *pickled* contents. This will contain the pickled\n*GEOSeries* object, which can be unpickled using the *cPickle.load* function, later.\n\n*gse* will try to interpret the input file as a pickled *GEOSeries* instance. Failing that, it will\nthen try to create a new instance from what will be assumed to be a ``GSE_series_matrix.txt`` file. \nThe upshot is that if you've already have a pickled instance, you can use this for subsequent operations\n(show-levels, for instance) without having to process the original input all over again, thereby \nsaving a bit of time.\n\n2. Metadata\n===========\n\nThere are two kinds of metadata in the GSE series matrix: *series* metadata and *sample* metadata. \nSeries metadata generally have two fields or columns, separated by a tab. The first column is the\nmetadata descriptor and always begins with ``!Series_``, and the second column is the associated value. \nFor instance::\n\n !Series_title \"Reconstruction of the dynamic regulatory ...\"\n\nNote that sometimes the value, which is of type *string* may be enclosed in quotation marks. \nThis isn't entirely consistent, but seems to be the case more often than not.\n\nSample metadata is of rougly similar format, with the first column being the descriptor,\nwhich always begins with ``!Sample_``. There will be as many columns after this first one as there\nare samples in the dataset, and are supposed to appear in the same order as the expression data\ncolumns (i.e. samples) in the dataset proper. However, to be certain that the metadata are correctly\nassociated with the corresponding sample, one of the sample metadata rows contains the sample ID as found in\nthe dataset proper, so an association of all other sample metadata with corresponding sample\nshould be done, indirectly, through this *sample_id* metadata row.\n\n2.1 Displaying Metadata\n-----------------------\n\nThe ``--show-metadata`` switch will cause series and sample metadata to be emitted to *stdout*. There\nare three formats: *pretty*, *json*, and *html*, with *pretty* being the default format. Format is selected\nwith the ``--metadata-format=`` switch.\n\n3. Dataset Output\n=================\n\nIf no output file is specified, no expression data will be emitted at all. Usin the ``--output=`` or ``-o`` \nswitch to specify output destination. If you want output to go to *stdout*, use ``--output=-`` or ``-i -``\n(i.e., use a hyphen for the filename.)\n\n3.1 Raw vs Log Expressioin values\n---------------------------------\n\nSome datasets contain \"raw\" data (or read counts.) Typically, we want expression values to be given\nas *log2* values. The ``--log2`` flag will cause expression data to be thusly converted.\n\n\n3.1 Grouping Sample Output\n--------------------------\n\nIf there are multiple *levels* of metadata, these can be used to group the samples, aggregating them\nby taking the arithmetic mean of the column values for samples that are in the same group. Say, for instance\nthat you have ten samples and that are actually two groups of five replicates each. There will be\n*sample metadata* that defines these groups. The putput will then be two columns (plus the index column,\nwhich is typically is typically the probe ID for each row) the values of which being the mean of the group of\nvalues in each of the two groups.\n\nThe available sample metadata levels can be displayed using the ``--list-levels`` switch, which prints out\nan enumerated list starting at 0. The zeroth level is just the individual samples, ungrouped. \n\nGrouping the samples is requested with the ``--group-by=``*level* or ``-g`` *level* switch. If not specified, this\nobviously defaults to zero. The level may be specified either with a non-negative integer or the metadata descriptor.\nIf using the descriptor, remember to enclose it in quotes if there are embedded spaces in the descriptor label.\n\n4. GSE Classes\n==============\n\nThere are three classes defined in this module, two of which act as containsers for the others.\n\n4.1 GSESeries\n-------------\n\nThis is the top-level class that contains both the data and metadata for specified dataset. It is passed\na file-like object from which it reads and parses the (expected) GSE series matrix. The resulting instance\noffers several methods for displaying the metadata or emiting TSV files containing the dataset as a table\nin which the columns are, perhaps, grouped according to column index metadata.\n\n4.2 GSESeriesMetadata\n---------------------\n\nThe metadata for the series matrix is accessed throught the ``metadata`` attribute of the *GSESeries* instance. Attributes\ncan be listed using the ``attribute`` property. These can and will, of course, vary with each particular dataset.\n\n4.3 GSESampleMetadata\n---------------------\n\nThe metadata for each sample in the series matrix can be accessed through the ``samples`` attribute of the *GSESeries*\ninstance. This is actually a property that returns a generator that can be used to iterate through the samples in\n\"sample order\", that is, the order in which they appear in the matrix. To obtain a specific sample from its\nindex, use the generator to create a list, then index that list. For example::\n\n fifth_sample = list(series_instance.samples)[4]\n\n5. MAGMA2\n=========\n\nThe older web applications, called *Guide* (see below) used an in-house-designed SQL database schema called *MAGMA*.\n(It was an acronym, but, just think of it as the molten aglomeration of a bunch of stuff swirling around\nin a great maelstrom, throwing off lots of heat and causing tremors now and then.) MAGMA was completely re-designed\nfor *Guide*'s successor, *HaemoSphere*, and is called *MAGMA2*.\n\n5.1 gse-magma (script)\n----------------------\n\nThe command-line script *gse-magma* takes the pickled `GEOSeries` object produced by `gse` and\nemits the DDL that will enter the datasets metadata into MAGMA2.\n\nThe output filename will also include a version that can be set using the ``--version`` switch (default: 1.0).\n\nThe file containing the DDL for rows to add to the *MAGMA2* dataset metadata will be found in::\n \n ._DDL.sql\n \n\n5.2 gse-magma (advanced configuration)\n--------------------------------------\n\nCreating the DDL is particularly tricky since not all GSE files will contain the same kinds of metadata,\nnor will we always want to use the same metadata for any given dataset. It is therefore possible to \nspecify configuration options, encoded as python objects, using the ``--magma2-config=`` switch.\n\nThis file can contain customised settings for callable objects that will take a GEOSeries instance as an argument\nand return a string. For example, the ``dataset_handle`` object might look like this::\n\n dataset_handle = lambda gseObj: gseObj.accession\n\nThere are also ``dataset_version`` and ``dataset_description`` objects that can be defined as well.\n\nSample metadata are referecned somewhat the same -- as callable python objects -- but the argument passed \nis the GSESampleMetadata instance. This will be called in a loop that iterates through each of the samples\nso metadata pertaining to each is available to these callable objects. For instance::\n\n sample_metadata_description = lambda samp_inst: samp_inst.title\n\nreturns the descriptive text for the given sample ``samp_inst``.\n\nThese callable objects can be full-blown functions, not just anonymous *lambda* functions. Other, scaffolding\nor supporting code can also be included in this configuration file. Care should be taken in naming \nvariables that should NOT be treated as configuration variables: their names should always begin with an\nunderscore (_). See the documentation for the *cfgparse* module for further information.\n\nA template configuration file can be generated by using the ``--template`` switch. This simply\nprints out the default configuration, in which all valus are set to empty strings or zeros.\n\n6. GUIDE\n========\n\nWEHI had an internally-developed web application called *Guide* that was a sort of genome browser\nmarried to a collection of datasets commonly used by our scientists. This module was written \nfirst and foremost to support and facilitate the addition of new datasets to this *Guide* collection.\n\n*Guide* has now been superceded by *HaemoSphere* which uses an updated version of *MAGMA* called, appropriately\nenough, *MAGMA2*. This section is included ONLY for historical purposes. As of 1/1/2014, Guide is no longer\nsupported in GSE.\n\n6.1 gse-guide (script)\n----------------------\n\nThe command-line script *gse-guide* takes the pickled `GEOSeries` object produced by `gse` and\nemits three files that will then be incorporated into the Guide application's database.\nGuide expects to see two files containing a picked object called a *matricks* which is a bit like a *pandas* `DataFrame`.\n(Newer versions of guide will deprecate *matricks* in favor of *pandas*.) \n\nUsing the ``--handle=`` switch\nwill cause these two files to be created. Originally, these contained the raw (i.e. unaggregated) samples and\nthe samples aggregated according to the celltype from which they were extracted. Here, \"celltype\" may be a misnomer\nbut it is still used for historical reasons. The celltype grouping is specified by the ``--group-by=`` switch, defaulting to the\nsecond (``--group-by=1``) metadata level value.\n\nThe output filenames will also include a version that can be set using the ``--version`` switch (default: 1.0).\n\nThe result will be two files named::\n\n SampleSignalProfiles...pickled\n\nand::\n\n CelltypeSignalProfiles...pickled\n\nAlso, a file containing the DDL for rows to add to the *guide* databaset tables will be found in::\n\n ._DDL.sql\n\n\n6.2 gse-guide (advanced configuration)\n--------------------------------------\n\nCreating the DDL is particularly tricky since not all GSE files will contain the same kinds of metadata,\nnor will we always want to use the same metadata for any given dataset. It is therefore possible to \nspecify configuration options, encoded as python objects, using the ``--guide-config=`` switch.\n\nThis file can contain customised settings for callable objects that will take a GEOSeries instance as an argument\nand return a string. For example, the ``dataset_handle`` object might look like this::\n\n dataset_handle = lambda gseObj: gseObj.accession\n\nThere are also ``dataset_version`` and ``dataset_description`` objects that can be defined as well.\n\nSample metadata are referecned somewhat the same -- as callable python objects -- but the argument passed \nis the GSESampleMetadata instance. This will be called in a loop that iterates through each of the samples\nso metadata pertaining to each is available to these callable objects. For instance::\n\n sample_description = lambda samp_inst: samp_inst.title\n\nreturns the descriptive text for the given sample ``samp_inst``.\n\nThese callable objects can be full-blown functions, not just anonymous *lambda* functions. Other, scaffolding\nor supporting code can also be included in this configuration file. Care should be taken in naming \nvariables that should NOT be treated as configuration variables: their names should always begin with an\nunderscore (_). See the documentation for the *cfgparse* module for further information.\n\nA template configuration file can be generated by using the ``--template`` switch. This simply\nprints out the default configuration, in which all valus are set to empty strings or zeros.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "http://pypi.python.org/pypi/gse/", "keywords": "dataset extraction metadata GSE bioinformatics", "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "gse", "package_url": "https://pypi.org/project/gse/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/gse/", "project_urls": { "Download": "UNKNOWN", "Homepage": "http://pypi.python.org/pypi/gse/" }, "release_url": "https://pypi.org/project/gse/0.1.9/", "requires_dist": null, "requires_python": null, "summary": "extract metadata and dataset from GEO Series Matrix format data", "version": "0.1.9" }, "last_serial": 1083814, "releases": { "0.1.7": [ { "comment_text": "", "digests": { "md5": "bce8ba39732286fb0ae3aae722a94f62", "sha256": "a38d8df3f530ea8908c3251f1f63786c81192c0f3c0e13ba1091bdf50a742fbf" }, "downloads": -1, "filename": "gse-0.1.7-py2.7.egg", "has_sig": false, "md5_digest": "bce8ba39732286fb0ae3aae722a94f62", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 26716, "upload_time": "2013-09-09T12:32:06", "url": "https://files.pythonhosted.org/packages/bd/8b/4d4aff41d854ab9a1d773599a7c96a9612761515ca2116965d94bbabb541/gse-0.1.7-py2.7.egg" } ], "0.1.8": [ { "comment_text": "", "digests": { "md5": "2a0083c945f9c6b74d383f5ac9ac7750", "sha256": "e6ddedd94eea1d722cfb2d77abafec2e5da1a17ca3a5f239611746cd0da40eb4" }, "downloads": -1, "filename": "gse-0.1.8-py2.7.egg", "has_sig": false, "md5_digest": "2a0083c945f9c6b74d383f5ac9ac7750", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 34584, "upload_time": "2014-01-31T05:01:58", "url": "https://files.pythonhosted.org/packages/6d/4b/efc9f447d73029d5e60effeacbd6476d61238faa1f806312a9821a414199/gse-0.1.8-py2.7.egg" } ], "0.1.9": [ { "comment_text": "", "digests": { "md5": "34ef2ded9687e95bb114ada14f8a7dda", "sha256": "a66adcddd6346001dec070ce9f618753dfae358ae2e752b78ae354fcd18d70f9" }, "downloads": -1, "filename": "gse-0.1.9-py2.7.egg", "has_sig": false, "md5_digest": "34ef2ded9687e95bb114ada14f8a7dda", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 35435, "upload_time": "2014-02-10T22:43:32", "url": "https://files.pythonhosted.org/packages/92/1a/9b936e8c2d9ef4467b2ad62cb428720c3bcae5105e955401fc700054998f/gse-0.1.9-py2.7.egg" }, { "comment_text": "", "digests": { "md5": "ed06e6ee572343e7491c53aab2c0c7b9", "sha256": "723b326ddb7ce86fb4d4728b61f4853e64cab67fd9023ac41b7e94cd419cab0f" }, "downloads": -1, "filename": "gse-0.1.9.tar.gz", "has_sig": false, "md5_digest": "ed06e6ee572343e7491c53aab2c0c7b9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15890, "upload_time": "2014-05-07T08:36:28", "url": "https://files.pythonhosted.org/packages/df/0d/fca5ce1dad47f1610adc2438a5638cbe5243801976349ea6068aee63f468/gse-0.1.9.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "34ef2ded9687e95bb114ada14f8a7dda", "sha256": "a66adcddd6346001dec070ce9f618753dfae358ae2e752b78ae354fcd18d70f9" }, "downloads": -1, "filename": "gse-0.1.9-py2.7.egg", "has_sig": false, "md5_digest": "34ef2ded9687e95bb114ada14f8a7dda", "packagetype": "bdist_egg", "python_version": "2.7", "requires_python": null, "size": 35435, "upload_time": "2014-02-10T22:43:32", "url": "https://files.pythonhosted.org/packages/92/1a/9b936e8c2d9ef4467b2ad62cb428720c3bcae5105e955401fc700054998f/gse-0.1.9-py2.7.egg" }, { "comment_text": "", "digests": { "md5": "ed06e6ee572343e7491c53aab2c0c7b9", "sha256": "723b326ddb7ce86fb4d4728b61f4853e64cab67fd9023ac41b7e94cd419cab0f" }, "downloads": -1, "filename": "gse-0.1.9.tar.gz", "has_sig": false, "md5_digest": "ed06e6ee572343e7491c53aab2c0c7b9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 15890, "upload_time": "2014-05-07T08:36:28", "url": "https://files.pythonhosted.org/packages/df/0d/fca5ce1dad47f1610adc2438a5638cbe5243801976349ea6068aee63f468/gse-0.1.9.tar.gz" } ] }