{ "info": { "author": "Luke Hodkinson", "author_email": "furious.luke@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: BSD License", "Natural Language :: English", "Operating System :: OS Independent", "Programming Language :: Python" ], "description": "# lizards-are-awesome\n\nA Docker based workflow for performing a Plink/fastStructure analysis from\non DArTseq SNP data, inferred from an Excel file.\n\n\n## Overview\n\nThis software seeks to reduce the manual labour involved in preparing DArTseq SNP\ndata in 1 row format for analysis with Plink and fastStructure. LAA is designed specifically for\nSNP data sets generated by DArTseq, in 1 row format. As such, input\ndata will be the following metadata provided by DArTseq: \"0\" =\nReference allele homozygote, \"1\"= SNP allele homozygote, \"2\"=\nheterozygote, and \"-\" = double null/ null allele homozygote (absence\nof fragment with SNP in genomic representation). LAA first converts\nthese data into ped and map files for plink analysis.\n\nMost of the work, besides the mentioned\nexternal packages, is done with a Python script. The primary operations\nperformed by the script are:\n\n 1. Duplicating the input data.\n 2. Performing a substitution on certain characters in both\n sets of data, in order to create Plink compatible characters (i.e. \"-\" to \"0\").\n 3. Independently indexing both sets of data.\n 4. Combining both sets of data.\n 5. Sorting on the combined index.\n 6. Transposing the combined data.\n 7. Outputting to Plink compatible `ped` and `map` formats.\n\nWhereas before these steps would have been carred out manually using various software\npackages, they are now performed automatically.\n\nIn addition to the conversion operation, there are additional functions\nto perform analysis runs of Plink and fastStructre, passing the data files\nbetween the two programs automatically.\n\nIn addition to the conversion operation, LAA automatically initiates \nthe program Plink on the generated ped and map files, and the \nresulting bed, bim and fam files are then passed on to and analysed \nwith fastStructure. The user can choose a maximum of K(number of \npopulations) to be analysed by fastStructure. Output files include \nthe meanQ value for each individual, defining the mean probability \nto belong to any one of the populations K1 to Kx.\n\n\n## Design Decisions\n\n### Why Docker?\n\nPlink is written for Linux based operating systems. As such on a Linux system\nall operations could be performed directly, without the need for any kind of\nvirtualisation layer. But, in order to support researchers using Windows based\noperating systems the decision was made to leverage Docker virtualisation.\n\nDocker provides a light-weight virtualisation layer enabling Linux software to\nrun on Windows with (relative) ease. It also has the added benefit of providing\na cloud based mechanism for disseminating software \"images\" to users. The advantage\nof Docker over other systems, like VirtualBox or VMWare, are:\n\n * cloud based distribution of prebuilt images,\n * future releases will allow native Docker containers, and\n * easy to replicate virtual image creation.\n\n### Why Python?\n\nPython is a powerful and expressive scripting language. It comes with many\ndiverse packages, and has excellent support from developers (for example,\nfastStructure is written in Python).\n\n\n## Dependencies\n\nWhen installing on any platform there are number of requisite dependencies:\n\n * Python\n * Docker\n\nIf you happen to be installing on Windows, then there are a couple of extra requirements:\n\n * Visual Studio Python compiler\n * MsysGit\n\n\n## Important\n\nWe've found that Docker has issues when running on Windows, resulting in faulty data\ntransformation. While you may be able to install LAA on a Windows system, the accuracy of \nresults are likely to be compromised.\n\nTo install on Windows, we recommend using a virtual machine running an Ubuntu\ninstallation, e.g. VMWare All steps detailed below under Installation will have to be \nperformed through the Virtual Machine, including installing Docker.\n\n\n## Installation\n\nBegin by installing all of the dependencies for your operating system as\nlisted above.\n\nOnce complete, open a system terminal (please see the subsection on system terminals\nbelow, under `usage`).\n\nFrom an open system terminal, install the LAA Python interface with:\n\n```bash\npip install lizards-are-awesome\n```\n\nNext, from a system terminal, download and prepare the `laa` docker image. This\nimage contains `plink`, `fastStructure`, and the conversion scripts, all built\ninto a light-weight Alpine linux image:\n\n```bash\nlaa init\n```\n\n## Usage\n\n### Terminals\n\nUsage is currently done directly from your operating system terminal. In Linux\nlike operating systems (including Mac OS X) use the system terminal emulator. In\nWindows operating systems use the Docker quick start terminal.\n\n### Input Format\n\nLAA accepts XLSX Excel formats and CSV. Unfortunately, XLSX is extremely slow\nto parse using opensource utilities. As such we recommend converting your Excel\ndata to CSV before use with LAA (simply open and then save as csv file using\nMicrosoft Office or opensource spreadsheet tools, like Libre \nOffice).\n\nThe data sheet should contain only columns with DArTseq SNP data \n(i.e. 0, 1, 2 and -), all other columns have to be removed.\nThe first row should contain the name of the population each \nindividual belongs to (e.g. species), the second row should contain \nthe ID of each individual. All following rows contain the SNP data.\n\nA short, fictitious, example:\n\n
| Pminima | \nPminima | \nPminor | \nPminima | \nPminor | \nPminima | \n
| lizard1 | \nlizard2 | \nlizard15 | \nlizard39 | \nlizard40 | \nlizard44 | \n
| 0 | \n1 | \n1 | \n2 | \n1 | \n1 | \n
| 0 | \n0 | \n0 | \n1 | \n0 | \n0 | \n
| 1 | \n- | \n1 | \n0 | \n1 | \n1 | \n
| 0 | \n0 | \n1 | \n0 | \n- | \n0 | \n
| 2 | \n2 | \n1 | \n1 | \n1 | \n2 | \n
| 2 | \n2 | \n1 | \n2 | \n1 | \n0 | \n
| 1 | \n1 | \n2 | \n1 | \n2 | \n1 | \n
| 1 | \n1 | \n1 | \n2 | \n0 | \n1 | \n
| 0 | \n0 | \n0 | \n0 | \n0 | \n0 | \n
| - | \n1 | \n2 | \n1 | \n1 | \n1 | \n