{ "info": { "author": "Nervana Systems", "author_email": "info@nervanasys.com", "bugtrack_url": null, "classifiers": [], "description": "NervanaGPU library\r\n==================\r\n\r\nIntroduction\r\n------------\r\n\r\n**nervanagpu** is a Python module for deep learning. It includes,\r\n\r\n- matrix-multiply (GEMM), convolution, and pooling kernels optimized\r\n using a custom\r\n `assembler `__,\r\n- element-wise and broadcast operations that automatically compound\r\n into efficient kernels,\r\n- a simple but powerful array class, leveraging and with code partially\r\n borrowed from pycuda\\ `1 <#refs>`__\\ ,\r\n- layer classes for building networks for benchmarking,\r\n- full assembler source to encourage contributions from the community.\r\n\r\nDesign goals\r\n^^^^^^^^^^^^\r\n\r\n**nervanagpu** grew out of a tool Nervana uses for internal hardware\r\nefforts. It's been repackaged for use by the community. The goals for\r\n**nervanagpu** are to provide,\r\n\r\n- near **theoretical peak performance**,\r\n- numpy functionality for **ease-of-use**,\r\n- convolution kernel features and arguments identical to\r\n cuDNN\\ `2 <#refs>`__\\ ,\r\n- integration into `neon `__,\r\n Nervana's full-featured deep learning library,\r\n- a tool for algorithmic explorations using alternative numerical\r\n formats,\r\n- a seemless transition path to Nervana hardware,\r\n- ease of integration into other deep learning frameworks.\r\n\r\nOnly NVIDIA Maxwell and future architectures are supported. Older\r\narchitectures are not well-suited for assembler level optimizations used\r\nhere.\r\n\r\nNumerical formats\r\n^^^^^^^^^^^^^^^^^\r\n\r\nSupported numerical formats currently include,\r\n\r\n- **fp32**: standard 32-bit floating point,\r\n- **fp16**: 16-bit floating point memory format with underlying\r\n operations in 32 bits.\r\n- **int8** and **uint8**: in elementwise and as input to the first\r\n convolutional layer.\r\n\r\nwith more to come (eg. like\r\n`this `__).\r\n\r\nExtra features\r\n^^^^^^^^^^^^^^\r\n\r\nOur kernels have some additional useful features:\r\n\r\n- 3D convolutions and 4D pooling (including output feature map dim)\r\n- optional ReLu is builtin to GEMM and convolution operations,\r\n- stochastic rounding support for **fp16**\\ \\ `3 <#refs>`__\\ ,\r\n- instrumented to return statistics useful for avoiding numerical\r\n issues (coming soon),\r\n- support for matrix sizes common in deep learning, significantly out\r\n performing cuBLAS\r\n\r\nSmall optimizations like these can result in significant speed and\r\nperformance improvements.\r\n\r\nUsage\r\n-----\r\n\r\n**nervanagpu** includes a factory class ``NervanaGPU`` and a numpy-like\r\narray class ``GPUTensor``. Memory layout for tensors and gemm ops is\r\n**row-ordered**. Below are examples on how they are used.\r\n\r\nMatrix multiplication example\r\n^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\r\n\r\nHere is full example of doing a basic GEMM operation using 16-bit float:\r\n\r\n.. code:: python\r\n\r\n import numpy as np\r\n import pycuda.autoinit\r\n from nervanagpu import NervanaGPU\r\n\r\n # initialize factory class\r\n ng = NervanaGPU(stochastic_round=False)\r\n\r\n m, n, k = 10, 20, 10\r\n dtype = np.float16\r\n\r\n # create matrices on host\r\n cpuA = np.random.randn(k,m)\r\n cpuB = np.random.randn(k,n)\r\n\r\n # transfer to device\r\n devA = ng.array(cpuA, dtype=dtype)\r\n devB = ng.array(cpuB, dtype=dtype)\r\n devC = ng.empty((m,n), dtype=dtype)\r\n\r\n # do GEMM operation\r\n ng.dot(devA.T, devB, devC, relu=False)\r\n\r\n # get from device\r\n cpuC = devC.get()\r\n\r\nElement-wise operations\r\n~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\n**nervanagpu** compiles tensor arithmetic expressions into efficient\r\nCUDA kernels which are lazily evaluated upon assignment. For example,\r\ncomputing variance along an axis consists of a set of element-wise,\r\nreduction and broadcast operations that compiles to a single kernel:\r\n\r\n.. code:: python\r\n\r\n # import and initialize NervanaGPU, transfer matrix from cpu to dev as above\r\n\r\n devC[:] = ng.mean(ng.square(devA - ng.mean(devA, axis=1)), axis=1)\r\n\r\nBatch normalization can be done by computing mean and variance across\r\nthe batch (n) dimension and automatically taking advantage of\r\nbroadcasting to subtract and divide the original data.\r\n\r\n.. code:: python\r\n\r\n # import and initialize NervanaGPU as above\r\n\r\n eps = .001\r\n A = ng.empty((128, 32), dtype=np.float16)\r\n A[:] = ng.rand() # generate uniform random on device between 0 and 1\r\n\r\n # Normalize batch data by batch mean and variance, \r\n A[:] = ng.reciprocal(ng.sqrt(ng.var(A, axis=1) + eps)) * (A - ng.mean(A, axis=1)) \r\n\r\nThe last expression above is automatically collapsed into a single gpu\r\nkernel. The two mean(A,axis=1) expressions are automatically simplified\r\ninto one. To be able to broadcast a reduction to a subsequent operation\r\nthe reduction op must appear prior to the broadcast op in the\r\n`*postfix* `__\r\nversion of the expression. Hence, the reciprocal operation instead of\r\ndivision.\r\n\r\nBuilding\r\n--------\r\n\r\n**nervanagpu** comes with full assembler code for kernels. To build the\r\nkernels, install\r\n`**maxas** `__, Nervana's\r\nassembler for NVIDIA Maxwell. The module can then be built by running:\r\n\r\n::\r\n\r\n make kernels # build the kernels\r\n make python # build python bindings\r\n make test # run nosetests\r\n make doc # build sphinx docs\r\n\r\nA simple ``make`` will perform the first two steps.\r\n\r\nDocumentation and tests are currently sparse. Please contribute.\r\n\r\nPerformance\r\n-----------\r\n\r\n**nervanagpu** comes with a set of benchmark scripts under\r\n``nervanagpu/benchmarks``. Also included are scripts to validate kernel\r\nresults against cuBLAS and cuDNN.\r\n\r\nHere is a sample run of ``benchmarks/convnet-benchmarks.py`` using the\r\nnetworks listed on Soumith Chintala's `benchmarking\r\npage `__. Run on a single\r\nTitanX with default clocks and power limit:\r\n\r\n::\r\n\r\n ---------------------------------------------\r\n Alexnet (dtype=float16, N=128) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 29.498 msecs 6043.698 gflops\r\n Avg(10) bprop: 66.562 msecs 5356.689 gflops\r\n Avg(10) total: 96.059 msecs 5567.654 gflops\r\n ---------------------------------------------\r\n Alexnet (dtype=float32, N=128) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 31.251 msecs 5704.698 gflops\r\n Avg(10) bprop: 77.567 msecs 4596.660 gflops\r\n Avg(10) total: 108.818 msecs 4914.869 gflops\r\n\r\n ---------------------------------------------\r\n Overfeat (dtype=float16, N=128) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 116.723 msecs 6134.994 gflops\r\n Avg(10) bprop: 242.084 msecs 5916.054 gflops\r\n Avg(10) total: 358.807 msecs 5987.277 gflops\r\n ---------------------------------------------\r\n Overfeat (dtype=float32, N=128) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 124.569 msecs 5748.559 gflops\r\n Avg(10) bprop: 278.408 msecs 5144.192 gflops\r\n Avg(10) total: 402.977 msecs 5331.015 gflops\r\n\r\n ---------------------------------------------\r\n VGG (dtype=float16, N=64) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 162.186 msecs 5978.348 gflops\r\n Avg(10) bprop: 357.850 msecs 5419.051 gflops\r\n Avg(10) total: 520.036 msecs 5593.481 gflops\r\n ---------------------------------------------\r\n VGG (dtype=float32, N=64) Results:\r\n ---------------------------------------------\r\n Avg(10) fprop: 170.822 msecs 5676.112 gflops\r\n Avg(10) bprop: 438.031 msecs 4427.108 gflops\r\n Avg(10) total: 608.853 msecs 4777.533 gflops\r\n\r\nAcknowledgements\r\n^^^^^^^^^^^^^^^^\r\n\r\nThanks to Erich Elsen and Bryan Catanzaro of Baidu, Matthieu Courbariaux\r\nand Fr\u00e9d\u00e9ric Bastien of the Bengio lab, Vincent Vanhoucke of Google, and\r\nSoumith Chintala of Facebook for feedback on early versions of this\r\nlibrary. We'd also like to thank NVIDIA for generously providing us with\r\nseveral TitanXs for benchmarking.\r\n\r\nReferences \r\n^^^^^^^^^^^\r\n\r\n1. Andreas Kl\u00f6ckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul\r\n Ivanov, Ahmed Fasih. `*PyCUDA and PyOpenCL: A scripting-based\r\n approach to GPU run-time code\r\n generation* `__ Parallel Computing,\r\n Volume 38, Issue 3, March 2012, Pages 157-174.\r\n\r\n2. Chetlur, Sharan, Cliff Woolley, Philippe Vandermersch, Jonathan\r\n Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. `*cuDNN:\r\n Efficient primitives for deep\r\n learning.* `__ arXiv preprint\r\n arXiv:1410.0759 (2014).\r\n\r\n3. Gupta, Suyog, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish\r\n Narayanan. `*Deep Learning with Limited Numerical\r\n Precision.* `__ arXiv preprint\r\n arXiv:1502.02551 (2015).", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/nervanasys/nervanagpu", "keywords": "", "license": "(see LICENSE document)", "maintainer": "", "maintainer_email": "", "name": "nervanagpu", "package_url": "https://pypi.org/project/nervanagpu/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/nervanagpu/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/nervanasys/nervanagpu" }, "release_url": "https://pypi.org/project/nervanagpu/0.3.1/", "requires_dist": null, "requires_python": null, "summary": "Python bindings for Nervana GPU kernels", "version": "0.3.1" }, "last_serial": 1534832, "releases": { "0.3.1": [ { "comment_text": "", "digests": { "md5": "769ab14b1a2f2c94d95f68b1770e48a9", "sha256": "7329813f1bf44a00a4b028dfb345386224ea5ba48b057e4a615b501db25a155b" }, "downloads": -1, "filename": "nervanagpu-0.3.1.tar.gz", "has_sig": false, "md5_digest": "769ab14b1a2f2c94d95f68b1770e48a9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 103690, "upload_time": "2015-05-04T22:59:56", "url": "https://files.pythonhosted.org/packages/73/1b/6d170c5c42733cc94db34e0e4bae6c51cf4f49510b512eccaa4d0ad53f85/nervanagpu-0.3.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "769ab14b1a2f2c94d95f68b1770e48a9", "sha256": "7329813f1bf44a00a4b028dfb345386224ea5ba48b057e4a615b501db25a155b" }, "downloads": -1, "filename": "nervanagpu-0.3.1.tar.gz", "has_sig": false, "md5_digest": "769ab14b1a2f2c94d95f68b1770e48a9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 103690, "upload_time": "2015-05-04T22:59:56", "url": "https://files.pythonhosted.org/packages/73/1b/6d170c5c42733cc94db34e0e4bae6c51cf4f49510b512eccaa4d0ad53f85/nervanagpu-0.3.1.tar.gz" } ] }