{ "info": { "author": "Robin Kahlow (Toraxxx)", "author_email": "xtremegosugaming@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3 :: Only" ], "description": "pytocl\r\n======\r\n\r\nA python library to seamlessly convert python functions to functions making use of OpenCL\r\n\r\nSetup\r\n=====\r\n\r\n``python setup.py install``\r\n\r\nDependencies\r\n============\r\n\r\n- Python 3\r\n- Python libraries\r\n- numpy\r\n- pyopencl\r\n- OpenCL compatible hardware (version 1 is enough)\r\n- An OpenCL runtime (eg. AMDAPPSDK for AMD CPUs/GPUs)\r\n\r\nUsage\r\n=====\r\n\r\nThe complete example can be found in examples/usage.py. For more examples check tests/cltest.py and examples/benchmarks.py.\r\n\r\n1. Creating a function to be converted\r\n--------------------------------------\r\n\r\nCreate a function (or convert an existing one) to be parallelized. In this example we calculate ``output = a + b`` for vectors, so we will be using one dimension. You can get the current index / global work id in the first dimension with get\\_global\\_id(0).\r\n\r\n.. code:: python\r\n\r\n from pytocl import *\r\n\r\n def parallel_add(a, b, output):\r\n i = get_global_id(0)\r\n output[i] = a[i] + b[i]\r\n\r\n2. Converting the function\r\n--------------------------\r\n\r\nFirst we need to create the information for each argument of our function excluding the dimension parameters. For this ``CLArgDesc`` is used which holds information about the type of the argument as well as the ``array_size`` (0 for scalars) and whether the argument is used as an output (but not necessarily copied back to the host).\r\n\r\n.. code:: python\r\n\r\n # Our vectors will have 16 elements\r\n array_size = 16\r\n\r\n dim_shape = (array_size,)\r\n\r\n # Create the descriptors for the arguments of the function, excluding the dimension\r\n arg_desc_a = CLArgDesc(CLArgType.float32_array, array_size=array_size) # a\r\n arg_desc_b = CLArgDesc(CLArgType.float32_array, array_size=array_size) # b\r\n arg_desc_output = CLArgDesc(CLArgType.float32_array, array_size=array_size) # output\r\n\r\nNext we need to create the descriptor for the function itself which has the gloabl id / dimension shape information and more information about its arguments (ie. whether they're copied from host to device before execution and whether theyre copied from device to host after execution)\r\n\r\n.. code:: python\r\n\r\n \"\"\"\r\n Create the function descriptor with the global id / dimension shape information.\r\n\r\n Arguments can be added by chaining .arg() calls (the argument order has to match the original\r\n function's argument order (ie. arg_desc_a -> a, arg_desc_b -> b, arg_desc_output -> output).\r\n is_readonly has to be set to False for arguments that are assigned to in the function.\r\n\r\n copy_in() or copy_out() can be called to copy the last added argument from host to \r\n device before execution or from device to host after execution.\r\n \"\"\"\r\n\r\n func_desc = (CLFuncDesc(parallel_add, global_size)\r\n .arg(arg_desc_a).copy_in()\r\n .arg(arg_desc_b).copy_in()\r\n .arg(arg_desc_output, is_readonly=False).copy_out())\r\n\r\nNow we can compile the function which gives us a normal python function we can call\r\n\r\n.. code:: python\r\n\r\n # Compile the actual function, you can also pass an argument to compile with a CL context.\r\n # By default it uses cl.create_some_context()\r\n parallel_add_cl = CLFunc(func_desc).compile()\r\n\r\n3. Calling the function\r\n-----------------------\r\n\r\nYou can call the function like a normal python function now, all the copying will be done for you. The passed scalars can be normal python types but all arrays have to be numpy ndarrays.\r\n\r\n.. code:: python\r\n\r\n import numpy as np\r\n\r\n [...]\r\n\r\n # Create the host buffers / vectors, the dtype needs to match the arg type of the arg desc\r\n a = np.array([ 1 ] * array_size, dtype=np.float32)\r\n b = np.array([ 2 ] * array_size, dtype=np.float32)\r\n output = np.array([ 0 ] * array_size, dtype=np.float32)\r\n\r\n # Now we can execute the compiled function, we need to provide buffers for all output copies.\r\n # For input copies we could also pass None to not copy them\r\n parallel_add_cl(a, b, output)\r\n\r\n # output should now be a + b\r\n\r\n print(\"A:\", a)\r\n print(\"B:\", b)\r\n print(\"Output:\", output)\r\n\r\nYou can also pass None for arguments only used as copy-inputs meaning no data will be copied and the current content will be retained.\r\n\r\nLimitations / Todo list\r\n=======================\r\n\r\n- Only simple python functions are supported.\r\n- All mathematical and logical expressions\r\n- All literals except for strings\r\n- If statements and IfElse constructs\r\n- While loops\r\n- For loops currently only support range()\r\n- Function calls get converted to use the same name in the kernel, but the called functions themselves aren't converted if they aren't available yet (eg. currently you can ``from math import exp`` and use exp(x))\r\n- Type inference for local variables is currently limited. Defaults to ``float``, variables named ``i``, ``j``, ``k`` or starting with ``i_`` become ``int``, variables starting with ``b_`` become ``bool``\r\n- ``return value`` is not supported, outputs have to be passed as an argument\r\n- Array slices are not supported\r\n- List comprehensions and other python-specific constructs are not supported\r\n- The source code of the function has to be available for conversion (which is often the case)\r\n- CUDA support?\r\n- Clean up the converter code\r\n\r\nBenchmarks\r\n==========\r\n\r\nin ``examples/benchmarks.py``\r\n\r\nTest hardware was an AMD Phenom II 1090T and an AMD 6970 The original functions and OpenCL with CPU are orders of magnitude slower (not shown here, you can uncomment the line in ``benchmarks.py`` though). Numpy versions are compared to the clified GPU versions.\r\n\r\nMatrix Multiply 100 times for matrices of same size\r\n---------------------------------------------------\r\n\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| Matrix size | Runtime Numpy | Runtime OpenCL GPU | Relative Numpy | Relative OpenCL GPU |\r\n+================+=================+======================+==================+=======================+\r\n| (128, 128) | 0.02s | 0.08s | 100.00% | 372.00% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (256, 256) | 0.08s | 0.16s | 100.00% | 200.18% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (512, 512) | 0.40s | 0.60s | 100.00% | 148.81% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (1024, 1024) | 2.64s | 3.64s | 100.00% | 137.71% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (2048, 2048) | 17.76s | 28.07s | 100.00% | 158.03% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n\r\nNeural Network sigmoid layer forward pass 100 times for input and weight matrices of same size\r\n----------------------------------------------------------------------------------------------\r\n\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| Matrix size | Runtime Numpy | Runtime OpenCL GPU | Relative Numpy | Relative OpenCL GPU |\r\n+================+=================+======================+==================+=======================+\r\n| (128, 128) | 0.05s | 0.05s | 100.00% | 114.13% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (256, 256) | 0.19s | 0.13s | 100.00% | 70.13% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (512, 512) | 2.62s | 0.54s | 100.00% | 20.62% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (1024, 1024) | 11.52s | 3.65s | 100.00% | 31.66% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (2048, 2048) | 60.01s | 27.40s | 100.00% | 45.66% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n\r\n2 layer MLP forward pass with 128 batch size and input vectors / weight matrices of same size 100 times\r\n-------------------------------------------------------------------------------------------------------\r\n\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| Matrix size | Runtime Numpy | Runtime OpenCL GPU | Relative Numpy | Relative OpenCL GPU |\r\n+================+=================+======================+==================+=======================+\r\n| (128, 128) | 0.09s | 0.07s | 100.00% | 77.65% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (256, 256) | 0.43s | 0.11s | 100.00% | 26.72% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (512, 512) | 1.35s | 0.32s | 100.00% | 23.80% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (1024, 1024) | 3.08s | 1.08s | 100.00% | 35.09% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n| (2048, 2048) | 7.71s | 3.56s | 100.00% | 46.17% |\r\n+----------------+-----------------+----------------------+------------------+-----------------------+\r\n\r\nContributors\r\n============\r\n\r\n- Toraxxx (Developer)", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/ToraxXx/pytocl", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "pytocl", "package_url": "https://pypi.org/project/pytocl/", "platform": "any", "project_url": "https://pypi.org/project/pytocl/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/ToraxXx/pytocl" }, "release_url": "https://pypi.org/project/pytocl/0.1.1/", "requires_dist": null, "requires_python": null, "summary": "Converts Python functions to OpenCL functions", "version": "0.1.1" }, "last_serial": 2223165, "releases": { "0.1.1": [ { "comment_text": "", "digests": { "md5": "0eeb007d85449a7825895db847355265", "sha256": "6bf74e75f86014a13b9b2302501fdc1a783f8fdc572fa75777cbe37e32ec9cfe" }, "downloads": -1, "filename": "pytocl-0.1.1.zip", "has_sig": false, "md5_digest": "0eeb007d85449a7825895db847355265", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8549, "upload_time": "2016-07-15T13:31:04", "url": "https://files.pythonhosted.org/packages/ae/a8/f5a0630b490799f3792372d76f84c45dc09b28bc1204cac27eb458708ba1/pytocl-0.1.1.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "0eeb007d85449a7825895db847355265", "sha256": "6bf74e75f86014a13b9b2302501fdc1a783f8fdc572fa75777cbe37e32ec9cfe" }, "downloads": -1, "filename": "pytocl-0.1.1.zip", "has_sig": false, "md5_digest": "0eeb007d85449a7825895db847355265", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 8549, "upload_time": "2016-07-15T13:31:04", "url": "https://files.pythonhosted.org/packages/ae/a8/f5a0630b490799f3792372d76f84c45dc09b28bc1204cac27eb458708ba1/pytocl-0.1.1.zip" } ] }