{ "info": { "author": "Alan Kang", "author_email": "jania902@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: Information Technology", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.6", "Topic :: Text Editors :: Text Processing" ], "description": "## csvsample\n\n``csvsample`` extracts some rows from CSV file to create randomly sampled CSV.\n\n### Features\n\n* The size of original file does not need to be specified beforehand.\n It means that the sampling process can be applied to data stream with\n unknown size such as system logs, no matter how large the amount of data\n is.\n* All methods accepts optional ``seed`` value. The same data set with the\n same sampling rate always yields exactly the same result, which is good\n for reproducibility.\n\n\n### Install\n\nYou can install ``csvsample`` via ``pip``:\n\n pip install csvsample\n\n\n### API\n\n``csvsample.sample()`` is the main API:\n\n csvsample.sample(lines, sampling_method, **kwargs)\n\n``lines`` can be any ``iterable`` containing valid CSV rows including header\nrow.\n\n``sampling_method`` should be one of followings:\n\n* ``random``\n* ``hash``\n* ``reservoir``\n\n``random`` sampling method performs random sampling using pseudo random number\ngenerator:\n\n import csvsample\n\n with open('input.csv', 'r') as i:\n with open('output.csv', 'w') as o:\n o.writelines(csvsample.sample(i, 'random', sample_rate=0.1))\n\n``hash`` sampling method performs hash-based sampling using extremely-fast hash\nfunction.\n\nLet's say that instead of saving all users' log, you want to randomly select\n10% of users and only save logs of those selected users. Simple random sampling\nwon't work. You can use hash-based sampling. \"Consistent\" nature of the\nalgorithm guarantees that any user ID selected once will always be selected\nagain:\n\n sampled = csvsample.sample(lines, 'hash', sample_size=0.1, col='user_id')\n\n``reservoir`` sampling method performs reservoir sampling. Let's say that you\nhave an URL of 100GB csv file. Since you don't have enough disk space, you just\nwant to save small portion of sample which is representative and unbiased.\n\n[reservoir sampling](https://en.wikipedia.org/wiki/Reservoir_sampling) method\nallows you to acquire random sample without saving entire data first:\n\n sampled = csvsample.sample(lines, 'reservoir', sample_size=1000)\n\nNow ``sampled`` variable contains exactly 1,000 randomly selected lines.\n\n### Helpers\n\nThere are some convenience helpers:\n\n* ``csvsample.sample_url(url, sampling_method, **kwargs)`` read CSV from given\n ``url``. You can specify character set encoding via ``encoding`` keyword\n argument. (default: ``utf-8``)\n\n``csvsample.sample()`` and other helpers return a generator containing sampled\nCSV rows including header row. The generator contains special function\n``to_buf()`` which converts itself into ``io.StringIO`` instance so that you can\npass the sampled CSV to other libraries such as Pandas:\n\n import csvsample\n import pandas as pd\n\n sampled = csvsample.sample_url(url, 'random', sample_rate=0.1)\n df = pd.read_csv(sampled.to_buf())\n\n\n### Command-line interface\n\n``csvsample`` also provides command-line interface.\n\nFollowing URL contains a CSV file from [DataHub](https://datahub.io/):\n\n > curl -sL https://bit.ly/2ItnHvK | head\n region,year,population\n WORLD,1950,2536274.721\n WORLD,1951,2583816.786\n WORLD,1952,2630584.384\n WORLD,1953,2677230.358\n\nA number of rows including header is 18019:\n\n > curl -sL https://bit.ly/2ItnHvK | wc -l\n 18019\n\nLet's make 10% of random sample:\n\n > curl -sL https://bit.ly/2ItnHvK | csvsample random 0.1 > sample.csv\n\n > wc -l sample.csv\n 1777 sample.csv\n\n > head -n 5 sample.csv\n region,year,population\n WORLD,1952,2630584.384\n WORLD,1972,3851545.181\n WORLD,1977,4229201.257\n WORLD,1988,5148556.956\n\nYou may use reservoir sampling method to obtain exact number of rows:\n\n > curl -sL https://bit.ly/2ItnHvK | csvsample reservoir 100 > sample.csv\n\n > wc -l sample.csv\n 100 sample.csv\n\n\n", "description_content_type": "", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/akngs/csvsample", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "csvsample", "package_url": "https://pypi.org/project/csvsample/", "platform": "", "project_url": "https://pypi.org/project/csvsample/", "project_urls": { "Homepage": "https://github.com/akngs/csvsample" }, "release_url": "https://pypi.org/project/csvsample/0.1.4/", "requires_dist": [ "fire", "xxhash", "check-manifest; extra == 'dev'", "pytest; extra == 'test'" ], "requires_python": "", "summary": "Create random samples from CSV file", "version": "0.1.4" }, "last_serial": 3879188, "releases": { "0.1.4": [ { "comment_text": "", "digests": { "md5": "7a629be1f9e53b51c8e1b104804d91f4", "sha256": "d1b878ff44164da5737996efa4272b2dd88f2c7b4fb51d887c350af53a12485b" }, "downloads": -1, "filename": "csvsample-0.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "7a629be1f9e53b51c8e1b104804d91f4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5661, "upload_time": "2018-05-19T10:31:55", "url": "https://files.pythonhosted.org/packages/1d/41/6b1b6ed3e25eaa6ab7d3ba00e828718bbab090041b3af0cb2236fc0922a2/csvsample-0.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e4f21712e5e565ba263915e6c03c1a94", "sha256": "8ff739851b822a2842eab162fe574c4dedb1c12bf8f45312490c4e5f1b7a586f" }, "downloads": -1, "filename": "csvsample-0.1.4.tar.gz", "has_sig": false, "md5_digest": "e4f21712e5e565ba263915e6c03c1a94", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5408, "upload_time": "2018-05-19T10:31:57", "url": "https://files.pythonhosted.org/packages/c7/57/74f945a23c869e72171b5941e736e8307fa561fffc009ae78656c4514c9b/csvsample-0.1.4.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "7a629be1f9e53b51c8e1b104804d91f4", "sha256": "d1b878ff44164da5737996efa4272b2dd88f2c7b4fb51d887c350af53a12485b" }, "downloads": -1, "filename": "csvsample-0.1.4-py3-none-any.whl", "has_sig": false, "md5_digest": "7a629be1f9e53b51c8e1b104804d91f4", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 5661, "upload_time": "2018-05-19T10:31:55", "url": "https://files.pythonhosted.org/packages/1d/41/6b1b6ed3e25eaa6ab7d3ba00e828718bbab090041b3af0cb2236fc0922a2/csvsample-0.1.4-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "e4f21712e5e565ba263915e6c03c1a94", "sha256": "8ff739851b822a2842eab162fe574c4dedb1c12bf8f45312490c4e5f1b7a586f" }, "downloads": -1, "filename": "csvsample-0.1.4.tar.gz", "has_sig": false, "md5_digest": "e4f21712e5e565ba263915e6c03c1a94", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5408, "upload_time": "2018-05-19T10:31:57", "url": "https://files.pythonhosted.org/packages/c7/57/74f945a23c869e72171b5941e736e8307fa561fffc009ae78656c4514c9b/csvsample-0.1.4.tar.gz" } ] }