{ "info": { "author": "Paul Butler", "author_email": "paulgb@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "Sample\n======\n\n``sample`` is a command-line tool for sampling data from a large,\nnewline-separated dataset (typically a CSV-like file).\n\nInstallation\n------------\n\n``sample`` is distributed with ``pip``. Once you've installed ``pip``,\nsimply run::\n\n > pip install sample-cli\n\nand sample will be installed into your Python environment.\n\nUsage\n-----\n\n``sample`` requires one argument, the input file. If the input file\nis ``-``, data will be read from standard input (in this case, only\nthe reservoir and approximate algorithms can be used).\n\nSimple Example\n**************\n\nTo take a sample of size 1000 from the file ``big_data.csv``,\nrun ``sample`` as follows::\n\n > sample -n 1000 big_data.csv\n\nThis will print 1000 random lines from the file to the terminal.\n\nFile Redirection\n****************\n\nUsually we want to save the sample to another file instead.\n``sample`` doesn't have file output built-in; instead it relies\non the output redirection features of your terminal. To save\nto ``big_data_sample.csv``, run the following command::\n\n > sample -n 1000 big_data.csv > big_data_sample.csv\n\nHeader Rows\n***********\n\nCSV files often have a header row with the column names. You can pass\nthe ``-r`` flag to ``sample`` to preserve the header row::\n\n > sample -n 1000 big_data.csv -r > big_data_sample.csv\n\nRarely, you may need to sample from a file with a header spanning\nmultiple rows. The ``-r`` argument takes an optional number of\nrows to preserve as a header::\n\n > sample -n 1000 -r 3 data_with_header.csv > sample_with_header.csv\n\nNote that if the ``-r`` argument is directly before the input filename,\nit must have an argument or else it will try to interpret the input\nfilename as the number of header rows and fail. Putting the ``-r`` argument\nafter the input filename will avoid this.\n\nRandom Seed\n***********\n\nThe output of ``sample`` is random and depend on the computer's random\nstate. Sometimes you may want to take a sample in a way that can be\nreproduced. You can pass a random seed to ``sample`` with the ``-s`` flag\nto accomplish this::\n\n > sample -s 45906345 data_file.csv > reproducable_sample.csv\n\nSampling Algorithms\n-------------------\n\nAlgorithm Comparison\n********************\n\n``sample`` implements three sampling algorithms, each with their own strengths\nand weaknesses.\n\n+------------------------+----------------+----------------+------------+\n| | Reservoir | Approximate | Two-pass |\n+========================+================+================+============+\n| Flag | ``--res`` | ``--app`` | ``--tp`` |\n+------------------------+----------------+----------------+------------+\n| ``stdin``-compatible | yes | yes | no |\n+------------------------+----------------+----------------+------------+\n| space complexity | ``O(ss * rs)`` | ``O(1)`` | ``O(ss)`` |\n+------------------------+----------------+----------------+------------+\n| fixed sample size | compatible | not compatible | compatible |\n+------------------------+----------------+----------------+------------+\n| fractional sample size | not compatible | compatible | compatible |\n+------------------------+----------------+----------------+------------+\n\nFor space complexity, `ss` is the number of records in the sample and `rs` is the maximum size of a record.\n\nReservoir Sampling\n******************\n\nReservoir sampling (`Random Sampling with a Reservoir (Vitter 85) `__)\nis a method of sampling from a stream of unknown size where the sample size is\nfixed in advance. It is a one-pass algorithm and uses space proportional to the\namount of data in the sample.\n\nReservoir sampling is the default algorithm used by ``sample``. For consistency,\nit can also be invoked with the argument ``--reservoir``.\n\nIf reservoir sampling, the sample size must be fixed rather than fractional.\n\nExample::\n\n > sample --reservoir -n 1000 big_data.csv > sample_data.csv\n\nApproximate Sampling\n********************\n\nApproximate sampling simply includes each row in the sample with a probability\ngiven as the sample proportion. It is a stateless algorithm with minimal space\nrequirements. Samples will have on average a size of ``fraction * population_size``,\nbut it will vary between each invocation. Because of this, approximate sampling\nis only useful when the sample size does not have to be exact (hence the name).\n\nExample::\n\n > sample --approximate -f 0.15 my_data.csv > my_sample.csv\n\nEquivalently, supply a percentage instead of a fraction by switching the\n``-f`` to a ``-p``::\n\n > sample --approximate -p 15 my_data.csv > my_sample.csv\n\nTwo-Pass Sampling\n*****************\n\nTwo-pass sampling is allowed two passes, first to count the number of records\n(ie. the population size) and second to emit the records which are part of the\nsample. Because of this it is not compatible with ``stdin`` as an input.\n\nAs two-pass sampling knows the population size, it will accept the sample size\nas either a fraction or a fixed number of elements.\n\nExample::\n\n > sample --two-pass -p 15 my_data.csv > my_sample.csv\n\nTwo-pass sampling uses memory proportional to the number of elements in the sample.\n", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "UNKNOWN", "keywords": null, "license": "UNKNOWN", "maintainer": null, "maintainer_email": null, "name": "sample-cli", "package_url": "https://pypi.org/project/sample-cli/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/sample-cli/", "project_urls": { "Download": "UNKNOWN", "Homepage": "UNKNOWN" }, "release_url": "https://pypi.org/project/sample-cli/0.0.1/", "requires_dist": null, "requires_python": null, "summary": "Command-line interface for sampling lines from text files", "version": "0.0.1" }, "last_serial": 846634, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "302f3d714cb489b980c1dfbf0c82faa1", "sha256": "5373b144710c6d797034a6a62cda8e111d3e07fb11299262dcc53b55fb43df80" }, "downloads": -1, "filename": "sample-cli-0.0.1.tar.gz", "has_sig": false, "md5_digest": "302f3d714cb489b980c1dfbf0c82faa1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5080, "upload_time": "2013-08-23T04:42:44", "url": "https://files.pythonhosted.org/packages/93/54/cbf3e75db1fd1e68b39ce9d5357a08dd1e9a1f518186324feaf76de52494/sample-cli-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "302f3d714cb489b980c1dfbf0c82faa1", "sha256": "5373b144710c6d797034a6a62cda8e111d3e07fb11299262dcc53b55fb43df80" }, "downloads": -1, "filename": "sample-cli-0.0.1.tar.gz", "has_sig": false, "md5_digest": "302f3d714cb489b980c1dfbf0c82faa1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 5080, "upload_time": "2013-08-23T04:42:44", "url": "https://files.pythonhosted.org/packages/93/54/cbf3e75db1fd1e68b39ce9d5357a08dd1e9a1f518186324feaf76de52494/sample-cli-0.0.1.tar.gz" } ] }