{ "info": { "author": "Dr. Jan-Philip Gehrcke", "author_email": "jgehrcke@googlemail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: POSIX", "Programming Language :: Python", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7", "Programming Language :: Python :: Implementation :: CPython" ], "description": "# Schniepel\n\nMeasures the resource utilization of a specific process over time.\n\nThis program also measures the utilization / saturation of system-wide resources\nmaking it straightforward to put the process-specific metrics into context.\n\nBuilt for Linux. Windows and Mac OS support might come.\n\n**Highlights**:\n\n- High sampling rate: by default, Schniepel uses a sampling interval of 0.5 seconds\n for making narrow spikes visible.\n- Schniepel is built for monitoring a program subject to process ID changes. This\n is useful for longevity experiments when the monitored process occasionaly\n restarts (for instance as of fail-over scenarios).\n- Schniepel can run unsupervised and infinitely long with predictable disk space\n requirements (it applies an output file rotation and retention policy).\n- Schniepel helps keeping data organized: time series data is written into\n [HDF5](https://en.wikipedia.org/wiki/Hierarchical_Data_Format) files, and\n annotated with relevant metadata such as the program invocation time, system\n hostname, and Schniepel software version.\n- Schniepel comes with a data plotting tool (separate from the data acquisition\n program).\n- Schniepel values measurement correctness very highly. The core sampling loop does\n little work besides the measurement itself: it writes each sample to a queue.\n A separate process consumes this queue and persists the time series data to\n disk, for later inspection. This keeps the sampling rate predictable upon disk\n write latency spikes, or generally upon backpressure. This matters especially\n in cloud environments where we sometimes see fsync latencies of multiple\n seconds.\n\n\n## Motivation\n\nThis was born out of a need for solid tooling. We started with [pidstat from\nsysstat](https://github.com/sysstat/sysstat/blob/master/pidstat.c), launched as\n`pidstat -hud -p $PID 1 1`. We found that it does not properly account for\nmultiple threads running in the same process, and that various issues in that\nregard exist in this program across various versions (see\n[here](https://github.com/sysstat/sysstat/issues/73#issuecomment-349946051),\n[here](https://github.com/sysstat/sysstat/commit/52977c479), and\n[here](https://github.com/sysstat/sysstat/commit/a63e87996)).\n\nThe program [cpustat](https://github.com/uber-common/cpustat) open-sourced by\nUber has a delightful README about the general measurement methodology and\noverall seems to be a great tool. However, it seems to be optimized for\ninteractive usage (whereas we were looking for a robust measurement program\nwhich can be pointed at a process and then be left unattended for a significant\nwhile) and there does not seem to be a decent approach towards persisting the\ncollected time series data on disk for later inspection (it seems to be able to\nwrite a binary file when using `-cpuprofile` but it is a little unclear what\nthis file contains and how to analyze the data).\n\nThe program [psrecord](https://github.com/astrofrog/psrecord) (which effectively\nwraps [psutil](https://psutil.readthedocs.io/en/latest/)) has a similar\nfundamental idea as Schniepel; it however does not have a clear separation of\nconcerns between persisting the data to disk, performing the measurement itself,\nand plotting the data, making it too error-prone and not production-ready.\n\n\n## Usage\n\n### Hints and tricks\n\n#### Convert an HDF5 file to a CSV file\n\nI recommend de-serialize and re-serialize using\n[pandas](https://pandas.pydata.org/). Example one-liner:\n```\npython -c 'import sys; import pandas as pd; df = pd.read_hdf(sys.argv[1], key=\"schniepel_timeseries\"); df.to_csv(sys.argv[2], index=False)' messer_20190718_213115.hdf5.0001 /tmp/hdf5-as-csv.csv\n```\nNote that this significantly inflates the file size (e.g., from 50 MiB to 300\nMiB).\n\n\n## Notes\n\n- Schniepel tries to not asymmetrically hide measurement uncertainty. For example,\n you might see it measure a CPU utilization of a single-threaded process\n slightly larger than 100 %. That's simply the measurement error. In other\n tooling such as `sysstat` it seems to be common practice to _asymmetrically_\n hide measurement uncertainty by capping values when they are known to in\n theory not exceed a certain threshold\n ([example](https://github.com/sysstat/sysstat/commit/52977c479d3de1cb2535f896273d518326c26722)).\n\n- Must be run with `root` privileges.\n\n- The value `-1` has a special meaning for some metrics\n ([NaN](https://en.wikipedia.org/wiki/NaN), which cannot be represented\n properly in HDF5). Example: A disk write latency of `-1 ms` means that no\n write happened in the corresponding time interval.\n\n- The highest meaningful sampling rate is limited by the kernel's timer and\n bookkeeping system.\n\n\n## Measurands (columns, and their units)\n\nThe quantities intended to be measured.\n\n\n#### `proc_cpu_id`\n\nThe ID of the CPU that this process is currently running on.\n\nMomentary state at sampling time.\n\n\n#### `proc_cpu_util_percent_total`\n\nThe CPU utilization of the process in `percent`.\n\nMean over the past sampling interval.\n\nIf the inspected process is known to contain just a single thread then\nthis can still sometimes be larger than 100 % as of measurement errors. If the\nprocess contains more than one thread then this can go far beyond 100 %.\n\nThis is based on the sum of the time spent in user space and in kernel space.\nFor a more fine-grained picture the following two metrics are also available:\n`proc_cpu_util_percent_user`, and `proc_cpu_util_percent_system`.\n\n\n#### `proc_num_threads`\n\nThe number of threads in the process.\n\nMomentary state at sampling time.\n\n\n#### `proc_num_ip_sockets_open`\n\nThe number of sockets currently being open. This includes IPv4 and IPv6 and does\nnot distinguish between TCP and UDP, and the connection state also does not\nmatter.\n\nMomentary state at sampling time.\n\n\n#### `proc_num_fds`\n\nThe number of file descriptors currently opened by this process.\n\nMomentary state at sampling time.\n\n\n#### `proc_disk_read_throughput_mibps` and `proc_disk_write_throughput_mibps`\n\nThe disk I/O throughput of the inspected process, in `MiB/s`.\n\nBased on Linux' `/proc//io` `rchar` and `wchar`. A highly relevant\n[piece of documentation](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/filesystems/proc.txt) (emphasis mine):\n\n> The number of bytes which this task has caused to be read from storage. This\n> is simply the sum of bytes which this process passed to read() and pread().\n> *It includes things like tty IO* and it is unaffected by whether or not actual\n> physical disk IO was required (*the read might have been satisfied from\n> pagecache*)\n\nMean over the past sampling interval.\n\n\n#### `proc_mem_rss_percent`\n\nFraction of process [resident set size](https://stackoverflow.com/a/21049737)\n(RSS) relative to machine's physical memory size in `percent`.\n\nMomentary state at sampling time.\n\n\n#### `proc_ctx_switch_rate_hz`\n\nThe rate of ([voluntary and\ninvoluntary](https://unix.stackexchange.com/a/442991)) context switches in `Hz`.\n\nMean over the past sampling interval.\n\n\n(list incomplete)\n\n\n## Valuable references\n\nExternal references on the subject matter that I found useful during\ndevelopment.\n\nAbout system performance measurement, and kernel time bookkeeping:\n\n- http://www.brendangregg.com/usemethod.html\n- https://www.vividcortex.com/blog/monitoring-and-observability-with-use-and-red\n- https://github.com/uber-common/cpustat/blob/master/README.md\n- https://elinux.org/Kernel_Timer_Systems\n- https://github.com/Leo-G/DevopsWiki/wiki/How-Linux-CPU-Usage-Time-and-Percentage-is-calculated\n\nAbout disk I/O statistics:\n\n- https://www.xaprb.com/blog/2010/01/09/how-linux-iostat-computes-its-results/\n- https://www.kernel.org/doc/Documentation/iostats.txt\n- https://blog.serverfault.com/2010/07/06/777852755/ (interpreting iostat output)\n- https://unix.stackexchange.com/a/462732 (What are merged writes?)\n- https://stackoverflow.com/a/8512978 (what is`%util` in iostat?)\n- https://coderwall.com/p/utc42q/understanding-iostat\n- https://www.percona.com/doc/percona-toolkit/LATEST/pt-diskstats.html\n\nOthers:\n\n- https://serverfault.com/a/85481/121951 (about system memory statistics)\n\nMusings about HDF5:\n\n- https://cyrille.rossant.net/moving-away-hdf5/\n- http://hdf-forum.184993.n3.nabble.com/File-corruption-and-hdf5-design-considerations-td4025305.html\n- https://pytables-users.narkive.com/QH2WlyqN/corrupt-hdf5-files\n- https://www.hdfgroup.org/2015/05/whats-coming-in-the-hdf5-1-10-0-release/\n- https://stackoverflow.com/q/35837243/145400", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jgehrcke/schniepel", "keywords": "", "license": "", "maintainer": "", "maintainer_email": "", "name": "schniepel", "package_url": "https://pypi.org/project/schniepel/", "platform": "", "project_url": "https://pypi.org/project/schniepel/", "project_urls": { "Homepage": "https://github.com/jgehrcke/schniepel" }, "release_url": "https://pypi.org/project/schniepel/0.1.0/", "requires_dist": null, "requires_python": "", "summary": "Measures the resource utilization of a specific process over time", "version": "0.1.0" }, "last_serial": 5615682, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "337f61ba5e0d6392b7e2c55a67069f0e", "sha256": "2dd45d51342c8648c58023843682e9d03b344e415aeb9570055727ce6371c487" }, "downloads": -1, "filename": "schniepel-0.1.0.tar.gz", "has_sig": false, "md5_digest": "337f61ba5e0d6392b7e2c55a67069f0e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 35034, "upload_time": "2019-07-31T21:21:39", "url": "https://files.pythonhosted.org/packages/00/b4/b85522460a4d7d6ef73d1e1f4c9807b7078b2e691f6a62f457269451a2ea/schniepel-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "337f61ba5e0d6392b7e2c55a67069f0e", "sha256": "2dd45d51342c8648c58023843682e9d03b344e415aeb9570055727ce6371c487" }, "downloads": -1, "filename": "schniepel-0.1.0.tar.gz", "has_sig": false, "md5_digest": "337f61ba5e0d6392b7e2c55a67069f0e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 35034, "upload_time": "2019-07-31T21:21:39", "url": "https://files.pythonhosted.org/packages/00/b4/b85522460a4d7d6ef73d1e1f4c9807b7078b2e691f6a62f457269451a2ea/schniepel-0.1.0.tar.gz" } ] }