{ "info": { "author": "Artur de Sousa Rocha", "author_email": "adsr@poczta.onet.pl", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python :: 3", "Topic :: Text Processing" ], "description": "# awking\n\nMake it easier to use Python as an AWK replacement.\n\n## Basic usage\n\n### Extracting groups of lines\n\n```python\nfrom awking import RangeGrouper\n\nlines = '''\ntext 1\ntext 2\ngroup start 1\ntext 3\ngroup end 1\ntext 4\ngroup start 2\ntext 5\ngroup end 2\ntext 6\n'''.splitlines()\n\nfor group in RangeGrouper('start', 'end', lines):\n print(list(group))\n```\n\nThis will output:\n\n```text\n['group start 1', 'text 3', 'group end 1']\n['group start 2', 'text 5', 'group end 2']\n```\n\n### Extracting fixed-width fields\n\n```python\nfrom awking import records\n\nps_aux = '''\nUSER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND\nroot 1 0.0 0.0 51120 2796 ? Ss Dec22 0:09 /usr/lib/systemd/systemd --system --deserialize 22\nroot 2 0.0 0.0 0 0 ? S Dec22 0:00 [kthreadd]\nroot 3 0.0 0.0 0 0 ? S Dec22 0:04 [ksoftirqd/0]\nroot 5 0.0 0.0 0 0 ? S< Dec22 0:00 [kworker/0:0H]\nroot 7 0.0 0.0 0 0 ? S Dec22 0:15 [migration/0]\nroot 8 0.0 0.0 0 0 ? S Dec22 0:00 [rcu_bh]\nroot 9 0.0 0.0 0 0 ? S Dec22 2:47 [rcu_sched]\nsaml 3015 0.0 0.0 117756 596 pts/2 Ss Dec22 0:00 bash\nsaml 3093 0.9 4.1 1539436 330796 ? Sl Dec22 70:16 /usr/lib64/thunderbird/thunderbird\nsaml 3873 0.0 0.1 1482432 8628 ? Sl Dec22 0:02 gvim -f\nroot 5675 0.0 0.0 124096 412 ? Ss Dec22 0:02 /usr/sbin/crond -n\nroot 5777 0.0 0.0 51132 1068 ? Ss Dec22 0:08 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplica\nsaml 5987 0.7 1.5 1237740 119876 ? Sl Dec26 14:05 /opt/google/chrome/chrome --type=renderer --lang=en-\nroot 6115 0.0 0.0 0 0 ? S Dec27 0:06 [kworker/0:2]\n'''\n\nfor user, _, command in records(ps_aux.splitlines(), widths=[7, 58, ...]):\n print(user, command)\n```\n\nThis will output:\n\n```text\nUSER COMMAND\nroot /usr/lib/systemd/systemd --system --deserialize 22\nroot [kthreadd]\nroot [ksoftirqd/0]\nroot [kworker/0:0H]\nroot [migration/0]\nroot [rcu_bh]\nroot [rcu_sched]\nsaml bash\nsaml /usr/lib64/thunderbird/thunderbird\nsaml gvim -f\nroot /usr/sbin/crond -n\nroot /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplica\nsaml /opt/google/chrome/chrome --type=renderer --lang=en-\nroot [kworker/0:2]\n```\n\n## The problem\n\nDid you ever have to scan a log file for XMLs? How hard was it for you to\nextract a set of multi-line XMLs into separate files?\n\nYou can use `re.findall` or `re.finditer` but you need to read the entire log\nfile into a string first. You can also use an AWK script like this one:\n\n```awk\n#!/usr/bin/awk -f\n\n/^Payload: <([-_a-zA-Z0-9]+:)?Request/ {\n ofname = \"request_\" (++index) \".xml\"\n sub(/^Payload: /, \"\")\n}\n\n/<([-_a-zA-Z0-9]+:)?Request/, /<\\/([-_a-zA-Z0-9]+:)?Request/ {\n print > ofname\n}\n\n/<\\/([-_a-zA-Z0-9]+:)?Request/ {\n if (ofname) {\n close(ofname)\n ofname = \"\"\n }\n}\n```\n\nThis works, and quite well. (Despite this being a Python module I encourage you\nto learn AWK if you don't already know it.)\n\nBut what if you want to build this kind of stuff into your Python application?\nWhat if your input is not lines in a file but a different type of objects?\n\n### Python equivalent using `awking`\n\nThe `RangeGrouper` class groups elements from the input iterable based on\npredicates for the start and end element. This is a bit like Perl's range\noperator or AWK's range pattern, except that your ranges get grouped into\n`START..END` iterables.\n\nAn equivalent of the above AWK script might look like this:\n\n```python\nfrom awking import RangeGrouper\nimport re\nimport sys\n\ng = RangeGrouper(r'^Payload: <([-_a-zA-Z0-9]+:)?Request',\n r'