{ "info": { "author": "V.Anh Tran", "author_email": "tranvietanh1991@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Topic :: Text Processing", "Topic :: Text Processing :: Filters", "Topic :: Text Processing :: General", "Topic :: Utilities" ], "description": "rextract\n========\n\nrextract is a powerful commandline tool to extract and manipulate strings using regular exressions.\n\nThink of it somewhat similar to \"sed\" or \"grep\", except it gives you complete control to rearrange/omit/duplicate\n\nthe output, so far as regular expressions support.\n\nIt supports python extended regular expressions (compatible with perl-style).\n\nUsage\n-----\n\n\tUsage: rextract (Options) [regex pattern] ([output format])\n\n\t\tReads from stdin and applies provided regex pattern line-by-line,\n\n\t\toptionally outputting in a format specified by \"output format.\"\n\n\n\tOptions:\n\n\t\\=\\=\\=\\=\\=\\=\\=\\=\n\n\n\t\t\t\\\\--debug Enable debug mode\n\n\t\t\t\\-\\-version Print version and exit\n\n\n\tUsage:\n\n\t\\=\\=\\=\\=\\=\n\n\n\t\tSome examples of usage could be:\n\n\t\t* Extracting one or more groups from the input\n\n\t\t* Omitting one or more groups from the input\n\n\t\t* Rearranging the input\n\n\t\t* Using some textual input (csv?) into a series of commands to execute\n\n\t\t* Creating reports\n\n\t\t* Many others!\n\n\n\n\tRegex Pattern Format/Tips:\n\n\t\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\n\n\t\trextract supports extended regular expression syntax (python/perl-style), specifically\n\n\t\tthat provided by the standard python \"re\" module.\n\n\t\tSome examples can be found at https://docs.python.org/3/howto/regex.html\n\n\n\n\t\tEach pattern contained within parenthesis counts as a group.\n\n\t\tEach group left-to-right is assigned a numerical value, starting with 1.\n\n\n\t\tYou can assign a name to a group like so: \n\n\n\t\t\t(?P[a-zA-Z][a-zA-Z0-9]+)\n\n\n\t\t\t^ This creates a group which you can reference as:\n\n\n\t\t\t${my_group_name}\n\n\n\t\tA group must start with a letter a-z or underscore ( '_' ),\n\n\t\t and must only contain letters a-z (any case), numbers 0-9, or underscore.\n\n\t\tName a group like: (?P.*)\n\n\n\t\tThe regular expression will match at any point within the line, unless\n\n\t\t'^' (start of line) or '$' (end of line) are present, in which case it\n\n\t\twill have to match with those restrictions.\n\n\n\t\tThus, there is no need to prefix with '.*' to be able to match somewhere\n\n\t\tin the center.\n\n\n If a line fails to match, it will not be provided to for use as output,\n\n it will just be silently skipped. You cannot logically format output\n\n from an unknown input.\n\n\n You can specify a \"NOT\" within your pattern by starting a group with a caret ( '^' )\n\n For example, to match that a line does not start with a semicolon, you can use a\n\n pattern line:\n\n\n ^([^;].*)\n\n\n\tOutput Format\n\n\t\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\\=\n\n\n\t\tThe \"Output Format\" argument can be though of exactly like a bash string -\n\n\t\tcharacters all have their literal value except those preceded with an\n\n\t\tun-escaped dollar ( $ ) sign. These represent groups, either numeric or\n\n\t\tby-name. See the section above for more info on creating groups.\n\n\n\t\tIf the \"Output Format\" argument is omitted, the entire matched region will\n\n\t\tbe printed, line-by-line. This is equivalent to providing an \n\n\t\t\"Output Format\" of '${0}'.\n\n\n\tNumeric Groups\n\n\t\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\n\n\n\t\tUse ${1} or $1 for the first group, and ${3} for the third.\n\n\t\tNumerical groups are assigned left-to-right as they appear in the regex,\n\n\t\teven if the group is in a conditional, and despite nesting.\n\n\n\t\tFor example:\n\n\t\t\t\"((yellow|white) cheddar)|((whole|part-skim) mozerella)\"\n\n\t\t\t\n\t\t\"yellow|white cheddar\" is the first group,\n\n\t\t\"yellow|white\" is the second group\n\n\t\t\"whole|part-skim mozerella\" is the third group\n\n\t\t\"whole|part-skim\" is the fourth group.\n\n\n\t\tIf a group is not matched, it is assigned the value of an empty string.\n\n\n\t\tUse the special group $0 or ${0]} to print the entire matched section.\n\n\n\tNamed Groups\n\n\t\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\\-\n\n\n\t\tIf you defined any groups to have a name (see section above on this topic),\n\n\t\ti.e. (?P...) , then you can reference that group in the output\n\n\t\tas $my_group or ${my_group}.\n\n\n\n\tNOTE: Make sure to single-quote the \"Output Format\" argument,\n\n\t\tor escape dollar [$] signs (by using \\$).\n\n\n\trextract version 1.0.0 by Tim Savannah\n\n\nExamples\n--------\n\n\n*passwd file*\n\nExample, extract all the usernames and UIDs from /etc/passwd of folks who use \"/bin/bash\" as their shell, and reformat it.\n\n\tcat /etc/passwd | rextract '^(?P[^:]+)[:][^:]*[:](?P[\\d]+).*/bin/bash$' '${username} [${uid}]'\n\n\nExample output:\n\n\tjoe55 [1000]\n\n\ttjoseph [1009]\n\n\tjames [1011]\n\n\n\nExplained Expression:\n\n* Match starts at first character of line. First group is named \"username\", and contains at least 1 character and all characters that are not ':'. \n* Then comes a colon ':'\n* Then comes 0 or more characters which are not colon ':'\n* Then comes a colon\n* Second group is named \"uid\", and contains one or more digits.\n* Then, match 0 or more characters\n* Then, match the string \"/bin/bash\" at the end of the line ( represented by '$' )\n\nOur output is the username, followed by square brackets enclosing the uid.\n\n\n\n*Logs*\n\nExample, extract a sorted list of all pacman (archlinux) packages updated/installed on a specific date, and versions\n\n\n\trextract '^(\\[2016-11-02).*(upgraded|installed) (?P.*)$' '$what' < /var/log/pacman.log | sort\n\n\nSample Output:\n\n\tasciidoc (8.6.9-3)\n\n\taccerciser (3.14.0-4 -> 3.22.0-1)\n\n\taccountsservice (0.6.42-1 -> 0.6.43-1)\n\n\tadwaita-icon-theme (3.20-2 -> 3.22.0-1)\n\n\taisleriot (3.20.2-1 -> 3.22.0+5+gb3024a2-1)\n\n\takonadi-qt4 (1.13.0-10 -> 1.13.0-11)\n\n\n\nExplanation:\n\nHere's a few sample lines from pacman.log:\n\n\n\t[2016-11-02 16:45] [ALPM] installed asciidoc (8.6.9-3)\n\n\t[2016-11-02 22:42] [ALPM] upgraded tali (3.20.0-2 -> 3.22.0-1)\n\n\t[2016-11-02 22:42] [ALPM] upgraded totem (3.20.1-1 -> 3.22.0+5+ge0bf46e-1)\n\n\t[2016-11-02 22:42] [ALPM] warning: directory permissions differ on /etc/unrealircd/\n\n\tfilesystem: 700 package: 755\n\n\t[2016-11-02 22:42] [ALPM] warning: directory permissions differ on /etc/unrealircd/aliases/\n\n\tfilesystem: 700 package: 755\n\n\t[2016-11-02 22:42] [ALPM] upgraded unrealircd (4.0.6-1 -> 4.0.7-1)\n\n\n\nAs you can see, there's a lot of information here, some relevant, some not.\n\nBasically, this is an example that COULD be done with grep and sed, but\n\nis much more easily accomplished with rextract, and we may actually want to modify\n\nthe form of the output (see \"More Advanced\" below)\n\n\nSo the filter expression says:\n\n* Filter that line starts with the date of interest\n* Filter that 0 or more characters occur between that date and either the word \"upgraded\" or \"installed\"\n* Extract everything after the word \"upgraded\" or \"installed\" (excluding the space after), and place into a group called \"what\"\n\nAnd our output expression just contains the 'what' portion.\n\n\n*Strip Comments*\n\nTo strip comments from a file, where a \"comment\" is defined by the character ';' and everything following on a given line:\n\ncat myFile.ini | rextract '(?P[^;]\\*)' > myFile.ini.stripped\n\n\n*More Advanced:*\n\n\nOkay, so now let's get more advanced. We want to produce a report that lists what software installations happened today,\n\nwhat the final version is, and whether it is new software (installation) or old software (upgrade). And we use the same log file.\n\n\nThis is accomplished with the following:\n\n\trextract '^(\\[2016-11-02).*(?P[ui])(pgraded|nstalled) (?P[^ ]+)[ ][\\(](.+[\\-][>][ ]){0,1}(?P.+)[\\)]$' '$name = $final_version [$action]' < /var/log/pacman.log | sort\n\n\nSample output:\n\n\tasciidoc = 8.6.9-3 [i]\n\n\tvim = 8.0.0055-1 [u]\n\n\tvim-runtime = 8.0.0055-1 [u]\n\n\tyelp = 3.22.0+1+gfabd8eb-1 [u]\n\t\n\tyelp-tools = 3.18.0+1+g193c2bd-1 [u]\n\n\tyelp-xsl = 3.20.1-2 [u]\n\n\n\nHere we show the package name, the final version, and a marker if it was an install or an upgrade ( [i] == install, [u] == upgrade ).\n\n\nFilter Explanation:\n\n* Start with today's date\n* This time, split the first letter of \"upgraded\" and \"installed\" into its own group, \"action\".\n* Ensure that following the \"action\" letter is the remainder of the word. Note, in theory this could match 'ipgraded' or 'unstalled', but with this given data, it won't. However, in other cases, it might. For those cases, we can match with an \"or\" condition, and use two groups (you cannot repeat group names, even in an \"or\" condition):\n\n\t\t./rextract '^(\\[2016-11-02).*\\[ ](((?P[u])pgraded)|((?P[i])nstalled))[ ](?P[^ ]+)[ ][\\(](.+[\\-][>][ ]){0,1}(?P.+)[\\)]$' '$name - $final_version [${a1}${a2}]' < /var/log/pacman.log\n\nSo here, note that we no longer can match \"ipgraded\" or \"unstalled\". When a group is present in the pattern string, but does not appear in a matched group, its value is assigned as an empty string. Thus, where we used \"$action\" in the simpler form, we now use \"${a1}${a2}\", as only one will hold a value ('u' or 'i'), and the other will be blank.\n\nAnyway, I digress. Be sensible, unless lives are on the line, it's OK to take shortcuts (like \"(?P\\[ui])(pgraded|nstalled)\") which are not \"technically\" 100% correct, but are 100% accurate with real-world data.\n\n* After the word \"upgraded/installed\" and the following space, take all non-space characters (\"\\[^ \\]\") and assign to group \"name\". This will be the package name.\n* Next follows a space, and an open parenthesis.\n* Then, is a conditional group. We match that there are at least one character followed by an arrow (\"->\") followed by a space, 0 or 1 times. This may be confusing to some, basically, we are making a group of \"{anything}-> \" and saying you may see that group 0 times, or 1 time. This covers the difference in representations of the \"installed\" and \"upgraded\" packages. 'Upgrades' will have matched that group 1 time (ok), 'installs' would have matched that group 0 times (ok).\n* Now that we've discarded the first part of the parenthesis in the upgrade case, and remain just inside the paren in the install place, what is left between the cursor and the close-parenthesis is the final version. So we match everything from cursor to the final version.\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tranvietanh1991/rextract", "keywords": "rextract", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "rextract", "package_url": "https://pypi.org/project/rextract/", "platform": "", "project_url": "https://pypi.org/project/rextract/", "project_urls": { "Homepage": "https://github.com/tranvietanh1991/rextract" }, "release_url": "https://pypi.org/project/rextract/1.1.0/", "requires_dist": null, "requires_python": "", "summary": "Powerful commandline tool to extract and manipulate strings using regular exressions", "version": "1.1.0" }, "last_serial": 4522571, "releases": { "1.1.0": [ { "comment_text": "", "digests": { "md5": "3478a5ac0048ce3127ebe2fa64cff055", "sha256": "b6a916295560b14026cbfc3b8c7b49a5f18397f110d259c3fc058acd553ac537" }, "downloads": -1, "filename": "rextract-1.1.0.tar.gz", "has_sig": false, "md5_digest": "3478a5ac0048ce3127ebe2fa64cff055", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27142, "upload_time": "2018-03-13T03:34:00", "url": "https://files.pythonhosted.org/packages/9f/1b/f6d2c81933d6699406974c7c6b99dbce7990e409a06405dbc2db997fc1a9/rextract-1.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "3478a5ac0048ce3127ebe2fa64cff055", "sha256": "b6a916295560b14026cbfc3b8c7b49a5f18397f110d259c3fc058acd553ac537" }, "downloads": -1, "filename": "rextract-1.1.0.tar.gz", "has_sig": false, "md5_digest": "3478a5ac0048ce3127ebe2fa64cff055", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 27142, "upload_time": "2018-03-13T03:34:00", "url": "https://files.pythonhosted.org/packages/9f/1b/f6d2c81933d6699406974c7c6b99dbce7990e409a06405dbc2db997fc1a9/rextract-1.1.0.tar.gz" } ] }