{ "info": { "author": "Doru Arfire", "author_email": "doruarfire@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Environment :: Console", "Intended Audience :: Developers", "Intended Audience :: System Administrators", "License :: OSI Approved", "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)", "Programming Language :: Python", "Topic :: Internet :: WWW/HTTP" ], "description": "======================================\n**screp**, easy command-line scrapping\n======================================\n\n\nWhat is screp?\n==============\n\n**screp** is a command line utility that provides easy and flexible scrapping\nof HTML documents. It works by finding a set of *anchors* (specified using a\nCSS selector) and then extracting information relative to those anchors,\noptionally post processing it using a set of standard operations. For each\nanchor it outputs a record formatted according to one of the supported formats\n(CSV, JSON or general).\n\n\nInvoking screp\n==============\n\n**screp** is invoked using the following syntax::\n\n$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]\n\nwhere:\n* FORMAT_SPEC is a format specification, one of:\n - *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row\n - *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole\n output as a list of JSON objects\n - *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format\n where computed values are substituted to their specifications (similar to\n bash parameter substitution)\n* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as\n detailed below\n* FILE can be either a local file or an absolute URL; if no FILEs are specified\n the standard input is read\n\n\nHow does screp work?\n====================\n\n**screp** tries to automate many of the steps taken when writing your own\nscrapper, steps like:\n\n* fetching the HTML documents, if necessary\n* parsing HTML\n* locating areas of interest in the DOM of the document\n* locating interesting information around those areas\n* simple processing of these pieces of information\n* formatting of the information\n* outputting the information\n\nTo use screp, you need to take a series of steps:\n* tell screp where to take the HTML documents; it works with multiple\n documents, from sources such as the web, the local file-system or STDIN\n* define the *primary anchor* using a CSS selector: these are elements through\n which you access records of interest in the HTML documents\n* specify the output format; this implies specifying:\n - *terms*, which are string computed relative to the anchors\n - how these terms are combined to produce a record; currently screp supports\n three methods of specifying formats:\n - CSV\n - JSON\n - general format\n* optionally, you can also define *secondary anchors*, which are elements\n computed relative to the *primary anchor* that can be used to define *terms*\n in a more succinct way\n\nDefining terms\n==============\n\nA *term* has the following format::\n\n anchor.accessor.accessor.accessor|filter|filter|filter\n\nIn other words, a term is an anchor(primary or secondary) followed by zero or\nmore accessors followed by zero or more filters.\n\n*Accessors* and *filters* (also collectively called *actions*) are functions\nthat take the output value of the last function (or the anchor, if this is the\nfirst action) and output another value. In other words, they form a pipeline.\nAccessors act on DOM elements and sets (actually ordered lists) of elements,\nwhereas filters act on strings. Each action has an in_type and an out_type. For\na term to be correctly defined the out_type of an action needs to match the\nin_type of the following action.\n\nThe supported types are: 'string', 'element', 'element_set'.\n\nActions can have zero or more parameters. When the action takes parameters it\nis specified as a function::\n\n action(parameter1, parameter2, parameter3)\n\nWhen not, only the action name is specified (no parentheses).\n\nFinally, terms have restrictions of the out_type of their last action (also\ncalled the out_type of the term):\n* if a term is used inside a format specification, its out_type must be\n 'string'\n* if a term is used to define a secondary anchor, its out_type must be\n 'element'\n\nExamples of terms\n-----------------\n\nThese are correct term definitions::\n\n '$.parent.parent.attr(title)|upper' outputs 'string'\n '@.desc(\".record\").first' outputs 'element\n 'anchor.ancestors(\".box\").children(\".price\")' outputs 'element_set'\n\nPredefined anchors and actions\n==============================\n\nThe following anchors are predefined:\n* **$** is the primary anchor defined by the primary anchor selector\n* **@** is the primary anchor representing the root of the current document\n\nThe following accessors are predefined:\n* **first** [in_type='element_set', out_type='element']: returns the first\n element in an element_set\n* **last** [in_type='element_set', out_type='element']: returns the last\n element in an element_set\n* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th\n element in an element_set; it also supports negative indexes, where -1\n represents the last element, -2 the second-to-last element, and so on\n* **class** [in_type='element', out_type='string']: returns the value of the\n 'class' attribute * **id** [in_type='element', out_type='string']: returns\n the value of the 'id' attribute * **parent** [in_type='element',\n out_type='element']: returns the parent of the current element\n* **text** [in_type='element', out_type='string']: returns the text enclosed by\n the current element\n* **tag** [in_type='element', out_type='string']: returns the tag of the\n current element\n* **attr(attr_name)** [in_type='element', out_type='string']: returns the value\n of the current element's attribute with name 'attr_name'\n* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the\n ordered list of descendants of the current element selected by the CSS\n selector specified by 'css_sel'\n* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to\n .desc(css_sel).first\n* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns\n the list of ancestors of the current element that satisfy the CSS selector\n specified by 'css_sel'\n* **children(css_sel)** [in_type='element', out_type='element_set']: returns\n the list of children of the current element that satisfy the CSS selector\n specified by 'css_sel'\n* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns\n the list of preceding siblings of the current element that satisfy the CSS\n selector specified by 'css_sel'\n* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns\n the list of following siblings of the current element that satisfy the CSS\n selector specified by 'css_sel'\n* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns\n the list of siblings of the current element that satisfy the CSS selector\n specified by 'css_sel'\n* **matching(css_sel)** [in_type='element_set', out_type='element_set']:\n filters an element_set and returns all elements that match the CSS selector\n specified by 'css_sel'\n\nThe following filters are predefined:\n* **upper** [in_type='string', out_type='string']: converts string to uppercase\n* **lower** [in_type='string', out_type='string']: converts string to lowercase\n* **trim** [in_type='string', out_type='string']: removes spaces at the\n beginning and end of the string\n* **strip(chars)** [in_type='string', out_type='string']: removes characters\n specified by 'chars' at the beginning and end of the string\n* **replace(old, new)** [in_type='string', out_type='string']: replaces all\n occurrences of 'old' with 'new'\n* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a\n regular expression substitution; *pattern* and *repl* are have the formats\n taken by the **re.sub** Python function from the standard Python library;\n\nSpecifying output formats\n=========================\n\nCSV format\n----------\n\nThe CSV output format is specified using the -c option. Optionally, using the\n-H option you can specify a CSV header to output before outputting records.\n\nExample::\n\n -c '$.attr(title), $.parent.desc(\".price\").text | trim' -H 'name, price'\n\n\nJSON format\n-----------\n\nThe JSON output format is defined using the -j option. It formats the output as\na JSON list of objects, one for each record. The *--indent-json* flat tells\nscrep to indent each object. The format is specified as a comma-separated list\nof *key=value* pairs, where the *key* represents the JSON key in the record\nobject while *value* is a term specification.\n\nExample::\n\n - j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'\n\n\nGeneral format\n--------------\n\nThen general format is specified by a general string containing term\nspecifications. To distinguish it from the general format, each term\nspecification is surrounded by braces. When formatting a record each term\nspecification is substituted with the computed value for that term.\n\nExample::\n\n -f 'some header {$.parent.text | replace(\"X\", \"Y\")} some middle {$.tag} some\n tail'\n\n\nSpecifying secondary anchors\n============================\n\nSecondary anchors are specified using the -a option. There can be any number of\nsecondary anchors definitions. The definitions have the format\n**=** where is an identifier and is a term definition\nrelative to any of the previously defined anchors (primary or secondary) that\nhas outputs an element. Secondary anchors can be redefined in later -a options\nbut only the last definition is retained.\n\nSecondary anchors examples\n--------------------------\n\nThese are examples of secondary anchors definitions::\n\n -a 'p=$.parent' -a 'gp=p.parent'\n\n -a 'interesting=$.fdesc(\".interesting-class\")' -a\n 'interesting=interesting.parent'", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/darfire/screp", "keywords": null, "license": "LGPL", "maintainer": null, "maintainer_email": null, "name": "screp", "package_url": "https://pypi.org/project/screp/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/screp/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/darfire/screp" }, "release_url": "https://pypi.org/project/screp/0.3.2/", "requires_dist": null, "requires_python": null, "summary": "Command-line utility for easy scraping of HTML documents", "version": "0.3.2" }, "last_serial": 799344, "releases": { "0.3": [ { "comment_text": "", "digests": { "md5": "a580821438b4bc7236902ba3228a737f", "sha256": "bcf41dfad9f5d3ac1c6bd23b4791903a10b8d60d58aec3a6050a7e8cf258e546" }, "downloads": -1, "filename": "screp-0.3.tar.gz", "has_sig": false, "md5_digest": "a580821438b4bc7236902ba3228a737f", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19049, "upload_time": "2013-02-16T19:17:41", "url": "https://files.pythonhosted.org/packages/29/7c/65454dd1a9c4691d59ed8371972a5fb18c8d29dc73489a597787c934d103/screp-0.3.tar.gz" } ], "0.3.1": [ { "comment_text": "", "digests": { "md5": "a44358303767adfbea2de09c38712f83", "sha256": "201896496e432e90ea3f5e7410193f62618126745c88137ec64a117045fba42d" }, "downloads": -1, "filename": "screp-0.3.1.tar.gz", "has_sig": false, "md5_digest": "a44358303767adfbea2de09c38712f83", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19322, "upload_time": "2013-02-16T19:53:19", "url": "https://files.pythonhosted.org/packages/03/9a/e7853c8f8099ce46fadb9e75a4800bad9a80af335e3e89012900c0dc60f8/screp-0.3.1.tar.gz" } ], "0.3.2": [ { "comment_text": "", "digests": { "md5": "4bab678283be29372520a7a6c8adc9d9", "sha256": "3ec2e2fcf4292d8f42468c3a7a8d750d0638e398696299748154b291d7c4e643" }, "downloads": -1, "filename": "screp-0.3.2.tar.gz", "has_sig": false, "md5_digest": "4bab678283be29372520a7a6c8adc9d9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19456, "upload_time": "2013-02-17T12:04:24", "url": "https://files.pythonhosted.org/packages/ba/6b/a0a287272610c891e776e1f7836d49703802212f44b6220d8241e96eaba7/screp-0.3.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "4bab678283be29372520a7a6c8adc9d9", "sha256": "3ec2e2fcf4292d8f42468c3a7a8d750d0638e398696299748154b291d7c4e643" }, "downloads": -1, "filename": "screp-0.3.2.tar.gz", "has_sig": false, "md5_digest": "4bab678283be29372520a7a6c8adc9d9", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 19456, "upload_time": "2013-02-17T12:04:24", "url": "https://files.pythonhosted.org/packages/ba/6b/a0a287272610c891e776e1f7836d49703802212f44b6220d8241e96eaba7/screp-0.3.2.tar.gz" } ] }