{
    "info": {
        "author": "Doru Arfire",
        "author_email": "doruarfire@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "Development Status :: 4 - Beta",
            "Environment :: Console",
            "Intended Audience :: Developers",
            "Intended Audience :: System Administrators",
            "License :: OSI Approved",
            "License :: OSI Approved :: GNU Library or Lesser General Public License (LGPL)",
            "Programming Language :: Python",
            "Topic :: Internet :: WWW/HTTP"
        ],
        "description": "======================================\n**screp**, easy command-line scrapping\n======================================\n\n\nWhat is screp?\n==============\n\n**screp** is a command line utility that provides easy and flexible scrapping\nof HTML documents. It works by finding a set of *anchors* (specified using a\nCSS selector) and then extracting information relative to those anchors,\noptionally post processing it using a set of standard operations. For each\nanchor it outputs a record formatted according to one of the supported formats\n(CSV, JSON or general).\n\n\nInvoking screp\n==============\n\n**screp** is invoked using the following syntax::\n\n$ screp [OPTION] FORMAT_SPEC PRIMARY_SELECTOR [FILES]\n\nwhere:\n* FORMAT_SPEC is a format specification, one of:\n  - *-c CSV_FORMAT_SPEC*, formats each record as a comma-separated-values row\n  - *-j JSON_FORMAT_SPEC*, formats each record as a JSON object and the whole\n    output as a list of JSON objects\n  - *-f GENERAL_FORMAT_SPEC*, formats each record according to a general format\n    where computed values are substituted to their specifications (similar to\n    bash parameter substitution)\n* PRIMARY_SELECTOR is a CSS selector that specifies the *primary anchor*, as\n  detailed below\n* FILE can be either a local file or an absolute URL; if no FILEs are specified\n  the standard input is read\n\n\nHow does screp work?\n====================\n\n**screp** tries to automate many of the steps taken when writing your own\nscrapper, steps like:\n\n* fetching the HTML documents, if necessary\n* parsing HTML\n* locating areas of interest in the DOM of the document\n* locating interesting information around those areas\n* simple processing of these pieces of information\n* formatting of the information\n* outputting the information\n\nTo use screp, you need to take a series of steps:\n* tell screp where to take the HTML documents; it works with multiple\n  documents, from sources such as the web, the local file-system or STDIN\n* define the *primary anchor* using a CSS selector: these are elements through\n  which you access records of interest in the HTML documents\n* specify the output format; this implies specifying:\n  - *terms*, which are string computed relative to the anchors\n  - how these terms are combined to produce a record; currently screp supports\n    three methods of specifying formats:\n      - CSV\n      - JSON\n      - general format\n* optionally, you can also define *secondary anchors*, which are elements\n  computed relative to the *primary anchor* that can be used to define *terms*\n  in a more succinct way\n\nDefining terms\n==============\n\nA *term* has the following format::\n\n    anchor.accessor.accessor.accessor|filter|filter|filter\n\nIn other words, a term is an anchor(primary or secondary) followed by zero or\nmore accessors followed by zero or more filters.\n\n*Accessors* and *filters* (also collectively called *actions*) are functions\nthat take the output value of the last function (or the anchor, if this is the\nfirst action) and output another value. In other words, they form a pipeline.\nAccessors act on DOM elements and sets (actually ordered lists) of elements,\nwhereas filters act on strings. Each action has an in_type and an out_type. For\na term to be correctly defined the out_type of an action needs to match the\nin_type of the following action.\n\nThe supported types are: 'string', 'element', 'element_set'.\n\nActions can have zero or more parameters. When the action takes parameters it\nis specified as a function::\n\n    action(parameter1, parameter2, parameter3)\n\nWhen not, only the action name is specified (no parentheses).\n\nFinally, terms have restrictions of the out_type of their last action (also\ncalled the out_type of the term):\n* if a term is used inside a format specification, its out_type must be\n  'string'\n* if a term is used to define a secondary anchor, its out_type must be\n  'element'\n\nExamples of terms\n-----------------\n\nThese are correct term definitions::\n\n    '$.parent.parent.attr(title)|upper' outputs 'string'\n    '@.desc(\".record\").first' outputs 'element\n    'anchor.ancestors(\".box\").children(\".price\")' outputs 'element_set'\n\nPredefined anchors and actions\n==============================\n\nThe following anchors are predefined:\n* **$** is the primary anchor defined by the primary anchor selector\n* **@** is the primary anchor representing the root of the current document\n\nThe following accessors are predefined:\n* **first** [in_type='element_set', out_type='element']: returns the first\n  element in an element_set\n* **last** [in_type='element_set', out_type='element']: returns the last\n  element in an element_set\n* **nth(n)** [in_type='element_set', out_type='element']: returns the n-th\n  element in an element_set; it also supports negative indexes, where -1\n  represents the last element, -2 the second-to-last element, and so on\n* **class** [in_type='element', out_type='string']: returns the value of the\n  'class' attribute * **id** [in_type='element', out_type='string']: returns\n  the value of the 'id' attribute * **parent** [in_type='element',\n  out_type='element']: returns the parent of the current element\n* **text** [in_type='element', out_type='string']: returns the text enclosed by\n  the current element\n* **tag** [in_type='element', out_type='string']: returns the tag of the\n  current element\n* **attr(attr_name)** [in_type='element', out_type='string']: returns the value\n  of the current element's attribute with name 'attr_name'\n* **desc(css_sel)** [in_type='element', out_type='element_set']: returns the\n  ordered list of descendants of the current element selected by the CSS\n  selector specified by 'css_sel'\n* **fdesc(css_sel)** [in_type='element', out_type='element']: equivalent to\n  .desc(css_sel).first\n* **ancestors(css_sel)** [in_type='element', out_type='element_set']: returns\n  the list of ancestors of the current element that satisfy the CSS selector\n  specified by 'css_sel'\n* **children(css_sel)** [in_type='element', out_type='element_set']: returns\n  the list of children of the current element that satisfy the CSS selector\n  specified by 'css_sel'\n* **psiblings(css_sel)** [in_type='element', out_type='element_set']: returns\n  the list of preceding siblings of the current element that satisfy the CSS\n  selector specified by 'css_sel'\n* **fsiblings(css_sel)** [in_type='element', out_type='element_set']: returns\n  the list of following siblings of the current element that satisfy the CSS\n  selector specified by 'css_sel'\n* **siblings(css_sel)** [in_type='element', out_type='element_set']: returns\n  the list of siblings of the current element that satisfy the CSS selector\n  specified by 'css_sel'\n* **matching(css_sel)** [in_type='element_set', out_type='element_set']:\n  filters an element_set and returns all elements that match the CSS selector\n  specified by 'css_sel'\n\nThe following filters are predefined:\n* **upper** [in_type='string', out_type='string']: converts string to uppercase\n* **lower** [in_type='string', out_type='string']: converts string to lowercase\n* **trim** [in_type='string', out_type='string']: removes spaces at the\n  beginning and end of the string\n* **strip(chars)** [in_type='string', out_type='string']: removes characters\n  specified by 'chars' at the beginning and end of the string\n* **replace(old, new)** [in_type='string', out_type='string']: replaces all\n  occurrences of 'old' with 'new'\n* **resub(pattern, repl)** [in_type='string', out_type='string']: performs a\n  regular expression substitution; *pattern* and *repl* are have the formats\n  taken by the **re.sub** Python function from the standard Python library;\n\nSpecifying output formats\n=========================\n\nCSV format\n----------\n\nThe CSV output format is specified using the -c option. Optionally, using the\n-H option you can specify a CSV header to output before outputting records.\n\nExample::\n\n    -c '$.attr(title), $.parent.desc(\".price\").text | trim' -H 'name, price'\n\n\nJSON format\n-----------\n\nThe JSON output format is defined using the -j option. It formats the output as\na JSON list of objects, one for each record. The *--indent-json* flat tells\nscrep to indent each object. The format is specified as a comma-separated list\nof *key=value* pairs, where the *key* represents the JSON key in the record\nobject while *value* is a term specification.\n\nExample::\n\n  - j 'text=$.text, ptext=$.parent.text | upper, gptext=$.parent.parent.text'\n\n\nGeneral format\n--------------\n\nThen general format is specified by a general string containing term\nspecifications. To distinguish it from the general format, each term\nspecification is surrounded by braces. When formatting a record each term\nspecification is substituted with the computed value for that term.\n\nExample::\n\n  -f 'some header {$.parent.text | replace(\"X\", \"Y\")} some middle {$.tag} some\n  tail'\n\n\nSpecifying secondary anchors\n============================\n\nSecondary anchors are specified using the -a option. There can be any number of\nsecondary anchors definitions. The definitions have the format\n**<name>=<term>** where <name> is an identifier and <term> is a term definition\nrelative to any of the previously defined anchors (primary or secondary) that\nhas outputs an element. Secondary anchors can be redefined in later -a options\nbut only the last definition is retained.\n\nSecondary anchors examples\n--------------------------\n\nThese are examples of secondary anchors definitions::\n\n    -a 'p=$.parent' -a 'gp=p.parent'\n\n    -a 'interesting=$.fdesc(\".interesting-class\")' -a\n    'interesting=interesting.parent'",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/darfire/screp",
        "keywords": null,
        "license": "LGPL",
        "maintainer": null,
        "maintainer_email": null,
        "name": "screp",
        "package_url": "https://pypi.org/project/screp/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/screp/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/darfire/screp"
        },
        "release_url": "https://pypi.org/project/screp/0.3.2/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "Command-line utility for easy scraping of HTML documents",
        "version": "0.3.2"
    },
    "last_serial": 799344,
    "releases": {
        "0.3": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a580821438b4bc7236902ba3228a737f",
                    "sha256": "bcf41dfad9f5d3ac1c6bd23b4791903a10b8d60d58aec3a6050a7e8cf258e546"
                },
                "downloads": -1,
                "filename": "screp-0.3.tar.gz",
                "has_sig": false,
                "md5_digest": "a580821438b4bc7236902ba3228a737f",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 19049,
                "upload_time": "2013-02-16T19:17:41",
                "url": "https://files.pythonhosted.org/packages/29/7c/65454dd1a9c4691d59ed8371972a5fb18c8d29dc73489a597787c934d103/screp-0.3.tar.gz"
            }
        ],
        "0.3.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "a44358303767adfbea2de09c38712f83",
                    "sha256": "201896496e432e90ea3f5e7410193f62618126745c88137ec64a117045fba42d"
                },
                "downloads": -1,
                "filename": "screp-0.3.1.tar.gz",
                "has_sig": false,
                "md5_digest": "a44358303767adfbea2de09c38712f83",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 19322,
                "upload_time": "2013-02-16T19:53:19",
                "url": "https://files.pythonhosted.org/packages/03/9a/e7853c8f8099ce46fadb9e75a4800bad9a80af335e3e89012900c0dc60f8/screp-0.3.1.tar.gz"
            }
        ],
        "0.3.2": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "4bab678283be29372520a7a6c8adc9d9",
                    "sha256": "3ec2e2fcf4292d8f42468c3a7a8d750d0638e398696299748154b291d7c4e643"
                },
                "downloads": -1,
                "filename": "screp-0.3.2.tar.gz",
                "has_sig": false,
                "md5_digest": "4bab678283be29372520a7a6c8adc9d9",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 19456,
                "upload_time": "2013-02-17T12:04:24",
                "url": "https://files.pythonhosted.org/packages/ba/6b/a0a287272610c891e776e1f7836d49703802212f44b6220d8241e96eaba7/screp-0.3.2.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "4bab678283be29372520a7a6c8adc9d9",
                "sha256": "3ec2e2fcf4292d8f42468c3a7a8d750d0638e398696299748154b291d7c4e643"
            },
            "downloads": -1,
            "filename": "screp-0.3.2.tar.gz",
            "has_sig": false,
            "md5_digest": "4bab678283be29372520a7a6c8adc9d9",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 19456,
            "upload_time": "2013-02-17T12:04:24",
            "url": "https://files.pythonhosted.org/packages/ba/6b/a0a287272610c891e776e1f7836d49703802212f44b6220d8241e96eaba7/screp-0.3.2.tar.gz"
        }
    ]
}