{
    "info": {
        "author": "Will Sijp",
        "author_email": "wim.sijp@gmail.com",
        "bugtrack_url": null,
        "classifiers": [
            "License :: OSI Approved :: MIT License",
            "Operating System :: OS Independent",
            "Programming Language :: Python :: 3"
        ],
        "description": "Scrapepath\n----------\n\n[Scrapepath](https://github.com/wsijp/scrapepath) is a templated web scraping syntax. [Scrapepath is pip installable](https://pypi.org/project/scrapepath/) via `pip install scrapepath`.\n\n\nRequirements\n------------\n\nInstall the required Python dependencies using the provided requirements.txt file, by:\n\n```bash\npip install -r requirements.txt\n```\n\n\nUsage\n-----\n\nTo run an example, execute on the command line without arguments:\n\n```bash\n./parser\n```\n\nTo use within Python:\n\n```python\nfrom parser import NodeParser\n\nnp = NodeParser(soup_template, soup, live_url)\nnp.hop_template()\nprint (json.dumps(np.result_dict, indent = 2, default = str))\n```\n\nWhere `soup_template` is a `BeautifulSoup` of the template file, `soup` is a `BeautifulSoup` of the scraped page and `live_url` the url of the scraped page.\n\nTemplates\n---------\n\nHTML pages are scraped using HTML templates, consisting of a mixture of the most important tags, and statements.\n\nTemplates consist of HTML files containing nested tags leading to the scraping element of interest.\n\nThe parser is based on `BeautifulSoup`.\n\nExample 1: Scraping data\n-----------------------\n\nThe following examples are from scraped pages `examples/example1a.html` and template `examples/scraped1.html`. Run the example using:\n\n`./parser.py examples/example1a.html examples/scraped1.html`\n\nThis scrapes the target page `scraped1.html` using the `template example1a.html`. The text item \"Tea\" is scraped from the target page using the `record` attribute in the template page. A path to the target text (\"Tea\") is specified in the template using tags that correspond to the target page. So, to scrape from:\n\n```html\n\n<ul class = \"my_list\">\n  <li class = \"my_item\">Coffee</li>\n  <li class = \"my_item\"><span class = \"cuppa\">Tea</span></li>\n  <li class = \"my_item\">Milk</li>\n</ul>\n\n```\n\nUse template:\n\n```html\n<ul class = \"my_list\">\n  <span class = \"cuppa\" record = \"text as favorite\"></span>\n</ul>\n\n```\n\nThis yields a dictionary containing the scraped data under the key \"favorite\" as specified in the `record` attribute:\n\n```json\n{\n  \"favorite\": \"Tea\"\n}\n```\nThe `text` statement within the record attribute corresponds to a function that obtains text from inside the HTML tag, and `favorite` is the key to record the data against. The `text` function can be replaced with custom Python functions.\n\n Starting from the outer node, `<ul>` , in the template, the parser looks for the first node in the scraped page that matches the template node in type and attributes. In this case, matching a ul with a ul, and class my_list with class my_list. Then, the same search takes place using the template node children, now confined within the children of the scraped node. So nested template nodes represent paths. The `<li>` node is not included in the template, as it would point the search to the first element of the list.\n\n In this case, nesting the template nodes is needlessly specific. There are no other nodes of class \"cuppa\", so we can omit the `<ul>` and `<li>` items, and the following template will record the same data:\n\n ```html\n <span class = \"cuppa\" record = \"text as favorite\"></span>\n\n ```\n\nSo paths along many nested nodes in the scraped page can be summarized by only a few nodes that define a unique path to the scraped data.\n\n\nLoops:\n\nA `for` loop scrapes all items in the list. In this simple example, we record only one variable (item_text) per item:\n\nTemplate:\n\n```html\n    <ul class = \"my_list\">\n      <for items = \"items\" condition = \"i < 5\">\n        <li class =\"my_item\" record = \"text as item_text\">\n        </li>\n      </for>\n    </ul>\n```\n\nThis results in the output:\n\n```json\n{\n  \"items\": [\n    {\n      \"item_text\": \"Coffee\"\n    },\n    {\n      \"item_text\": \"Tea\"\n    },\n    {\n      \"item_text\": \"Milk\"\n    },\n    {\n      \"item_text\": \"Biscuits\"\n    },\n    {\n      \"item_text\": \"Chocolate\"\n    }\n  ]\n}\n```\n\nHere, the parser matches all the children of the `<for>` template node to the children of the `<ul>` node in the scraped page `scraped1.html` . Run the example using: `./parser.py examples/example1b.html examples/scraped1.html`. The `condition` node indicates that only the first 5 items should be recorded, where `i` is the loop counter variable.\n\nExample 2: for loops on mixed nodes\n----------------------------------\n\nIn the following html, a `<for>` template loop node needs to enclose two template nodes, one for each tag (div and p) and class (my_item and milk_class):\n\nTo scrape from:\n\n```html\n<div class = \"my_list\">\n  <div class = \"my_item\">Coffee</div>\n  <div class = \"my_item\"><span class = \"cuppa\">Tea</span></div>\n  <p class = \"milk_class\">Milk</p>\n  <div class = \"my_item\">Biscuits</div>\n  Chocolate\n</div>\n```\n\nUse template:\n\n```html\n<div class = \"my_list\">\n  <for items = \"items\" >\n    <div class =\"my_item\" record = \"text as item_text\"></div>\n    <p class =\"milk_class\" record = \"text as item_text\"></p>\n  </for>\n</div>\n\n```\n\nHowever, the `<for>` template loop node is unable to record the text element \"chocolate\", as the `<for>` only looks for proper nodes among the children of the `<div class = \"my_list\">` node. To do this, a `<forchild>` template loop node is needed, along with a `<str>` template node to record the `NavigableString` element \"chocolate\":\n\nTemplate:\n\n```html\n\n<div class = \"my_list\">\n  <forchild items = \"items_with_string\" >\n    <div class =\"my_item\" record = \"text as item_text\"></div>\n    <p class =\"milk_class\" record = \"text as item_text\"></p>\n    <str record = \"text as item_text\"></div>\n  </forchild>\n</div>\n\n```\n\nIn this case, the parser looks for the first match to the first template node (the child of the `<for>` node), and loops over its sibblings, probing with all template nodes (the children of this for node). Run this example using `examples/example1b.html` and `examples/scraped1.html`.\n\nExample 3: Jumping to linked pages\n---------------------------------\n\nFollow links on pages using the `<jump>` template node:\n\nTo scrape from:\n\n```html\n\n<a href=\"example_linked.html\"></a>\n\n```\n\nUse template:\n\n```html\n    <a record = \"href as my_link\">\n      <jump on = \"my_link\">\n        <ibody>\n          <div class = \"message\" record = \"text as msg_from_link\"></div>\n        </ibody>\n      </jump>\n    <a>\n```\n\nHere, the nodes within the `<jump>` node act on the linked page.\n\nThis example is invoked with:\n\n```bash\n./parser.py examples/example3a.html examples/scraped3.html\n```",
        "description_content_type": "text/markdown",
        "docs_url": null,
        "download_url": "",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/wsijp/scrapepath",
        "keywords": "",
        "license": "",
        "maintainer": "",
        "maintainer_email": "",
        "name": "scrapepath",
        "package_url": "https://pypi.org/project/scrapepath/",
        "platform": "",
        "project_url": "https://pypi.org/project/scrapepath/",
        "project_urls": {
            "Homepage": "https://github.com/wsijp/scrapepath"
        },
        "release_url": "https://pypi.org/project/scrapepath/0.1.1/",
        "requires_dist": null,
        "requires_python": "",
        "summary": "Templated scraping syntax",
        "version": "0.1.1"
    },
    "last_serial": 5394433,
    "releases": {
        "0.1.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "f5bde24507caa196a3a8d744c93653a8",
                    "sha256": "0aa647d004eedb7b8a5805ec184d1bb5f2dc4e4184b6d44196a9fdeb42ac6a6e"
                },
                "downloads": -1,
                "filename": "scrapepath-0.1.1.tar.gz",
                "has_sig": false,
                "md5_digest": "f5bde24507caa196a3a8d744c93653a8",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 3703,
                "upload_time": "2019-06-13T06:20:16",
                "url": "https://files.pythonhosted.org/packages/44/10/67c3eee880479002ce7c58b7b92664a71bd50aed17d44acc25ca922fd0bd/scrapepath-0.1.1.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "f5bde24507caa196a3a8d744c93653a8",
                "sha256": "0aa647d004eedb7b8a5805ec184d1bb5f2dc4e4184b6d44196a9fdeb42ac6a6e"
            },
            "downloads": -1,
            "filename": "scrapepath-0.1.1.tar.gz",
            "has_sig": false,
            "md5_digest": "f5bde24507caa196a3a8d744c93653a8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 3703,
            "upload_time": "2019-06-13T06:20:16",
            "url": "https://files.pythonhosted.org/packages/44/10/67c3eee880479002ce7c58b7b92664a71bd50aed17d44acc25ca922fd0bd/scrapepath-0.1.1.tar.gz"
        }
    ]
}