{ "info": { "author": "Will Sijp", "author_email": "wim.sijp@gmail.com", "bugtrack_url": null, "classifiers": [ "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 3" ], "description": "Scrapepath\n----------\n\n[Scrapepath](https://github.com/wsijp/scrapepath) is a templated web scraping syntax. [Scrapepath is pip installable](https://pypi.org/project/scrapepath/) via `pip install scrapepath`.\n\n\nRequirements\n------------\n\nInstall the required Python dependencies using the provided requirements.txt file, by:\n\n```bash\npip install -r requirements.txt\n```\n\n\nUsage\n-----\n\nTo run an example, execute on the command line without arguments:\n\n```bash\n./parser\n```\n\nTo use within Python:\n\n```python\nfrom parser import NodeParser\n\nnp = NodeParser(soup_template, soup, live_url)\nnp.hop_template()\nprint (json.dumps(np.result_dict, indent = 2, default = str))\n```\n\nWhere `soup_template` is a `BeautifulSoup` of the template file, `soup` is a `BeautifulSoup` of the scraped page and `live_url` the url of the scraped page.\n\nTemplates\n---------\n\nHTML pages are scraped using HTML templates, consisting of a mixture of the most important tags, and statements.\n\nTemplates consist of HTML files containing nested tags leading to the scraping element of interest.\n\nThe parser is based on `BeautifulSoup`.\n\nExample 1: Scraping data\n-----------------------\n\nThe following examples are from scraped pages `examples/example1a.html` and template `examples/scraped1.html`. Run the example using:\n\n`./parser.py examples/example1a.html examples/scraped1.html`\n\nThis scrapes the target page `scraped1.html` using the `template example1a.html`. The text item \"Tea\" is scraped from the target page using the `record` attribute in the template page. A path to the target text (\"Tea\") is specified in the template using tags that correspond to the target page. So, to scrape from:\n\n```html\n\n\n\n```\n\nUse template:\n\n```html\n\n\n```\n\nThis yields a dictionary containing the scraped data under the key \"favorite\" as specified in the `record` attribute:\n\n```json\n{\n \"favorite\": \"Tea\"\n}\n```\nThe `text` statement within the record attribute corresponds to a function that obtains text from inside the HTML tag, and `favorite` is the key to record the data against. The `text` function can be replaced with custom Python functions.\n\n Starting from the outer node, `