{ "info": { "author": "barisumog", "author_email": "barisumog@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Programming Language :: Python :: 3.3", "Topic :: Internet :: WWW/HTTP", "Topic :: Internet :: WWW/HTTP :: Indexing/Search" ], "description": "pyllage\r\n=======\r\n\r\nA web scraping tool in Python 3.\r\n\r\n**pyllage** is a simple and practical tool to extract data\r\nfrom web pages.\r\n\r\nAs opposed to full fledged scraping frameworks, it provides a\r\nbare bones approach. The basic API allows quick testing of\r\nideas and easy integration with other tools and scripts.\r\n\r\n\r\nFeatures\r\n--------\r\n\r\n* supports HTTP GET and POST requests\r\n\r\n* allows custom request headers (cookies, user-agents, etc)\r\n\r\n* adjusts encoding according to *Content-Type* information in either the response headers, or the
of the html document\r\n\r\n* custom parser built upon the standard HTMLParser class\r\n\r\n* practical selectors for extracting data (no tree traversal or XPath)\r\n\r\n\r\nRequirements\r\n------------\r\n\r\nCurrently, all package functionality is built upon the standard library.\r\nThis may or may not change in the future.\r\n\r\nTests are written to **py.test**, so that's a requirement if you want to\r\nrun the bundled tests.\r\n\r\n\r\nInstalling\r\n----------\r\n\r\n::\r\n\r\n pip install pyllage\r\n\r\n\r\n\r\nQuick Start\r\n-----------\r\n\r\nHere's a few quick examples illustrating *some* of the functions::\r\n\r\n import pyllage\r\n stack = pyllage.get_stack(\"http://somesite.com/etcetera\")\r\n \r\n # get all links, print the href=... parts\r\n \r\n links = pyllage.choose(stack, tag=\"a\")\r\n for key in links:\r\n print(links[key][\"attrs\"])\r\n \r\n # get all text data except scripts and print it\r\n \r\n texts = pyllage.choose(stack, tag=\"script\", select=False)\r\n data = pyllage.rip_data(texts)\r\n print(\"\\n\".join(data))\r\n \r\n # get all spans and divs with class=help (but not class=helpmore)\r\n \r\n helps = pyllage.choose(stack, tag=\"span div\", attrs=\"class=help\", exact=True)\r\n \r\n # get all divs containing the word pyllage in their text part\r\n \r\n pylls = pyllage.choose(stack, tag=\"div\", data=\"pyllage\")\r\n\r\n\r\nHow the parser works & The stack\r\n--------------------------------\r\n\r\nThe parser spits out a dictionary that we call the *stack*.\r\n\r\nIt's looks something like this::\r\n\r\n {1: {\"tag\": \"div\", \"attrs\": \"class=main\", \"data\": \"\"},\r\n 2: {\"tag\": \"p\", \"attrs\": \"\", \"data\": \"Hello world\"},\r\n 3: {\"tag\": \"span\", \"attrs\": \"class=red bold\", \"data\": \"Exclaim!\"},\r\n 4: {\"tag\": \"a\", \"attrs\": 'href=\"http://somewhere\"', \"data\": \"click me\"}}\r\n\r\nThe keys of *stack* are consecutive integers starting from 1.\r\n\r\nWhile parsing an html document, the parser creates a new entry in the *stack* every time\r\nit finds an opening tag. Every entry itself is a dictionary with 3 items:\r\n\r\n``tag`` is the tag name of the encountered tag\r\n\r\n``attrs`` is everything else inside the opening bracket (class, id, style, href, etc.)\r\n\r\n``data`` is the text that is parsed *after* the opening bracket, and before the closing\r\nbracket (or a new opening bracket)\r\n\r\nFor example: ``