{ "info": { "author": "barisumog", "author_email": "barisumog@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Education", "Intended Audience :: End Users/Desktop", "Intended Audience :: Science/Research", "License :: OSI Approved :: GNU General Public License v3 (GPLv3)", "Programming Language :: Python :: 3.3", "Topic :: Internet :: WWW/HTTP", "Topic :: Internet :: WWW/HTTP :: Indexing/Search" ], "description": "pyllage\r\n=======\r\n\r\nA web scraping tool in Python 3.\r\n\r\n**pyllage** is a simple and practical tool to extract data\r\nfrom web pages.\r\n\r\nAs opposed to full fledged scraping frameworks, it provides a\r\nbare bones approach. The basic API allows quick testing of\r\nideas and easy integration with other tools and scripts.\r\n\r\n\r\nFeatures\r\n--------\r\n\r\n* supports HTTP GET and POST requests\r\n\r\n* allows custom request headers (cookies, user-agents, etc)\r\n\r\n* adjusts encoding according to *Content-Type* information in either the response headers, or the of the html document\r\n\r\n* custom parser built upon the standard HTMLParser class\r\n\r\n* practical selectors for extracting data (no tree traversal or XPath)\r\n\r\n\r\nRequirements\r\n------------\r\n\r\nCurrently, all package functionality is built upon the standard library.\r\nThis may or may not change in the future.\r\n\r\nTests are written to **py.test**, so that's a requirement if you want to\r\nrun the bundled tests.\r\n\r\n\r\nInstalling\r\n----------\r\n\r\n::\r\n\r\n pip install pyllage\r\n\r\n\r\n\r\nQuick Start\r\n-----------\r\n\r\nHere's a few quick examples illustrating *some* of the functions::\r\n\r\n import pyllage\r\n stack = pyllage.get_stack(\"http://somesite.com/etcetera\")\r\n \r\n # get all links, print the href=... parts\r\n \r\n links = pyllage.choose(stack, tag=\"a\")\r\n for key in links:\r\n print(links[key][\"attrs\"])\r\n \r\n # get all text data except scripts and print it\r\n \r\n texts = pyllage.choose(stack, tag=\"script\", select=False)\r\n data = pyllage.rip_data(texts)\r\n print(\"\\n\".join(data))\r\n \r\n # get all spans and divs with class=help (but not class=helpmore)\r\n \r\n helps = pyllage.choose(stack, tag=\"span div\", attrs=\"class=help\", exact=True)\r\n \r\n # get all divs containing the word pyllage in their text part\r\n \r\n pylls = pyllage.choose(stack, tag=\"div\", data=\"pyllage\")\r\n\r\n\r\nHow the parser works & The stack\r\n--------------------------------\r\n\r\nThe parser spits out a dictionary that we call the *stack*.\r\n\r\nIt's looks something like this::\r\n\r\n {1: {\"tag\": \"div\", \"attrs\": \"class=main\", \"data\": \"\"},\r\n 2: {\"tag\": \"p\", \"attrs\": \"\", \"data\": \"Hello world\"},\r\n 3: {\"tag\": \"span\", \"attrs\": \"class=red bold\", \"data\": \"Exclaim!\"},\r\n 4: {\"tag\": \"a\", \"attrs\": 'href=\"http://somewhere\"', \"data\": \"click me\"}}\r\n\r\nThe keys of *stack* are consecutive integers starting from 1.\r\n\r\nWhile parsing an html document, the parser creates a new entry in the *stack* every time\r\nit finds an opening tag. Every entry itself is a dictionary with 3 items:\r\n\r\n``tag`` is the tag name of the encountered tag\r\n\r\n``attrs`` is everything else inside the opening bracket (class, id, style, href, etc.)\r\n\r\n``data`` is the text that is parsed *after* the opening bracket, and before the closing\r\nbracket (or a new opening bracket)\r\n\r\nFor example: ``
Hello
``\r\n\r\n``tag`` is \"div\"\r\n\r\n``attrs`` is \"id=main_div\"\r\n\r\n``data`` is \"Hello\"\r\n\r\nWhile parsing the html and populating the *stack* this way, the parser prunes *insignificant*\r\nentries (entries with no attrs or data). For example:\r\n\r\n``
Hello
``\r\n\r\nThe outer div here won't be included in the *stack*, since it doesn't have any useful\r\ninfo (no class or id etc, and no direct data).\r\n\r\nSome tags may have more than 1 attribute. These will be concatenated into a single string\r\nlike this:\r\n\r\n``
`` will parse into:\r\n\r\n``{'tag': 'div', 'attrs': 'class=wrapper | style=float:left;', 'data': ''}``\r\n\r\n\r\nAPI\r\n---\r\n\r\nFunctions for retrieving the stack\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\n::\r\n\r\n pyllage.get_stack(url, headers={}, method=\"GET\", postdata=None, filename=\"\"):\r\n \"\"\"Wraps http request and parsing. Also allows stack write.\"\"\"\r\n\r\nThis is the main utility function for retrieving the stack from a url. It wraps the\r\nfunctionality of the 3 functions below, providing a shortcut. Normally, you can just\r\nuse ``get_stack`` unless you need to interfere with the **url -> stack** process.\r\n\r\n``headers`` lets you include cookies or user-agent strings.\r\n\r\n``method`` is either \"GET\" or \"POST\".\r\n\r\n``postdata`` is a bytes object containing data to be sent to the server for POST requests.\r\n\r\nIf ``filename`` is given, the returned stack will also be written to disk, allowing\r\nfor inspecting with a text editor.\r\n\r\n\r\n::\r\n\r\n pyllage.get(url, headers={} method=\"GET\", postdata=None):\r\n \"\"\"Http request the url, return response, headers, status and codec.\"\"\"\r\n\r\nRaw function that makes the Http request.\r\n\r\n``response = get(\"http://somesite.com\", {\"Cookie\": \"valid=true;\"})``\r\n\r\n``response = get(\"http://othersite.com\", method=\"POST\", postdata=b\"answer=42\")``\r\n\r\nThe function returns a dictionary with the following keys:\r\n\r\n``headers`` contains the received http headers (may include cookies, etc)\r\n\r\n``status`` is an integer representing the status message returned by the server\r\n(200 = OK, 404 = Not found, etc)\r\n\r\n``html`` contains the body of the response. Note that this is of *bytes* type.\r\n\r\n``codec`` is a string containing the encoding declared in the http response.\r\n**pyllage** looks at the response headers for a *Content-Type* with charset\r\nvalue. If there's none, it looks at the part of the html body. If there's no\r\ncodec information there, it defaults to *utf-8*.\r\n\r\n\r\n::\r\n\r\n pyllage.parse(html):\r\n \"\"\"Instantiate a parser to process html, return the stack.\"\"\"\r\n\r\nPlease note that the html must be decoded into a string before it can be parsed.\r\nThe ``get_stack`` function handles this automatically.\r\n\r\n\r\n::\r\n\r\n pyllage.stack_to_file(filename, stack, codec):\r\n \"\"\"Write a stack to file with formatting.\"\"\"\r\n\r\nWrites the stack to a file on disk. Note that it **overwrites** any existing data in the given file.\r\n\r\n\r\nSelector functions for operating on the stack\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\n::\r\n\r\n pyllage.choose(stack, tag=None, attrs=None, data=None, select=True, exact=False):\r\n \"\"\"Returns a dictionary of items from stack that fit given criteria.\r\n\r\n If select is True, returns items that fit criteria. If False, returns all others.\r\n If exact is False, compares tag, attrs and data flexibly.\r\n If exact is True, compares tag, attrs and data exactly as given.\"\"\"\r\n\r\nMain selector function. Examples:\r\n\r\n``pyllage.choose(stack, tag=\"a\")``\r\nReturns all entries.\r\n\r\n``pyllage.choose(stack, tag=\"div span a\", select=False)``\r\nReturns all entries with tags other than
, , or .\r\n\r\n``pyllage.choose(stack, tag=\"div\", attrs=\"id=\")``\r\nReturns all
entries with an *id* attribute.\r\n\r\n``pyllage.choose(stack, attrs=\"class=blue\", exact=True)``\r\nReturns all entries with **exactly** the attribute \"class=blue\" (won't select \"class=blue button\" for example)\r\n\r\n``pyllage.choose(stack, data=\"\", exact=True, select=False)``\r\nReturns all entries with non-empty data.\r\n\r\n::\r\n\r\n pyllage.relative(stack, index, offset=1, count=1):\r\n \"\"\"Returns count number of items, starting at offset from index.\r\n\r\n With defaults, it just returns the next item.\r\n Offset can be negative, count must be greater than 1.\"\"\"\r\n\r\n``index`` is the integer key for the base item in stack.\r\nUseful for extracting data from tags with no id or class attribute.\r\n\r\nE.g. something like ``
The data you need is42
``\r\nWhen you can select the wrapping div with its class, and then using its index, call\r\n``pyllage.relative(stack, index, 2, 1)``\r\n\r\n``pyllage.relative(stack, index, -5, count=4)``\r\nReturns the 4 entries that comes right before the given index.\r\n\r\nNote that this function works as expected on stacks that you have manipulated. That is,\r\nif the indexes in your stack are [3, 5, 88, 101], then ``pyllage.relative(stack, 5)`` will\r\ngive the entry at 88.\r\n\r\n\r\n::\r\n\r\n pyllage.rip_data(stack):\r\n \"\"\"Returns an ordered list of non-blank data values in stack.\"\"\"\r\n\r\nFor getting the data after you have selected the entry.\r\n\r\n::\r\n\r\n pyllage.rip_index(stack):\r\n \"\"\"Returns an ordered list of the indexes in stack.\"\"\"\r\n\r\nUseful for doing batch operations with ``pyllage.relative``. For example::\r\n\r\n links = pyllage.choose(stack, tag=\"a\") # choose all links\r\n link_inds = pyllage.rip_index(links) # get the indexes\r\n new_stack = {}\r\n for i in link_inds:\r\n new_stack.update(pyllage.relative(stack, i))\r\n\r\nNow ``new_stack`` contains all the elements directly following an
tag.\r\n\r\n::\r\n\r\n pyllage.between(stack, start, stop):\r\n \"\"\"Returns items between given indexes, inclusive.\"\"\"\r\n\r\nWhen you have a very large document, and you're only interested in a certain\r\npart of it, you can use this crop the stack.\r\n\r\nAlso works as expected in manipulated stacks. Say your stack indexes are\r\n[3, 5, 88, 101]. ``pyllage.between(stack, 50, 90)`` will return the item at 88.\r\n\r\n\r\n\r\nFeedback\r\n--------\r\n\r\n**pyllage** is currently under development, so more features are on their way.\r\n\r\nIf you have any ideas about features, or would like some new selector functions,\r\nfeel free to open an issue on Github.\r\n\r\n\r\nLicense\r\n-------\r\n\r\n**pyllage** is open sourced under GPLv3.", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/barisumog/pyllage", "keywords": "web scraper scraping", "license": "GPLv3", "maintainer": "", "maintainer_email": "", "name": "pyllage", "package_url": "https://pypi.org/project/pyllage/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/pyllage/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/barisumog/pyllage" }, "release_url": "https://pypi.org/project/pyllage/0.1.2/", "requires_dist": null, "requires_python": null, "summary": "A web scraping tool in Python 3", "version": "0.1.2" }, "last_serial": 797367, "releases": { "0.1.0": [ { "comment_text": "", "digests": { "md5": "182ebbaac4d813f807fe6e26ee02f46a", "sha256": "160f7a9ac8f6b49d55dc478cfb5ae11e4767e5c1ba00764aa9a3269aec62711f" }, "downloads": -1, "filename": "pyllage-0.1.0.zip", "has_sig": false, "md5_digest": "182ebbaac4d813f807fe6e26ee02f46a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23251, "upload_time": "2013-04-01T09:29:22", "url": "https://files.pythonhosted.org/packages/40/f6/521116f1afbf1ba778b7fe083408d1660bf99b5835fb96c3999920b4b6f9/pyllage-0.1.0.zip" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "659cad1df2054a10e23038e8f8e98190", "sha256": "aa45cd8fe9baef1f37058826bbd2f88a7918e598fcc4e71605d3feee35bcdb94" }, "downloads": -1, "filename": "pyllage-0.1.1.zip", "has_sig": false, "md5_digest": "659cad1df2054a10e23038e8f8e98190", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 23296, "upload_time": "2013-04-01T09:32:06", "url": "https://files.pythonhosted.org/packages/0a/8e/21495aa050ad88b162bc17ae754b81475b1faa8a0a173e0e402ca5ac6a88/pyllage-0.1.1.zip" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "d34cb0c6040623b851fe54f6d0b4e4b5", "sha256": "b861e36e4a3b08e2fe95f7191e5344a07f1c7f05c172be49e2347334dff223f5" }, "downloads": -1, "filename": "pyllage-0.1.2.zip", "has_sig": false, "md5_digest": "d34cb0c6040623b851fe54f6d0b4e4b5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 36262, "upload_time": "2013-04-06T08:32:58", "url": "https://files.pythonhosted.org/packages/e0/6c/b1265b51a82ac674b17d58d264381130f258a7a0833854a2534829d88af6/pyllage-0.1.2.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "d34cb0c6040623b851fe54f6d0b4e4b5", "sha256": "b861e36e4a3b08e2fe95f7191e5344a07f1c7f05c172be49e2347334dff223f5" }, "downloads": -1, "filename": "pyllage-0.1.2.zip", "has_sig": false, "md5_digest": "d34cb0c6040623b851fe54f6d0b4e4b5", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 36262, "upload_time": "2013-04-06T08:32:58", "url": "https://files.pythonhosted.org/packages/e0/6c/b1265b51a82ac674b17d58d264381130f258a7a0833854a2534829d88af6/pyllage-0.1.2.zip" } ] }