{ "info": { "author": "Joe Farro", "author_email": "joe@jf.io", "bugtrack_url": null, "classifiers": [ "Development Status :: 3 - Alpha", "Framework :: Scrapy", "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: MIT License", "Operating System :: OS Independent", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: Implementation :: PyPy", "Topic :: Internet :: WWW/HTTP :: Indexing/Search", "Topic :: Software Development :: Libraries :: Python Modules", "Topic :: Text Processing", "Topic :: Text Processing :: Markup :: HTML", "Topic :: Text Processing :: Markup :: XML", "Topic :: Utilities" ], "description": "A DSL for extracting data from a web page. The DSL serves two purposes:\nfinds elements and extracts their text or attribute values. The main\nreason for developing this is to have all the CSS selectors for scraping\na site in one place (I prefer CSS selectors over anything else).\n\nThe DSL wraps `PyQuery`_.\n\nA few links:\n\n* `Github repository `_\n\n* `PyPi package `_\n\n* `Discussion group `_\n\nExample\n-------\n\nGiven the following take template:\n\n::\n\n $ h1 | text\n save: h1_title\n $ ul\n save each: uls\n $ li\n | 0 [title]\n save: title\n | 1 text\n save: second_li\n $ p | 1 text\n save: p_text\n\nAnd the following HTML:\n\n.. code:: html\n\n
\n

Le Title 1

\n

Some body here

\n

The second body here

\n
    \n
  • A first li
  • \n
  • Second li in list #a
  • \n
  • A third li
  • \n
\n
    \n
  • B first li
  • \n
  • Second li in list #b
  • \n
  • B third li
  • \n
\n
\n\nThe following data will be extracted (presented in JSON format):\n\n.. code:: json\n\n {\n \"h1_title\": \"Le Title 1\",\n \"p_text\": \"The second body here\",\n \"uls\": [\n {\n \"title\": \"a less than awesome title\",\n \"second_li\": \"Second li in list #a\"\n },\n {\n \"title\": \"some awesome title\",\n \"second_li\": \"Second li in list #b\"\n }\n ]\n }\n\nTake templates always result in a single python ``dict``.\n\nThe template can also be written in the following, more concise, syntax:\n\n::\n\n $ h1 | text ; : h1_title\n $ ul\n save each : uls\n $ li\n | 0 [title] ; : title\n | 1 text ; : second_li\n $ p | 1 text ; : p_text\n\nThe example above is formatted with extra whitespace to make the structure\nof the resulting data more apparent.\n\nMore Examples\n^^^^^^^^^^^^^\n\nFor more complex examples:\n\n- Scraping the `reddit home page `_\n\n - `Inline version `_\n\n - `Verbose version `_\n\n- Scraping the latest `web-scraping questions `_ on Stack Overflow:\n\n - `Overview `_\n\n - `questions-listing.take `_\n\n - `question-page.take `_\n\nInstall\n-------\n\n.. code::\n\n pip install take\n\n\nUsage\n-----\n\nCreating a Take Template\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nA take template can be created from a file via the static method\n``TakeTemplate.from_file()``.\n\n.. code:: python\n\n from take import TakeTemplate\n tt = TakeTemplate.from_file('yourfile.take')\n\nThe ``TakeTemplate`` constructor can be used to create a template from either\na ``basestring`` or an ``Iterable``.\n\nTo create a template from a string:\n\n.. code:: python\n\n from take import TakeTemplate\n TMPL = \"\"\"\n $ nav a\n save each: nav\n | text\n save: text\n | [href]\n save: link\n \"\"\"\n tt = TakeTemplate(TMPL)\n\nAdditionally, a ``base_url`` keyword argument can be specified which\nwill cause relative URLs to be made absolute via the value of the\n``base_url`` parameter for any documents that are processed.\n\n.. code:: python\n\n tt = TakeTemplate.from_file('yourfile.take', base_url='http://www.example.com')\n\n tt = TakeTempalte(TMPL, base_url='http://www.example.com')\n\nIf a ``base_url`` is provided when the template is used, it will\noverride the ``base_url`` provided when the template was created. The\n``base_url`` parameter must be provided as a keyword argument.\n\nUsing a Take Template\n^^^^^^^^^^^^^^^^^^^^^\n\nTo parse from a URL:\n\n.. code:: python\n\n data = tt(url='http://www.example.com')\n\nTo parse from a html string:\n\n.. code:: python\n\n data = tt('
hello world
')\n\nTo parse from a file:\n\n.. code:: python\n\n data = tt(filename=path_to_html_file)\n\nAlternatively, the ``take()`` method can be used:\n\n.. code:: python\n\n data = tt.take(url='http://www.example.com')\n\nValid parameters for the template callable or the ``take()`` method are\nthe same as those for the `PyQuery constructor`_.\n\nAdditionally, if the ``'base_url'`` keyword parameter is supplied, all\nrelative URLs will be made absolute via the value of ``'base_url'``.\n\n.. code:: python\n\n data = tt(url='http://www.example.com', base_url='http://www.example.com')\n\nTake Templates\n--------------\n\nTake templates are whitespace sensitive and are comprised of three types\nof statements:\n\n- Comment Lines\n\n - ``# some comment``\n\n- Queries\n\n - ``$ h1``\n\n - ``| text``\n\n - ``$ h1 | 0 text``\n\n- Directives\n\n - ``save: h1_title``\n\n - ``save each: comments``\n\n - ``merge: *``\n\n - ``def: get comments``\n\nComment Lines\n-------------\n\nAny line with a ``#`` as the first non-whitespace character is considered a comment line.\n\n::\n\n # this line is a comment\n # the third line is a CSS selector query\n $ #main-nav a\n\nComment lines are completely ignored. Partially commented lines and multi-line comments are not supported at this time.\n\nQueries\n-------\n\nThere are two main types of queries in take templates:\n\n- CSS selector queries\n\n- Non-CSS selector queries\n\nThe reason they\u2019re divided like this is because CSS selectors always go\nfirst on the line and they can be followed by non-CSS selector queries.\nNon-CSS selector queries can\u2019t be followed by CSS selector queries.\nSeems easier to read this way, but it\u2019s arbitrary and may change.\n\nCSS Selector Queries\n^^^^^^^^^^^^^^^^^^^^\n\nCSS selector queries start with ``$`` and end either at the end of the\nline, the ``|`` character or the ``;`` character. The ``|`` character\nis the starting character for non-CSS selector queries, and the ``;``\ncharacter ends the statement and starts an `inline sub-context <#inline-sub-contexts>`_.\n\n- ``$ #siteTable .thing | text``\n- ``$ .domain a``\n\nIn the first example above, the CSS selector query is\n``#siteTable .thing``. The second is ``.domain a``.\n\nThe CSS selectors are passed to `PyQuery`_, so anything PyQuery can\naccept can be used. From what I understand, there are a few `bugs`_ in\nPyQuery (that may be in the underlying libraries `lxml`_ or\n`cssselect`_). Those will come up.\n\nNon-CSS Selector Queries\n^^^^^^^^^^^^^^^^^^^^^^^^\n\nNon-CSS selector queries start with ``|`` and continue until the ``;`` character or the\nline ends. There are five non-CSS selector queries:\n\n- **Element indexes**\n\n - Syntax: an integer\n\n - ``| 0`` will return the first element in the current context\n\n - ``| 1`` will return the second element in the current context\n\n - ``| -1`` will return the last element in the current context\n\n- **Attribute retrieval**\n\n - Syntax: ``[attr]``\n\n - ``| [href]`` will return the value of the ``href`` attribute of the\n first element in the current context\n\n - ``| 1 [href]`` will return the value of the ``href`` attribute of the\n second element in the current context\n\n- **Text retrieval**\n\n - Syntax: ``text``\n\n - ``| text`` will return the text of the current context\n\n - ``| 1 text`` will first get the second element in the current context\n and then return it\u2019s text\n\n- **Own text retrieval**\n\n - Syntax: ``own_text``\n\n - ``| own_text`` will return the text of the current context without the text\n from its children\n\n - ``| 1 own_text`` will first get the second element in the current context\n and then return it\u2019s text without the text from its children\n\n- **Field retrieval**\n\n - Syntax: ``.field_name``\n\n - ``| .description`` will do a dictionary lookup on the context and retrieve\n the value of the ``'description'`` item\n\n - ``| .parent.child`` will do a dictionary lookup on the context and retrieve\n the value of the ``'parent'`` and then it will lookup ``'child'`` on that value\n\n**Order matters**: Index queries should precede other queries. Also, only one\nof ``[attr]``, ``text``, ``own_text`` or ``.field_name`` queries can be used.\n\nIndentation\n-----------\n\nThe level of indentation on each line defines the context for the line.\n\nThe root context of a take template is the current document being\nprocessed. Every statement that is not indented is executed against the\ndocument being processed.\n\nEach line that is indented more deeply has a context that is the result\nof the last query in the parent context. For example:\n\n::\n\n $ #some-id\n $ li\n $ div\n\nThe query on the first line is executed against the document being\nprocessed. The query on the second line is executed against the result\nof the first line. So, the second line is synonomous with\n``$ #some-id li``. The query on the third line is also executed against\nthe result of the first line. So, it can be re-written as\n``$ #some-id div``.\n\nAnother example:\n\n::\n\n $ a\n | 0\n | text\n | [href]\n\nThe third and fourth lines retrieve the text and href attribute,\nrespectively, from the first ```` in the document being processed.\nThis could be rewritten as:\n\n::\n\n $ a | 0\n | text\n | [href]\n\nInline Sub Contexts\n^^^^^^^^^^^^^^^^^^^\n\nInline sub-contexts allow multuple statements per line. The syntax is:\n\n::\n\n statement ; sub-context-statement\n\nThe main thing to note is: whatever comes after the semi-colin is treated as if it were a line with deeper indentation.\n\nInline sub-contexts are primarily used with directives. For example, the following take template:\n\n::\n\n $ h1 | 0 text\n save: section_title\n\nCan be re-written as:\n\n::\n\n $ h1 | 0 text ; save: document_title\n\nBoth templates save the text in the first ``

`` element into the result ``dict`` with the key ``'document_title'``. More on `save directives <#save-directive>`_ later.\n\nDirectives\n----------\n\nDirectives are commands that are executed against the current context.\nThey're format is a directive name followed by an optional parameter list:\n\n::\n\n [: [ ]*]?\n\nAn example of a ``save`` directive:\n\n::\n\n save : some_name\n\nNot all directives require parameters. For example, the ``shrink`` directive,\nwhich collapses whitespace, does not:\n\n::\n\n shrink\n\nThe following directives are built-in:\n\n- ``save``, alias ``:``\n\n - Saves a value.\n\n- ``save each``\n\n - Creates a list of results.\n\n- ``namespace``, alias ``+``\n\n - Creates child ``dict`` for saving values into.\n\n- ``shrink``\n\n - Collapses and trims whitespace.\n\n- ``def``\n\n - Defines a new directive. *Currently only new directives defined in the current document are available.*\n\n- ``merge``, alias ``>>``\n\n - Copies a value from a directive's result into the template's result.\n\nSave Directive\n^^^^^^^^^^^^^^\n\n*Alias:* ``:``\n\nSave directives save the context into the result ``dict``. These are\ngenerally only intended to be applied to the result of non-CSS Selector\nqueries.\n\nThe syntax is:\n\n::\n\n save: \n\n``:`` is an alias for ``save:``. So, a save directive can also be written as:\n\n::\n\n : \n\nThe identifier can contain anything except whitespace, a comma (``,``) or a semi-colin (``;``).\nAlso, the identifier can contain dots (``.``), which designate sub-\\ ``dicts`` for\nsaving.\n\nFor example, the following take template:\n\n::\n\n $ a | 0\n | text\n save: first_a.description\n | [href]\n save: first_a.url\n\nAnd the following HTML:\n\n.. code:: html\n\n \n\nWill result in the following python ``dict``:\n\n.. code:: python\n\n {\n 'first_a': {\n 'description': 'fo sho',\n 'url': 'http://www.example.com'\n }\n }\n\nUsing the ``:`` alias, the template can be written as:\n\n::\n\n $ a | 0\n | text\n : first_a.text\n | [href]\n : first_a.href\n\nOr, more succinctly:\n\n::\n\n $ a | 0\n | text ; : first_a.text\n | [href] ; : first_a.href\n\nSave Each Directive\n^^^^^^^^^^^^^^^^^^^\n\nSave each directives produce a ``dict`` for each element in the context. Generally, these are used for repeating elements on a page. In the `reddit sample `_, a save each directive is used to save each of the reddit entries.\n\nThe syntax is:\n\n::\n\n save each: \n \n\nThe identifier can contain anything except whitespace, a comma (``,``) or a semi-colin (``;``).\nAlso, the identifier can contain dots (``.``), which designate sub-\\ ``dict``\\ s for\nsaving.\n\nSave each directives apply the next sub-context to each of the elements\nof their context value. Put another way, save each directives repeatedly\nprocess each element of their context.\n\nFor example, in the following take template, the ``| text`` and\n``| [href]`` queries (along with saving the results) will be applied to\nevery ```` in the document.\n\n::\n\n $ a\n save each: anchors\n | text\n save: description\n | [href]\n save: url\n\nApplying the above take template to the following HTML:\n\n.. code:: html\n\n \n\nWill result in the following python ``dict``:\n\n.. code:: python\n\n {\n 'anchors': [{\n 'description': 'fo sho',\n 'url': 'http://www.example.com'\n },{\n 'description': 'psych out',\n 'url': 'http://www.another.com'\n }\n ]\n }\n\nNamespace Directive\n^^^^^^^^^^^^^^^^^^^\n\n*Alias:* ``+``\n\nNamespace directives create a sub-``dict`` on the current result-value and everyting in the\nnext sub-context is saved into the new ``dict``.\n\nThe syntax is:\n\n::\n\n namespace: \n \n\n```` is the key the sub-``dict`` is saved as.\n\nAn example:\n\n::\n\n $ a | 0\n namespace: first_a\n | text\n save: description\n | [href]\n save: url\n\nApplying the above take template to the following HTML:\n\n.. code:: html\n\n \n\nWill result in the following python ``dict``:\n\n.. code:: python\n\n {\n 'first_a': {\n 'description': 'fo sho',\n 'url': 'http://www.example.com'\n }\n }\n\nThe ``description`` and ``url`` fields are saved in the ``first_a`` namespace. This reduces\nthe need for save directives like: ``first_a.description``.\n\n``+`` is an alias for the ``namespace`` directive. So, the template above can also be written as:\n\n::\n\n $ a | 0\n + : first_a\n | text\n save: description\n | [href]\n save: url\n\nOr, more succinctly, using inline sub-contexts and the ``:`` alias for save:\n\n::\n\n $ a | 0 ; + : first_a\n | text ; : description\n | [href] ; : url\n\n\n\nShrink Directive\n^^^^^^^^^^^^^^^^\n\nThe ``shrink`` directive trims and collapses whitespace from text. It doesn't take any parameters,\nso the usage is just the word ``shrink``:\n\n::\n\n $ p | text ; : with_spacing\n $ p | text ; shrink ; : shrink_on_text\n\nIf applied to an element, it will be applied to the element's text.\n\n::\n\n $ p ; shrink ; : shrink_on_elem\n\nApplying the above statements to the following HTML:\n\n.. code:: html\n\n

Hello World!

\n\nWill result in the following python ``dict``:\n\n.. code:: python\n\n {\n 'with_spacing': 'Hello World!',\n 'shrink_on_text': 'Hello World!',\n 'shrink_on_elem': 'Hello World!'\n }\n\nDef Directive\n^^^^^^^^^^^^^\n\nThe ``def`` directive saves a sub-context as a custom directive which can be invoked later. This is a\nway to re-use sections of a take template. Directives created in this fashion **always result in a new**\n``dict``.\n\nThe syntax is:\n\n::\n\n def: \n \n\nFor example:\n\n::\n\n def: get first link\n $ a | 0\n | text ; : description\n | [href] ; : url\n\nIn the above template, a new directive named ``get first link`` is created. The new directive saves\nthe text and href attribute from the first ```` element in the context onto which it is\ninvoked. The directive will always result in a new ``dict`` containing ``description`` and\n``url`` keys.\n\nThe identifier can contain spaces; all spaces are collapsed to be a single space,\ne.g. ``def: some name`` is collapsed to ``def: some name``.\n\nDirectives created by ``def`` are invoked without parameters.\n\nThe example below defines a custom directive and applies it against the first ``