PK!/R HISTORY.rstHistory ======= 1.0.1 (2019-02-07) ------------------ - Accept both .yaml and .yml as valid YAML file extensions. - Documentation fixes. 1.0 (2018-05-25) ---------------- - Bumped version to 1.0. 1.0b7 (2018-03-21) ------------------ - Dropped support for Python 3.3. - Fixes for handling Unicode data in HTML for Python 2. - Added registry for preprocessors. 1.0b6 (2018-01-17) ------------------ - Support for writing specifications in YAML. 1.0b5 (2018-01-16) ------------------ - Added a class-based API for writing specifications. - Added predefined transformation functions. - Removed callables from specification maps. Use the new API instead. - Added support for registering new reducers and transformers. - Added support for defining sections in document. - Refactored XPath evaluation method in order to parse path expressions once. - Preprocessing will be done only once when the tree is built. - Concatenation is now the default reducing operation. 1.0b4 (2018-01-02) ------------------ - Added "--version" option to command line arguments. - Added option to force the use of lxml's HTML builder. - Fixed the error where non-truthy values would be excluded from the result. - Added support for transforming node text during preprocess. - Added separate preprocessing function to API. - Renamed the "join" reducer as "concat". - Renamed the "foreach" keyword for keys as "section". - Removed some low level debug messages to substantially increase speed. 1.0b3 (2017-07-25) ------------------ - Removed the caching feature. 1.0b2 (2017-06-16) ------------------ - Added helper function for getting cache hash keys of URLs. 1.0b1 (2017-04-26) ------------------ - Added optional value transformations. - Added support for custom reducer callables. - Added command-line option for scraping documents from local files. 1.0a2 (2017-04-04) ------------------ - Added support for Python 2.7. - Fixed lxml support. 1.0a1 (2016-08-24) ------------------ - First release on PyPI. PK!׼|aa docs/Makefile# Minimal makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build SPHINXPROJ = piculet SOURCEDIR = source BUILDDIR = build # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) PK!docs/source/_static/custom.cssPK!JDDdocs/source/api.rstAPI === .. automodule:: piculet :members: :show-inheritance: PK!M""docs/source/conf.pyimport sys import os # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. sys.path.insert(0, os.path.abspath('..')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [ 'sphinx.ext.autodoc', 'pygenstub' ] # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ['.rst', '.md'] source_suffix = '.rst' # The encoding of source files. # source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = 'Piculet' copyright = '2016-2018, H. Turgut Uyar' author = 'H. Turgut Uyar' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = '1.0' # The full version, including alpha/beta/rc tags. release = '1.0' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: # today = '' # Else, today_fmt is used as the format for a strftime call. # today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all # documents. # default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. # add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). # add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. # show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. modindex_common_prefix = ['piculet.'] # If true, keep warnings as "system message" paragraphs in the built documents. # keep_warnings = False # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'sphinx_rtd_theme' # html_style = 'custom.css' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. # html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". # html_title = None # A shorter title for the navigation bar. Default is the same as html_title. # html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. # html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. # html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. # html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. # html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. # html_use_smartypants = True # Custom sidebar templates, maps document names to template names. # html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. # html_additional_pages = {} # If false, no module index is generated. # html_domain_indices = True # If false, no index is generated. # html_use_index = True # If true, the index is split into individual pages for each letter. # html_split_index = False # If true, links to the reST sources are added to the pages. # html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. # html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. # html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. # html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). # html_file_suffix = None # Language to be used for generating the HTML full-text search index. # Sphinx supports the following languages: # 'da', 'de', 'en', 'es', 'fi', 'fr', 'h', 'it', 'ja' # 'nl', 'no', 'pt', 'ro', 'r', 'sv', 'tr' # html_search_language = 'en' # A dictionary with options for the search language support, empty by default. # Now only 'ja' uses this config value # html_search_options = {'type': 'default'} # The name of a javascript file (relative to the configuration directory) that # implements a search results scorer. If empty, the default will be used. # html_search_scorer = 'scorer.js' # Output file base name for HTML help builder. htmlhelp_basename = 'piculetdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). 'papersize': 'a4paper', # The font size ('10pt', '11pt' or '12pt'). # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # 'preamble': '', # Latex figure (float) alignment # 'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'piculet.tex', 'Piculet Documentation', 'H. Turgut Uyar', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. # latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. # latex_use_parts = False # If true, show page references after internal links. # latex_show_pagerefs = False # If true, show URL addresses after external links. # latex_show_urls = False # Documents to append as an appendix to all manuals. # latex_appendices = [] # If false, no module index is generated. # latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ (master_doc, 'piculet', 'Piculet Documentation', [author], 1) ] # If true, show URL addresses after external links. # man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'Piculet', 'Piculet Documentation', author, 'Piculet', 'XML/HTML scraper using XPath queries.', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. # texinfo_appendices = [] # If false, no module index is generated. # texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. # texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. # texinfo_no_detailmenu = False PK!m5DDdocs/source/extract.rstData extraction =============== This section explains how to write the specification for extracting data from a document. We'll scrape the following HTML content for the movie "The Shining" in our examples: .. literalinclude:: ../../examples/shining.html :language: html Instead of the :func:`scrape_document ` function that reads the content and the specification from files, we'll use the :func:`scrape ` function that works directly on the content and the specification map: .. code-block:: python >>> from piculet import scrape Assuming the HTML document above is saved as :file:`shining.html`, let's get its content: .. code-block:: python >>> with open("shining.html") as f: ... document = f.read() The :func:`scrape ` function assumes that the document is in XML format. So if any conversion is needed, it has to be done before calling this function. [#xhtml]_ After building the DOM tree, the function will apply the extraction rules to the root element of the tree, and return a mapping where each item is generated by one of the rules. .. note:: Piculet uses the `ElementTree`_ module for building and querying XML trees. However, it will make use of the `lxml`_ package if it's installed. The :func:`scrape ` function takes an optional ``lxml_html`` parameter which will use the HTML builder from the lxml package, thereby building the tree without converting HTML into XML first. The specification mapping contains two keys: the ``pre`` key is for specifying the preprocessing operations (these will be covered in the next section), and the ``items`` key is for specifying the rules that describe how to extract the data: .. code-block:: python spec = {"pre": [...], "items": [...]} The items list contains item mappings, where each item has a ``key`` and a ``value`` description. The key specifies the key for the item in the output mapping and the value specifies how to extract the data to set as the value for that item. Typically, a value specifier consists of a path query and a reducing function. The query is applied to the root and a list of strings is obtained. Then, the reducing function converts this list into a single string. [#reducing]_ For example, to get the title of the movie from the example document, we can write: >>> spec = { ... "items": [ ... { ... "key": "title", ... "value": { ... "path": "//title/text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'title': 'The Shining'} The ``.//title/text()`` path generates the list ``['The Shining']`` and the reducing function ``first`` selects the first element from that list. .. note:: By default, the XPath queries are limited by `what ElementTree supports`_ (plus the ``text()`` and ``@attr`` clauses which are added by Piculet). However, if the `lxml`_ package is installed a `much wider range of XPath constructs`_ can be used. Multiple items can be collected in a single invocation: >>> spec = { ... "items": [ ... { ... "key": "title", ... "value": { ... "path": "//title/text()", ... "reduce": "first" ... } ... }, ... { ... "key": "year", ... "value": { ... "path": '//span[@class="year"]/text()', ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'title': 'The Shining', 'year': '1980'} If a path doesn't match any element in the tree, the item will be excluded from the output. Note that in the following example, the "foo" key doesn't get included: >>> spec = { ... "items": [ ... { ... "key": "title", ... "value": { ... "path": "//title/text()", ... "reduce": "first" ... } ... }, ... { ... "key": "foo", ... "value": { ... "path": "//foo/text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'title': 'The Shining'} Reducing -------- Piculet contains a few predefined reducing functions. Other than the ``first`` reducer used in the examples above, a very common reducer is ``concat`` which will concatenate the selected strings: >>> spec = { ... "items": [ ... { ... "key": "full_title", ... "value": { ... "path": "//h1//text()", ... "reduce": "concat" ... } ... } ... ] ... } >>> scrape(document, spec) {'full_title': 'The Shining (1980)'} ``concat`` is the default reducer, i.e. if no reducer is given, the strings will be concatenated: >>> spec = { ... "items": [ ... { ... "key": "full_title", ... "value": { ... "path": "//h1//text()" ... } ... } ... ] ... } >>> scrape(document, spec) {'full_title': 'The Shining (1980)'} If you want to get rid of extra whitespace, you can use the ``clean`` reducer. After concatenating the strings, this will remove leading and trailing whitespace and replace multiple whitespace with a single space: >>> spec = { ... "items": [ ... { ... "key": "review", ... "value": { ... "path": '//div[@class="review"]//text()', ... "reduce": "clean" ... } ... } ... ] ... } >>> scrape(document, spec) {'review': 'Fantastic movie. Definitely recommended.'} In this example, the ``concat`` reducer would have produced the value ``'\n Fantastic movie.\n Definitely recommended.\n '`` As explained above, if a path query doesn't match any element, the item gets automatically excluded. That means, Piculet doesn't try to apply the reducing function on the result of the path query if it's an empty list. Therefore, reducing functions can safely assume that the path result is a non-empty list. If you want to use a custom reducer, you have to register it first. The name for the specifier (the first parameter) has to be a valid Python identifier. .. code-block:: python >>> from piculet import reducers >>> reducers.register("second", lambda x: x[1]) >>> spec = { ... "items": [ ... { ... "key": "year", ... "value": { ... "path": "//h1//text()", ... "reduce": "second" ... } ... } ... ] ... } >>> scrape(document, spec) {'year': '1980'} Transforming ------------ After the reduction operation, you can apply a transformation to the resulting string. A transformation function must take a string as parameter and can return any value of any type. Piculet contains several predefined transformers: ``int``, ``float``, ``bool``, ``len``, ``lower``, ``upper``, ``capitalize``. For example, to get the year of the movie as an integer: >>> spec = { ... "items": [ ... { ... "key": "year", ... "value": { ... "path": '//span[@class="year"]/text()', ... "reduce": "first", ... "transform": "int" ... } ... } ... ] ... } >>> scrape(document, spec) {'year': 1980} If you want to use a custom transformer, you have to register it first: .. code-block:: python >>> from piculet import transformers >>> transformers.register("year25", lambda x: int(x) + 25) >>> spec = { ... "items": [ ... { ... "key": "25th_year", ... "value": { ... "path": '//span[@class="year"]/text()', ... "reduce": "first", ... "transform": "year25" ... } ... } ... ] ... } >>> scrape(document, spec) {'25th_year': 2005} Multi-valued items ------------------ Data with multiple values can be created by using a ``foreach`` key in the value specifier. This is a path expression to select elements from the tree. [#multivalued]_ The path and reducing function will be applied *to each selected element* and the obtained values will be the members of the resulting list. For example, to get the genres of the movie, we can write: >>> spec = { ... "items": [ ... { ... "key": "genres", ... "value": { ... "foreach": '//ul[@class="genres"]/li', ... "path": "./text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'genres': ['Horror', 'Drama']} If the ``foreach`` key doesn't match any element the item will be excluded from the result: >>> spec = { ... "items": [ ... { ... "key": "foos", ... "value": { ... "foreach": '//ul[@class="foos"]/li', ... "path": "./text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {} If a transformation is specified, it will be applied to every element in the resulting list: >>> spec = { ... "items": [ ... { ... "key": "genres", ... "value": { ... "foreach": '//ul[@class="genres"]/li', ... "path": "./text()", ... "reduce": "first", ... "transform": "lower" ... } ... } ... ] ... } >>> scrape(document, spec) {'genres': ['horror', 'drama']} Subrules -------- Nested structures can be created by writing subrules as value specifiers. If the value specifier is a mapping that contains an ``items`` key, then this will be interpreted as a subrule and the generated mapping will be the value for the key. >>> spec = { ... "items": [ ... { ... "key": "director", ... "value": { ... "items": [ ... { ... "key": "name", ... "value": { ... "path": '//div[@class="director"]//a/text()', ... "reduce": "first" ... } ... }, ... { ... "key": "link", ... "value": { ... "path": '//div[@class="director"]//a/@href', ... "reduce": "first" ... } ... } ... ] ... } ... } ... ] ... } >>> scrape(document, spec) {'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}} Subrules can be combined with lists: >>> spec = { ... "items": [ ... { ... "key": "cast", ... "value": { ... "foreach": '//table[@class="cast"]/tr', ... "items": [ ... { ... "key": "name", ... "value": { ... "path": "./td[1]/a/text()", ... "reduce": "first" ... } ... }, ... { ... "key": "link", ... "value": { ... "path": "./td[1]/a/@href", ... "reduce": "first" ... } ... }, ... { ... "key": "character", ... "value": { ... "path": "./td[2]/text()", ... "reduce": "first" ... } ... } ... ] ... } ... } ... ] ... } >>> scrape(document, spec) {'cast': [{'character': 'Jack Torrance', 'link': '/people/2', 'name': 'Jack Nicholson'}, {'character': 'Wendy Torrance', 'link': '/people/3', 'name': 'Shelley Duvall'}]} Items generated by subrules can also be transformed. The transformation function is always applied as the last step in a "value" definition. But transformers for subitems take mappings (as opposed to strings) as parameter. >>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x) >>> spec = { ... "items": [ ... { ... "key": "cast", ... "value": { ... "foreach": '//table[@class="cast"]/tr', ... "items": [ ... { ... "key": "name", ... "value": { ... "path": "./td[1]/a/text()", ... "reduce": "first" ... } ... }, ... { ... "key": "character", ... "value": { ... "path": "./td[2]/text()", ... "reduce": "first" ... } ... } ... ], ... "transform": "stars" ... } ... } ... ] ... } >>> scrape(document, spec) {'cast': ['Jack Nicholson as Jack Torrance', 'Shelley Duvall as Wendy Torrance']} Generating keys from content ---------------------------- You can generate items where the key value also comes from the content. For example, consider how you would get the runtime and the language of the movie. Instead of writing multiple items for each ``h3`` element under an "info" class ``div``, we can write only one item that will select these divs and use the h3 text as the key. These elements can be selected using ``foreach`` specifications in the items. This will cause a new item to be generated for each selected element. To get the key value, we can use paths, reducers -and also transformers- that will be applied to the selected element: >>> spec = { ... "items": [ ... { ... "foreach": '//div[@class="info"]', ... "key": { ... "path": "./h3/text()", ... "reduce": "first" ... }, ... "value": { ... "path": "./p/text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'Language:': 'English', 'Runtime:': '144 minutes'} The ``normalize`` reducer concatenates the strings, converts it to lowercase, replaces spaces with underscores and strips other non-alphanumeric characters: >>> spec = { ... "items": [ ... { ... "foreach": '//div[@class="info"]', ... "key": { ... "path": "./h3/text()", ... "reduce": "normalize" ... }, ... "value": { ... "path": "./p/text()", ... "reduce": "first" ... } ... } ... ] ... } >>> scrape(document, spec) {'language': 'English', 'runtime': '144 minutes'} You could also give a string instead of a path and reducer for the key. In this case, the elements would still be traversed; only the last one would set the final value for the item. This could be OK if you are sure that there is only one element that matches the ``foreach`` path of the key. Sections -------- The specification also provides the ability to define sections within the document. An element can be selected as the root of a section such that the XPath queries in that section will be relative to that root. This can be used to make XPath expressions shorter and also constrain the search in the tree. For example, the "director" example above can also be written using sections: .. code-block:: python >>> spec = { ... "section": '//div[@class="director"]//a', ... "items": [ ... { ... "key": "director", ... "value": { ... "items": [ ... { ... "key": "name", ... "value": { ... "path": "./text()", ... "reduce": "first" ... } ... }, ... { ... "key": "link", ... "value": { ... "path": "./@href", ... "reduce": "first" ... } ... } ... ] ... } ... } ... ] ... } >>> scrape(document, spec) {'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}} .. [#xhtml] Note that the example document is already in XML format. .. [#reducing] This means that the query has to end with either ``text()`` or some attribute value as in ``@attr``. And the reducing function should be implemented so that it takes a list of strings and returns a string. .. [#multivalued] This implies that the ``foreach`` query should **not** end in ``text()`` or ``@attr``. .. _ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html .. _what ElementTree supports: https://docs.python.org/3/library/xml.etree.elementtree.html#xpath-support .. _lxml: http://lxml.de/ .. _much wider range of XPath constructs: http://lxml.de/xpathxslt.html#xpath PK!Ldocs/source/history.rst.. include:: ../../HISTORY.rst PK!Edocs/source/index.rstPiculet ======= .. include:: ../../README.rst Contents ======== .. toctree:: :maxdepth: 2 overview extract preprocess low-level api history Indices and Tables ================== * :ref:`genindex` * :ref:`search` PK!m docs/source/low-level.rst Lower-level functions ===================== Piculet also provides a lower-level API where you can run the stages separately. For example, if the same document will be scraped multiple times with different rules, calling the ``scrape`` function repeatedly will cause the document to be parsed into a DOM tree repeatedly. Instead, you can create the DOM tree once and run extraction rules against this tree multiple times. Also, this API uses classes to express the specification and therefore development tools can help better in writing the rules by showing error indicators and suggesting autocompletions. Building the tree ----------------- The DOM tree can be created from the document using the :func:`build_tree ` function: .. code-block:: python >>> from piculet import build_tree >>> root = build_tree(document) If the document needs to be converted from HTML to XML, you can use the :func:`html_to_xhtml ` function: .. code-block:: python >>> from piculet import html_to_xhtml >>> converted = html_to_xhtml(document) >>> root = build_tree(converted) If lxml is available, you can use the ``lxml_html`` parameter for building the tree without converting an HTML document into XHTML: .. code-block:: python >>> root = build_tree(document, lxml_html=True) .. note:: Note that if you use the lxml.html builder, there might be differences about how the tree is built compared to the piculet conversion method and the path queries for preprocessing and extraction might need changes. Preprocessing ------------- The tree can be modified using the :func:`preprocess ` function: .. code-block:: python >>> from piculet import preprocess >>> ops = [{"op": "remove", "path": '//div[class="ad"]'}] >>> preprocess(root, ops) Data extraction --------------- The class-based API to data extraction has a one-to-one correspondance with the specification mapping. A :class:`Rule ` object corresponds to a key-value pair in the items list. Its value is produced by an ``extractor``. In the simple case, an extractor is a :class:`Path ` object which is a combination of a path, a reducer, and a transformer. .. code-block:: python >>> from piculet import Path, Rule, reducers, transformers >>> extractor = Path('//span[@class="year"]/text()', ... reduce=reducers.first, ... transform=transformers.int) >>> rule = Rule(key="year", extractor=extractor) >>> rule.extract(root) {'year': 1980} An extractor can have a ``foreach`` attribute if it will be multi-valued: .. code-block:: python >>> extractor = Path(foreach='//ul[@class="genres"]/li', ... path="./text()", ... reduce=reducers.first, ... transform=transformers.lower) >>> rule = Rule(key="genres", extractor=extractor) >>> rule.extract(root) {'genres': ['horror', 'drama']} The ``key`` attribute of a rule can be an extractor in which case it can be used to extract the key value from content. A rule can also have a ``foreach`` attribute for generating multiple items in one rule. These features will work as they are described in the data extraction section. A :class:`Rules ` object contains a collection of rule objects and it corresponds to the "items" part in the specification mapping. It acts both as the top level extractor that gets applied to the root of the tree, and also as an extractor for any rule with subrules. .. code-block:: python >>> from piculet import Rules >>> rules = [Rule(key="title", ... extractor=Path("//title/text()")), ... Rule(key="year", ... extractor=Path('//span[@class="year"]/text()', ... transform=transformers.int))] >>> Rules(rules).extract(root) {'title': 'The Shining', 'year': 1980} A more complete example with transformations is below. Again note that, the specification is exactly the same as given in the corresponding mapping example in the data extraction chapter. .. code-block:: python >>> rules = [ ... Rule(key="cast", ... extractor=Rules( ... foreach='//table[@class="cast"]/tr', ... rules=[ ... Rule(key="name", ... extractor=Path("./td[1]/a/text()")), ... Rule(key="character", ... extractor=Path("./td[2]/text()")) ... ], ... transform=lambda x: "%(name)s as %(character)s" % x ... )) ... ] >>> Rules(rules).extract(root) {'cast': ['Jack Nicholson as Jack Torrance', 'Shelley Duvall as Wendy Torrance']} A rules object can have a ``section`` attribute as described in the data extraction chapter: .. code-block:: python >>> rules = [ ... Rule(key="director", ... extractor=Rules( ... section='//div[@class="director"]//a', ... rules=[ ... Rule(key="name", ... extractor=Path("./text()")), ... Rule(key="link", ... extractor=Path("./@href")) ... ])) ... ] >>> Rules(rules).extract(root) {'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}} PK!b2docs/source/overview.rstOverview ======== Scraping a document consists of three stages: #. Building a DOM tree out of the document. This is a straightforward operation for an XML document. For an HTML document, Piculet will first try to convert it into XHTML and then build the tree from that. #. Preprocessing the tree. This is an optional stage. In some cases it might be helpful to do some changes on the tree to simplify the extraction process. #. Extracting data out of the tree. The preprocessing and extraction stages are expressed as part of a scraping specification. The specification is a mapping which can be stored in a file format that can represent a mapping, such as JSON or YAML. Details about the specification are given in later chapters. Command Line Interface ---------------------- Installing Piculet creates a script named ``piculet`` which can be used to invoke the command line interface:: $ piculet -h usage: piculet [-h] [--debug] command ... The ``scrape`` command extracts data out of a document as described by a specification file:: $ piculet scrape -h usage: piculet scrape [-h] -s SPEC [--html] document The location of the document can be given as a file path or a URL. For example, say you want to extract some data from the file `shining.html`_. An example specification is given in `movie.json`_. Download both of these files and run the command:: $ piculet scrape -s movie.json shining.html This should print the following output:: { "cast": [ { "character": "Jack Torrance", "link": "/people/2", "name": "Jack Nicholson" }, { "character": "Wendy Torrance", "link": "/people/3", "name": "Shelley Duvall" } ], "director": { "link": "/people/1", "name": "Stanley Kubrick" }, "genres": [ "Horror", "Drama" ], "language": "English", "review": "Fantastic movie. Definitely recommended.", "runtime": "144 minutes", "title": "The Shining", "year": 1980 } For HTML documents, the ``--html`` option has to be used. If the document address starts with ``http://`` or ``https://``, the content will be taken from the given URL. For example, to extract some data from the Wikipedia page for `David Bowie`_, download the `wikipedia.json`_ file and run the command:: piculet scrape -s wikipedia.json --html "https://en.wikipedia.org/wiki/David_Bowie" This should print the following output:: { "birthplace": "Brixton, London, England", "born": "1947-01-08", "name": "David Bowie", "occupation": [ "Singer", "songwriter", "actor" ] } In the same command, change the name part of the URL to ``Merlene_Ottey`` and you will get similar data for `Merlene Ottey`_. Note that since the markup used in Wikipedia pages for persons varies, the kinds of data you get with this specification will also vary. Piculet can be used as a simplistic HTML to XHTML convertor by invoking it with the ``h2x`` command. This command takes the file name as input and prints the converted content, as in ``piculet h2x foo.html``. If the input file name is given as ``-`` it will read the content from the standard input and therefore can be used as part of a pipe: ``cat foo.html | piculet h2x -`` Using in programs ----------------- The scraping operation can also be invoked programmatically using the :func:`scrape_document ` function. Note that this function prints its output and doesn't return anything: .. code-block:: python from piculet import scrape_document url = "https://en.wikipedia.org/wiki/David_Bowie" spec = "wikipedia.json" scrape_document(url, spec, content_format="html") YAML support ------------ To use YAML for specification, Piculet has to be installed with YAML support:: pip install piculet[yaml] Note that this will install an external module for parsing YAML files, and therefore will not be contained to the standard library anymore. The YAML version of the configuration example above can be found in `movie.yaml`_. .. _shining.html: https://github.com/uyar/piculet/blob/master/examples/shining.html .. _movie.json: https://github.com/uyar/piculet/blob/master/examples/movie.json .. _movie.yaml: https://github.com/uyar/piculet/blob/master/examples/movie.yaml .. _wikipedia.json: https://github.com/uyar/piculet/blob/master/examples/wikipedia.json .. _David Bowie: https://en.wikipedia.org/wiki/David_Bowie .. _Merlene Ottey: https://en.wikipedia.org/wiki/Merlene_Ottey PK! UQQdocs/source/preprocess.rstPreprocessing ============= Other than extraction rules, specifications can also contain preprocessing operations which allow modifications on the tree before starting data extraction. Such operations can be needed to make data extraction simpler or to remove the need for some postprocessing operations on the collected data. The syntax for writing preprocessing operations is as follows: .. code-block:: python rules = { "pre": [ { "op": "...", ... }, { "op": "...", ... } ], "items": [ ... ] } Every preprocessing operation item has a name which is given as the value of the "op" key. The other items in the mapping are specific to the operation. The operations are applied in the order as they are written in the operations list. The predefined preprocessing operations are explained below. Removing elements ----------------- This operation removes from the tree all the elements (and its subtree) that are selected by a given XPath query: .. code-block:: python {"op": "remove", "path": "..."} Setting element attributes -------------------------- This operation selects all elements by a given XPath query and sets an attribute for these elements to a given value: .. code-block:: python {"op": "set_attr", "path": "...", "name": "...", "value": "..."} The attribute "name" can be a literal string or an extractor as described in the data extraction chapter. Similarly, the attribute "value" can be given as a literal string or an extractor. Setting element text -------------------- This operation selects all elements by a given XPath query and sets their texts to a given value: .. code-block:: python {"op": "set_text", "path": "...", "text": "..."} The "text" can be a literal string or an extractor. PK!VN88 piculet.py# Copyright (C) 2014-2019 H. Turgut Uyar # # Piculet is free software: you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # Piculet is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public License # along with Piculet. If not, see . """Piculet is a module for scraping XML and HTML documents using XPath queries. It consists of this single source file with no dependencies other than the standard library, which makes it very easy to integrate into applications. For more information, please refer to the documentation: https://piculet.readthedocs.io/ """ from __future__ import absolute_import, division, print_function, unicode_literals import json import logging import os import re import sys from argparse import ArgumentParser from collections import deque from functools import partial from operator import itemgetter from pkgutil import find_loader __version__ = "1.0.1" PY2 = sys.version_info < (3, 0) if PY2: str, bytes = unicode, str if PY2: from cgi import escape as html_escape from HTMLParser import HTMLParser from StringIO import StringIO from htmlentitydefs import name2codepoint from urllib2 import urlopen else: from html import escape as html_escape from html.parser import HTMLParser from io import StringIO from urllib.request import urlopen if PY2: from contextlib import contextmanager @contextmanager def redirect_stdout(new_stdout): """Context manager for temporarily redirecting stdout.""" old_stdout, sys.stdout = sys.stdout, new_stdout try: yield new_stdout finally: sys.stdout = old_stdout else: from contextlib import redirect_stdout _logger = logging.getLogger(__name__) ########################################################### # HTML OPERATIONS ########################################################### # TODO: this is too fragile _CHARSET_TAGS = [ b' str :param content: Content of HTML document to decode. :param charset: Character set of the page. :param fallback_charset: Character set to use if it can't be figured out. :return: Decoded content of the document. """ if charset is None: for tag in _CHARSET_TAGS: start = content.find(tag) if start >= 0: charset_start = start + len(tag) charset_end = content.find(b'"', charset_start) charset = content[charset_start:charset_end].decode("ascii") _logger.debug("charset found in : %s", charset) break else: _logger.debug("charset not found, using fallback: %s", fallback_charset) charset = fallback_charset _logger.debug("decoding for charset: %s", charset) return content.decode(charset) class HTMLNormalizer(HTMLParser): """HTML cleaner and XHTML convertor. DOCTYPE declarations and comments are removed. """ VOID_ELEMENTS = frozenset( { "area", "base", "basefont", "bgsound", "br", "col", "command", "embed", "frame", "hr", "image", "img", "input", "isindex", "keygen", "link", "menuitem", "meta", "nextid", "param", "source", "track", "wbr", } ) """Tags to handle as self-closing.""" def __init__(self, omit_tags=None, omit_attrs=None): """Initialize this normalizer. :sig: (Optional[Iterable[str]], Optional[Iterable[str]]) -> None :param omit_tags: Tags to remove, along with all their content. :param omit_attrs: Attributes to remove. """ if PY2: HTMLParser.__init__(self) else: super().__init__(convert_charrefs=True) self.omit_tags = set(omit_tags) if omit_tags is not None else set() # sig: Set[str] self.omit_attrs = set(omit_attrs) if omit_attrs is not None else set() # sig: Set[str] # stacks used during normalization self._open_tags = deque() self._open_omitted_tags = deque() def handle_starttag(self, tag, attrs): """Process the starting of a new element.""" if tag in self.omit_tags: _logger.debug("omitting starting tag: <%s>", tag) self._open_omitted_tags.append(tag) if not self._open_omitted_tags: # stack empty -> not in omit mode if "@" in tag: # email address in angular brackets print("<%s>" % tag, end="") return if (tag == "li") and (self._open_tags[-1] == "li"): _logger.debug("opened
  • without closing previous
  • , adding
  • ") self.handle_endtag("li") attributes = [] for attr_name, attr_value in attrs: if attr_name in self.omit_attrs: _logger.debug("omitting attribute of <%s>: %s", tag, attr_name) continue if attr_value is None: _logger.debug( "adding empty value for attribute of <%s>: %s", tag, attr_name ) attr_value = "" markup = '%(name)s="%(value)s"' % { "name": attr_name, "value": html_escape(attr_value, quote=True), } attributes.append(markup) line = "<%(tag)s%(attrs)s%(slash)s>" % { "tag": tag, "attrs": (" " + " ".join(attributes)) if len(attributes) > 0 else "", "slash": " /" if tag in self.VOID_ELEMENTS else "", } print(line, end="") if tag not in self.VOID_ELEMENTS: self._open_tags.append(tag) def handle_endtag(self, tag): """Process the ending of an element.""" if not self._open_omitted_tags: # stack empty -> not in omit mode if tag not in self.VOID_ELEMENTS: last = self._open_tags[-1] if (tag == "ul") and (last == "li"): _logger.debug("closing
      without closing last
    • , adding
    • ") self.handle_endtag("li") if tag == last: # expected end tag print("" % {"tag": tag}, end="") self._open_tags.pop() elif tag not in self._open_tags: _logger.debug("closing tag without opening tag: <%s>", tag) # XXX: for , this case gets invoked after the case below elif tag == self._open_tags[-2]: _logger.debug( "unexpected closing tag <%s> instead of <%s>, closing both", tag, last ) print("" % {"tag": last}, end="") print("" % {"tag": tag}, end="") self._open_tags.pop() self._open_tags.pop() elif (tag in self.omit_tags) and (tag == self._open_omitted_tags[-1]): # end of expected omitted tag self._open_omitted_tags.pop() def handle_data(self, data): """Process collected character data.""" if not self._open_omitted_tags: # stack empty -> not in omit mode line = html_escape(data) print(line.decode("utf-8") if PY2 and isinstance(line, bytes) else line, end="") def handle_entityref(self, name): """Process an entity reference.""" # XXX: doesn't get called if convert_charrefs=True num = name2codepoint.get(name) # we are sure we're on PY2 here if num is not None: print("&#%(ref)d;" % {"ref": num}, end="") def handle_charref(self, name): """Process a character reference.""" # XXX: doesn't get called if convert_charrefs=True print("&#%(ref)s;" % {"ref": name}, end="") # def feed(self, data): # super().feed(data) # # close all remaining open tags # for tag in reversed(self._open_tags): # print('' % {'tag': tag}, end='') def html_to_xhtml(document, omit_tags=None, omit_attrs=None): """Clean HTML and convert to XHTML. :sig: (str, Optional[Iterable[str]], Optional[Iterable[str]]) -> str :param document: HTML document to clean and convert. :param omit_tags: Tags to exclude from the output. :param omit_attrs: Attributes to exclude from the output. :return: Normalized XHTML content. """ out = StringIO() normalizer = HTMLNormalizer(omit_tags=omit_tags, omit_attrs=omit_attrs) with redirect_stdout(out): normalizer.feed(document) return out.getvalue() ########################################################### # DATA EXTRACTION OPERATIONS ########################################################### # sigalias: XPathResult = Union[Sequence[str], Sequence[Element]] _USE_LXML = find_loader("lxml") is not None if _USE_LXML: _logger.info("using lxml") from lxml import etree as ElementTree from lxml.etree import Element XPath = ElementTree.XPath xpath = ElementTree._Element.xpath else: from xml.etree import ElementTree from xml.etree.ElementTree import Element class XPath: """An XPath expression evaluator. This class is mainly needed to compensate for the lack of ``text()`` and ``@attr`` axis queries in ElementTree XPath support. """ def __init__(self, path): """Initialize this evaluator. :sig: (str) -> None :param path: XPath expression to evaluate. """ if path[0] == "/": # ElementTree doesn't support absolute paths # TODO: handle this properly, find root of tree path = "." + path def descendant(element): # strip trailing '//text()' return [t for e in element.findall(path[:-8]) for t in e.itertext() if t] def child(element): # strip trailing '/text()' return [ t for e in element.findall(path[:-7]) for t in ([e.text] + [c.tail if c.tail else "" for c in e]) if t ] def attribute(element, subpath, attr): result = [e.attrib.get(attr) for e in element.findall(subpath)] return [r for r in result if r is not None] if path.endswith("//text()"): _apply = descendant elif path.endswith("/text()"): _apply = child else: steps = path.split("/") front, last = steps[:-1], steps[-1] # after dropping PY2: *front, last = path.split('/') if last.startswith("@"): _apply = partial(attribute, subpath="/".join(front), attr=last[1:]) else: _apply = partial(Element.findall, path=path) self._apply = _apply # sig: Callable[[Element], XPathResult] def __call__(self, element): """Apply this evaluator to an element. :sig: (Element) -> XPathResult :param element: Element to apply this expression to. :return: Elements or strings resulting from the query. """ return self._apply(element) xpath = lambda e, p: XPath(p)(e) _EMPTY = {} # sig: Dict # sigalias: Reducer = Callable[[Sequence[str]], str] # sigalias: PathTransformer = Callable[[str], Any] # sigalias: MapTransformer = Callable[[Mapping[str, Any]], Any] # sigalias: Transformer = Union[PathTransformer, MapTransformer] # sigalias: ExtractedItem = Union[str, Mapping[str, Any]] class Extractor: """Abstract base extractor for getting data out of an XML element.""" def __init__(self, transform=None, foreach=None): """Initialize this extractor. :sig: (Optional[Transformer], Optional[str]) -> None :param transform: Function to transform the extracted value. :param foreach: Path to apply for generating a collection of values. """ self.transform = transform # sig: Optional[Transformer] """Function to transform the extracted value.""" self.foreach = XPath(foreach) if foreach is not None else None # sig: Optional[XPath] """Path to apply for generating a collection of values.""" def apply(self, element): """Get the raw data from an element using this extractor. :sig: (Element) -> ExtractedItem :param element: Element to apply this extractor to. :return: Extracted raw data. """ raise NotImplementedError("Concrete extractors must implement this method") def extract(self, element, transform=True): """Get the processed data from an element using this extractor. :sig: (Element, Optional[bool]) -> Any :param element: Element to extract the data from. :param transform: Whether the transformation will be applied or not. :return: Extracted and optionally transformed data. """ value = self.apply(element) if (value is None) or (value is _EMPTY) or (not transform): return value return value if self.transform is None else self.transform(value) @staticmethod def from_map(item): """Generate an extractor from a description map. :sig: (Mapping[str, Any]) -> Extractor :param item: Extractor description. :return: Extractor object. :raise ValueError: When reducer or transformer names are unknown. """ transformer = item.get("transform") if transformer is None: transform = None else: transform = transformers.get(transformer) if transform is None: raise ValueError("Unknown transformer") foreach = item.get("foreach") path = item.get("path") if path is not None: reducer = item.get("reduce") if reducer is None: reduce = None else: reduce = reducers.get(reducer) if reduce is None: raise ValueError("Unknown reducer") extractor = Path(path, reduce, transform=transform, foreach=foreach) else: items = item.get("items") # TODO: check for None rules = [Rule.from_map(i) for i in items] extractor = Rules( rules, section=item.get("section"), transform=transform, foreach=foreach ) return extractor class Path(Extractor): """An extractor for getting text out of an XML element.""" def __init__(self, path, reduce=None, transform=None, foreach=None): """Initialize this extractor. :sig: ( str, Optional[Reducer], Optional[PathTransformer], Optional[str] ) -> None :param path: Path to apply to get the data. :param reduce: Function to reduce selected texts into a single string. :param transform: Function to transform extracted value. :param foreach: Path to apply for generating a collection of data. """ if PY2: Extractor.__init__(self, transform=transform, foreach=foreach) else: super().__init__(transform=transform, foreach=foreach) self.path = XPath(path) # sig: XPath """XPath evaluator to apply to get the data.""" if reduce is None: reduce = reducers.concat self.reduce = reduce # sig: Reducer """Function to reduce selected texts into a single string.""" def apply(self, element): """Apply this extractor to an element. :sig: (Element) -> str :param element: Element to apply this extractor to. :return: Extracted text. """ # _logger.debug("applying path on <%s>: %s", element.tag, self.path) selected = self.path(element) if len(selected) == 0: # _logger.debug("no match") value = None else: # _logger.debug("selected elements: %s", selected) value = self.reduce(selected) # _logger.debug("reduced using %s: %s", self.reduce, value) return value class Rules(Extractor): """An extractor for getting data items out of an XML element.""" def __init__(self, rules, section=None, transform=None, foreach=None): """Initialize this extractor. :sig: ( Sequence[Rule], str, Optional[MapTransformer], Optional[str] ) -> None :param rules: Rules for generating the data items. :param section: Path for setting the root of this section. :param transform: Function to transform extracted value. :param foreach: Path for generating multiple items. """ if PY2: Extractor.__init__(self, transform=transform, foreach=foreach) else: super().__init__(transform=transform, foreach=foreach) self.rules = rules # sig: Sequence[Rule] """Rules for generating the data items.""" self.section = XPath(section) if section is not None else None # sig: Optional[XPath] """XPath expression for selecting a subroot for this section.""" def apply(self, element): """Apply this extractor to an element. :sig: (Element) -> Mapping[str, Any] :param element: Element to apply the extractor to. :return: Extracted mapping. """ if self.section is None: subroot = element else: subroots = self.section(element) if len(subroots) == 0: _logger.debug("no section root found") return _EMPTY if len(subroots) > 1: raise ValueError("Section path should select exactly one element") subroot = subroots[0] _logger.debug("setting root: <%s>", subroot.tag) data = {} for rule in self.rules: extracted = rule.extract(subroot) data.update(extracted) return data if len(data) > 0 else _EMPTY class Rule: """A rule describing how to get a data item out of an XML element.""" def __init__(self, key, extractor, foreach=None): """Initialize this rule. :sig: (Union[str, Extractor], Extractor, Optional[str]) -> None :param key: Name to distinguish this data item. :param extractor: Extractor that will generate this data item. :param foreach: Path for generating multiple items. """ self.key = key # sig: Union[str, Extractor] """Name to distinguish this data item.""" self.extractor = extractor # sig: Extractor """Extractor that will generate this data item.""" self.foreach = XPath(foreach) if foreach is not None else None # sig: Optional[XPath] """XPath evaluator for generating multiple items.""" @staticmethod def from_map(item): """Generate a rule from a description map. :sig: (Mapping[str, Any]) -> Rule :param item: Item description. :return: Rule object. """ item_key = item["key"] key = item_key if isinstance(item_key, str) else Extractor.from_map(item_key) value = Extractor.from_map(item["value"]) return Rule(key=key, extractor=value, foreach=item.get("foreach")) def extract(self, element): """Extract data out of an element using this rule. :sig: (Element) -> Mapping[str, Any] :param element: Element to extract the data from. :return: Extracted data. """ data = {} subroots = [element] if self.foreach is None else self.foreach(element) for subroot in subroots: # _logger.debug("setting section element: <%s>", subroot.tag) key = self.key if isinstance(self.key, str) else self.key.extract(subroot) if key is None: # _logger.debug("no value generated for key name") continue # _logger.debug("extracting key: %s", key) if self.extractor.foreach is None: value = self.extractor.extract(subroot) if (value is None) or (value is _EMPTY): # _logger.debug("no value generated for key") continue data[key] = value # _logger.debug("extracted value for %s: %s", key, data[key]) else: # don't try to transform list items by default, it might waste a lot of time raw_values = [ self.extractor.extract(r, transform=False) for r in self.extractor.foreach(subroot) ] values = [v for v in raw_values if (v is not None) and (v is not _EMPTY)] if len(values) == 0: # _logger.debug("no items found in list") continue data[key] = ( values if self.extractor.transform is None else list(map(self.extractor.transform, values)) ) # _logger.debug("extracted value for %s: %s", key, data[key]) return data def remove_elements(root, path): """Remove selected elements from the tree. :sig: (Element, str) -> None :param root: Root element of the tree. :param path: XPath to select the elements to remove. """ if _USE_LXML: get_parent = ElementTree._Element.getparent else: # ElementTree doesn't support parent queries, so we'll build a map for it get_parent = root.attrib.get("_get_parent") if get_parent is None: get_parent = {e: p for p in root.iter() for e in p}.get root.attrib["_get_parent"] = get_parent elements = XPath(path)(root) _logger.debug("removing %s elements using path: %s", len(elements), path) if len(elements) > 0: for element in elements: _logger.debug("removing element: <%s>", element.tag) # XXX: could this be hazardous? parent removed in earlier iteration? get_parent(element).remove(element) def set_element_attr(root, path, name, value): """Set an attribute for selected elements. :sig: ( Element, str, Union[str, Mapping[str, Any]], Union[str, Mapping[str, Any]] ) -> None :param root: Root element of the tree. :param path: XPath to select the elements to set attributes for. :param name: Description for name generation. :param value: Description for value generation. """ elements = XPath(path)(root) _logger.debug("updating %s elements using path: %s", len(elements), path) for element in elements: attr_name = name if isinstance(name, str) else Extractor.from_map(name).extract(element) if attr_name is None: _logger.debug("no attribute name generated for <%s>:", element.tag) continue attr_value = ( value if isinstance(value, str) else Extractor.from_map(value).extract(element) ) if attr_value is None: _logger.debug("no attribute value generated for <%s>:", element.tag) continue _logger.debug("setting %s attribute of <%s>: %s", attr_name, element.tag, attr_value) element.attrib[attr_name] = attr_value def set_element_text(root, path, text): """Set the text for selected elements. :sig: (Element, str, Union[str, Mapping[str, Any]]) -> None :param root: Root element of the tree. :param path: XPath to select the elements to set attributes for. :param text: Description for text generation. """ elements = XPath(path)(root) _logger.debug("updating %s elements using path: %s", len(elements), path) for element in elements: element_text = ( text if isinstance(text, str) else Extractor.from_map(text).extract(element) ) # note that the text can be None in which case the existing text will be cleared _logger.debug("setting text of <%s>: %s", element.tag, element_text) element.text = element_text def build_tree(document, lxml_html=False): """Build a tree from an XML document. :sig: (str, Optional[bool]) -> Element :param document: XML document to build the tree from. :param lxml_html: Use the lxml.html builder if available. :return: Root element of the XML tree. """ content = document.encode("utf-8") if PY2 else document if _USE_LXML and lxml_html: _logger.info("using lxml html builder") import lxml.html return lxml.html.fromstring(content) return ElementTree.fromstring(content) class Registry: """A simple, attribute-based namespace.""" def __init__(self, entries): """Initialize this registry. :sig: (Mapping[str, Any]) -> None :param entries: Entries to add to this registry. """ self.__dict__.update(entries) def get(self, item): """Get the value of an entry from this registry. :sig: (str) -> Any :param item: Entry to get the value for. :return: Value of entry. """ return self.__dict__.get(item) def register(self, key, value): """Register a new entry in this registry. :sig: (str, Any) -> None :param key: Key to search the entry in this registry. :param value: Value to store for the entry. """ self.__dict__[key] = value _PREPROCESSORS = { "remove": remove_elements, "set_attr": set_element_attr, "set_text": set_element_text, } preprocessors = Registry(_PREPROCESSORS) # sig: Registry """Predefined preprocessors.""" _REDUCERS = { "first": itemgetter(0), "concat": partial(str.join, ""), "clean": lambda xs: re.sub(r"\s+", " ", "".join(xs).replace("\xa0", " ")).strip(), "normalize": lambda xs: re.sub(r"[^a-z0-9_]", "", "".join(xs).lower().replace(" ", "_")), } reducers = Registry(_REDUCERS) # sig: Registry """Predefined reducers.""" _TRANSFORMERS = { "int": int, "float": float, "bool": bool, "len": len, "lower": str.lower, "upper": str.upper, "capitalize": str.capitalize, "lstrip": str.lstrip, "rstrip": str.rstrip, "strip": str.strip, } transformers = Registry(_TRANSFORMERS) # sig: Registry """Predefined transformers.""" def preprocess(root, pre): """Process a tree before starting extraction. :sig: (Element, Sequence[Mapping[str, Any]]) -> None :param root: Root of tree to process. :param pre: Descriptions for processing operations. """ for step in pre: op = step["op"] if op == "remove": remove_elements(root, step["path"]) elif op == "set_attr": set_element_attr(root, step["path"], name=step["name"], value=step["value"]) elif op == "set_text": set_element_text(root, step["path"], text=step["text"]) else: raise ValueError("Unknown preprocessing operation") def extract(element, items, section=None): """Extract data from an XML element. :sig: ( Element, Sequence[Mapping[str, Any]], Optional[str] ) -> Mapping[str, Any] :param element: Element to extract the data from. :param items: Descriptions for extracting items. :param section: Path to select the root element for these items. :return: Extracted data. """ rules = Rules([Rule.from_map(item) for item in items], section=section) return rules.extract(element) def scrape(document, spec, lxml_html=False): """Extract data from a document after optionally preprocessing it. :sig: (str, Mapping[str, Any], Optional[bool]) -> Mapping[str, Any] :param document: Document to scrape. :param spec: Extraction specification. :param lxml_html: Use the lxml.html builder if available. :return: Extracted data. """ root = build_tree(document, lxml_html=lxml_html) pre = spec.get("pre") if pre is not None: preprocess(root, pre) data = extract(root, spec.get("items"), section=spec.get("section")) return data ########################################################### # COMMAND-LINE INTERFACE ########################################################### def h2x(source): """Convert an HTML file into XHTML and print. :sig: (str) -> None :param source: Path of HTML file to convert. """ if source == "-": _logger.debug("reading from stdin") content = sys.stdin.read() else: _logger.debug("reading from file: %s", os.path.abspath(source)) with open(source, "rb") as f: content = decode_html(f.read()) print(html_to_xhtml(content), end="") def scrape_document(address, spec, content_format="xml"): """Scrape data from a file path or a URL and print. :sig: (str, str, Optional[str]) -> None :param address: File path or URL of document to scrape. :param spec: Path of spec file. :param content_format: Whether the content is XML or HTML. """ _logger.debug("loading spec from file: %s", os.path.abspath(spec)) if os.path.splitext(spec)[-1] in (".yaml", ".yml"): if find_loader("yaml") is None: raise RuntimeError("YAML support not available") import yaml spec_loader = yaml.load else: spec_loader = json.loads with open(spec) as f: spec_map = spec_loader(f.read()) if address.startswith(("http://", "https://")): _logger.debug("loading url: %s", address) with urlopen(address) as f: content = f.read() else: _logger.debug("loading file: %s", os.path.abspath(address)) with open(address, "rb") as f: content = f.read() document = decode_html(content) if content_format == "html": _logger.debug("converting html document to xhtml") document = html_to_xhtml(document) # _logger.debug('=== CONTENT START ===\n%s\n=== CONTENT END===', document) data = scrape(document, spec_map) print(json.dumps(data, indent=2, sort_keys=True)) def make_parser(prog): """Build a parser for command line arguments. :sig: (str) -> ArgumentParser :param prog: Name of program. :return: Parser for arguments. """ parser = ArgumentParser(prog=prog) parser.add_argument("--version", action="version", version="%(prog)s " + __version__) parser.add_argument("--debug", action="store_true", help="enable debug messages") commands = parser.add_subparsers(metavar="command", dest="command") commands.required = True h2x_parser = commands.add_parser("h2x", help="convert HTML to XHTML") h2x_parser.add_argument("file", help="file to convert") h2x_parser.set_defaults(func=lambda a: h2x(a.file)) scrape_parser = commands.add_parser("scrape", help="scrape a document") scrape_parser.add_argument("document", help="file path or URL of document to scrape") scrape_parser.add_argument("-s", "--spec", required=True, help="spec file") scrape_parser.add_argument("--html", action="store_true", help="document is in HTML format") scrape_parser.set_defaults( func=lambda a: scrape_document( a.document, a.spec, content_format="html" if a.html else "xml" ) ) return parser def main(argv=None): """Entry point of the command line utility. :sig: (Optional[List[str]]) -> None :param argv: Command line arguments. """ argv = argv if argv is not None else sys.argv parser = make_parser(prog="piculet") arguments = parser.parse_args(argv[1:]) # set debug mode if arguments.debug: logging.basicConfig(level=logging.DEBUG) _logger.debug("running in debug mode") # run the handler for the selected command try: arguments.func(arguments) except Exception as e: print(e, file=sys.stderr) sys.exit(1) if __name__ == "__main__": main() PK!Htests/conftest.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from pytest import fixture import logging import os import sys from hashlib import md5 from io import BytesIO import piculet PY2 = sys.version_info < (3, 0) if PY2: import mock else: from unittest import mock if PY2: from urllib2 import urlopen else: from urllib.request import urlopen logging.raiseExceptions = False cache_dir = os.path.join(os.path.dirname(__file__), ".cache") if not os.path.exists(cache_dir): os.makedirs(cache_dir) def mock_urlopen(url): key = md5(url.encode("utf-8")).hexdigest() cache_file = os.path.join(cache_dir, key) if not os.path.exists(cache_file): content = urlopen(url).read() with open(cache_file, "wb") as f: f.write(content) else: with open(cache_file, "rb") as f: content = f.read() return BytesIO(content) piculet.urlopen = mock.Mock(wraps=mock_urlopen) @fixture(scope="session") def shining_content(): """Contents of the shining.html file.""" file_path = os.path.join(os.path.dirname(__file__), "..", "examples", "shining.html") with open(file_path) as f: content = f.read() return content @fixture def shining(shining_content): """Root element of the XML tree for the movie document "The Shining".""" return piculet.build_tree(shining_content) PK!لtests/test_cli.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from pytest import config, mark, raises import json import logging import os import sys from io import StringIO from pkg_resources import get_distribution import piculet if sys.version_info.major < 3: import mock else: from unittest import mock base_dir = os.path.dirname(__file__) wikipedia_spec = os.path.join(base_dir, "..", "examples", "wikipedia.json") def test_version(): assert get_distribution("piculet").version == piculet.__version__ def test_help_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "--help"]) out, err = capsys.readouterr() assert out.startswith("usage: ") def test_version_should_print_version_number_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "--version"]) out, err = capsys.readouterr() assert "piculet " + get_distribution("piculet").version + "\n" in {out, err} def test_no_command_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet"]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert ("required: command" in err) or ("too few arguments" in err) def test_invalid_command_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "foo"]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert ("invalid choice: 'foo'" in err) or ("invalid choice: u'foo'" in err) def test_unrecognized_arguments_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "--foo", "h2x", ""]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert "unrecognized arguments: --foo" in err def test_debug_mode_should_print_debug_messages(caplog): caplog.set_level(logging.DEBUG) with mock.patch("sys.stdin", StringIO("")): piculet.main(argv=["piculet", "--debug", "h2x", "-"]) assert caplog.record_tuples[0][-1] == "running in debug mode" def test_h2x_no_input_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "h2x"]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert ("required: file" in err) or ("too few arguments" in err) @mark.skipif(sys.platform not in {"linux", "linux2"}, reason="/dev/shm only available on linux") def test_h2x_should_read_given_file(capsys): content = "" with open("/dev/shm/test.html", "w") as f: f.write(content) piculet.main(argv=["piculet", "h2x", "/dev/shm/test.html"]) out, err = capsys.readouterr() os.unlink("/dev/shm/test.html") assert out == content def test_h2x_should_read_stdin_when_input_is_dash(capsys): content = "" with mock.patch("sys.stdin", StringIO(content)): piculet.main(argv=["piculet", "h2x", "-"]) out, err = capsys.readouterr() assert out == content def test_scrape_no_url_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "scrape", "-s", wikipedia_spec]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert ("required: document" in err) or ("too few arguments" in err) def test_scrape_no_spec_should_print_usage_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "scrape", "http://www.foo.com/"]) out, err = capsys.readouterr() assert err.startswith("usage: ") assert ("required: -s" in err) or ("--spec is required" in err) def test_scrape_missing_spec_file_should_fail_and_exit(capsys): with raises(SystemExit): piculet.main(argv=["piculet", "scrape", "http://www.foo.com/", "-s", "foo.json"]) out, err = capsys.readouterr() assert "No such file or directory: " in err def test_scrape_local_should_scrape_given_file(capsys): dirname = os.path.join(os.path.dirname(__file__), "..", "examples") shining = os.path.join(dirname, "shining.html") spec = os.path.join(dirname, "movie.json") piculet.main(argv=["piculet", "scrape", shining, "-s", spec]) out, err = capsys.readouterr() data = json.loads(out) assert data["title"] == "The Shining" @mark.skipif(not config.getvalue("--cov"), reason="takes unforeseen amount of time") def test_scrape_should_scrape_given_url(capsys): piculet.main( argv=[ "piculet", "scrape", "https://en.wikipedia.org/wiki/David_Bowie", "-s", wikipedia_spec, "--html", ] ) out, err = capsys.readouterr() data = json.loads(out) assert data["name"] == "David Bowie" PK!I[\!\!tests/test_extract.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from pytest import raises from piculet import Path, Rule, Rules, build_tree, reducers, transformers def test_no_rules_should_return_empty_result(shining): data = Rules([]).extract(shining) assert data == {} def test_extracted_value_should_be_reduced(shining): rules = [Rule(key="title", extractor=Path("//title/text()", reduce=reducers.first))] data = Rules(rules).extract(shining) assert data == {"title": "The Shining"} def test_default_reducer_should_be_concat(shining): rules = [Rule(key="full_title", extractor=Path("//h1//text()"))] data = Rules(rules).extract(shining) assert data == {"full_title": "The Shining (1980)"} def test_added_reducer_should_be_usable(shining): reducers.register("second", lambda x: x[1]) rules = [Rule(key="year", extractor=Path("//h1//text()", reduce=reducers.second))] data = Rules(rules).extract(shining) assert data == {"year": "1980"} def test_reduce_by_lambda_should_be_ok(shining): rules = [Rule(key="title", extractor=Path("//title/text()", reduce=lambda xs: xs[0]))] data = Rules(rules).extract(shining) assert data == {"title": "The Shining"} def test_reduced_value_should_be_transformable(shining): rules = [Rule(key="year", extractor=Path('//span[@class="year"]/text()', transform=int))] data = Rules(rules).extract(shining) assert data == {"year": 1980} def test_added_transformer_should_be_usable(shining): transformers.register("year25", lambda x: int(x) + 25) rules = [ Rule( key="year", extractor=Path('//span[@class="year"]/text()', transform=transformers.year25), ) ] data = Rules(rules).extract(shining) assert data == {"year": 2005} def test_multiple_rules_should_generate_multiple_items(shining): rules = [ Rule(key="title", extractor=Path("//title/text()")), Rule("year", extractor=Path('//span[@class="year"]/text()', transform=int)), ] data = Rules(rules).extract(shining) assert data == {"title": "The Shining", "year": 1980} def test_item_with_no_data_should_be_excluded(shining): rules = [ Rule(key="title", extractor=Path("//title/text()")), Rule(key="foo", extractor=Path("//foo/text()")), ] data = Rules(rules).extract(shining) assert data == {"title": "The Shining"} def test_item_with_empty_str_value_should_be_included(): content = '' rules = [Rule(key="foo", extractor=Path("//foo/@val"))] data = Rules(rules).extract(build_tree(content)) assert data == {"foo": ""} def test_item_with_zero_value_should_be_included(): content = '' rules = [Rule(key="foo", extractor=Path("//foo/@val", transform=int))] data = Rules(rules).extract(build_tree(content)) assert data == {"foo": 0} def test_item_with_false_value_should_be_included(): content = '' rules = [Rule(key="foo", extractor=Path("//foo/@val", transform=bool))] data = Rules(rules).extract(build_tree(content)) assert data == {"foo": False} def test_multivalued_item_should_be_list(shining): rules = [ Rule(key="genres", extractor=Path(foreach='//ul[@class="genres"]/li', path="./text()")) ] data = Rules(rules).extract(shining) assert data == {"genres": ["Horror", "Drama"]} def test_multivalued_items_should_be_transformable(shining): rules = [ Rule( key="genres", extractor=Path( foreach='//ul[@class="genres"]/li', path="./text()", transform=transformers.lower, ), ) ] data = Rules(rules).extract(shining) assert data == {"genres": ["horror", "drama"]} def test_empty_values_should_be_excluded_from_multivalued_item_list(shining): rules = [ Rule(key="foos", extractor=Path(foreach='//ul[@class="foos"]/li', path="./text()")) ] data = Rules(rules).extract(shining) assert data == {} def test_subrules_should_generate_subitems(shining): rules = [ Rule( key="director", extractor=Rules( rules=[ Rule(key="name", extractor=Path('//div[@class="director"]//a/text()')), Rule(key="link", extractor=Path('//div[@class="director"]//a/@href')), ] ), ) ] data = Rules(rules).extract(shining) assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}} def test_multivalued_subrules_should_generate_list_of_subitems(shining): rules = [ Rule( key="cast", extractor=Rules( foreach='//table[@class="cast"]/tr', rules=[ Rule(key="name", extractor=Path("./td[1]/a/text()")), Rule(key="character", extractor=Path("./td[2]/text()")), ], ), ) ] data = Rules(rules).extract(shining) assert data == { "cast": [ {"character": "Jack Torrance", "name": "Jack Nicholson"}, {"character": "Wendy Torrance", "name": "Shelley Duvall"}, ] } def test_subitems_should_be_transformable(shining): rules = [ Rule( key="cast", extractor=Rules( foreach='//table[@class="cast"]/tr', rules=[ Rule(key="name", extractor=Path("./td[1]/a/text()")), Rule(key="character", extractor=Path("./td[2]/text()")), ], transform=lambda x: "%(name)s as %(character)s" % x, ), ) ] data = Rules(rules).extract(shining) assert data == { "cast": ["Jack Nicholson as Jack Torrance", "Shelley Duvall as Wendy Torrance"] } def test_key_should_be_generatable_using_path(shining): rules = [ Rule( foreach='//div[@class="info"]', key=Path("./h3/text()"), extractor=Path("./p/text()"), ) ] data = Rules(rules).extract(shining) assert data == {"Language:": "English", "Runtime:": "144 minutes"} def test_generated_key_should_be_normalizable(shining): rules = [ Rule( foreach='//div[@class="info"]', key=Path("./h3/text()", reduce=reducers.normalize), extractor=Path("./p/text()"), ) ] data = Rules(rules).extract(shining) assert data == {"language": "English", "runtime": "144 minutes"} def test_generated_key_should_be_transformable(shining): rules = [ Rule( foreach='//div[@class="info"]', key=Path("./h3/text()", reduce=reducers.normalize, transform=lambda x: x.upper()), extractor=Path("./p/text()"), ) ] data = Rules(rules).extract(shining) assert data == {"LANGUAGE": "English", "RUNTIME": "144 minutes"} def test_generated_key_none_should_be_excluded(shining): rules = [ Rule( foreach='//div[@class="info"]', key=Path("./foo/text()"), extractor=Path("./p/text()"), ) ] data = Rules(rules).extract(shining) assert data == {} def test_section_should_set_root_for_queries(shining): rules = [ Rule( key="director", extractor=Rules( section='//div[@class="director"]//a', rules=[ Rule(key="name", extractor=Path("./text()")), Rule(key="link", extractor=Path("./@href")), ], ), ) ] data = Rules(rules).extract(shining) assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}} def test_section_no_roots_should_return_empty_result(shining): rules = [ Rule( key="director", extractor=Rules( section="//foo", rules=[Rule(key="name", extractor=Path("./text()"))] ), ) ] data = Rules(rules).extract(shining) assert data == {} def test_section_multiple_roots_should_raise_error(shining): with raises(ValueError): rules = [ Rule( key="director", extractor=Rules( section="//div", rules=[Rule(key="name", extractor=Path("./text()"))] ), ) ] Rules(rules).extract(shining) PK!9pXXtests/test_html.py# -*- coding: utf-8 -*- from __future__ import absolute_import, division, print_function, unicode_literals from pytest import raises from piculet import decode_html, html_to_xhtml TEMPLATE = """ %(meta)s

      ğışĞİŞ

      """ def read_document(actual, reported, tag): if reported == "none": meta = "" else: tag = ( '' if tag == "charset" else '' ) meta = tag % {"charset": reported} content = TEMPLATE % {"meta": meta} return content.encode(actual) def test_decode_content_meta_charset_correct_should_succeed(): content = read_document("utf-8", "utf-8", "charset") assert "ğışĞİŞ" in decode_html(content) def test_decode_content_meta_content_type_correct_should_succeed(): content = read_document("utf-8", "utf-8", "content-type") assert "ğışĞİŞ" in decode_html(content) def test_decode_content_meta_charset_incorrect_should_fail(): content = read_document("utf-8", "iso8859-9", "charset") assert "ğışĞİŞ" not in decode_html(content) def test_decode_content_meta_content_type_incorrect_should_fail(): content = read_document("utf-8", "iso8859-9", "content-type") assert "ğışĞİŞ" not in decode_html(content) def test_decode_content_meta_charset_incompatible_should_raise_unicode_error(): content = read_document("iso8859-9", "utf-8", "charset") with raises(UnicodeDecodeError): decode_html(content) def test_decode_content_meta_content_type_incompatible_should_raise_unicode_error(): content = read_document("iso8859-9", "utf-8", "content-type") with raises(UnicodeDecodeError): decode_html(content) def test_decode_content_requested_charset_correct_should_succeed(): content = read_document("utf-8", "iso8859-9", "charset") assert "ğışĞİŞ" in decode_html(content, charset="utf-8") def test_decode_content_requested_charset_incorrect_should_fail(): content = read_document("utf-8", "utf-8", "charset") assert "ğışĞİŞ" not in decode_html(content, charset="iso8859-9") def test_decode_content_requested_charset_incompatible_should_raise_unicode_error(): content = read_document("iso8859-9", "iso8859-9", "charset") with raises(UnicodeDecodeError): decode_html(content, charset="utf-8") def test_decode_content_fallback_default_correct_should_succeed(): content = read_document("utf-8", "none", "charset") assert "ğışĞİŞ" in decode_html(content) def test_decode_content_fallback_correct_should_succeed(): content = read_document("iso8859-9", "none", "charset") assert "ğışĞİŞ" in decode_html(content, fallback_charset="iso8859-9") def test_decode_content_fallback_incorrect_should_fail(): content = read_document("iso8859-9", "none", "charset") assert "ğışĞİŞ" not in decode_html(content, fallback_charset="iso8859-1") def test_decode_content_fallback_incompatible_should_raise_unicode_error(): content = read_document("iso8859-9", "none", "charset") with raises(UnicodeDecodeError): decode_html(content, fallback_charset="utf-8") def test_html_to_xhtml_well_formed_xml_should_succeed(): content = """""" normalized = html_to_xhtml(content) assert normalized == """""" def test_html_to_xhtml_doctype_should_be_removed(): content = """""" normalized = html_to_xhtml(content) assert normalized == """""" def test_html_to_xhtml_comment_should_be_removed(): content = """""" normalized = html_to_xhtml(content) assert normalized == """""" def test_html_to_xhtml_omitted_tags_should_be_removed(): content = """

      """ normalized = html_to_xhtml(content, omit_tags={"font"}) assert normalized == """

      """ def test_html_to_xhtml_omitted_attributes_should_be_removed(): content = """

      """ normalized = html_to_xhtml(content, omit_attrs={"font"}) assert normalized == """

      """ def test_html_to_xhtml_self_closing_tags_should_have_slash_at_end(): content = """
      """ normalized = html_to_xhtml(content) assert normalized == """
      """ def test_html_to_xhtml_attributes_should_have_values(): content = """""" normalized = html_to_xhtml(content) assert normalized == """""" def test_html_to_xhtml_unclosed_tags_should_be_closed(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_unclosed_lis_should_be_closed(): content = """
      """ normalized = html_to_xhtml(content) assert normalized == """
      """ def test_html_to_xhtml_unclosed_last_lis_should_be_closed(): content = """
      """ normalized = html_to_xhtml(content) assert normalized == """
      """ def test_html_to_xhtml_end_tag_without_start_tag_should_be_discarded(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_incorrect_nesting_should_be_reordered(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_angular_brackets_with_at_symbols_should_be_replaced(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      <uyar@tekir.org>

      """ def test_html_to_xhtml_ampersands_should_be_replaced_in_data(): content = """

      &

      """ normalized = html_to_xhtml(content) assert normalized == """

      &

      """ def test_html_to_xhtml_lts_should_be_replaced_in_data(): content = """

      <

      """ normalized = html_to_xhtml(content) assert normalized == """

      <

      """ def test_html_to_xhtml_gts_should_be_replaced_in_data(): content = """

      >

      """ normalized = html_to_xhtml(content) assert normalized == """

      >

      """ def test_html_to_xhtml_ampersands_should_be_replaced_in_attribute_values(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_lts_should_be_replaced_in_attribute_values(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_gts_should_be_replaced_in_attribute_values(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_quotes_should_be_replaced_in_attribute_values(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ def test_html_to_xhtml_unicode_data_should_be_preserved(): content = """

      ğış

      """ normalized = html_to_xhtml(content) assert normalized == """

      ğış

      """ def test_html_to_xhtml_unicode_attribute_value_should_be_preserved(): content = """

      """ normalized = html_to_xhtml(content) assert normalized == """

      """ PK!_ )tests/test_preprocess.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from pytest import raises from piculet import extract, preprocess def test_unknown_preprocessor_should_raise_error(shining): with raises(ValueError): pre = [{"op": "foo", "path": "//tr[1]"}] preprocess(shining, pre) def test_remove_should_remove_selected_element(shining): pre = [{"op": "remove", "path": "//tr[1]"}] items = [ { "key": "cast", "value": { "foreach": '//table[@class="cast"]/tr', "items": [{"key": "name", "value": {"path": "./td[1]/a/text()"}}], }, } ] preprocess(shining, pre) data = extract(shining, items) assert data == {"cast": [{"name": "Shelley Duvall"}]} def test_remove_selected_none_should_not_cause_error(shining): pre = [{"op": "remove", "path": "//tr[50]"}] items = [ { "key": "cast", "value": { "foreach": '//table[@class="cast"]/tr', "items": [{"key": "name", "value": {"path": "./td[1]/a/text()"}}], }, } ] preprocess(shining, pre) data = extract(shining, items) assert data == {"cast": [{"name": "Jack Nicholson"}, {"name": "Shelley Duvall"}]} def test_set_attr_value_from_str_should_set_attribute_for_selected_elements(shining): pre = [ {"op": "set_attr", "path": "//ul[@class='genres']/li", "name": "foo", "value": "bar"} ] items = [{"key": "genres", "value": {"foreach": "//li[@foo='bar']", "path": "./text()"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {"genres": ["Horror", "Drama"]} def test_set_attr_value_from_path_should_set_attribute_for_selected_elements(shining): pre = [ { "op": "set_attr", "path": '//ul[@class="genres"]/li', "name": "foo", "value": {"path": "./text()"}, } ] items = [{"key": "genres", "value": {"foreach": "//li[@foo]", "path": "./@foo"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {"genres": ["Horror", "Drama"]} def test_set_attr_value_from_path_no_value_should_be_ignored(shining): pre = [ { "op": "set_attr", "path": '//ul[@class="genres"]/li', "name": "foo", "value": {"path": "./@bar"}, } ] items = [{"key": "genres", "value": {"foreach": "//li[@foo]", "path": "./@foo"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {} def test_set_attr_name_from_path_should_set_attribute_for_selected_elements(shining): pre = [ { "op": "set_attr", "path": '//ul[@class="genres"]/li', "name": {"path": "./text()"}, "value": "bar", } ] items = [{"key": "genres", "value": {"foreach": "//li[@Horror]", "path": "./@Horror"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {"genres": ["bar"]} def test_set_attr_name_from_path_no_value_should_be_ignored(shining): pre = [ { "op": "set_attr", "path": '//ul[@class="genres"]/li', "name": {"path": "./@bar"}, "value": "bar", } ] items = [{"key": "genres", "value": {"foreach": ".//li[@Horror]", "path": "./@Horror"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {} def test_set_attr_selected_none_should_not_cause_error(shining): pre = [{"op": "set_attr", "path": "//foo", "name": "foo", "value": "bar"}] items = [{"key": "genres", "value": {"foreach": '//li[@foo="bar"]', "path": "./@foo"}}] preprocess(shining, pre) data = extract(shining, items) assert data == {} def test_set_text_value_from_str_should_set_text_for_selected_elements(shining): pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": "Foo"}] items = [ {"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}} ] preprocess(shining, pre) data = extract(shining, items) assert data == {"genres": ["Foo", "Foo"]} def test_set_text_value_from_path_should_set_text_for_selected_elements(shining): pre = [ { "op": "set_text", "path": '//ul[@class="genres"]/li', "text": {"path": "./text()", "transform": "lower"}, } ] items = [ {"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}} ] preprocess(shining, pre) data = extract(shining, items) assert data == {"genres": ["horror", "drama"]} def test_set_text_no_value_should_be_ignored(shining): pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": {"path": "./@foo"}}] items = [ {"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}} ] preprocess(shining, pre) data = extract(shining, items) assert data == {} PK!1&&tests/test_reducers.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from piculet import reducers def test_reducer_first_should_return_first_item(): assert reducers.first(["a", "b", "c"]) == "a" def test_reducer_concat_should_return_concatenated_items(): assert reducers.concat(["a", "b", "c"]) == "abc" def test_reducer_clean_should_remove_extra_space(): assert reducers.clean([" a ", " b", " c "]) == "a b c" def test_reducer_clean_should_treat_nbsp_as_space(): assert reducers.clean([" a ", " \xa0 b", " c "]) == "a b c" def test_reducer_normalize_should_convert_to_lowercase(): assert reducers.normalize(["A", "B", "C"]) == "abc" def test_reducer_normalize_should_remove_nonalphanumeric_characters(): assert reducers.normalize(["a+", "?b7", "{c}"]) == "ab7c" def test_reducer_normalize_should_keep_underscores(): assert reducers.normalize(["a_", "b", "c"]) == "a_bc" def test_reducer_normalize_should_replace_spaces_with_underscores(): assert reducers.normalize(["a", " b", "c"]) == "a_bc" PK!K -#-#tests/test_scrape.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from pytest import raises from piculet import reducers, scrape, transformers def test_no_rules_should_return_empty_result(shining_content): data = scrape(shining_content, {"items": []}) assert data == {} def test_extracted_value_should_be_reduced(shining_content): items = [{"key": "title", "value": {"path": "//title/text()", "reduce": "first"}}] data = scrape(shining_content, {"items": items}) assert data == {"title": "The Shining"} def test_default_reducer_should_be_concat(shining_content): items = [{"key": "full_title", "value": {"path": "//h1//text()"}}] data = scrape(shining_content, {"items": items}) assert data == {"full_title": "The Shining (1980)"} def test_added_reducer_should_be_usable(shining_content): reducers.register("second", lambda x: x[1]) items = [{"key": "year", "value": {"path": "//h1//text()", "reduce": "second"}}] data = scrape(shining_content, {"items": items}) assert data == {"year": "1980"} def test_unknown_reducer_should_raise_error(shining_content): with raises(ValueError): items = [{"key": "year", "value": {"path": "//h1//text()", "reduce": "foo"}}] scrape(shining_content, {"items": items}) def test_reduced_value_should_be_transformable(shining_content): items = [ {"key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "int"}} ] data = scrape(shining_content, {"items": items}) assert data == {"year": 1980} def test_added_transformer_should_be_usable(shining_content): transformers.register("year25", lambda x: int(x) + 25) items = [ { "key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "year25"}, } ] data = scrape(shining_content, {"items": items}) assert data == {"year": 2005} def test_unknown_transformer_should_raise_error(shining_content): with raises(ValueError): items = [ { "key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "year42"}, } ] scrape(shining_content, {"items": items}) def test_multiple_rules_should_generate_multiple_items(shining_content): items = [ {"key": "title", "value": {"path": "//title/text()"}}, {"key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "int"}}, ] data = scrape(shining_content, {"items": items}) assert data == {"title": "The Shining", "year": 1980} def test_item_with_no_data_should_be_excluded(shining_content): items = [ {"key": "title", "value": {"path": "//title/text()"}}, {"key": "foo", "value": {"path": "//foo/text()"}}, ] data = scrape(shining_content, {"items": items}) assert data == {"title": "The Shining"} def test_multivalued_item_should_be_list(shining_content): items = [ {"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}} ] data = scrape(shining_content, {"items": items}) assert data == {"genres": ["Horror", "Drama"]} def test_multivalued_items_should_be_transformable(shining_content): items = [ { "key": "genres", "value": { "foreach": '//ul[@class="genres"]/li', "path": "./text()", "transform": "lower", }, } ] data = scrape(shining_content, {"items": items}) assert data == {"genres": ["horror", "drama"]} def test_empty_values_should_be_excluded_from_multivalued_item_list(shining_content): items = [ {"key": "foos", "value": {"foreach": '//ul[@class="foos"]/li', "path": "./text()"}} ] data = scrape(shining_content, {"items": items}) assert data == {} def test_subrules_should_generate_subitems(shining_content): items = [ { "key": "director", "value": { "items": [ {"key": "name", "value": {"path": '//div[@class="director"]//a/text()'}}, {"key": "link", "value": {"path": '//div[@class="director"]//a/@href'}}, ] }, } ] data = scrape(shining_content, {"items": items}) assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}} def test_multivalued_subrules_should_generate_list_of_subitems(shining_content): items = [ { "key": "cast", "value": { "foreach": '//table[@class="cast"]/tr', "items": [ {"key": "name", "value": {"path": "./td[1]/a/text()"}}, {"key": "character", "value": {"path": "./td[2]/text()"}}, ], }, } ] data = scrape(shining_content, {"items": items}) assert data == { "cast": [ {"character": "Jack Torrance", "name": "Jack Nicholson"}, {"character": "Wendy Torrance", "name": "Shelley Duvall"}, ] } def test_subitems_should_be_transformable(shining_content): transformers.register("stars", lambda x: "%(name)s as %(character)s" % x) items = [ { "key": "cast", "value": { "foreach": '//table[@class="cast"]/tr', "items": [ {"key": "name", "value": {"path": "./td[1]/a/text()"}}, {"key": "character", "value": {"path": "./td[2]/text()"}}, ], "transform": "stars", }, } ] data = scrape(shining_content, {"items": items}) assert data == { "cast": ["Jack Nicholson as Jack Torrance", "Shelley Duvall as Wendy Torrance"] } def test_key_should_be_generatable_using_path(shining_content): items = [ { "foreach": '//div[@class="info"]', "key": {"path": "./h3/text()"}, "value": {"path": "./p/text()"}, } ] data = scrape(shining_content, {"items": items}) assert data == {"Language:": "English", "Runtime:": "144 minutes"} def test_generated_key_should_be_normalizable(shining_content): items = [ { "foreach": '//div[@class="info"]', "key": {"path": "./h3/text()", "reduce": "normalize"}, "value": {"path": "./p/text()"}, } ] data = scrape(shining_content, {"items": items}) assert data == {"language": "English", "runtime": "144 minutes"} def test_generated_key_should_be_transformable(shining_content): items = [ { "foreach": '//div[@class="info"]', "key": {"path": "./h3/text()", "reduce": "normalize", "transform": "upper"}, "value": {"path": "./p/text()"}, } ] data = scrape(shining_content, {"items": items}) assert data == {"LANGUAGE": "English", "RUNTIME": "144 minutes"} def test_generated_key_none_should_be_excluded(shining_content): items = [ { "foreach": '//div[@class="info"]', "key": {"path": "./foo/text()"}, "value": {"path": "./p/text()"}, } ] data = scrape(shining_content, {"items": items}) assert data == {} def test_tree_should_be_preprocessable(shining_content): pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": "Foo"}] items = [ {"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}} ] data = scrape(shining_content, {"items": items, "pre": pre}) assert data == {"genres": ["Foo", "Foo"]} def test_section_should_set_root_for_queries(shining_content): items = [ { "key": "director", "value": { "section": '//div[@class="director"]//a', "items": [ {"key": "name", "value": {"path": "./text()"}}, {"key": "link", "value": {"path": "./@href"}}, ], }, } ] data = scrape(shining_content, {"items": items}) assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}} def test_section_no_roots_should_return_empty_result(shining_content): items = [ { "key": "director", "value": { "section": "//foo", "items": [{"key": "name", "value": {"path": "./text()"}}], }, } ] data = scrape(shining_content, {"items": items}) assert data == {} def test_section_multiple_roots_should_raise_error(shining_content): with raises(ValueError): items = [ { "key": "director", "value": { "section": "//div", "items": [{"key": "name", "value": {"path": "./text()"}}], }, } ] scrape(shining_content, {"items": items}) PK!*?mmtests/test_xpath.pyfrom __future__ import absolute_import, division, print_function, unicode_literals from piculet import build_tree, xpath content = 'foobar' root = build_tree(content) def test_non_text_queries_should_return_elements(): selected = xpath(root, ".//t1") assert [s.tag for s in selected] == ["t1", "t1"] def test_child_text_queries_should_return_strings(): selected = xpath(root, ".//t1/text()") assert selected == ["foo"] def test_descendant_text_queries_should_return_strings(): selected = xpath(root, ".//t1//text()") assert selected == ["foo", "bar"] def test_attr_queries_should_return_strings(): selected = xpath(root, ".//t1/@a") assert selected == ["v"] def test_non_absolute_queries_should_be_ok(): selected = xpath(root, "//t1") assert [s.tag for s in selected] == ["t1", "t1"] PK!H$((piculet-1.0.1.dist-info/entry_points.txtN+I/N.,()*L.I-Vy\\PK!+#piculet-1.0.1.dist-info/LICENSE.txt GNU LESSER GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. This version of the GNU Lesser General Public License incorporates the terms and conditions of version 3 of the GNU General Public License, supplemented by the additional permissions listed below. 0. Additional Definitions. As used herein, "this License" refers to version 3 of the GNU Lesser General Public License, and the "GNU GPL" refers to version 3 of the GNU General Public License. "The Library" refers to a covered work governed by this License, other than an Application or a Combined Work as defined below. An "Application" is any work that makes use of an interface provided by the Library, but which is not otherwise based on the Library. Defining a subclass of a class defined by the Library is deemed a mode of using an interface provided by the Library. A "Combined Work" is a work produced by combining or linking an Application with the Library. The particular version of the Library with which the Combined Work was made is also called the "Linked Version". The "Minimal Corresponding Source" for a Combined Work means the Corresponding Source for the Combined Work, excluding any source code for portions of the Combined Work that, considered in isolation, are based on the Application, and not on the Linked Version. The "Corresponding Application Code" for a Combined Work means the object code and/or source code for the Application, including any data and utility programs needed for reproducing the Combined Work from the Application, but excluding the System Libraries of the Combined Work. 1. Exception to Section 3 of the GNU GPL. You may convey a covered work under sections 3 and 4 of this License without being bound by section 3 of the GNU GPL. 2. Conveying Modified Versions. If you modify a copy of the Library, and, in your modifications, a facility refers to a function or data to be supplied by an Application that uses the facility (other than as an argument passed when the facility is invoked), then you may convey a copy of the modified version: a) under this License, provided that you make a good faith effort to ensure that, in the event an Application does not supply the function or data, the facility still operates, and performs whatever part of its purpose remains meaningful, or b) under the GNU GPL, with none of the additional permissions of this License applicable to that copy. 3. Object Code Incorporating Material from Library Header Files. The object code form of an Application may incorporate material from a header file that is part of the Library. You may convey such object code under terms of your choice, provided that, if the incorporated material is not limited to numerical parameters, data structure layouts and accessors, or small macros, inline functions and templates (ten or fewer lines in length), you do both of the following: a) Give prominent notice with each copy of the object code that the Library is used in it and that the Library and its use are covered by this License. b) Accompany the object code with a copy of the GNU GPL and this license document. 4. Combined Works. You may convey a Combined Work under terms of your choice that, taken together, effectively do not restrict modification of the portions of the Library contained in the Combined Work and reverse engineering for debugging such modifications, if you also do each of the following: a) Give prominent notice with each copy of the Combined Work that the Library is used in it and that the Library and its use are covered by this License. b) Accompany the Combined Work with a copy of the GNU GPL and this license document. c) For a Combined Work that displays copyright notices during execution, include the copyright notice for the Library among these notices, as well as a reference directing the user to the copies of the GNU GPL and this license document. d) Do one of the following: 0) Convey the Minimal Corresponding Source under the terms of this License, and the Corresponding Application Code in a form suitable for, and under terms that permit, the user to recombine or relink the Application with a modified version of the Linked Version to produce a modified Combined Work, in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source. 1) Use a suitable shared library mechanism for linking with the Library. A suitable mechanism is one that (a) uses at run time a copy of the Library already present on the user's computer system, and (b) will operate properly with a modified version of the Library that is interface-compatible with the Linked Version. e) Provide Installation Information, but only if you would otherwise be required to provide such information under section 6 of the GNU GPL, and only to the extent that such information is necessary to install and execute a modified version of the Combined Work produced by recombining or relinking the Application with a modified version of the Linked Version. (If you use option 4d0, the Installation Information must accompany the Minimal Corresponding Source and Corresponding Application Code. If you use option 4d1, you must provide the Installation Information in the manner specified by section 6 of the GNU GPL for conveying Corresponding Source.) 5. Combined Libraries. You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities that are not Applications and are not covered by this License, and convey such a combined library under terms of your choice, if you do both of the following: a) Accompany the combined library with a copy of the same work based on the Library, uncombined with any other library facilities, conveyed under the terms of this License. b) Give prominent notice with the combined library that part of it is a work based on the Library, and explaining where to find the accompanying uncombined form of the same work. 6. Revised Versions of the GNU Lesser General Public License. The Free Software Foundation may publish revised and/or new versions of the GNU Lesser General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Library as you received it specifies that a certain numbered version of the GNU Lesser General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that published version or of any later version published by the Free Software Foundation. If the Library as you received it does not specify a version number of the GNU Lesser General Public License, you may choose any version of the GNU Lesser General Public License ever published by the Free Software Foundation. If the Library as you received it specifies that a proxy can decide whether future versions of the GNU Lesser General Public License shall apply, that proxy's public statement of acceptance of any version is permanent authorization for you to choose that version for the Library. PK!H|n-WYpiculet-1.0.1.dist-info/WHEEL A н#Z;/" bFF]xzwK;<*mTֻ0*Ri.4Vm0[H, JPK!HL^ piculet-1.0.1.dist-info/METADATAUs6 ~_)i,*5^bg=մ[\$%!?2wAGa2E*צJ+gUbVu$vGguY*$8w|EGrV襸3%FV-aNd} My=)Q.+φ  g2,:9gegS\ՔI5{lGX*]$F4n![h԰uyѕ;o׋][q](\#%ƖXIQ!ID0r&Sbb>X-3UcΤ_VfpUg&cvCZ2Õ,o??bN0gNweB֗=_8Pq ua6ܱ=sN}TY@U  kê{| wk:* jD,$MJ|x9OB\8 {zvD>um`΂架̝)ALډmG3)4Xsojj7Ng*[IC?&#HUrm1gϹNs(k%PFqe-a[T MO󪅠 ?DՁخ+-P,r>I^]raȔ t-sgζ͚!#sT\KԐ2bz%: 7~:NI"0ۜ}xVY\*g'm#_PK!H"2piculet-1.0.1.dist-info/RECORDuɖH}> d2Z0 n8 muWV"7qb?WQOi]?@}b$kn]I 0HVnYj^8nՊެE0!>&aߊF[%yĈ_D mej$,?o._A?C}Ec?4g*Aұ3  Q+&5G`iZͽ4gy36~%]}P޾=P"Y:jH!xThw}# @?3|#O9V/M<"+jU+AbU`Pf+} f+xt2+jZK;(-:Wd VˆPxyҝXV3R5!JTtCbR7ok` }A:B#zi_3]vW0y)|H`1ݖsPM$e$])v^CVPؖ.l;0Cixt1.`Xgt%}g|Er-m8dkΦ:MH]!WK<^١ړlCEO_ЭA? Q̝Xf &r!#xXdlW?*HI%ݒ2 ;-5E $3'i}${HGU$INjNr_e:^: rve*?D.P?sR& -?\=扽hjb&0 yζ$APK!/R HISTORY.rstPK!׼|aa docs/MakefilePK! docs/source/_static/custom.cssPK!JDD docs/source/api.rstPK!M"": docs/source/conf.pyPK!m5DD3.docs/source/extract.rstPK!LOsdocs/source/history.rstPK!Esdocs/source/index.rstPK!m tdocs/source/low-level.rstPK!b2docs/source/overview.rstPK! UQQIdocs/source/preprocess.rstPK!VN88 ңpiculet.pyPK!H2'tests/conftest.pyPK!ل,tests/test_cli.pyPK!I[\!\!?tests/test_extract.pyPK!9pXXpatests/test_html.pyPK!_ )tests/test_preprocess.pyPK!1&& tests/test_reducers.pyPK!K -#-#ftests/test_scrape.pyPK!*?mmżtests/test_xpath.pyPK!H$((cpiculet-1.0.1.dist-info/entry_points.txtPK!+#piculet-1.0.1.dist-info/LICENSE.txtPK!H|n-WYpiculet-1.0.1.dist-info/WHEELPK!HL^ piculet-1.0.1.dist-info/METADATAPK!H"2jpiculet-1.0.1.dist-info/RECORDPK;