{ "info": { "author": "Tian L.", "author_email": "bolitt@gmail.com", "bugtrack_url": null, "classifiers": [ "Intended Audience :: Developers", "Intended Audience :: Science/Research", "License :: OSI Approved :: BSD License", "Operating System :: OS Independent", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Topic :: Software Development", "Topic :: Software Development :: Libraries :: Python Modules" ], "description": "======================================\npy-matchtpl: xml/html matching library\n======================================\n\nA python library to match and extract xml/html source with pre-defined \ntemplate. It provides a convenient and coding-free way for data \nprocessing, especially for web page.\n\nThe features of ``matchtpl`` are summarized as follows:\n\n* **Easy to use**. The goal is to help developer ease their text-data processing job. \n Only basic knowledge of `jQuery `_ (mostly, *CSSSelector*), one popular javascript\n DOM-manipulation library, is assumed. User only need to provide the XML-template to\n tell how to extract information and what the expected output is, then ``matchtpl`` will \n finish the rest of the work.\n\n* **User-friendly**. Our toolkit does not require coding in python. If you are to\n do very sophisticated work, py-matchtpl can take over dirty things, such as \n parse html file, extract useful information, organize data into preferrable\n data structures, or streaming into *string* (plaintext) / `json `_ / `yaml `_ / python builtin structures (by default).\n \n* **Extensibilty**. Currently, it supports three basic types of data structures: \n (1) *string*; (2) *array*; (3) *map*. We can utilize their combination to meet the requirements\n in most cases. What's more, user can provide *UDF* (user-defined function) to customize in his/her \n own way. \n\nThe fundamental philosophy of ``matchtpl`` is:\n\n* **Neat**: keep it clean and hide the dirty things.\n\n* **Simple**: everything looks configurable, declarative and intuitive. (avoid to use complex control flow syntax: ``if``/``for``/``while``.)\n\n* **Extensible**: leave imagination to user, and any ideas can be integrated in a rapid way.\n\nInstallation\n=====================\n\nYou can install the latest package from source (or, download and unzip from github)::\n\n $ git clone https://github.com/bolitt/py-matchtpl.git\n \n $ python setup.py install\n\n\nor use python easy_install or pip::\n\n $ easy_install matchtpl\n\n # alternatively install by pip\n\n $ pip install matchtpl\n\n\n\nBasic Data Structures\n=====================\n\n1. **string**: ````. Typical atom structure, can be post-processed and\n converted into other types, like ``int``, ``float`` and etc.\n\n2. **array**: ````. An ordered list of data, also known as list.\n It can be retrieved by its index: *array[0]*.\n\n3. **map**: ````. A key-value based structure, also known as hash or table.\n It can be retrieved by key-like way: *map['name']* or by property-like way: *map.name*.\n\nWe believe most data can be fit into those data structures or their combinations.\n\n\nKeywords & Elements\n-------------------------\n\nHere are typical keywords:\n\n* **select**: select target element(s) from document.\n * selector_string (string): CSS3 Selector to choose target.\n\n* **get**: get internal text | html of target DOM element.\n * type (string): \"text\" | \"html\". \n\n* **eval**: locally evaluate via python syntax. (Often used to call jquery-like API.)\n * script_text (string): script using python syntax.\n\n* **default**: default value if none.\n * value (string): default value.\n\n* **as**: output format in human-readable way.\n * type (string): str | json | yaml. If not provided, will return python builtin data strucutures.\n\t\n* **encoding**: set decoder for datasource.\n * encode_type (string): such as UTF-8 (default), GBK/GB2312 (some Chinese websites), UTF-16, etc.\n\n(Keywords are not limited as above.)\n\n\nAnd extensible elements are:\n\n* Strucuture element: ````, ````, ```` (see: above).\n\n* Root element: ````. Act as serilization class, and provide multiple formats to output result.\n\n* Customized element: ````, where *action* here can be other non-conflictive tag. *action* is a\n customized action provide by user when calling *parser.parse(..., {'action': some_function})*.\n\n\nQuick Start\n=====================\n\nThe example shows how to extract data from html source. \nMatchtpl provides an easy way to parse your html file\nand format output. It is a real case to extract products\ninformation from web page of amazon.com.\n\n\nPython Code\n------------------------\n\nIn python, typical usage often looks like this::\n\n #!/usr/bin/env python\n\n from matchtpl import MTemplateEnv, MTemplate, MTemplateParser\n\t\n if __name__ == '__main__':\n # initialize environment\n env = MTemplateEnv(template = 'tpl_amazon.xml')\n \n # build template\n tpl = MTemplate()\n tpl.build(env)\n\n # initialize parser and parse\n parser = MTemplateParser(tpl)\n results = parser.parse('amazon.html')\n\n\nConfigurable Template\n------------------------\n\nThe pre-defined template is written in xml, which acts as a\nconfig file to indicates the meta information of the target \n(usually another html/xml file or stream). Then,\nparser will use the template to guide its processing, and \noutput the result::\n\n \n \n \n \n\t \n \n \n \n \n \n \n \n \n \n \n \n \n\n\nAfter execution, the output is organized as json::\n\n [\n [\n {\n \"image\": \"http://ec4.images-amazon.com/images/I/516Vhic-I9L._AA160_.jpg\", \n \"info\": \"\u5218\u4e9a\u8389 \u5e7f\u4e1c\u7701\u51fa\u7248\u96c6\u56e2\uff0c\u5e7f\u4e1c\u7ecf\u6d4e\u51fa\u7248\u793e (2011-05) - Kindle\u7535\u5b50\u4e66\", \n \"price\": \"\uffe51.99\", \n \"review\": \"\u5e73\u57474.4 \u661f\", \n \"title\": \"\u603b\u7ecf\u7406\u8d22\u52a1\u4e00\u672c\u901a\"\n }, \n // up to 25 results: map\n ]\n ]\n\n(At present, json, yaml, plaintext or python builtin structures are allowed. More format will be supported later.)\n\n\nFuture Scenarios\n=================\n\nPossible functionalities:\n\n1. Unix-like pipe: ``|``. Just concatenate output|input step by step.\n\n2. Interactive. Interaction with pages: like doing automation/login/testing.\n\n3. Type-casting. convert type into int/float, or direct instantiation of a class.\n\n4. Regex support ``/^abcd/ABCD/g`` and some basic UDFs, like split/trim/toUpper/toLower.\n\nProject Links\n==============\n**Package Release**: https://pypi.python.org/pypi/matchtpl\n\n**Source Code**: https://github.com/bolitt/py-matchtpl.git \n\n\nContributors\n==============\n\n* v0.1 Tian Lin\n Initialize the project, and alpha release of the library.\n\n\n*Any contributions are welcome!*\n\n\n\nSee https://pypi.python.org/pypi/matchtpl for the full documentation\n\nNews\n====\n\nv-0.1.0.dev1, 11/8/2013 -- Initial release.\n\nv-0.1.0.dev2, 12/11/2013 -- Minor change on class interfaces.\n\nv-0.1.0.dev3, 12/15/2013 -- Cleanup some dependences and fix setup bug.\n\n\nv-0.1.2, ?/?/2013 -- Add keyword encoding for root element!", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/bolitt/py-matchtpl.git", "keywords": "match template crawler extract data xml html", "license": "BSD license", "maintainer": null, "maintainer_email": null, "name": "matchtpl", "package_url": "https://pypi.org/project/matchtpl/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/matchtpl/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/bolitt/py-matchtpl.git" }, "release_url": "https://pypi.org/project/matchtpl/0.1.2/", "requires_dist": null, "requires_python": null, "summary": "Matching template to extract data from xml or html", "version": "0.1.2" }, "last_serial": 961654, "releases": { "0.1.0.dev1": [ { "comment_text": "", "digests": { "md5": "264768773d71711b1e861ec45559e872", "sha256": "13bf5ad2547fdd4584d55f88d68b096aef75d9708c36e744281af91b617d95f4" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev1.zip", "has_sig": false, "md5_digest": "264768773d71711b1e861ec45559e872", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 16985, "upload_time": "2013-11-21T04:10:58", "url": "https://files.pythonhosted.org/packages/be/c7/2ca2cf8f99b0d949e6a3d686afae02db87cca5737cb20444eeb276cb167e/matchtpl-0.1.0.dev1.zip" } ], "0.1.0.dev2": [ { "comment_text": "", "digests": { "md5": "97114a68df0d83ad594ef1920c162524", "sha256": "ad7cc89c4cd34cbcb871982cba66060d701b4fad039c0f0f8033d8524d919546" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev2.win32.exe", "has_sig": false, "md5_digest": "97114a68df0d83ad594ef1920c162524", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 212323, "upload_time": "2013-12-14T12:53:05", "url": "https://files.pythonhosted.org/packages/58/ff/bed3eff95e23358780e79c51ba5d6507aeea125d27ae73d1e56d13bf0c2a/matchtpl-0.1.0.dev2.win32.exe" }, { "comment_text": "", "digests": { "md5": "15612ff5f996fb4df9a898318361cb83", "sha256": "ba66ec35b64dd1b155c7b28df532cd8da1230a61de6f9a2e98154abf0614f368" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev2.zip", "has_sig": false, "md5_digest": "15612ff5f996fb4df9a898318361cb83", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17312, "upload_time": "2013-12-14T12:52:57", "url": "https://files.pythonhosted.org/packages/ec/6a/c17d1c44f3b5210054a99ce4e6f2b3d534cf01a853a4cb8e52d2b94a7e15/matchtpl-0.1.0.dev2.zip" } ], "0.1.0.dev3": [ { "comment_text": "", "digests": { "md5": "1fe94a763a3baa5a675edf64a556dcba", "sha256": "c96b218aabd43752a70793dce22c92c7a5385f80e06220e1d4f231a1454aff00" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev3.win32.exe", "has_sig": false, "md5_digest": "1fe94a763a3baa5a675edf64a556dcba", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 212601, "upload_time": "2013-12-14T17:12:00", "url": "https://files.pythonhosted.org/packages/4f/ff/74ee3bce6010805ac575ea92a4123564eb00ca53095a323d37f7e7872ffa/matchtpl-0.1.0.dev3.win32.exe" }, { "comment_text": "built for Windows-8", "digests": { "md5": "e342ebb7f35f600194e6cae4eceb458c", "sha256": "ce71a4ad30caef82db1c4b1659ad91c8dfdb16f2701da2fd93f6aa8a8ef38daf" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev3.win32.zip", "has_sig": false, "md5_digest": "e342ebb7f35f600194e6cae4eceb458c", "packagetype": "bdist_dumb", "python_version": "any", "requires_python": null, "size": 13203, "upload_time": "2013-12-14T17:11:42", "url": "https://files.pythonhosted.org/packages/1a/d9/f71f4454261e1fac52ed7111e0f25631f90fa9ea3256f8b2157f668293e2/matchtpl-0.1.0.dev3.win32.zip" }, { "comment_text": "", "digests": { "md5": "2a671d5a70a60ed852f2bc0e00195bb1", "sha256": "b4132f59bfbefc66380e493d2478a297ac7910782ef721fd446efe0233d36936" }, "downloads": -1, "filename": "matchtpl-0.1.0.dev3.zip", "has_sig": false, "md5_digest": "2a671d5a70a60ed852f2bc0e00195bb1", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 17626, "upload_time": "2013-12-14T17:09:12", "url": "https://files.pythonhosted.org/packages/7b/71/0842db237c53dbf99731e9547047721c30a6936fd5c3a1aa52d44c253ea2/matchtpl-0.1.0.dev3.zip" } ], "0.1.1": [ { "comment_text": "", "digests": { "md5": "17cc4e862fa71d7754e58c2603d9e39f", "sha256": "8a9aed53aaef434b68b0faac8a62390a2252df2f0778ac51f08668c5bf5e7c87" }, "downloads": -1, "filename": "matchtpl-0.1.1.win32.exe", "has_sig": false, "md5_digest": "17cc4e862fa71d7754e58c2603d9e39f", "packagetype": "bdist_wininst", "python_version": "any", "requires_python": null, "size": 212708, "upload_time": "2013-12-30T17:15:18", "url": "https://files.pythonhosted.org/packages/a1/ea/466bde866177c7e7581813b772ea226dfc5cfbf06d7dd5bffed45f8f9237/matchtpl-0.1.1.win32.exe" }, { "comment_text": "built for Windows-8", "digests": { "md5": "2c1448ed7b80c550d1c17affa5771744", "sha256": "51edaa511b8bd7891fa5bf94f942568c97f61d31e53b411e93a1fa67860ea251" }, "downloads": -1, "filename": "matchtpl-0.1.1.win32.tar.gz", "has_sig": false, "md5_digest": "2c1448ed7b80c550d1c17affa5771744", "packagetype": "bdist_dumb", "python_version": "any", "requires_python": null, "size": 11231, "upload_time": "2013-12-30T17:15:10", "url": "https://files.pythonhosted.org/packages/a7/f2/708e3d925134318e9d3938669264bddffb81678027b79a461be61900f144/matchtpl-0.1.1.win32.tar.gz" }, { "comment_text": "built for Windows-8", "digests": { "md5": "5e0bc0f2efdd7a1548e70465d591cdf1", "sha256": "57c528c985b17f47934b41768f909c79eba72a21396fe6f61d6ec65a8c264559" }, "downloads": -1, "filename": "matchtpl-0.1.1.win32.zip", "has_sig": false, "md5_digest": "5e0bc0f2efdd7a1548e70465d591cdf1", "packagetype": "bdist_dumb", "python_version": "any", "requires_python": null, "size": 13257, "upload_time": "2013-12-30T17:15:14", "url": "https://files.pythonhosted.org/packages/f2/90/fd01a2af6e4e19a01db56ae4d96a2cd9d74ecaeeae13f71c65ed8fb529c2/matchtpl-0.1.1.win32.zip" }, { "comment_text": "", "digests": { "md5": "19f093b7a30f1b67cbafc69f10783ba6", "sha256": "d6f7dc21d884f1175cbab83a2fb14b24304afa68ee5835ed6564cb5b7bf04495" }, "downloads": -1, "filename": "matchtpl-0.1.1.zip", "has_sig": false, "md5_digest": "19f093b7a30f1b67cbafc69f10783ba6", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 94900, "upload_time": "2013-12-30T17:15:03", "url": "https://files.pythonhosted.org/packages/a1/3d/52b817a3e87ee6d4794c4f21ab9036851124d5d4c788fe65b589f3141fe7/matchtpl-0.1.1.zip" } ], "0.1.2": [ { "comment_text": "", "digests": { "md5": "87c2668d7220a594e0c928249d29f53a", "sha256": "38e74ed0cd5281c631fa26628507fd92b5d2f4de9de87cc16bdb55e5548ff7da" }, "downloads": -1, "filename": "matchtpl-0.1.2.zip", "has_sig": false, "md5_digest": "87c2668d7220a594e0c928249d29f53a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 95178, "upload_time": "2014-01-06T13:02:57", "url": "https://files.pythonhosted.org/packages/32/9f/1f968e3a02db780cd5a2cb27cbfee922fb4c88dd066663682bbe2761e459/matchtpl-0.1.2.zip" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "87c2668d7220a594e0c928249d29f53a", "sha256": "38e74ed0cd5281c631fa26628507fd92b5d2f4de9de87cc16bdb55e5548ff7da" }, "downloads": -1, "filename": "matchtpl-0.1.2.zip", "has_sig": false, "md5_digest": "87c2668d7220a594e0c928249d29f53a", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 95178, "upload_time": "2014-01-06T13:02:57", "url": "https://files.pythonhosted.org/packages/32/9f/1f968e3a02db780cd5a2cb27cbfee922fb4c88dd066663682bbe2761e459/matchtpl-0.1.2.zip" } ] }