{ "info": { "author": "Jurismarches", "author_email": "contact@jurismarches.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 4 - Beta", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Programming Language :: Python", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6" ], "description": "|logo| boned-html\n##################\n\n|pypi| |travis| |coveralls|\n\nBoned html is a small python library.\n\nIt helps you extract text from an html (in the form of a lxml tree),\nprocess this text, to classify it,\nreinject text in the html with specific css classes.\n\nThe typical use is for anotating an html with classes.\nFor example you are categorizing text,\nand you want the user to visualize those categories on the original html.\n\nThe text will be extracted in a smart way:\nit won't stop at semantic tags (``, ``, etc.)\nbut at other tags (`

`, `

`, etc.).\n\nAs you reinject the text, semantic tags will be added back to text,\nand general html layout will be respected.\n\n.. |logo| image:: ./images/boned-html-64.png\n\n.. |pypi| image:: http://img.shields.io/pypi/v/boned-html.svg?style=flat\n :target: https://pypi.python.org/pypi/boned-html\n.. |travis| image:: http://img.shields.io/travis/jurismarches/boned-html/master.svg?style=flat\n :target: https://travis-ci.org/jurismarches/boned-html\n.. |coveralls| image:: http://img.shields.io/coveralls/jurismarches/boned-html/master.svg?style=flat\n :target: https://coveralls.io/r/jurismarches/boned-html\n\nInstallation\n============\n\n::\n\n pip install boned-html\n\nUsage\n=====\n\nThe fonctionalities are provided by the class `boned_html.Chunker`__ with methods:\n\n* `chunk_tree` to get text chunks from an lxml tree.\n* `unchunk` to put back chunks together providing css classes for pieces of text.\n\n.. __: ./boned_html/chunker.py\n\n\n\nA quick example: imagine we have a function to detect a tel number value in a sentence::\n\n >>> import re\n >>> from itertools import cycle\n >>> def get_tel(text):\n ... splits = re.split(r\"(\\+?(?:\\d\\s*){8,13})\", text)\n ... return list(zip(splits, cycle([None, \"tel\"])))\n >>> get_tel(\"call +33 00 00 00 00\")\n [('call ', None), ('+33 00 00 00 00', 'tel'), ('', None)]\n\nAnd an html::\n\n >>> html = '''\n ... \n ... call +33 00 00 00 00\n ... \n ...

To get an operator call

\n ...

call (country) +33 00 00 00 00

\n ... \n ... \n ... '''\n\nWe chunk::\n\n >>> import lxml.html\n >>> from boned_html import HtmlBoner\n >>> tree = lxml.html.fromstring(html)\n >>> boned = HtmlBoner(tree)\n\nWe evaluate each text and assign \"tel\" class to it if there is a telephone::\n\n >>> for i, text in enumerate(boned):\n ... if text is not None:\n ... boned.set_classes(i, get_tel(text))\n\nWe now rebuild the tree::\n\n >>> boned.tree\n \n >>> print(boned)\n \n call +33 00 00 00 00\n \n

To get an operator call

\n

call (country) +33 00 00 00 00

\n \n \n\nWe have a specific span around our number,\nalso opening and closure of `em` tag was handled,\nand phone number in `head/title` remains the same.\n\n\n", "description_content_type": null, "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/jurismarches/boned-html", "keywords": "html,lxml,scrapping", "license": "", "maintainer": "", "maintainer_email": "", "name": "boned-html", "package_url": "https://pypi.org/project/boned-html/", "platform": "", "project_url": "https://pypi.org/project/boned-html/", "project_urls": { "Homepage": "https://github.com/jurismarches/boned-html" }, "release_url": "https://pypi.org/project/boned-html/0.2/", "requires_dist": [ "lxml", "flake8 (>=2.5.4); extra == 'quality'", "coverage (>=4.0.3); extra == 'test'", "nose (>=1.3.7); extra == 'test'" ], "requires_python": "", "summary": "A small library to smartly extract text from html and eventually rebuild html", "version": "0.2" }, "last_serial": 2964973, "releases": { "0.1": [ { "comment_text": "", "digests": { "md5": "d2b37999fd0b783c9f2950ba0c4a397d", "sha256": "0ca823e798d664bb65c62041a0233b11d72a0d9f2333e253500496c14fa58d6f" }, "downloads": -1, "filename": "boned_html-0.1-py3-none-any.whl", "has_sig": false, "md5_digest": "d2b37999fd0b783c9f2950ba0c4a397d", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 10102, "upload_time": "2017-06-20T18:53:55", "url": "https://files.pythonhosted.org/packages/90/84/a4f6408061019706e72c0e1a3e52469e4e6bdf4cd33ad8686a948d0fdba4/boned_html-0.1-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "5017e4e7061e943c8d93e955e882c825", "sha256": "508d9fe48dd1403f8adf5ce9171677fa10321cb0eeb0135bcaa22fd17bed6fc0" }, "downloads": -1, "filename": "boned-html-0.1.tar.gz", "has_sig": false, "md5_digest": "5017e4e7061e943c8d93e955e882c825", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 7300, "upload_time": "2017-06-20T18:53:57", "url": "https://files.pythonhosted.org/packages/ee/df/7ceb2c7560ea4c9717c40f00cce2b27b2b5b4175a4dfa68b1f2003cee7e1/boned-html-0.1.tar.gz" } ], "0.2": [ { "comment_text": "", "digests": { "md5": "9545412bc910a236fe4d41c13154e809", "sha256": "e75ba1ebfd018660f9ac9dc12f0fe8e3a8ce84b840add48c92cb98748c82de9b" }, "downloads": -1, "filename": "boned_html-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "9545412bc910a236fe4d41c13154e809", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13003, "upload_time": "2017-06-21T13:35:50", "url": "https://files.pythonhosted.org/packages/5e/54/3252b547ac81f7933be843b62432e7a8a6bd83ed05652d98860a97cee53b/boned_html-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "099be9ed384bb245b3f9f24f315ad472", "sha256": "69b1e42a3ef14217b531c47635b355cdeddd716684af1da2d284d4649cf4caff" }, "downloads": -1, "filename": "boned-html-0.2.tar.gz", "has_sig": false, "md5_digest": "099be9ed384bb245b3f9f24f315ad472", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10892, "upload_time": "2017-06-21T13:35:52", "url": "https://files.pythonhosted.org/packages/9a/e2/d54fb05a126bbb6047a7062206aa3fa773b204ce3e982e29afbf46cd8d7d/boned-html-0.2.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "9545412bc910a236fe4d41c13154e809", "sha256": "e75ba1ebfd018660f9ac9dc12f0fe8e3a8ce84b840add48c92cb98748c82de9b" }, "downloads": -1, "filename": "boned_html-0.2-py3-none-any.whl", "has_sig": false, "md5_digest": "9545412bc910a236fe4d41c13154e809", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 13003, "upload_time": "2017-06-21T13:35:50", "url": "https://files.pythonhosted.org/packages/5e/54/3252b547ac81f7933be843b62432e7a8a6bd83ed05652d98860a97cee53b/boned_html-0.2-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "099be9ed384bb245b3f9f24f315ad472", "sha256": "69b1e42a3ef14217b531c47635b355cdeddd716684af1da2d284d4649cf4caff" }, "downloads": -1, "filename": "boned-html-0.2.tar.gz", "has_sig": false, "md5_digest": "099be9ed384bb245b3f9f24f315ad472", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 10892, "upload_time": "2017-06-21T13:35:52", "url": "https://files.pythonhosted.org/packages/9a/e2/d54fb05a126bbb6047a7062206aa3fa773b204ce3e982e29afbf46cd8d7d/boned-html-0.2.tar.gz" } ] }