{ "info": { "author": "Terry Peng", "author_email": "pengtaoo@gmail.com", "bugtrack_url": null, "classifiers": [], "description": "===\r\nMDR\r\n===\r\n\r\n.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master\r\n :target: https://travis-ci.org/scrapinghub/mdr\r\n\r\nMDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages `_ but\r\nchange the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment `_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching `_.\r\n\r\n\r\nRequires\r\n========\r\n\r\n``numpy`` and ``scipy`` must be installed to build this package.\r\n\r\nUsage\r\n=====\r\n\r\nDetect listing data\r\n~~~~~~~~~~~~~~~~~~~\r\n\r\nMDR assume the data record close to the elements has most text nodes::\r\n\r\n [1]: import requests\r\n [2]: from mdr.mdr import MDR\r\n [3]: mdr = MDR()\r\n [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')\r\n [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))\r\n ...\r\n\r\n [8]: [doc.getpath(c) for c in candidates[:10]]\r\n ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',\r\n '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',\r\n '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',\r\n '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',\r\n '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',\r\n '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',\r\n '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',\r\n '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',\r\n '/html/body/div[2]',\r\n '/html/body/div[2]/div[4]/div/div[1]']\r\n\r\nExtract data record\r\n~~~~~~~~~~~~~~~~~~~\r\n\r\nMDR can find the repetiton pattern by using tree matching under certain candidate DOM tree.then it will build a mapping from so-called `seed element` to a list of matched elements from different DOM trees.\r\n\r\nUsed with annotation (optional)\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\nYou can annotate the seed record with any tools (e.g. scrapely_) you like, then mdr will be able to find the other data in the page.\r\n\r\ne.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.\r\n\r\nAuthor\r\n======\r\n\r\nTerry Peng \r\n\r\nLicense\r\n=======\r\n\r\nMIT\r\n\r\n.. _scrapely: https://github.com/scrapy/scrapely\r\n.. _here: http://ibc.scrapinghub.com/tmp/h.html", "description_content_type": null, "docs_url": null, "download_url": "UNKNOWN", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/tpeng/mdr", "keywords": "", "license": "MIT", "maintainer": "", "maintainer_email": "", "name": "mdr", "package_url": "https://pypi.org/project/mdr/", "platform": "UNKNOWN", "project_url": "https://pypi.org/project/mdr/", "project_urls": { "Download": "UNKNOWN", "Homepage": "https://github.com/tpeng/mdr" }, "release_url": "https://pypi.org/project/mdr/0.0.1/", "requires_dist": null, "requires_python": null, "summary": "python library to detect and extract listing data from HTML page", "version": "0.0.1" }, "last_serial": 1225531, "releases": { "0.0.1": [ { "comment_text": "", "digests": { "md5": "8e66378ab5c993bf650acd75b2880ee0", "sha256": "7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3" }, "downloads": -1, "filename": "mdr-0.0.1.tar.gz", "has_sig": false, "md5_digest": "8e66378ab5c993bf650acd75b2880ee0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 49284, "upload_time": "2014-08-14T12:35:43", "url": "https://files.pythonhosted.org/packages/9a/9e/c8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8/mdr-0.0.1.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "8e66378ab5c993bf650acd75b2880ee0", "sha256": "7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3" }, "downloads": -1, "filename": "mdr-0.0.1.tar.gz", "has_sig": false, "md5_digest": "8e66378ab5c993bf650acd75b2880ee0", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 49284, "upload_time": "2014-08-14T12:35:43", "url": "https://files.pythonhosted.org/packages/9a/9e/c8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8/mdr-0.0.1.tar.gz" } ] }