{
    "info": {
        "author": "Terry Peng",
        "author_email": "pengtaoo@gmail.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "===\r\nMDR\r\n===\r\n\r\n.. image:: https://travis-ci.org/scrapinghub/mdr.svg?branch=master\r\n    :target: https://travis-ci.org/scrapinghub/mdr\r\n\r\nMDR is a library detect and extract listing data from HTML page. It implemented base on the `Finding and Extracting Data Records from Web Pages <http://dl.acm.org/citation.cfm?id=1743635>`_ but\r\nchange the similarity to tree alignment proposed by `Web Data Extraction Based on Partial Tree Alignment <http://doi.acm.org/10.1145/1060745.1060761>`_ and `Automatic Wrapper Adaptation by Tree Edit Distance Matching <http://arxiv.org/pdf/1103.1252.pdf>`_.\r\n\r\n\r\nRequires\r\n========\r\n\r\n``numpy`` and ``scipy`` must be installed to build this package.\r\n\r\nUsage\r\n=====\r\n\r\nDetect listing data\r\n~~~~~~~~~~~~~~~~~~~\r\n\r\nMDR assume the data record close to the elements has most text nodes::\r\n\r\n    [1]: import requests\r\n    [2]: from mdr.mdr import MDR\r\n    [3]: mdr = MDR()\r\n    [4]: r = requests.get('http://www.yelp.co.uk/biz/the-ledbury-london')\r\n    [5]: candidates, doc = mdr.list_candidates(r.text.encode('utf8'))\r\n    ...\r\n\r\n    [8]: [doc.getpath(c) for c in candidates[:10]]\r\n     ['/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]/div[2]/ul',\r\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]',\r\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]',\r\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[1]/div/div[2]/div[1]/div[1]/div',\r\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[2]/div/div[3]',\r\n     '/html/body/div[2]/div[3]/div[1]/div/div[4]/div[1]/div/div[2]/ul/li[2]/div/div/ul',\r\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[1]/div[2]/div[1]',\r\n     '/html/body/div[2]/div[3]/div[2]/div/div[1]/div[2]/div[2]/div[1]/table/tbody',\r\n     '/html/body/div[2]',\r\n     '/html/body/div[2]/div[4]/div/div[1]']\r\n\r\nExtract data record\r\n~~~~~~~~~~~~~~~~~~~\r\n\r\nMDR can find the repetiton pattern by using tree matching under certain candidate DOM tree.then it will build a mapping from so-called `seed element` to a list of matched elements from different DOM trees.\r\n\r\nUsed with annotation (optional)\r\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\r\n\r\nYou can annotate the seed record with any tools (e.g. scrapely_) you like, then mdr will be able to find the other  data in the page.\r\n\r\ne.g. you can find this demo page here_. the colored data in first row are annotated manually, the rest are extracted by MDR.\r\n\r\nAuthor\r\n======\r\n\r\nTerry Peng <pengtaoo@gmail.com>\r\n\r\nLicense\r\n=======\r\n\r\nMIT\r\n\r\n.. _scrapely: https://github.com/scrapy/scrapely\r\n.. _here: http://ibc.scrapinghub.com/tmp/h.html",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "https://github.com/tpeng/mdr",
        "keywords": "",
        "license": "MIT",
        "maintainer": "",
        "maintainer_email": "",
        "name": "mdr",
        "package_url": "https://pypi.org/project/mdr/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/mdr/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "https://github.com/tpeng/mdr"
        },
        "release_url": "https://pypi.org/project/mdr/0.0.1/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "python library to detect and extract listing data from HTML page",
        "version": "0.0.1"
    },
    "last_serial": 1225531,
    "releases": {
        "0.0.1": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "8e66378ab5c993bf650acd75b2880ee0",
                    "sha256": "7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3"
                },
                "downloads": -1,
                "filename": "mdr-0.0.1.tar.gz",
                "has_sig": false,
                "md5_digest": "8e66378ab5c993bf650acd75b2880ee0",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 49284,
                "upload_time": "2014-08-14T12:35:43",
                "url": "https://files.pythonhosted.org/packages/9a/9e/c8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8/mdr-0.0.1.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "8e66378ab5c993bf650acd75b2880ee0",
                "sha256": "7d7be84742642e82e96ab555de31904a90e109a3ff92f2586b6c16920589bbb3"
            },
            "downloads": -1,
            "filename": "mdr-0.0.1.tar.gz",
            "has_sig": false,
            "md5_digest": "8e66378ab5c993bf650acd75b2880ee0",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 49284,
            "upload_time": "2014-08-14T12:35:43",
            "url": "https://files.pythonhosted.org/packages/9a/9e/c8330017d8c0aec9a053f29310137da11d5d33633f5d806404c38d3b13e8/mdr-0.0.1.tar.gz"
        }
    ]
}