{
    "info": {
        "author": "Tavis Barr",
        "author_email": "tavisbarr@gmail.com",
        "bugtrack_url": null,
        "classifiers": [],
        "description": "BEIJING TOMORROW\n----------------\n\nBy Tavis Barr, tavisbarr@gmail.com, Copyright 2016\nLicensed under the Gnu Public License V. 2.0\nContact me about other licensing arrangements\n\n\nThis program uses satellite data from the NASA MODIS program, and pollution\ndata from the US embassy in Beijing to develop a predictor of the next day's\nchange in pollution based on today's satellite images.\n\nIt is not designed as a first-rate forecaster -- it would need several \nimprovements for that, including addition of weather data and probably better\ntraining of the models, or for that matter, just including the current\npollution level as a feature -- but rather as a demonstration of how to use\nCaffeOnSpark to perform a complete modelling exercise from download to\ntraining to prediction.  The CaffeOnSpark package is not very well documented\nas of the time of this writing, so the code serves to illustrate its usage.\n\n\nSETUP\n-----\n\nThis program is designed to be run on Spark.  I execute this through the\nEclipse IDE using PyDev, but it can also be done via the command line.  To get\nPyDev working with CaffeOnSpark, the following changes need to be made to the\nPython interpreter:\n\n(a) The following need to be added to PYTHONPATH, with ${SPARK_HOME} and\n${CAFFE_ON_SPARK} replaced with the actual absolute paths:\n\n${SPARK_HOME}\n${SPARK_HOME}/python\n${SPARK_HOME}/python/lib/py4j-[your_version]-src.zip\n${SPARK_HOME}/python/lib/pyspark.zip\n${CAFFE_ON_SPARK}/caffe-grid/target/caffeonsparkpythonapi.zip\n\n(b) The following environmental variables need to be added:\n\nSPARK_HOME needs to be set to the root of your Spark installation\nPYSPARK_SUBMIT_ARGS needs to be set to the following (YMMV):\n\n--master local[*]       # OR URL if you are running on YARN \n--queue PyDevSpark1.6.1 # Can be called whatever  \n--files ${CAFFE_ON_SPARK}/caffe-public/python/caffe/_caffe.so,\n        ${CAFFE_ON_SPARK}/caffe-public/distribute/lib/libcaffe.so.1.0.0-rc3  \n--jars \"${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar,\n\t\t${CAFFE_ON_SPARK}/caffe-distri/target/caffe-distri-0.1-SNAPSHOT-jar-with-dependencies.jar\" \n--driver-library-path \"${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar\" \n--driver-class-path \"${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar\" \n--conf \tspark.driver.extraLibraryPath=\"${LD_LIBRARY_PATH}:\n\t\t${CAFFE_ON_SPARK}/caffe-public/distribute/lib:\n\t\t${CAFFE_ON_SPARK}/caffe-distri/distribute/lib\"  \n--conf \tspark.executorEnv.LD_LIBRARY_PATH=\"${LD_LIBRARY_PATH}:\n\t\t${CAFFE_ON_SPARK}/caffe-public/distribute/lib:\n\t\t${CAFFE_ON_SPARK}/caffe-distri/distribute/lib\" \npyspark-shell\n\nThese are roughly the same arguments that will need to be made when\ninvoking pyspark if this program is run from the command line.  Again, \n${CAFFE_ON_SPARK} should be replaced with its actual value.\n\n\n\nRunning the program takes place in four major steps, each of which uses its\nown module.  These are: (1) Downloading the data, (2) Building the LMDB\ndatabase, (3) Training the module, and (4) Prediction on today's data.\nAdditionally, there is a module to train a standard (non-\"deep\") model\nusing the Random Forest library on Apache Spark for comparison.  Each of\nthese modules are described in order.\n\n(1) Downloading the data\n\nThe module that is responsible for downloading the satellite data,\ntilescraper, is available as a separate package, because it likely has uses\naside from training models etc.  The pollution data are much smaller, and\nare pulled in real time using the beijingpollutioncompiler module.  The data\nare pulled from the US Embassy web site in Beijing; the Chinese Ministry of\nEnvironmental Protection also publishes data that are more accurate, but these\nare more difficult to obtain and also do not go back quite as far.\n\n\n(2) Building the database\n\nCaffeOnSpark expects the data in one of a handful of datbase formats; I found\nLMDB easiest to work with, so I transformed the data into this format to use\nthem.  Note that the images require to have the index order of the underlying\ndata changed from the standard image format for Caffe to use them.\n\nUnfortunately, the Python module for Caffe also expects the outcome data to be\nan integer; fixing this is a one-line change in the Caffe source code, but I \ndid not want to use a non-standard version of Caffe, so I just multiplied the \nchanges by 100.\n\nWhen no satellite data is available, MODIS does not throw an exception but\nmerely returns a black image.  The pollution data may be blank for some days.\nAs the CaffeOnSpark trainer is expecting all observations to be valid, days\nwith such data are discarded at the time the database is built.\n\nThe outcome to be predicted is the logarithmic change between today's pollution\nlevel and tomorrow's pollution level.\n\nThe features are expected to have a mean of zero, so the mean of each pixel is\nsubtratcted from every image of the training and test data before it is placed\nin the database.\n\n(3) Training the Model\n\nThe Python interface to CaffeOnSpark is somewhat limited as of this writing;\nin particular, it is necessary to have the configuration of the solver and \nnetwork specified in .prototxt files rather than configured as Python objects\nin the code (the Caffe interfaces to define them in the code will work, but\nthey cannot easily be attached to the CaffeOnSpark configuration).\n\nThe steps to train the model are pretty much the same as in the demonstration\nexamples widely available online; in fact, very little deviation from these\nsteps is supported by the Python interface to CaffeOnSpark.  The key is that\nwe obtain our model configuration from a folder in the resource directory,\nand also deliver the model there.  When it is finished training, the routine\nreports an R-squared of the regression on the test sample.  In any event,\nmost of the work in this section involves tweaking the .prototxt files to \nimprove the model.\n\n(4) Prediction\n\nOnce the model is trained, it can be called from the regular Caffe\ninterface without requiring the overhead of Spark.  After checking that today's\nimage is not blank, we load it and transform it the same way as the training\nimages were transformed -- the axes are swapped and the mean is subtracted.\nFinally, the test image is expected to be part of a batch, in this case a \nbatch of one, so we add an extra dimension to the image array.  \n\nAdditionally, we need a slight alteration to the .prototxt file to predict\nusing this model.  First, the top layers (loss and accuracy) need to be taken\nout of the model, otherwise they will be returned and not the prediction.\nFinally, the batch size for the test model needs to be set to one.\n\nWith that in hand, we are ready to make a prediction.  Obviously, this model\nis quite rudimentary.\n\n(5) Comparison\n\nIt may be interesting to see how well we can do without using \"deep\" learning.\nThe pollutionrflearner downscales the image into a 4x4 grid, and then builds\na random forest over the grid.  The clear challenge in building any model\nof this phenomenon is that we have far more features than observations, so we\nneed to intelligently simplify the features before attempting to train.  In\nany event, the random forest model does slightly worse than the \"deep\" model,\nbut not substantially so.  Frankly, neither model is highly predictive.\n\n",
        "description_content_type": null,
        "docs_url": null,
        "download_url": "UNKNOWN",
        "downloads": {
            "last_day": -1,
            "last_month": -1,
            "last_week": -1
        },
        "home_page": "http://www.github.com/tavisbarr/BeijingTomorrow",
        "keywords": null,
        "license": "UNKNOWN",
        "maintainer": null,
        "maintainer_email": null,
        "name": "BeijingTomorrow",
        "package_url": "https://pypi.org/project/BeijingTomorrow/",
        "platform": "UNKNOWN",
        "project_url": "https://pypi.org/project/BeijingTomorrow/",
        "project_urls": {
            "Download": "UNKNOWN",
            "Homepage": "http://www.github.com/tavisbarr/BeijingTomorrow"
        },
        "release_url": "https://pypi.org/project/BeijingTomorrow/0.1.0/",
        "requires_dist": null,
        "requires_python": null,
        "summary": "This package generates a beijingtomorrow of the next day's pollution level in Beijing using daily MODIS satellite data and CaffeOnSpark.",
        "version": "0.1.0"
    },
    "last_serial": 2222067,
    "releases": {
        "0.1.0": [
            {
                "comment_text": "",
                "digests": {
                    "md5": "ad2a51d76c6a951da7c3837f5cea7603",
                    "sha256": "84f83651e0d8df7eed10fb7825a403a1e6f9d53614e40e5b89b69a5ddd1bd200"
                },
                "downloads": -1,
                "filename": "BeijingTomorrow-0.1.0-py2-none-any.whl",
                "has_sig": false,
                "md5_digest": "ad2a51d76c6a951da7c3837f5cea7603",
                "packagetype": "bdist_wheel",
                "python_version": "2.7",
                "requires_python": null,
                "size": 14868,
                "upload_time": "2016-07-14T22:49:00",
                "url": "https://files.pythonhosted.org/packages/e7/fe/a085338542270fa3a7be90a1a1e181cbac9738dd5b0e01a85fa2632a5568/BeijingTomorrow-0.1.0-py2-none-any.whl"
            },
            {
                "comment_text": "",
                "digests": {
                    "md5": "e472628508ec2ac404f5274ea85873b8",
                    "sha256": "ea5700ad9df226394de49082488c4323c1f04829168197af34701cdf69d37b80"
                },
                "downloads": -1,
                "filename": "BeijingTomorrow-0.1.0.tar.gz",
                "has_sig": false,
                "md5_digest": "e472628508ec2ac404f5274ea85873b8",
                "packagetype": "sdist",
                "python_version": "source",
                "requires_python": null,
                "size": 4094748,
                "upload_time": "2016-07-14T22:48:56",
                "url": "https://files.pythonhosted.org/packages/8c/57/f42c2811783d05abc69e6bf64b42c0979bc156026a308bf6316103854db4/BeijingTomorrow-0.1.0.tar.gz"
            }
        ]
    },
    "urls": [
        {
            "comment_text": "",
            "digests": {
                "md5": "ad2a51d76c6a951da7c3837f5cea7603",
                "sha256": "84f83651e0d8df7eed10fb7825a403a1e6f9d53614e40e5b89b69a5ddd1bd200"
            },
            "downloads": -1,
            "filename": "BeijingTomorrow-0.1.0-py2-none-any.whl",
            "has_sig": false,
            "md5_digest": "ad2a51d76c6a951da7c3837f5cea7603",
            "packagetype": "bdist_wheel",
            "python_version": "2.7",
            "requires_python": null,
            "size": 14868,
            "upload_time": "2016-07-14T22:49:00",
            "url": "https://files.pythonhosted.org/packages/e7/fe/a085338542270fa3a7be90a1a1e181cbac9738dd5b0e01a85fa2632a5568/BeijingTomorrow-0.1.0-py2-none-any.whl"
        },
        {
            "comment_text": "",
            "digests": {
                "md5": "e472628508ec2ac404f5274ea85873b8",
                "sha256": "ea5700ad9df226394de49082488c4323c1f04829168197af34701cdf69d37b80"
            },
            "downloads": -1,
            "filename": "BeijingTomorrow-0.1.0.tar.gz",
            "has_sig": false,
            "md5_digest": "e472628508ec2ac404f5274ea85873b8",
            "packagetype": "sdist",
            "python_version": "source",
            "requires_python": null,
            "size": 4094748,
            "upload_time": "2016-07-14T22:48:56",
            "url": "https://files.pythonhosted.org/packages/8c/57/f42c2811783d05abc69e6bf64b42c0979bc156026a308bf6316103854db4/BeijingTomorrow-0.1.0.tar.gz"
        }
    ]
}