{
"info": {
"author": "Tom Charnock",
"author_email": "charnock@iap.fr",
"bugtrack_url": null,
"classifiers": [
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
"Programming Language :: Python :: 3.6"
],
"description": "# Information maximiser\n\nUsing neural networks, sufficient statistics can be obtained from data by maximising the Fisher information.\n\nThe neural network takes some data ${\\bf d}$ and maps it to a compressed summary $\\mathscr{f}:{\\bf d}\\to{\\bf x}$ where ${\\bf x}$ can have the same size as the dimensionality of the parameter space, rather than the data space.\n\nTo train the neural network a batch of simulations ${\\bf d}_{\\sf sim}^{\\sf fid}$ created at a fiducial parameter value $\\boldsymbol{\\theta}^{\\rm fid}$ are compressed by the neural network to obtain ${\\bf x}_{\\sf sim}^{\\sf fid}$. From this we can calculate the covariance ${\\bf C_\\mathscr{f}}$ of the compressed summaries. We learn about model parameter distributions using the derivative of the simulation. This can be provided analytically or numercially using ${\\bf d}_{\\sf sim}^{\\sf fid+}$ created above the fiducial parameter value $\\boldsymbol{\\theta}^{\\sf fid+}$ and ${\\bf d}_{\\sf sim}^{\\sf fid-}$ created below the fiducial parameter value $\\boldsymbol{\\theta}^{\\sf fid-}$. The simulations are compressed using the network and used to find mean of the summaries $\\partial\\boldsymbol{\\mu}_\\mathscr{f}/\\partial\\theta_\\alpha\\equiv\\boldsymbol{\\mu}_\\mathscr{f},_\\alpha$ via the chain rule\n$$\\frac{\\partial\\mu}{\\partial\\theta_\\alpha} = \\frac{1}{n_{\\textrm{sims}}}\\sum_{i=1}^{n_{\\textrm{sims}}}\\frac{\\partial{\\bf x}_i}{\\partial{\\bf d}_i}\\frac{\\partial{\\bf d}_i}{\\partial\\theta_\\alpha}.$$\nWe then use ${\\bf C}_\\mathscr{f}$ and $\\boldsymbol{\\mu}_\\mathscr{f},_\\alpha$ to calculate the Fisher information\n$${\\bf F}_{\\alpha\\beta} = \\boldsymbol{\\mu}_\\mathscr{f},^T_{\\alpha}{\\bf C}^{-1}_\\mathscr{f}\\boldsymbol{\\mu}_\\mathscr{f},_{\\beta}.$$\nWe want to maximise the Fisher information, and we want the summaries to be orthogonal so to train the network we minimise the loss function\n $$\\Lambda = -\\ln|{\\bf F}_{\\alpha\\beta}|+\\lambda||{\\bf C}_\\mathscr{f}-\\mathbb{1}||_2,$$\nwhere $\\lambda$ is some coupling for the square norm of the network covariance.\n\nWhen using this code please cite arXiv:1802.03537.
\nThe code in the paper can be downloaded as v1 or v1.1 of the code kept on zenodo:
\n[](https://doi.org/10.5281/zenodo.1175196)\n
\n\nThis code is run using
\n>`python-3.6.6`\n\n>`tensorflow-1.12.0`\n\n>`numpy-1.15.0`\n\n>`tqdm==4.29.0`\n\nAlthough these precise versions may not be necessary, I have put them here to avoid possible conflicts.\n\n## Load modules\n\n\n```python\n%matplotlib inline\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tensorflow as tf\nimport IMNN.IMNN as IMNN\nimport IMNN.ABC.ABC as ABC\nimport IMNN.ABC.priors as priors\n```\n\n# Summarising the mean and the variance\n\nFor this example we are going to use $n_{\\bf d}=10$ data points of a 1D field of Gaussian noise with unknown mean and variance to see if the network can learn to summarise them.
\n\nThe likelihood is given by\n$$\\mathcal{L} = \\prod_i^{n_{\\bf d}}\\frac{1}{\\sqrt{2\\pi|\\Sigma|}}\\exp\\left[-\\frac{1}{2}\\frac{(d_i-\\mu)^2}{\\Sigma}\\right]$$\n\nWe can solve this problem analytically, so it is useful to check how well the network does. There is a single sufficient statistic which describes each the mean and the variance, which can be found by finding the maximum of the probability. We find that\n$$\\sum_i^{n_{\\bf d}}d_i = \\mu\\textrm{ and }\\sum_i^{n_{\\bf d}}(d_i-\\mu)^2=n_{\\bf d}\\Sigma$$\n\nWe can calculate the Fisher information by taking the negative of second derivative of the likelihood taking the expectation by inserting the above relations at examining at some fiducial parameter values\n$${\\bf F}_{\\alpha\\beta} = -\\left.\\left(\\begin{array}{cc}\\displaystyle-\\frac{n_{\\bf d}}{\\Sigma}&0\\\\0&\\displaystyle-\\frac{n_{\\bf d}}{2\\Sigma^2}\\end{array}\\right)\\right|_{\\textrm{fiducial}}.$$\nIf we choose a fiducial mean of $\\mu^{\\textrm{fid}}=0$ and variance of $\\Sigma^{\\textrm{fid}} = 1$ then we obtain a Fisher information matrix of\n\n\n\n\n\n```python\nexact_fisher = -np.array([[-10. / 1., 0.], [0. , - 0.5 * 10 / 1.**2.]])\ndeterminant_exact_fisher = np.linalg.det(exact_fisher)\nprint(\"determinant of the Fisher information\", determinant_exact_fisher)\nplt.imshow(np.linalg.inv(exact_fisher))\nplt.title(\"Inverse Fisher matrix\")\nplt.xticks([0, 1], [r\"$\\mu$\", r\"$\\Sigma$\"])\nplt.yticks([0, 1], [r\"$\\mu$\", r\"$\\Sigma$\"])\nplt.colorbar();\n```\n\n determinant of the Fisher information 50.000000000000014\n\n\n\n\n\n\nLet us observe our _real_ data which happens to have true parameters $\\mu=3$ and $\\Sigma=2$\n\n\n```python\nreal_data = np.random.normal(3., np.sqrt(2.), size = (1, 10))\n```\n\n\n```python\nfig, ax = plt.subplots(1, 1, figsize = (10, 6))\nax.plot(real_data[0], label = \"observed data\")\nax.legend(frameon = False)\nax.set_xlim([0, 9])\nax.set_xticks([])\nax.set_ylabel(\"Data amplitude\");\n```\n\n\n\n\n\nThe posterior distribution for this data (normalised to integrate to 1) is\n\n\n```python\n\u03bc_array = np.linspace(-10, 10, 1000)\n\u03a3_array = np.linspace(0.001, 10, 1000)\n\nparameter_grid = np.array(np.meshgrid(\u03bc_array, \u03a3_array, indexing = \"ij\"))\ndx = (\u03bc_array[1] - \u03bc_array[0]) * (\u03a3_array[1] - \u03a3_array[0])\n\nanalytic_posterior = np.exp(-0.5 * (np.sum((real_data[0][:, np.newaxis] - parameter_grid[0, :, 0][np.newaxis, :])**2., axis = 0)[:, np.newaxis] / parameter_grid[1, 0, :][np.newaxis, :] + real_data.shape[1] * np.log(2. * np.pi * parameter_grid[1, 0, :][np.newaxis, :])))\nanalytic_posterior = analytic_posterior.T / np.sum(analytic_posterior * dx)\n```\n\n\n```python\nfig, ax = plt.subplots(2, 2, figsize = (16, 10))\nplt.subplots_adjust(wspace = 0, hspace = 0)\nax[0, 0].plot(parameter_grid[0, :, 0], np.sum(analytic_posterior, axis = 0), linewidth = 1.5, color = 'C2', label = \"Analytic marginalised posterior\")\nax[0, 0].legend(frameon = False)\nax[0, 0].set_xlim([-10, 10])\nax[0, 0].set_ylabel('$\\\\mathcal{P}(\\\\mu|{\\\\bf d})$')\nax[0, 0].set_yticks([])\nax[0, 0].set_xticks([])\nax[1, 0].set_xlabel('$\\mu$');\nax[1, 0].set_ylim([0, 10])\nax[1, 0].set_ylabel('$\\Sigma$')\nax[1, 0].set_xlim([-10, 10])\nax[1, 0].contour(parameter_grid[0, :, 0], parameter_grid[1, 0, :], analytic_posterior, colors = \"C2\")\nax[1, 1].plot(np.sum(analytic_posterior, axis = 1), parameter_grid[1, 0, :], linewidth = 1.5, color = 'C2', label = \"Analytic marginalised posterior\")\nax[1, 1].legend(frameon = False)\nax[1, 1].set_ylim([0, 10])\nax[1, 1].set_xlabel('$\\\\mathcal{P}(\\\\Sigma|{\\\\bf d})$')\nax[1, 1].set_xticks([])\nax[1, 1].set_yticks([])\nax[0, 1].axis(\"off\");\n```\n\n\n\n\n\nNow lets see how the information maximising neural network can recover this posterior.\n\n## Generate data\n\nWe start by defining a function to generate the data with the correct shape. The shape must be\n```\ndata_shape = None + input shape\n```\n\n\n```python\ninput_shape = [10]\n```\n\nIt is useful to define the generating function so that it only takes in the value of the parameter as its input since the function can then be used for ABC later.
\nThe data needs to be generated at a fiducial parameter value and at perturbed values just below and above the fiducial parameter for the numerical derivative.\n\n\n```python\n\u03b8_fid = np.array([0, 1.])\n\u0394\u03b8pm = np.array([0.1, 0.1])\n```\n\nThe data at the perturbed values should have the shape\n```\nperturbed_data_shape = None + number of parameters + input shape\n```\n\nThe generating function is defined so that the fiducial parameter is passed as a list so that many simulations can be made at once. This is very useful for the ABC function later.\n\n\n```python\ndef simulator(\u03b8, seed, simulator_args):\n if seed is not None:\n np.random.seed(seed)\n if len(\u03b8.shape) > 1:\n \u03bc = \u03b8[:, 0]\n \u03a3 = \u03b8[:, 1]\n else:\n \u03bc = 0.\n \u03a3 = \u03b8\n return np.moveaxis(np.random.normal(\u03bc, np.sqrt(\u03a3), simulator_args[\"input shape\"] + [\u03b8.shape[0]]), -1, 0)\n```\n\n```python\ndef simulator(\u03b8, seed, simulator_args):\n if seed is not None:\n np.random.seed(seed)\n if len(\u03b8.shape) > 1:\n if \u03b8.shape[1] == 2:\n \u03bc = \u03b8[:, 0]\n \u03a3 = \u03b8[:, 1]\n else:\n \u03bc = \u03b8[:, 0]\n \u03a3 = np.ones_like(\u03bc)\n else:\n \u03bc = \u03b8\n \u03a3 = 1.\n return np.moveaxis(np.random.normal(\u03bc, np.sqrt(\u03a3), simulator_args[\"input shape\"] + [\u03b8.shape[0]]), -1, 0)\n```\n\n### Training data\nEnough data needs to be made to approximate the covariance matrix of the output summaries. The number of simulations needed to approximate the covariance is `n_s`. If the data is particularly large then it might not be possible to pass all the data into active memory at once and so several the simulations can be split into batches.\n\nFor example if we wanted to make 2000 simulations, but estimate the covariance using 1000 simulations at a time\nwe would set\n\n\n```python\nn_s = 1000\nnum_sims = 10 * n_s\nseed = np.random.randint(1e6)\n```\n\nThe training data can now be made\n\n\n```python\nt = simulator(\u03b8 = np.tile(\u03b8_fid, [num_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\n```\n\nIdeally we would be able to take the derivative of our simulations with respect to the model parameters. We can indeed do that in this case, but since this is possibly a rare occurance I will show an example where the derivatives are calculated numerically. By suppressing the sample variance between the simulations created at some lower and upper varied parameter values, far fewer simulations are needed.\n\n\n```python\nn_p = 1000\nnum_partial_sims = 10 * n_p\n```\n\nThe sample variance is supressed by choosing the same initial seed when creating the upper and lower simulations.\n\n\n```python\nt_m = simulator(\u03b8 = np.tile(\u03b8_fid - np.array([0.1, 0.]), [num_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\nt_p = simulator(\u03b8 = np.tile(\u03b8_fid + np.array([0.1, 0.]), [num_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\nt_m = np.stack([t_m, simulator(\u03b8 = np.tile(\u03b8_fid - np.array([0., 0.1]), [num_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})], axis = 1)\nt_p = np.stack([t_p, simulator(\u03b8 = np.tile(\u03b8_fid + np.array([0., 0.1]), [num_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})], axis = 1)\nt_d = (t_p - t_m) / (2. * \u0394\u03b8pm)[np.newaxis, :, np.newaxis]\n```\n\nThe fiducial simulations and simulations for the derivative must be collected in a dictionary to be stored in the TensorFlow graph or passed to the training function.\n\n\n```python\ndata = {\"data\": t, \"data_d\": t_d}\n```\n\n### Test data\nWe should also make some test data, but here we will use only one combination. This needs adding to the dictionary\n\n\n```python\nnum_validation_sims = n_s\nnum_validation_partial_sims = n_p\nseed = np.random.randint(1e6)\ntt = simulator(\u03b8 = np.tile(\u03b8_fid, [num_validation_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\ntt_m = simulator(\u03b8 = np.tile(\u03b8_fid - np.array([0.1, 0.]), [num_validation_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\ntt_p = simulator(\u03b8 = np.tile(\u03b8_fid + np.array([0.1, 0.]), [num_validation_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})\ntt_m = np.stack([tt_m, simulator(\u03b8 = np.tile(\u03b8_fid - np.array([0., 0.1]), [num_validation_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})], axis = 1)\ntt_p = np.stack([tt_p, simulator(\u03b8 = np.tile(\u03b8_fid + np.array([0., 0.1]), [num_validation_partial_sims, 1]), seed = seed, simulator_args = {\"input shape\": input_shape})], axis = 1)\ntt_d = (tt_p - tt_m) / (2. * \u0394\u03b8pm)[np.newaxis, :, np.newaxis]\ndata[\"validation_data\"] = tt\ndata[\"validation_data_d\"] = tt_d\n```\n\n### Data visualisation\nWe can plot the data to see what it looks like.\n\n\n```python\nfig, ax = plt.subplots(1, 1, figsize = (10, 6))\nax.plot(data['data'][np.random.randint(num_sims)], label = \"training data\")\nax.plot(data['validation_data'][np.random.randint(num_validation_sims)], label = \"test data\")\nax.legend(frameon = False)\nax.set_xlim([0, 9])\nax.set_xticks([])\nax.set_ylabel(\"Data amplitude\");\n```\n\n\n\n\n\nIt is also very useful to plot the upper and lower derivatives to check that the sample variance is actually supressed since the network learns extremely slowly if this isn't done properly.\n\n\n```python\nfig, ax = plt.subplots(2, 2, figsize = (20, 10))\nplt.subplots_adjust(hspace = 0)\ntraining_index = np.random.randint(num_partial_sims)\ntest_index = np.random.randint(num_validation_partial_sims)\n\nax[0, 0].plot(t_m[training_index, 0], label = \"lower training data\", color = \"C0\", linestyle = \"dashed\")\nax[0, 0].plot(t_p[training_index, 0], label = \"upper training data\", color = \"C0\")\nax[0, 0].plot(tt_m[test_index, 0], label = \"lower validation data\", color = \"C1\", linestyle = \"dashed\")\nax[0, 0].plot(tt_p[test_index, 0], label = \"upper validation data\", color = \"C1\")\nax[0, 0].legend(frameon = False)\nax[0, 0].set_xlim([0, 9])\nax[0, 0].set_xticks([])\nax[0, 0].set_ylabel(\"Data amplitude with varied mean\")\nax[1, 0].plot(data[\"data_d\"][training_index, 0], label = \"derivative training data\", color = \"C0\")\nax[1, 0].plot(data[\"data_d\"][test_index, 0], label = \"derivative validation data\", color = \"C1\")\nax[1, 0].set_xlim([0, 9])\nax[1, 0].set_xticks([])\nax[1, 0].legend(frameon = False)\nax[1, 0].set_ylabel(\"Amplitude of the derivative of the data\\nwith respect to the mean\");\n\nax[0, 1].plot(t_m[training_index, 1], label = \"lower training data\", color = \"C0\", linestyle = \"dashed\")\nax[0, 1].plot(t_p[training_index, 1], label = \"upper training data\", color = \"C0\")\nax[0, 1].plot(tt_m[test_index, 1], label = \"lower validation data\", color = \"C1\", linestyle = \"dashed\")\nax[0, 1].plot(tt_p[test_index, 1], label = \"upper validation data\", color = \"C1\")\nax[0, 1].legend(frameon = False)\nax[0, 1].set_xlim([0, 9])\nax[0, 1].set_xticks([])\nax[0, 1].set_ylabel(\"Data amplitude with varied covariance\")\nax[1, 1].plot(data[\"data_d\"][training_index, 1], label = \"derivative training data\", color = \"C0\")\nax[1, 1].plot(data[\"data_d\"][test_index, 1], label = \"derivative validation data\", color = \"C1\")\nax[1, 1].set_xlim([0, 9])\nax[1, 1].set_xticks([])\nax[1, 1].legend(frameon = False)\nax[1, 1].set_ylabel(\"Amplitude of the derivative of the data\\nwith respect to covariance\");\n```\n\n\n\n\n\n## Initiliase the neural network\n### Define network parameters\nThe network is initialised with a base set of parameters
\n\n> `\"dtype\"` - `int, optional` - 32 or 64 for 32 or 64 bit floats and integers\n\n> `\"number of simulations\"` - `int` - the number of simulations to use to approximate covariance\n\n> `\"number of derivative simulations\"` - `int` - the number of derivatives of the simulations to calculate derivative of the mean\n\n> `\"fiducial\"` - `list` - fiducial parameter values at which to train the network at\n\n> `\"number of summaries\"` - `int` - number of summaries the network makes from the data\n\n> `\"input shape\"` - `list` - the shape of the input data\n\n> `\"filename\"` - `str, optional` - a filename to save/load the graph with\n\n\n```python\nparameters = {\n \"dtype\": 32,\n \"number of simulations\": n_s,\n \"number of derivative simulations\": n_p,\n \"fiducial\": \u03b8_fid.tolist(),\n \"number of summaries\": 2,\n \"input shape\": input_shape,\n \"filename\": \"data/model\",\n}\n```\n\n\n```python\ntf.reset_default_graph()\nn = IMNN.IMNN(parameters = parameters)\n```\n\n## Self-defined network\n\nThe information maximising neural network must be a provided with a neural network to optimise. In principle, this should be highly specified to pull out the informative features in the data. All weights should be defined in their own variable scope. Additional tensors which control, say, the dropout or the value of a leaky relu negative gradient can be defined and passed to the training and validation phase using a dictionary.\n\nBelow is an example of a network which takes in the data and passes it through a fully connected neural network with 2 hidden layers with 128 neurons in each and outputs two summaries. The activation is leaky relu on the hidden layers and linear on the output.\n\n
\n