QML Tutorial
-----------------
This tutorial is a general introduction to kernel-ridge regression with QML.
Theory
~~~~~~~~~~~
Regression model of some property, :math:`y`, for some system, :math:`\widetilde{\mathbf{X}}` - this could correspond to e.g. the atomization energy of a molecule:
:math:`y\left(\widetilde{\mathbf{X}} \right) = \sum_i \alpha_i \ K\left( \widetilde{\mathbf{X}}, \mathbf{X}_i\right)`
E.g. Using Gaussian kernel function with Frobenius norm:
:math:`K_{ij} = K\left( \mathbf{X}_i, \mathbf{X}_j\right) = \exp\left( -\frac{\| \mathbf{X}_i - \mathbf{X}_j\|_2^2}{2\sigma^2}\right)`
Regression coefficients are obtained through kernel matrix inversion and multiplication with reference labels
:math:`\boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}`
Tutorial exercises
~~~~~~~~~~~~~~~~~~~~~
Clone the following GIT repository to access the necessary scripts and QM7 dataset (atomization energies and relaxed geometries at PBE0/def2-TZVP level of theory) for ~7k GDB1-7 molecules. [#rupp]_ [#ruddigkeit]_
.. code:: bash
git clone https://github.com/qmlcode/tutorial.git
Additionally, the repository contains Python3 scripts with the solutions to each exercise.
Exercise 1: Representations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In this exercise we use \qml~to generate the Coulomb matrix and Bag of bonds (BoB) representations. [#montavon]_
In QML data can be parsed via the ``Compound`` class, which stores data and generates representations in Numpy's ndarray format.
If you run the code below, you will read in the file ``qm7/0001.xyz`` (a methane molecule) and generate a coulomb matrix representation (sorted by row-norm) and a BoB representation.
.. code:: python
import qml
# Create the compound object mol from the file qm7/0001.xyz which happens to be methane
mol = qml.Compound(xyz="qm7/0001.xyz")
# Generate and print a coulomb matrix for compound with 5 atoms
mol.generate_coulomb_matrix(size=5, sorting="row-norm")
print(mol.representation)
# Generate and print BoB bags for compound containing C and H
mol.generate_bob(asize={"C":2, "H":5})
print(mol.representation)
The representations are simply stored as 1D-vectors.
Note the keyword ``size`` which is the largest number of atoms in a molecule occurring in test or training set.
Additionally, the coulomb matrix can take a sorting scheme as keyword, and the BoB representations requires the specifications of how many atoms of a certain type to make room for in the representations.
Lastly, you can print the following properties which is read from the XYZ file:
.. code:: python
# Print other properties stored in the object
print(mol.coordinates)
print(mol.atomtypes)
print(mol.nuclear_charges)
print(mol.name)
print(mol.unit_cell)
Exercise 2: Kernels
~~~~~~~~~~~~~~~~~~~~~
In this exercise we generate a Gaussian kernel matrix, :math:`\mathbf{K}`, using the representations, :math:`\mathbf{X}`, which are generated similarly to the example in the previous exercise:
:math:`K_{ij} = \exp\left( -\frac{\| \mathbf{X}_i - \mathbf{X}_j\|_2^2}{2\sigma^2}\right)`
QML supplies functions to generate the most basic kernels (E.g. Gaussian, Laplacian). In the exercise below, we calculate a Gaussian kernel for the QM7 dataset.
In order to save time you can import the entire QM7 dataset as ``Compound`` objects from the file ``tutorial_data.py`` found in the tutorial GitHub repository.
.. code:: python
# Import QM7, already parsed to QML
from tutorial_data import compounds
from qml.kernels import gaussian_kernel
# For every compound generate a coulomb matrix or BoB
for mol in compounds:
mol.generate_coulomb_matrix(size=23, sorting="row-norm")
# mol.generate_bob(size=23, asize={"O":3, "C":7, "N":3, "H":16, "S":1})
# Make a big 2D array with all the representations
X = np.array([mol.representation for mol in compounds])
# Print all representations
print(X)
# Run on only a subset of the first 100 (for speed)
X = X[:100]
# Define the kernel width
sigma = 1000.0
# K is also a Numpy array
K = gaussian_kernel(X, X, sigma)
# Print the kernel
print K
Exercise 3: Regression
~~~~~~~~~~~~~~~~~~~~~~~~
With the kernel matrix and representations sorted out in the previous two exercise, we can now solve the :math:`\boldsymbol{\alpha}` regression coefficients:
:math:`\boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}\label{eq:inv}`
One of the most efficient ways of solving this equation is using a Cholesky-decomposition.
QML includes a function named ``cho_solve()`` to do this via the math module ``qml.math``.
In this step it is convenient to only use a subset of the full dataset as training data (see below).
The following builds on the code from the previous step.
To save time, you can import the PBE0/def2-TZVP atomization energies for the QM7 dataset from the file ``tutorial_data.py``.
This has been sorted to match the ordering of the representations generated in the previous exercise.
Extend your code from the previous step with the code below:
.. code:: python
from qml.math import cho_solve
from tutorial_data import energy_pbe0
# Assign 1000 first molecules to the training set
X_training = X[:1000]
Y_training = energy_pbe0[:1000]
sigma = 4000.0
K = gaussian_kernel(X_training, X_training, sigma)
print(K)
# Add a small lambda to the diagonal of the kernel matrix
K[np.diag_indices_from(K)] += 1e-8
# Use the built-in Cholesky-decomposition to solve
alpha = cho_solve(K, Y_training)
print(alpha)
Exercise 4: Prediction
~~~~~~~~~~~~~~~~~~~~~~~~
With the :math:`\boldsymbol{\alpha}` regression coefficients from the previous step, we have (successfully) trained the machine, and we are now ready to do predictions for other compounds.
This is done using the following equation:
:math:`y\left(\widetilde{\mathbf{X}} \right) = \sum_i \alpha_i \ K\left( \widetilde{\mathbf{X}}, \mathbf{X}_i\right)`
In this step we further divide the dataset into a training and a test set. Try using the last 1000 entries as test set.
.. code:: python
# Assign 1000 last molecules to the test set
X_test = X[-1000:]
Y_test = energy_pbe0[-1000:]
# calculate a kernel matrix between test and training data, using the same sigma
Ks = gaussian_kernel(X_test, X_training, sigma)
# Make the predictions
Y_predicted = np.dot(Ks, alpha)
# Calculate mean-absolute-error (MAE):
print np.mean(np.abs(Y_predicted - Y_test))
Exercise 5: Learning curves
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Repeat the prediction from Exercise 2.4 with training set sizes of 1000, 2000, and 4000 molecules.
Note the MAE for every training size.
Plot a learning curve of the MAE versus the training set size.
Generate a learning curve for the Gaussian and Laplacian kernels, as well using the coulomb matrix and bag-of-bonds representations.
Which combination gives the best learning curve? Note you will have to adjust the kernel width (sigma) underway.
Exercise 6: Delta learning
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A powerful technique in machine learning is the delta learning approach. Instead of predicting the PBE0/def2-TZVP atomization energies, we shall try to predict the difference between DFTB3 (a semi-empirical quantum method) and PBE0 atomization energies.
Instead of importing the ``energy_pbe0`` data, you can import the ``energy_delta`` and use this instead
.. code:: python
from tutorial_data import energy_delta
Y_training = energy_delta[:1000]
Y_test = energy_delta[-1000:]
Finally re-draw one of the learning curves from the previous exercise, and note how the prediction improves.
References
~~~~~~~~~~~~~
.. [#rupp] Rupp et al, Phys Rev Letters, 2012.
.. [#ruddigkeit] Ruddigkeit et al, J Chem Inf Model, 2012.
.. [#montavon] Montavon et al, New J Phys, 2013.