QML Tutorial ----------------- This tutorial is a general introduction to kernel-ridge regression with QML. Theory ~~~~~~~~~~~ Regression model of some property, :math:`y`, for some system, :math:`\widetilde{\mathbf{X}}` - this could correspond to e.g. the atomization energy of a molecule: :math:`y\left(\widetilde{\mathbf{X}} \right) = \sum_i \alpha_i \ K\left( \widetilde{\mathbf{X}}, \mathbf{X}_i\right)` E.g. Using Gaussian kernel function with Frobenius norm: :math:`K_{ij} = K\left( \mathbf{X}_i, \mathbf{X}_j\right) = \exp\left( -\frac{\| \mathbf{X}_i - \mathbf{X}_j\|_2^2}{2\sigma^2}\right)` Regression coefficients are obtained through kernel matrix inversion and multiplication with reference labels :math:`\boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}` Tutorial exercises ~~~~~~~~~~~~~~~~~~~~~ Clone the following GIT repository to access the necessary scripts and QM7 dataset (atomization energies and relaxed geometries at PBE0/def2-TZVP level of theory) for ~7k GDB1-7 molecules. [#rupp]_ [#ruddigkeit]_ .. code:: bash git clone https://github.com/qmlcode/tutorial.git Additionally, the repository contains Python3 scripts with the solutions to each exercise. Exercise 1: Representations ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this exercise we use \qml~to generate the Coulomb matrix and Bag of bonds (BoB) representations. [#montavon]_ In QML data can be parsed via the ``Compound`` class, which stores data and generates representations in Numpy's ndarray format. If you run the code below, you will read in the file ``qm7/0001.xyz`` (a methane molecule) and generate a coulomb matrix representation (sorted by row-norm) and a BoB representation. .. code:: python import qml # Create the compound object mol from the file qm7/0001.xyz which happens to be methane mol = qml.Compound(xyz="qm7/0001.xyz") # Generate and print a coulomb matrix for compound with 5 atoms mol.generate_coulomb_matrix(size=5, sorting="row-norm") print(mol.representation) # Generate and print BoB bags for compound containing C and H mol.generate_bob(asize={"C":2, "H":5}) print(mol.representation) The representations are simply stored as 1D-vectors. Note the keyword ``size`` which is the largest number of atoms in a molecule occurring in test or training set. Additionally, the coulomb matrix can take a sorting scheme as keyword, and the BoB representations requires the specifications of how many atoms of a certain type to make room for in the representations. Lastly, you can print the following properties which is read from the XYZ file: .. code:: python # Print other properties stored in the object print(mol.coordinates) print(mol.atomtypes) print(mol.nuclear_charges) print(mol.name) print(mol.unit_cell) Exercise 2: Kernels ~~~~~~~~~~~~~~~~~~~~~ In this exercise we generate a Gaussian kernel matrix, :math:`\mathbf{K}`, using the representations, :math:`\mathbf{X}`, which are generated similarly to the example in the previous exercise: :math:`K_{ij} = \exp\left( -\frac{\| \mathbf{X}_i - \mathbf{X}_j\|_2^2}{2\sigma^2}\right)` QML supplies functions to generate the most basic kernels (E.g. Gaussian, Laplacian). In the exercise below, we calculate a Gaussian kernel for the QM7 dataset. In order to save time you can import the entire QM7 dataset as ``Compound`` objects from the file ``tutorial_data.py`` found in the tutorial GitHub repository. .. code:: python # Import QM7, already parsed to QML from tutorial_data import compounds from qml.kernels import gaussian_kernel # For every compound generate a coulomb matrix or BoB for mol in compounds: mol.generate_coulomb_matrix(size=23, sorting="row-norm") # mol.generate_bob(size=23, asize={"O":3, "C":7, "N":3, "H":16, "S":1}) # Make a big 2D array with all the representations X = np.array([mol.representation for mol in compounds]) # Print all representations print(X) # Run on only a subset of the first 100 (for speed) X = X[:100] # Define the kernel width sigma = 1000.0 # K is also a Numpy array K = gaussian_kernel(X, X, sigma) # Print the kernel print K Exercise 3: Regression ~~~~~~~~~~~~~~~~~~~~~~~~ With the kernel matrix and representations sorted out in the previous two exercise, we can now solve the :math:`\boldsymbol{\alpha}` regression coefficients: :math:`\boldsymbol{\alpha} = (\mathbf{K} + \lambda \mathbf{I})^{-1} \mathbf{y}\label{eq:inv}` One of the most efficient ways of solving this equation is using a Cholesky-decomposition. QML includes a function named ``cho_solve()`` to do this via the math module ``qml.math``. In this step it is convenient to only use a subset of the full dataset as training data (see below). The following builds on the code from the previous step. To save time, you can import the PBE0/def2-TZVP atomization energies for the QM7 dataset from the file ``tutorial_data.py``. This has been sorted to match the ordering of the representations generated in the previous exercise. Extend your code from the previous step with the code below: .. code:: python from qml.math import cho_solve from tutorial_data import energy_pbe0 # Assign 1000 first molecules to the training set X_training = X[:1000] Y_training = energy_pbe0[:1000] sigma = 4000.0 K = gaussian_kernel(X_training, X_training, sigma) print(K) # Add a small lambda to the diagonal of the kernel matrix K[np.diag_indices_from(K)] += 1e-8 # Use the built-in Cholesky-decomposition to solve alpha = cho_solve(K, Y_training) print(alpha) Exercise 4: Prediction ~~~~~~~~~~~~~~~~~~~~~~~~ With the :math:`\boldsymbol{\alpha}` regression coefficients from the previous step, we have (successfully) trained the machine, and we are now ready to do predictions for other compounds. This is done using the following equation: :math:`y\left(\widetilde{\mathbf{X}} \right) = \sum_i \alpha_i \ K\left( \widetilde{\mathbf{X}}, \mathbf{X}_i\right)` In this step we further divide the dataset into a training and a test set. Try using the last 1000 entries as test set. .. code:: python # Assign 1000 last molecules to the test set X_test = X[-1000:] Y_test = energy_pbe0[-1000:] # calculate a kernel matrix between test and training data, using the same sigma Ks = gaussian_kernel(X_test, X_training, sigma) # Make the predictions Y_predicted = np.dot(Ks, alpha) # Calculate mean-absolute-error (MAE): print np.mean(np.abs(Y_predicted - Y_test)) Exercise 5: Learning curves ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Repeat the prediction from Exercise 2.4 with training set sizes of 1000, 2000, and 4000 molecules. Note the MAE for every training size. Plot a learning curve of the MAE versus the training set size. Generate a learning curve for the Gaussian and Laplacian kernels, as well using the coulomb matrix and bag-of-bonds representations. Which combination gives the best learning curve? Note you will have to adjust the kernel width (sigma) underway. Exercise 6: Delta learning ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A powerful technique in machine learning is the delta learning approach. Instead of predicting the PBE0/def2-TZVP atomization energies, we shall try to predict the difference between DFTB3 (a semi-empirical quantum method) and PBE0 atomization energies. Instead of importing the ``energy_pbe0`` data, you can import the ``energy_delta`` and use this instead .. code:: python from tutorial_data import energy_delta Y_training = energy_delta[:1000] Y_test = energy_delta[-1000:] Finally re-draw one of the learning curves from the previous exercise, and note how the prediction improves. References ~~~~~~~~~~~~~ .. [#rupp] Rupp et al, Phys Rev Letters, 2012. .. [#ruddigkeit] Ruddigkeit et al, J Chem Inf Model, 2012. .. [#montavon] Montavon et al, New J Phys, 2013.