Tutorial

Welcome to the SPARC tutorial!

Ammonia Borate (NH\(_3\)BH\(_3\))

Ammonia Borate structure — Schematic representation of B-N bond dissociation in NH\(_3\)BH\(_3\) .

Introduction 

In this tutorial, we demonstrate how to run the SPARC workflow end-to-end. We will show how to:

Prepare input files (see Input File for details),
Prepare initial ab initio data,
Train interatomic potentials with DeepMD-kit,
Run machine learning molecular dynamics (ML/MD),
Using active learning with Query-by-Committee (QbC) to adaptively improve the model.

Preparation 

First, create a new test directory and download the tutorial data. If you haven’t installed SPARC yet, see SPARC Installation Guide or the Quick Start Guide guide.

The dataset is hosted on Zenodo and can be accessed via the corresponding Zenodo record. You can also download the archive directly from here. After downloading the archive, extract its contents:

$ tar -xvf AmmoniaBorate.tar.gz

$ cd AmmoniaBorate
$ ls
INCAR  POSCAR  input.json  input.yaml  iter_000000  plumed_dp.dat
$ tree iter_000000
   iter_000000
   └── 00.dft
      └── AseMD.traj

POSCAR: Initial structure in VASP format,
INCAR: VASP input file,
plumed_dp.dat: Plumed file,
input.yaml: SPARC workflow configuration file,
input.json: DeepMD training parameters.

Note

All first-principles calculations were performed using VASP package and since it as licenced software, the POTCAR files are not included in the archive. Users must supply the appropriate POTCAR files corresponding to their licensed VASP distribution.

The compressed archive contains all input files required to execute the SPARC MLIP workflow. It also includes the configuration files necessary to fully reproduce the results reported in this study. Additionally, the Result directory located at the root level of the extracted archive contains the output data generated from the execution of the workflow.

The file AseMD.traj inside the folder iter_000000/00.dft contains the initial AIMD data computed with VASP to start the workflow. Now we only need to prepare the input.yaml ( see Input File) file, which defines the entire SPARC workflow.

The input.yaml defines the SPARC workflow, which is organized into four main steps:

Initial Data preparation with ab initio configurations
Run DeepMD-kit for Training ML models interatomic potential models,
Use ASE MD engine together with Plumed to run Machine Learning Molecular Dynamics (ML/MD) sampling,
Iteratively improve ML potential with Active learning by detecting poorly represented regions with new candidates.

Note

Each file is included in this tutorial for reference.

input.yaml

# YAML configuration file for SPARC pacakge
# General setings
general:
structure_file: "POSCAR"           # Input structure

# DFT calculator settings
dft_calculator:
name: "VASP"                       # DFT package name
prec: "Normal"                     # Precision level
kgamma: True                       # Gamma point calculation
incar_file: "INCAR"                # Path to INCAR file
exe_command: "mpirun -np 16 /path/to/VASP/bin/vasp_std"   # VASP executable

# ASE MD simulation settings
md_simulation:
ensemble: "NVT"                    # Ensemble for MD simulation
thermostat: "Nose"                 # Thermostat type (nose-Hoover)
tdamp: 10                          # Damping parameter for thermostat
timestep_fs: 1.0                   # TimeStep for MD simulation
temperature: 350                   # Temperature in Kelvin
steps: 0                           # Number of MD Steps [if = 0 then False]
log_frequency: 1                   # Interval for MD log and save trajectories
restart: False                     # MD Restart
use_dft_plumed: False              # Use PLUMED for AIMD simulation

# Plumed settings
dft_plumed:
input_file: "plumed_dft.dat"       # PLUMED input file
restart: False
kT: 0.02585                        # kT value eV unit

# Deepmd training settings
deepmd_setup:
training: False                    # Enable Deep potential training
data_dir: "Dataset/00.data"        # Path to store training and validation data
input_file: "input.json"           # Input file for DeepMD training
skip_min: 0                        # Minimum frames to skip
skip_max: null                     # Maximum frames to skip
MdSimulation: True                 # run ML/MD simulation
num_models: 2                      # Number of DeePMD models
timestep_fs: 1.0                   # ML/MD timestep
md_steps: 500000                   # ML dynamics steps
multiple_run: 1                    # Run multiple inference of ML/MD with different veolcities
log_frequency: 80                  # log ML/MD
epot_threshold: 2.5                # Allowed +/- eV thresold potentail energy fluctuation durinf ML/MD run
use_plumed: True                   # True, if run ML/MD with PLUMED
plumed_file: 'plumed_dp.dat'       # Plumed input file

# Active Learning
active_learning: True               # QbC to `tag` new configurations
learning_restart: False             # restart Workflow cycle if killed
latest_model: 'iter_000005/01.train/training_1/frozen_model_1.pb'   # Path to latest model when restart
iteration: 5                        # Number of AL iterations
model_dev:                          # min/max atomic force to 'tag' new configurations
f_min_dev: 0.05
f_max_dev: 0.5

# check metrics (optional)
distance_metrics:
- pair: [0, 7]
   min_distance: 1.0  # Minimum allowed distance in Angstroms
   max_distance: 7.0  # Maximum allowed distance in Angstroms
- pair: [1, 7]
   min_distance: 0.8  # Minimum allowed distance in Angstroms
   max_distance: 1.5  # Maximum allowed distance in Angstroms
- pair: [0, 6]
   min_distance: 0.8  # Minimum allowed distance in Angstroms
   max_distance: 1.5  # Maximum allowed distance in Angstroms

# File output settings [defaults]
output:
log_file: "AseMD.log"        # Log file for MD simulation
xyz_file: "AseTraj.xyz"      # XYZ trajectory file for MD run
aimdtraj_file: "AseMD.traj"  # ASE trajectory file for AIMD run
dptraj_file: "dpmd.traj"     # ASE ML/MD trajectory file

input.json

   {
   "_comment": " model parameters",
   "model": {
      "type_map":     ["B", "H", "N"],
      "descriptor" :{
            "type":             "se_e2_a",
            "sel":              "auto",
            "rcut_smth":        0.50,
            "rcut":             6.00,
            "neuron":           [25, 50, 100],
            "resnet_dt":        false,
            "axis_neuron":      16,
            "seed":             12123,
            "_comment":         " that's all"
      },
      "fitting_net" : {
            "neuron":           [120, 120, 120],
            "resnet_dt":        true,
            "seed":             12123,
            "_comment":         " that's all"
      },
      "_comment":     " that's all"
   },

   "learning_rate" :{
      "type":         "exp",
      "decay_steps":  5000,
      "start_lr":     0.001,
      "stop_lr":      3.51e-8,
      "_comment":     "that's all"
   },

   "loss" :{
      "type":         "ener",
      "start_pref_e": 0.02,
      "limit_pref_e": 1,
      "start_pref_f": 1000,
      "limit_pref_f": 1,
      "start_pref_v": 0,
      "limit_pref_v": 0,
      "_comment":     " that's all"
   },

   "training" : {
      "training_data": {
            "systems":          ["training_data"],
            "batch_size":       "auto",
            "_comment":         "that's all"
      },
      "validation_data":{
            "systems":          ["validation_data"],
            "batch_size":       "auto",
            "numb_btch":        1,
            "_comment":         "that's all"
      },
      "numb_steps":   500000,
      "seed":         12123,
      "disp_file":    "lcurve.out",
      "disp_freq":    1000,
      "save_freq":    10000,
      "_comment":     "that's all"
   },

   "_comment":         "that's all"
}

INCAR

# Global Parameters #
SYSTEM = NH3BH3
ISTART = 0
ICHARG = 2
ISPIN  = 1
ENCUT  = 300
LCHARG = .FALSE.
LWAVE  = .FALSE.
# Electronic Relaxation #
ISMEAR = 0
SIGMA  = 0.05
EDIFF  = 1E-05

POSCAR

NH3BH3
0000000000000000
0000000000000000    0.0000000000000000    0.0000000000000000
0000000000000000   12.0000000000000000    0.0000000000000000
0000000000000000    0.0000000000000000   12.0000000000000000
B    H    N
   6     1
Direct
3779049269710626  0.4608129327206285  0.3686019834737309
6453715677372998  0.5804071143184686  0.7242016573374457
5854579728428035  0.5273358295291573  0.6234820843519628
6923831185069531  0.6059197989660134  0.5995049316749146
4105296494597752  0.4845721146109838  0.2757983426625543
3076168814930469  0.3940802010339866  0.3962878597352457
4170584773274015  0.4942723187483011  0.4581363881549123
6647284028152569  0.5421008292213045  0.6505695389414967

Execute SPARC 

sparc -i input.yaml

Warning

Ensure all environment variables are properly set before execution. See the Quick Start Guide section for details.

This launches the first iteration (``iter_000000``):

iter_000000/
├── 00.dft    # Ab initio reference data
├── 01.train  # Training of ML models
└── 02.dpmd   # ML-driven MD simulation

Training ML Models 

SPARC automatically collects ab initio data and prepares training/validation sets. It then launches MLIP training. Depending on user settings, SPARC trains n models and creates separate training_n directories.

The input.json file is native to the DeePMD-kit package. See the DeepMD training documentation for a full list of parameters.

$ cd iter_000000/01.train/training_1
$ less lcurve.out
#  step      rmse_val    rmse_trn    rmse_e_val  rmse_e_trn    rmse_f_val  rmse_f_trn         lr
# If there is no available reference data, rmse_*_{val,trn} will print nan
0           3.99e+01    3.18e+01      9.27e-01    9.28e-01      1.26e+00    1.01e+00    1.0e-03
1000        6.78e+00    6.19e+00      8.65e-01    8.59e-01      2.14e-01    1.95e-01    1.0e-03
2000        4.98e+00    4.83e+00      1.77e-01    1.57e-01      1.57e-01    1.53e-01    1.0e-03
3000        2.00e+00    2.75e+00      8.99e-02    9.15e-02      6.32e-02    8.69e-02    1.0e-03
.
.
497000      1.14e-02    8.39e-03      3.95e-04    2.66e-04      1.12e-02    8.20e-03    3.9e-08
498000      9.45e-03    7.52e-03      1.51e-04    3.71e-04      9.26e-03    7.30e-03    3.9e-08
499000      8.35e-03    9.65e-03      1.84e-04    1.48e-04      8.17e-03    9.46e-03    3.9e-08
500000      1.97e-02    1.04e-02      2.98e-04    9.08e-05      1.94e-02    1.02e-02    3.5e-08

Note

You can also visualize the learning process by plotting the lcurve.out file. See Analysis section.

Running ML/MD 

Once models are trained, SPARC selects one of them to initiate ML/MD simulations:

$ cd iter_000000/02.dpmd
$ ls
AseMolDyn.log  COLVAR  dpmd.traj  HILLS_0  HILLS_1  Iter0_dpmd.log
AseTraj.xyz    dft_candidates   model_dev_0.out   set.000  set.001
type_map.raw   type.raw

SPARC tracks atomic force error across models:

$ less model_dev_0.out
$ # iter_000000/02.dpmd
# step      max_devi_v         min_devi_v         avg_devi_v         max_devi_f         min_devi_f         avg_devi_f             devi_e
   0       4.288987e-02       1.236075e-03       2.209929e-02       8.148417e-01       5.249647e-02       3.124690e-01       3.175687e-02
   1       9.956690e-02       2.644640e-03       4.849840e-02       4.707522e-01       9.577766e-02       2.531237e-01       1.749532e-02
   2       1.359298e-01       5.290284e-03       6.208815e-02       3.054830e-01       6.672351e-02       1.852998e-01       2.692625e-02
   3       8.619494e-02       1.549988e-02       4.847261e-02       2.494638e-01       9.084693e-03       1.613021e-01       2.746094e-02
   4       1.177004e-01       5.339600e-03       5.895431e-02       2.977156e-01       2.450532e-02       1.692072e-01       3.893455e-02

Frames with atomic forces between the thresholds are automatically flagged as candidates for relabeling using ab intio and are stored in iter_000000/02.dpmd/dft_candidates folder. These will appear in the next iteration folder.

$ ls iter_000000/02.dpmd/dft_candidates
0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
$ tree iter_000000/02.dpmd/dft_candidates
   ├── 0001
   │   └── POSCAR
   ├── 0002
   │   └── POSCAR

Active Learning Loop 

The workflow continues automatically:

iter_000001/
├── 00.dft
├── 01.train
└── 02.dpmd

Each AL cycle consists of:

Selecting new candidate structures,
Relabeling with ab initio,
Retraining ML models,
Running ML/MD with updated potentials.

Monitor Workflow 

SPARC provides utilities for monitoring the workflow. For example:

from sparc.src.utils import PlotForceDeviation
PlotForceDeviation("iter_000000/02.dpmd/", iteration_window='all')

Note

See the notebook for analysis functions and figures:

PLUMED Plugin 

SPARC is primarily designed to streamline the generation of training data. By using it with PLUMED plugin, standard molecular dynamics simulations can be extended beyond equilibrium sampling. This enables the ML potential to explore new regions of configurational space by applying appropriate acceleration or biasing strategies.

To enable PLUMED, include the following keys in the deepmd_setup section of input.yaml:

use_plumed: True                   # Set to True to run ML/MD with PLUMED
plumed_file: "plumed_dp.dat"       # PLUMED input file

An example PLUMED input file (plumed_dp.dat) is provided in this tutorial. It defines the collective variables and biasing scheme for the NH\(_3\)BH\(_3\) system. See the PLUMED documentation for details on constructing custom CVs and bias potentials.

plumed_dp.dat

# PLUMED
d1: DISTANCE ATOMS=1,8

UPPER_WALLS ARG=d1 AT=0.55 KAPPA=100000.0 EXP=2 EPS=1 OFFSET=0 LABEL=uwall

# Define species for contact matrix
# n=6, m=12, R0=0.265(nm) [B/N-B/N], and R0=0.222(nm) [C-H]
DENSITY SPECIES=1  LABEL=B
DENSITY SPECIES=2-7 LABEL=H
DENSITY SPECIES=8 LABEL=N

# B-B B-H B-N H-H H-N N-N
contact_mat: CONTACT_MATRIX ATOMS=B,H,N SWITCH11={RATIONAL D_0=0.0 R_0=0.800 NN=6 MM=12} SWITCH12={RATIONAL D_0=0.0 R_0=0.125 NN=6 MM=12}  SWITCH13={RATIONAL D_0=0.0 R_0=0.222 NN=6 MM=12} SWITCH22={RATIONAL D_0=0.0 R_0=0.500 NN=6 MM=12} SWITCH23={RATIONAL D_0=0.0 R_0=0.125 NN=6 MM=12} SWITCH33={RATIONAL D_0=0.0 R_0=0.800 NN=6 MM=12}

# SPRINT coordinate
ss: SPRINT MATRIX=contact_mat

# PbMetaD
pb: PBMETAD ARG=ss.coord-0,ss.coord-7 SIGMA=0.1,0.1 HEIGHT=2.0 PACE=1000 FILE=HILLS_0,HILLS_1

#
PRINT ARG=d1,ss.*,pb.* FILE=COLVAR STRIDE=3

# FLUSH data
FLUSH STRIDE=250

Next Step 

Try replacing the system with your own molecule.