Data Processing

Overview

The data processing module converts an ASE trajectory file into the DeepMD .npy format required for MLIP training. Frames are randomly split into training and validation sets using a configurable ratio.

The processed data is written to two subdirectories inside data_dir:

data_dir/
├── training_data/      # DeepMD npy sets for training
└── validation_data/    # DeepMD npy sets for validation

Usage

get_data is called automatically by SPARC during the training step. The relevant options are set under mlip_setup in input.yaml:

mlip_setup:
  data_dir: "Training_Data"   # Output directory for processed data
  skip_min: 0                 # Skip first N frames
  skip_max: null              # Skip frames beyond this index (null = keep all)
  train_ratio: 0.8            # Training fraction — rest goes to validation
  seed: 42                    # Random seed for reproducible split

Train / Validation Split

Given n_frames remaining after applying skip_min / skip_max:

n_train = int(n_frames × train_ratio)
n_val   = n_frames − n_train

n_val frame indices are drawn randomly (seeded by seed); the remaining indices form the training set. Setting seed to a fixed value produces a reproducible split across runs.

Parameter	Default	Description
`train_ratio`	`0.8`	Fraction of frames for training. Must be in `(0.0, 1.0)`.
`seed`	`42`	NumPy random seed for the validation index draw.
`skip_min`	`0`	Number of frames to skip from the start of the trajectory.
`skip_max`	`null`	Skip frames from this index onward (`null` keeps all frames).

Module Contents

Data processing module for converting ASE trajectories to DeepMD format.

sparc.src.data_processing.get_data(ase_traj='AseMD.traj', dir_name='Dataset', skip_min=0, skip_max=None, seed=42, train_ratio=0.8)[source]

Process an ASE trajectory file and split the data into training and validation datasets.

Parameters:

ase_traj (str) – ASE trajectory file name (default: ‘AseMD.traj’)
dir_name (str) – Path to the directory for saving training and validation datasets
skip_min (int) – Skip the first n frames
skip_max (Optional[int]) – Skip the last n frames (default: None)
seed (int) – Random seed for reproducible train/validation split (default: 42)
train_ratio (float) – Fraction of frames used for training; remainder goes to validation (default: 0.8, i.e. 80 % training / 20 % validation)

Return type:

None

Raises:

FileNotFoundError – If trajectory file does not exist
ValueError – If trajectory is empty or invalid

Reference

See dpdata for details on the DeepMD npy format.