Data Processing
Overview
The data processing module converts an ASE trajectory file into the
DeepMD .npy format required for MLIP training. Frames are randomly
split into training and validation sets using a configurable ratio.
The processed data is written to two subdirectories inside data_dir:
data_dir/
├── training_data/ # DeepMD npy sets for training
└── validation_data/ # DeepMD npy sets for validation
Usage
get_data is called automatically by SPARC during the training step.
The relevant options are set under mlip_setup in input.yaml:
mlip_setup:
data_dir: "Training_Data" # Output directory for processed data
skip_min: 0 # Skip first N frames
skip_max: null # Skip frames beyond this index (null = keep all)
train_ratio: 0.8 # Training fraction — rest goes to validation
seed: 42 # Random seed for reproducible split
Train / Validation Split
Given n_frames remaining after applying skip_min / skip_max:
n_train = int(n_frames × train_ratio)
n_val = n_frames − n_train
n_val frame indices are drawn randomly (seeded by seed); the
remaining indices form the training set. Setting seed to a fixed
value produces a reproducible split across runs.
Parameter |
Default |
Description |
|---|---|---|
|
|
Fraction of frames for training. Must be in |
|
|
NumPy random seed for the validation index draw. |
|
|
Number of frames to skip from the start of the trajectory. |
|
|
Skip frames from this index onward ( |
Module Contents
Data processing module for converting ASE trajectories to DeepMD format.
- sparc.src.data_processing.get_data(ase_traj='AseMD.traj', dir_name='Dataset', skip_min=0, skip_max=None, seed=42, train_ratio=0.8)[source]
Process an ASE trajectory file and split the data into training and validation datasets.
- Parameters:
ase_traj (
str) – ASE trajectory file name (default: ‘AseMD.traj’)dir_name (
str) – Path to the directory for saving training and validation datasetsskip_min (
int) – Skip the first n framesskip_max (
Optional[int]) – Skip the last n frames (default: None)seed (
int) – Random seed for reproducible train/validation split (default: 42)train_ratio (
float) – Fraction of frames used for training; remainder goes to validation (default: 0.8, i.e. 80 % training / 20 % validation)
- Return type:
- Raises:
FileNotFoundError – If trajectory file does not exist
ValueError – If trajectory is empty or invalid
Reference
See dpdata for details on the DeepMD npy format.