Data Processing

Overview

The data processing module converts an ASE trajectory file into the DeepMD .npy format required for MLIP training. Frames are randomly split into training and validation sets using a configurable ratio.

The processed data is written to two subdirectories inside data_dir:

data_dir/
├── training_data/      # DeepMD npy sets for training
└── validation_data/    # DeepMD npy sets for validation

Usage

get_data is called automatically by SPARC during the training step. The relevant options are set under mlip_setup in input.yaml:

mlip_setup:
  data_dir: "Training_Data"   # Output directory for processed data
  skip_min: 0                 # Skip first N frames
  skip_max: null              # Skip frames beyond this index (null = keep all)
  train_ratio: 0.8            # Training fraction — rest goes to validation
  seed: 42                    # Random seed for reproducible split

Train / Validation Split

Given n_frames remaining after applying skip_min / skip_max:

n_train = int(n_frames × train_ratio)
n_val   = n_frames − n_train

n_val frame indices are drawn randomly (seeded by seed); the remaining indices form the training set. Setting seed to a fixed value produces a reproducible split across runs.

Parameter

Default

Description

train_ratio

0.8

Fraction of frames for training. Must be in (0.0, 1.0).

seed

42

NumPy random seed for the validation index draw.

skip_min

0

Number of frames to skip from the start of the trajectory.

skip_max

null

Skip frames from this index onward (null keeps all frames).

Module Contents

Data processing module for converting ASE trajectories to DeepMD format.

sparc.src.data_processing.get_data(ase_traj='AseMD.traj', dir_name='Dataset', skip_min=0, skip_max=None, seed=42, train_ratio=0.8)[source]

Process an ASE trajectory file and split the data into training and validation datasets.

Parameters:
  • ase_traj (str) – ASE trajectory file name (default: ‘AseMD.traj’)

  • dir_name (str) – Path to the directory for saving training and validation datasets

  • skip_min (int) – Skip the first n frames

  • skip_max (Optional[int]) – Skip the last n frames (default: None)

  • seed (int) – Random seed for reproducible train/validation split (default: 42)

  • train_ratio (float) – Fraction of frames used for training; remainder goes to validation (default: 0.8, i.e. 80 % training / 20 % validation)

Return type:

None

Raises:

Reference

See dpdata for details on the DeepMD npy format.