Tutorial¶

This notebook is a guided tutorial for running and understanding a full kMCpy workflow.

What you will learn:

Which input artifacts are required and why.
How to run a reproducible NASICON simulation through the Python API.
How to inspect and interpret the resulting transport metrics.
How to build an LCE model and prepare fitting matrices for your own dataset.
How to fit LCE parameters.

0) Prerequisites¶

From the repository root:

uv sync --extra doc
uv sync --extra dev

The notebook is stored in docs and rendered with nbsphinx. By default docs rendering does not execute cells (nbsphinx.execute = never).

1) Understand the required inputs¶

For a production kMC run, kMCpy needs these categories of files:

Structure: a CIF file defining the host lattice.
Event library: events.json describing allowed hops and their dependency graph.
Model file: model.json containing the local cluster expansion structure and fitted coefficients.
Initial state: initial_state.json occupation vector at simulation start.

In this tutorial we use the repository’s validated NASICON regression dataset under tests/files.

Input snapshot (used in this tutorial)¶

Below is the modern Configuration payload used for the bundled NASICON regression run:

{
  "structure_file": "tests/files/EntryWithCollCode15546_Na4Zr2Si3O12_573K.cif",
  "model_file": "tests/files/input/model.json",
  "event_file": "tests/files/input/events.json",
  "initial_state_file": "tests/files/input/initial_state.json",
  "mobile_ion_specie": "Na",
  "temperature": 298,
  "attempt_frequency": 5000000000000.0,
  "equilibration_passes": 1,
  "kmc_passes": 100,
  "supercell_shape": [2, 1, 1],
  "site_mapping": {"Na": ["Na", "X"], "Zr": "Zr", "Si": ["Si", "P"], "O": "O"},
  "convert_to_primitive_cell": true,
  "elementary_hop_distance": 3.47782,
  "random_seed": 12345,
  "name": "NASICON_Regression"
}

If you run the model-building cells first, the simulation cell uses the generated example/output/tutorial/model/model.json; otherwise it falls back to the validated bundled model file above.

Raw input data excerpts¶

These are actual excerpts from the input dataset used in this tutorial.

A) ``tests/files/input/initial_state.json`` (occupation preview)

{
  "occupation_preview_first_24": [
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    1,
    0,
    0,
    0,
    0,
    0,
    0,
    1,
    0,
    0,
    0,
    0,
    0,
    0,
    0,
    0
  ],
  "occupation_length": 84
}

B) ``tests/files/input/events.json`` (first event)

{
  "mobile_ion_indices": [
    0,
    8
  ],
  "local_env_indices": [
    8,
    10,
    14,
    4,
    6,
    12,
    22,
    16,
    20,
    24,
    26,
    18
  ],
  "local_env_indices_site": [
    8,
    10,
    14,
    4,
    6,
    12,
    22,
    16,
    20,
    24,
    26,
    18
  ]
}

C) ``tests/files/input/fitting_results.json`` (latest fitted record preview)

{
  "alpha": 1.5,
  "empty_cluster": 339.1083600548,
  "rmse": 29.1495050791,
  "loocv": 149.8671381416,
  "keci_preview": [
    10.5029085379,
    4.0803809727,
    0.0,
    0.0,
    -11.2639080302,
    0.0,
    -4.0578126572,
    0.0
  ]
}

D) ``NASICON cif file (ICSD-15546)`` (header excerpt)

#(C) 2021 by FIZ Karlsruhe - Leibniz Institute for Information Infrastructure.  All rights reserved.
data_15546-ICSD
_database_code_ICSD 15546
_chemical_formula_structural 'Na4 Zr2 Si3 O12'
_chemical_formula_sum 'Na4 O12 Si3 Zr2'
_diffrn_ambient_temperature 573.

2) Build a local cluster expansion (LCE) model¶

Use this when starting from a new crystal system and you need model topology (lce.json) before fitting.

basis_type='chebyshev' is selected once for the lattice. Active-site occupations are stored as species-state indices (0..q-1), and sites with q allowed species are expanded into q - 1 decorated Chebyshev site functions.

[ ]:

from pathlib import Path
from kmcpy.io.cif import load_labeled_structure_from_cif
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion
from kmcpy.structure.local_lattice_structure import LocalLatticeStructure

repo = Path('.').resolve()
out_dir = repo / 'example/output/tutorial/model'
out_dir.mkdir(parents=True, exist_ok=True)

structure = load_labeled_structure_from_cif(
    str(repo / 'tests' / 'files' / 'EntryWithCollCode15546_Na4Zr2Si3O12_573K.cif'),
    primitive=True,
)

site_mapping = {
    'Na': ['Na', 'X'],
    'Zr': 'Zr',
    'Si': ['Si', 'P'],
    'O': 'O',
}

local_lattice = LocalLatticeStructure(
    template_structure=structure,
    center=0,
    cutoff=4.0,
    site_mapping=site_mapping,
    basis_type='chebyshev',
)

lce = LocalClusterExpansion()
lce.build(local_lattice_structure=local_lattice, cutoff_cluster=[6, 6, 0])
lce.to(str(out_dir / 'lce.json'))

3) Generate correlation matrix¶

For your own training dataset (for example NEB barriers), build a list of structures + target values and convert them into fitting matrices. If you leave the training lists empty, this tutorial uses bundled fitting matrices so the workflow still runs end-to-end.

Fitting data format (what each file contains)¶

When preparing fitting inputs, LocalClusterExpansion.fit(...) delegates to the registered LCE fitter and expects:

correlation_matrix.txt: shape (n_samples, n_features) where multicomponent sites may create multiple decorated features per orbit
e_kra.txt: shape (n_samples,) target barrier values
weight.txt: shape (n_samples,) sample weights

Minimal conceptual example:

correlation_matrix.txt
0.12  0.00 -0.34 ...
0.10  0.05 -0.29 ...
...

e_kra.txt
120.5
140.1
...

weight.txt
1.0
1.0
...

[ ]:

from pathlib import Path
import shutil
import numpy as np

from pymatgen.core import Structure
from kmcpy.io.neb import NEBDataLoader
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion

repo = Path('.').resolve()
model = LocalClusterExpansion.from_file(str(repo / 'example/output/tutorial/model/lce.json'))
fit_dir = repo / 'example/output/tutorial/fit_data'
fit_dir.mkdir(parents=True, exist_ok=True)

training_structures = [
    # Replace with your local-environment structure paths
    # 'path/to/neb_local_env_0001.cif',
    # 'path/to/neb_local_env_0002.cif',
]
target_ekra = [
    # Replace with matching barrier values in meV
    # 120.5,
    # 140.1,
]

if not training_structures and not target_ekra:
    reference_dir = repo / 'tests' / 'files' / 'fitting' / 'local_cluster_expansion'
    for name in ('correlation_matrix.txt', 'e_kra.txt', 'weight.txt'):
        shutil.copy(reference_dir / name, fit_dir / name)
    print('No custom NEB data supplied; copied bundled tutorial fitting data from', reference_dir)
else:
    if len(training_structures) != len(target_ekra):
        raise ValueError('training_structures and target_ekra must have the same length')

    loader = NEBDataLoader(model=model)
    for cif_file, ekra in zip(training_structures, target_ekra):
        structure = Structure.from_file(cif_file)
        loader.add_structure(structure, float(ekra))

    if len(loader) == 0:
        raise ValueError('No samples were added. Provide at least one (structure, e_kra) pair.')

    np.savetxt(fit_dir / 'correlation_matrix.txt', loader.get_correlation_matrix())
    np.savetxt(fit_dir / 'e_kra.txt', loader.get_properties())
    np.savetxt(fit_dir / 'weight.txt', np.ones(len(loader)))
    print('Generated fitting data from custom NEB structures in', fit_dir)

corr = np.atleast_2d(np.loadtxt(fit_dir / 'correlation_matrix.txt'))
e_kra = np.atleast_1d(np.loadtxt(fit_dir / 'e_kra.txt'))
weight = np.atleast_1d(np.loadtxt(fit_dir / 'weight.txt'))
print('correlation_matrix shape:', corr.shape)
print('e_kra shape:', e_kra.shape)
print('weight shape:', weight.shape)

4) Fit LCE parameters¶

Fit using generated training files. This writes keci.txt and fitting_results.json for later model loading:

[ ]:

from pathlib import Path
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion

repo = Path('.').resolve()
fit_dir = repo / 'example/output/tutorial/fit_data'

lce = LocalClusterExpansion()
model_params, y_pred, y_true = lce.fit(
    alpha=1.5,
    max_iter=1_000_000,
    ekra_fname=str(fit_dir / 'e_kra.txt'),
    keci_fname=str(fit_dir / 'keci.txt'),
    weight_fname=str(fit_dir / 'weight.txt'),
    corr_fname=str(fit_dir / 'correlation_matrix.txt'),
    fit_results_fname=str(fit_dir / 'fitting_results.json'),
    lce_params_fname=None,
    lce_params_history_fname=None,
)

print('Non-zero KECI terms:', sum(abs(x) > 1e-8 for x in model_params.keci))
print('RMSE (meV):', round(float(model_params.rmse), 4))
print('Saved fitting outputs in', fit_dir)

5) Construct a `CompositeLCEModel`¶

The simulator consumes a composite model with both barrier (kra_model) and site-energy-difference (site_model) contributions. Load each LocalClusterExpansion, apply its fitted parameters, then merge the prepared models into a CompositeLCEModel.

[ ]:

from pathlib import Path
from kmcpy.models.composite_lce_model import CompositeLCEModel
from kmcpy.models.fitting.fitter import LCEFitter
from kmcpy.models.local_cluster_expansion import LocalClusterExpansion

root = Path('.').resolve()
input_dir = root / 'tests' / 'files' / 'input'

def load_fitted_lce(lce_file, fit_file):
    model = LocalClusterExpansion.from_file(str(lce_file))
    fitter = LCEFitter.from_file(str(fit_file))
    model.set_parameters(fitter.model_parameters)
    metadata = {
        'time_stamp': fitter.model_parameters.time_stamp,
        'time': fitter.model_parameters.time,
    }
    return model, metadata

kra_model, kra_metadata = load_fitted_lce(
    input_dir / 'lce.json',
    input_dir / 'fitting_results.json',
)
site_model, site_metadata = load_fitted_lce(
    input_dir / 'lce_site.json',
    input_dir / 'fitting_results_site.json',
)
compound_lce_model = CompositeLCEModel(
    site_model=site_model,
    kra_model=kra_model,
    kra_fit_metadata=kra_metadata,
    site_fit_metadata=site_metadata,
)
model_file = root / 'example' / 'output' / 'tutorial' / 'model' / 'model.json'
model_file.parent.mkdir(parents=True, exist_ok=True)
compound_lce_model.to(str(model_file))

print(type(compound_lce_model).__name__)
print('Has site model:', compound_lce_model.site_model is not None)
print('Has KRA model:', compound_lce_model.kra_model is not None)
print('Wrote model file:', model_file)

Optional: build a simple `LocalBarrierModel` instead¶

If you do not need an LCE, use LocalBarrierModel for direct barrier logic: constant barriers, local occupation counts, species-count rules, wildcard patterns, or exact catalog-style matches.

[ ]:

from pathlib import Path
from kmcpy.models import LocalBarrierModel

root = Path('.').resolve()
local_barrier_model = LocalBarrierModel(
    default_barrier=300.0,
    site_species={
        1: {0: 'P', 1: 'Si'},
        2: {0: 'Si', 1: 'P'},
        3: {0: 'Si', 1: 'P'},
        4: {0: 'Al', 1: 'Si'},
    },
)
local_barrier_model.add_species_count_rule(
    name='si_rich',
    sites='local_env',
    species='Si',
    min_count=4,
    barrier=420.0,
)
local_barrier_file = root / 'example' / 'output' / 'tutorial' / 'model' / 'local_barrier_model.json'
local_barrier_file.parent.mkdir(parents=True, exist_ok=True)
local_barrier_model.to(str(local_barrier_file))
print('Wrote local barrier model:', local_barrier_file)

6) Run a reproducible NASICON simulation (API path)¶

This section runs the same setup used by the regression test suite. It is a good sanity check that your environment and dependencies are working.

If the previous section wrote example/output/tutorial/model/model.json, that generated model file is used. If not, the code falls back to the bundled validated tests/files/input/model.json.

[ ]:

from pathlib import Path
import os

from kmcpy.simulator.config import Configuration
from kmcpy.simulator.kmc import KMC

root = Path('.').resolve()
file_path = root / 'tests' / 'files'
model_file = root / 'example' / 'output' / 'tutorial' / 'model' / 'model.json'
if not model_file.exists():
    model_file = file_path / 'input' / 'model.json'

run_dir = root / 'example' / 'output' / 'tutorial' / 'run'
run_dir.mkdir(parents=True, exist_ok=True)

config = Configuration(
    structure_file=str(file_path / 'EntryWithCollCode15546_Na4Zr2Si3O12_573K.cif'),
    model_file=str(model_file),
    event_file=str(file_path / 'input' / 'events.json'),
    initial_state_file=str(file_path / 'input' / 'initial_state.json'),
    mobile_ion_specie='Na',
    temperature=298,
    attempt_frequency=5e12,
    equilibration_passes=1,
    kmc_passes=100,
    supercell_shape=(2, 1, 1),
    site_mapping={'Na': ['Na', 'X'], 'Zr': 'Zr', 'Si': ['Si', 'P'], 'O': 'O'},
    convert_to_primitive_cell=True,
    elementary_hop_distance=3.47782,
    random_seed=12345,
    name='NASICON_Regression',
)

kmc = KMC.from_config(config)
original_cwd = Path.cwd()
try:
    os.chdir(run_dir)
    tracker = kmc.run()
finally:
    os.chdir(original_cwd)

tracker.return_current_info()

(1.1193006038758543e-06, np.float64(307.3744449426362), np.float64(1.4630573145769372e-08), np.float64(4.576882562174338e-09), np.float64(1.1823906621661553), np.float64(0.31283002494661705), np.float64(0.21998150220477225))

7) Results (reference + run output)¶

For the input snapshot above, the expected final metrics are:

Metric	Value	Unit
time	1.1193006038758543e-06	s
msd	307.37444494263616	Angstrom^2
jump_diffusivity	1.4630573145769372e-08	cm^2/s
tracer_diffusivity	4.5768825621743376e-09	cm^2/s
conductivity	1.1823906621661553	mS/cm
havens_ratio	0.312830024946617	dimensionless
correlation_factor	0.21998150220477225	dimensionless

These values are intentionally embedded here so the tutorial remains informative even when cells are not executed during docs build.

Metric definitions¶

tracker.return_current_info() returns:

(time, msd, jump_diffusivity, tracer_diffusivity, conductivity, havens_ratio, correlation_factor)

time: simulated physical time (s)
msd: mean squared displacement (Angstrom^2)
jump_diffusivity: jump diffusivity (cm^2/s)
tracer_diffusivity: tracer diffusivity (cm^2/s)
conductivity: ionic conductivity (mS/cm)
havens_ratio: Haven ratio (dimensionless)
f: correlation factor (dimensionless)

tracker.result_units returns the same unit metadata, and Tracker.write_results(...) writes it to a results_units*.json.gz sidecar.

Output files¶

After the run, kMCpy writes compressed CSV artifacts in example/output/tutorial/run:

results_NASICON_Regression.csv.gz
results_units_NASICON_Regression.json.gz
displacement_NASICON_Regression_<pass>.csv.gz
hop_counter_NASICON_Regression_<pass>.csv.gz
current_occ_NASICON_Regression_<pass>.csv.gz

[2]:

from pathlib import Path
import pandas as pd

run_dir = Path('example/output/tutorial/run')
results_file = run_dir / 'results_NASICON_Regression.csv.gz'
df = pd.read_csv(results_file)
print(df.tail(1).to_string(index=False))

    time          jump_diffusivity     tracer_diffusivity  conductivity        f     havens_ratio        msd
0.000001 1.463057e-08 4.576883e-09      1.182391 0.219982 0.31283 307.374445

8) Next steps¶

Swap NASICON regression inputs with your own CIF/event/model files.
Run temperature sweeps by varying temperature in Configuration.
Track convergence by increasing kmc_passes and comparing jump_diffusivity, tracer_diffusivity, and conductivity.