# Fit A Local Cluster Expansion This page starts after you have built the [`LocalClusterExpansion`](../modules/local_cluster_expansion.rst) in [Choose And Build A Model](models.md) and written fitting files with [`NEBDataLoader`](../modules/neb.rst) in [Local Environments And NEB Data](local_environments_neb.md). Fitting does not decide the local environment. It only finds coefficients for the fixed feature order already defined by the LCE object. ## Cluster, Orbit, Correlation A cluster is a set of local sites: point, pair, triplet, or quadruplet. An orbit is a group of symmetry-equivalent clusters. The correlation vector evaluates the basis-decorated orbit functions for one local occupation. The fitted scalar is $$ y(\sigma) = E_0 + \sum_j \alpha_j \Phi_j(\sigma) $$ where: - `sigma` is the ordered local occupation vector, - `Phi_j` is one decorated cluster-orbit feature, - `alpha_j` is the fitted coefficient, - `E_0` is the empty-cluster term. For a multicomponent site with `q` allowed states, the Chebyshev basis uses `q - 1` non-constant site functions. Cluster features are products of these site functions, so multicomponent sites add more decorated features. ## Fit Parameters After writing fitting inputs with `NEBDataLoader.write_fitting_inputs(...)`, fit the coefficients: ```python fit_files = loader.write_fitting_inputs(output_dir="fit_kra") params, y_pred, y_true = kra_lce.fit( **fit_files, alpha=1e-4, lce_params_fname="fit_kra/lce_params.json", ) kra_lce.set_parameters(params) kra_lce.to("kra_lce.json") ``` The important [`LocalClusterExpansion.fit(...)`](../modules/local_cluster_expansion.rst) arguments are: - `alpha`: Lasso regularization strength. Larger values usually produce fewer active coefficients. - `corr_fname`: correlation matrix file from `NEBDataLoader`. - `ekra_fname`: target-value file. The name is historical; the values can be `E_KRA` or another fitted scalar as long as the model usage is consistent. - `weight_fname`: sample weights, one per target value. - `lce_params_fname`: output JSON for fitted coefficients and metadata. - `max_iter`: maximum Lasso iterations. `fit(...)` returns: - `params`: fitted LCE parameters. - `y_pred`: model predictions for the training rows. - `y_true`: target values loaded from `ekra_fname`. Call `set_parameters(params)` before saving or using the LCE in kMC. For a composite model, fit the KRA LCE and site-energy-difference model separately, then combine them: ```python from kmcpy.models import CompositeLCEModel model = CompositeLCEModel(kra_model=kra_lce, site_model=site_lce) model.to("model.json") ``` If you only have a KRA model, omit `site_model`. ## Underfit And Overfit NEB data is usually expensive, so the number of training structures is often small compared with the number of possible local environments. Do not chase a perfect training error without checking whether the model is physically useful. Typical symptoms: - Underfit: the model has too few active features or too strong regularization; both training RMSE and validation error are large. - Overfit: training RMSE is very small, but leave-one-out or held-out error is large; the model is fitting noise or sparse sampling artifacts. Some residual fitting error is normal for sparse NEB datasets. Prefer a stable model with sensible errors and few active coefficients over a model that only reproduces the training set. ## Practical Fitting Checks Before using an LCE in kMC: - confirm all training structures map to the expected local occupation length, - inspect the correlation matrix shape, - compare `y_true` and `y_pred`, - inspect RMSE and LOOCV, - keep the model, fitting parameters, local site order, and training data together. Next: [Prepare Input And Run kMC](run_kmc.md).