Source code for aerosandbox.modeling.fitting

import aerosandbox.numpy as np
from aerosandbox.optimization.opti import Opti
from typing import Union, Dict, Callable, List
from aerosandbox.modeling.surrogate_model import SurrogateModel
import copy
import warnings


[docs]class FittedModel(SurrogateModel): """ A model that is fitted to data. Maps from R^N -> R^1. You can evaluate this model at a given point by calling it just like a function, e.g.: >>> my_fitted_model = FittedModel(...) # See FittedModel.__init__ docstring for syntax >>> y = my_fitted_model(x) The input to the model (`x` in the example above) is of the type: * in the general N-dimensional case, a dictionary where: keys are variable names and values are float/array * in the case of a 1-dimensional input (R^1 -> R^1), a float/array. If you're not sure what the input type of `my_fitted_model` should be, just do: >>> print(my_fitted_model) # Displays the valid input type to the model The output of the model (`y` in the example above) is always a float or array. See the docstring __init__ method of FittedModel for more details of how to instantiate and use FittedModel. One might have expected a fitted model to be a literal Python function rather than a Python class - the benefit of having FittedModel as a class rather than a function is that you can easily save (pickle) classes including data (e.g. parameters, x_data, y_data), but you can't do that with functions. And, because the FittedModel class has a __call__ method, you can basically still just think of it like a function. """ def __init__(self, model: Callable[ [ Union[np.ndarray, Dict[str, np.ndarray]], Dict[str, float] ], np.ndarray ], x_data: Union[np.ndarray, Dict[str, np.ndarray]], y_data: np.ndarray, parameter_guesses: Dict[str, float], parameter_bounds: Dict[str, tuple] = None, residual_norm_type: str = "L2", fit_type: str = "best", weights: np.ndarray = None, put_residuals_in_logspace: bool = False, verbose=True, ): """ Fits an analytical model to n-dimensional unstructured data using an automatic-differentiable optimization approach. Args: model: The model that you want to fit your dataset to. This is a callable with syntax f(x, p) where: * x is a dict of dependent variables. Same format as x_data [dict of 1D ndarrays of length n]. * If the model is one-dimensional (e.g. f(x1) instead of f(x1, x2, x3...)), you can instead interpret x as a 1D ndarray. (If you do this, just give `x_data` as an array.) * p is a dict of parameters. Same format as param_guesses [dict with syntax param_name:param_value]. Model should return a 1D ndarray of length n. Basically, if you've done it right: >>> model(x_data, parameter_guesses) should evaluate to a 1D ndarray where each x_data is mapped to something analogous to y_data. (The fit will likely be bad at this point, because we haven't yet optimized on param_guesses - but the types should be happy.) Model should use aerosandbox.numpy operators. The model is not allowed to make any in-place changes to the input `x`. The most common way this manifests itself is if someone writes something to the effect of `x += 3` or similar. Instead, write `x = x + 3`. x_data: Values of the dependent variable(s) in the dataset to be fitted. This is a dictionary; syntax is { var_name:var_data}. * If the model is one-dimensional (e.g. f(x1) instead of f(x1, x2, x3...)), you can instead supply x_data as a 1D ndarray. (If you do this, just treat `x` as an array in your model, not a dict.) y_data: Values of the independent variable in the dataset to be fitted. [1D ndarray of length n] parameter_guesses: a dict of fit parameters. Syntax is {param_name:param_initial_guess}. * Parameters will be initialized to the values set here; all parameters need an initial guess. * param_initial_guess is a float; note that only scalar parameters are allowed. parameter_bounds: Optional: a dict of bounds on fit parameters. Syntax is {"param_name":(min, max)}. * May contain only a subset of param_guesses if desired. * Use None to represent one-sided constraints (i.e. (None, 5)). residual_norm_type: What error norm should we minimize to optimize the fit parameters? Options: * "L1": minimize the L1 norm or sum(abs(error)). Less sensitive to outliers. * "L2": minimize the L2 norm, also known as the Euclidian norm, or sqrt(sum(error ** 2)). The default. * "Linf": minimize the L_infinty norm or max(abs(error)). More sensitive to outliers. fit_type: Should we find the model of best fit (i.e. the model that minimizes the specified residual norm), or should we look for a model that represents an upper/lower bound on the data (useful for robust surrogate modeling, so that you can put bounds on modeling error): * "best": finds the model of best fit. Usually, this is what you want. * "upper bound": finds a model that represents an upper bound on the data (while still trying to minimize the specified residual norm). * "lower bound": finds a model that represents a lower bound on the data (while still trying to minimize the specified residual norm). weights: Optional: weights for data points. If not supplied, weights are assumed to be uniform. * Weights are automatically normalized. [1D ndarray of length n] put_residuals_in_logspace: Whether to optimize using the logarithmic error as opposed to the absolute error (useful for minimizing percent error). Note: If any model outputs or data are negative, this will raise an error! verbose: Should the progress of the optimization solve that is part of the fitting be displayed? See `aerosandbox.Opti.solve(verbose=)` syntax for more details. Returns: A model in the form of a FittedModel object. Some things you can do: >>> y = FittedModel(x) # evaluate the FittedModel at new x points >>> FittedModel.parameters # directly examine the optimal values of the parameters that were found >>> FittedModel.plot() # plot the fit """ super().__init__() ##### Prepare all inputs, check types/sizes. ### Flatten all inputs def flatten(input): return np.array(input).flatten() try: x_data = { k: flatten(v) for k, v in x_data.items() } x_data_is_dict = True except AttributeError: # If it's not a dict or dict-like, assume it's a 1D ndarray dataset x_data = flatten(x_data) x_data_is_dict = False y_data = flatten(y_data) n_datapoints = np.length(y_data) ### Handle weighting if weights is None: weights = np.ones(n_datapoints) else: weights = flatten(weights) sum_weights = np.sum(weights) if sum_weights <= 0: raise ValueError("The weights must sum to a positive number!") if np.any(weights < 0): raise ValueError("No entries of the weights vector are allowed to be negative!") weights = weights / np.sum(weights) # Normalize weights so that they sum to 1. ### Check format of parameter_bounds input if parameter_bounds is None: parameter_bounds = {} for param_name, v in parameter_bounds.items(): if param_name not in parameter_guesses.keys(): raise ValueError( f"A parameter name (key = \"{param_name}\") in parameter_bounds was not found in parameter_guesses.") if not np.length(v) == 2: raise ValueError( "Every value in parameter_bounds must be a tuple in the format (lower_bound, upper_bound). " "For one-sided bounds, use None for the unbounded side.") ### If putting residuals in logspace, check positivity if put_residuals_in_logspace: if not np.all(y_data > 0): raise ValueError("You can't fit a model with residuals in logspace if y_data is not entirely positive!") ### Check dimensionality of inputs to fitting algorithm relevant_inputs = { "y_data" : y_data, "weights": weights, } try: relevant_inputs.update(x_data) except TypeError: relevant_inputs.update({"x_data": x_data}) for key, value in relevant_inputs.items(): # Check that the length of the inputs are consistent series_length = np.length(value) if not series_length == n_datapoints: raise ValueError( f"The supplied data series \"{key}\" has length {series_length}, but y_data has length {n_datapoints}.") ##### Formulate and solve the fitting optimization problem ### Initialize an optimization environment opti = Opti() ### Initialize the parameters as optimization variables params = {} for param_name, param_initial_guess in parameter_guesses.items(): if param_name in parameter_bounds: params[param_name] = opti.variable( init_guess=param_initial_guess, lower_bound=parameter_bounds[param_name][0], upper_bound=parameter_bounds[param_name][1], ) else: params[param_name] = opti.variable( init_guess=param_initial_guess, ) ### Evaluate the model at the data points you're trying to fit x_data_original = copy.deepcopy( x_data) # Make a copy of x_data so that you can determine if the model did in-place operations on x and tattle on the user. try: y_model = model(x_data, params) # Evaluate the model except Exception: raise Exception(""" There was an error when evaluating the model you supplied with the x_data you supplied. Likely possible causes: * Your model() does not have the call syntax model(x, p), where x is the x_data and p are parameters. * Your model should take in p as a dict of parameters, but it does not. * Your model assumes x is an array-like but you provided x_data as a dict, or vice versa. See the docstring of FittedModel() if you have other usage questions or would like to see examples. """) try: ### If the model did in-place operations on x_data, throw an error x_data_is_unchanged = np.all(x_data == x_data_original) except ValueError: x_data_is_unchanged = np.all([ x_series == x_series_original for x_series, x_series_original in zip(x_data, x_data_original) ]) if not x_data_is_unchanged: raise TypeError("model(x_data, parameter_guesses) did in-place operations on x, which is not allowed!") if y_model is None: # Make sure that y_model actually returned something sensible raise TypeError("model(x_data, parameter_guesses) returned None, when it should've returned a 1D ndarray.") ### Compute how far off you are (error) if not put_residuals_in_logspace: error = y_model - y_data else: y_model = np.fmax(y_model, 1e-300) # Keep y_model very slightly always positive, so that log() doesn't NaN. error = np.log(y_model) - np.log(y_data) ### Set up the optimization problem to minimize some norm(error), which looks different depending on the norm used: if residual_norm_type.lower() == "l1": # Minimize the L1 norm abs_error = opti.variable(init_guess=0, n_vars=np.length(y_data)) # Make the abs() of each error entry an opt. var. opti.subject_to([ abs_error >= error, abs_error >= -error, ]) opti.minimize(np.sum(weights * abs_error)) elif residual_norm_type.lower() == "l2": # Minimize the L2 norm opti.minimize(np.sum(weights * error ** 2)) elif residual_norm_type.lower() == "linf": # Minimize the L-infinity norm linf_value = opti.variable(init_guess=0) # Make the value of the L-infinity norm an optimization variable opti.subject_to([ linf_value >= weights * error, linf_value >= -weights * error ]) opti.minimize(linf_value) else: raise ValueError("Bad input for the 'residual_type' parameter.") ### Add in the constraints specified by fit_type, which force the model to stay above / below the data points. if fit_type == "best": pass elif fit_type == "upper bound": opti.subject_to(y_model >= y_data) elif fit_type == "lower bound": opti.subject_to(y_model <= y_data) else: raise ValueError("Bad input for the 'fit_type' parameter.") ### Solve sol = opti.solve(verbose=verbose) ##### Construct a FittedModel ### Create a vector of solved parameters params_solved = {} for param_name in params: try: params_solved[param_name] = sol(params[param_name]) except Exception: params_solved[param_name] = np.nan ### Store all the data and inputs self.model = model self.x_data = x_data self.y_data = y_data self.parameters = params_solved self.parameter_guesses = parameter_guesses self.parameter_bounds = parameter_bounds self.residual_norm_type = residual_norm_type self.fit_type = fit_type self.weights = weights self.put_residuals_in_logspace = put_residuals_in_logspace
[docs] def __call__(self, x): super().__call__(x) return self.model(x, self.parameters)
[docs] def goodness_of_fit(self, type="R^2" ): """ Returns a metric of the goodness of the fit. Args: type: Type of metric to use for goodness of fit. One of: * "R^2": The coefficient of determination. Strictly speaking only mathematically rigorous to use this for linear fits. https://en.wikipedia.org/wiki/Coefficient_of_determination * "mean_absolute_error" or "mae" or "L1": The mean absolute error of the fit. * "root_mean_squared_error" or "rms" or "L2": The root mean squared error of the fit. * "max_absolute_error" or "Linf": The maximum deviation of the fit from any of the data points. Returns: The metric of the goodness of the fit. """ if type == "R^2": y_mean = np.mean(self.y_data) SS_tot = np.sum( (self.y_data - y_mean) ** 2 ) y_model = self(self.x_data) SS_res = np.sum( (self.y_data - y_model) ** 2 ) R_squared = 1 - SS_res / SS_tot return R_squared elif type == "mean_absolute_error" or type == "mae" or type == "L1": return np.mean(np.abs(self.y_data - self(self.x_data))) elif type == "root_mean_squared_error" or type == "rms" or type == "L2": return np.sqrt(np.mean((self.y_data - self(self.x_data)) ** 2)) elif type == "max_absolute_error" or type == "Linf": return np.max(np.abs(self.y_data - self(self.x_data))) else: valid_types = [ "R^2", "mean_absolute_error", "mae", "L1", "root_mean_squared_error", "rms", "L2", "max_absolute_error", "Linf" ] valid_types_formatted = [ f" * \"{valid_type}\"" for valid_type in valid_types ] raise ValueError("Bad value of `type`! Valid values are:\n" + "\n".join(valid_types_formatted))