Identifying and Handling Outliers

class zaps.eda._outliers.Olrs(cols=None, mapping=None, method='iqr', distance=None, tail='both', hide_p_bar=False)[source]

Bases: PipeLineMixin, TransformerMixin, BaseEstimator

Identify and handle outlier values, identification is done by calculating a threshold beyond which an observation is considered to be an outlier, using one of the following methods:

  • Gaussian approximation:

    outliers are captured based on distance from the mean. e.g.: x data point is an outlier if: (mean - 3 * std) > x or x > (mean + 3 * std) where 3 is distance from mean.

  • Inter-quantile range proximity rule (IQR):

    outliers are identified based on distance from IQR(Q3-Q1). e.g.: x data point is an outlier if: (q1 - 1.5 * iqr) > x or x > (q3 + 1.5 * iqr) where 1.5 is distance from IQR.

  • Median Absolute Deviation from the median (MAD-median rule):

    same formula as Gaussian approximation for highlighting outliers, however, replacing mean with median and std with MAD which is suitable for skewed data. See notes below.

  • Quantiles:

    outliers are identified using a specific quantile values. e.g.: x data point is an outlier if: (.05) > x or x > (1 - .05) where [.05, 1-.05] are the 5th and 95th quantile.

Handling is done using Winsorization that is transforming the data by limiting the extreme values outliers, to a certain arbitrary value. The arbitrary value are the thresholds from one of the methods mentioned above beyond which outliers are labeled.

e.g.: if distance = .1 and method = q then it’s 80% winsorization as all data below the 10th quantile is set to the 10th quantile, and data above the 90th quantile is set to the 90th quantile, thus 20% of data is reassigned.

Winsorizing is different from trimming because the extreme values are not removed, but are instead replaced by other values.

Notes

  • default distance under mad method is not scaled. If it is desired to use MAD as a robust replacement for the standard deviation of normal distribution, then multiply the distance by 1.4826 before passing it to distance parameter. e.g.: (3 * std) becomes ((3 * 1.4826) * MAD)

  • If the data is normally distributed then Gaussian approximation method is best suited for identifying outliers, otherwise, rest of methods works on normal and non-normal data.

  • data passed to fit and transform methods will be converted to a DataFrame if not one already, default behavior is to bypass cols and mapping parameter if any of names not found in the DataFrame and transform all numeric columns instead. So be mindful of names used in cols and mapping parameters as column names will be generic in that case.

Parameters:
  • cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index) or None) – column names of numeric features. If None then cols is ignored and all numeric columns will be inferred and transformed this also applies if any of the cols not found in the DataFrame or thier data types are not numeric.

  • mapping (dict or None) – Dictionary for mapping different outlier labeling method to each column, it must have the following structure: {‘column name’:(method, distance)} and follows the same logic of method and distance parameters. Columns could be independent of cols parameter and will be merged during fit.

  • method (str) – method to label outliers, one of [gaus, qr, mad, q]:

    • gaus: Gaussian approximation

    • iqr: Inter-quantile range proximity rule (IQR)

    • mad: Median Absoulte Deviation from the median

    • q: data quantiles

  • distance (float or None) – override default distance to label outliers, default is: {gaus: 3, iqr: 1.5, mad: 1, q: .05}.

    Note

    when method = q:

    • distance indicates the quantiles. Example: if distance = .05, data will be capped at 5th and 95th percentiles.

    • Outliers will be removed up to a maximum of the 20th percentiles. Thus, ‘distance’ takes values between 0 and 0.2

  • tail (str) – specify direction to handle outliers, One of both, right, left.

    • both for outliers at both ends of the distribution

    • right for outliers at the right end of the distribution

    • left for outliers at the left end of the distribution

  • hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default False

fit(X, y=None, labels=None, disp_res=False)[source]

Calculate thresholds beyond which values are labeled as outliers

Parameters:
  • X (np.ndarray or pd.DataFrame) – data source to use in outlier threshold calculation, usually the training data. Ndarray will be converted to a DataFrame with generic column names.

  • y (None) – There is no need of a target in this transformer, yet the pipeline API requires this parameter.

  • labels (list, Pandas Base Index or None) – labels to use as column names when converting X to a DataFrame. If None, generic names will be generated

  • disp_res (bool) – triggers displaying capping thresholds per each column

Variables:
  • z_inf_out (numpy array) – excluded columns having inf values, if any.

  • z_thrsh_df (Pandas DataFrame) – method used and capping thresholds per feature

  • feature_names_in (numpy array) – Feature names in.

  • n_features_in (numpy array) – Number of feature in

transform(X, mark=False)[source]

Cap outliers.

Parameters:
  • X (np.ndarray or pd.DataFrame) – data to transform, Ndarray will be converted to a DataFrame with generic column names.

  • mark (Bool) – whether to flag the capped outliers or not. If True, a new binary column is added to the dataframe flagging outlier observations

Variables:
  • z_olrs (dict) – outlier data points index per feature

  • z_unique_olrs_idx (dict,) – unique outlier index across all features

Returns:

df_clean (Pandas DataFrame) – transformed features after capping outliers