Categorical Analysis

class zaps.eda._cat_analysis.CatAna(df, cols, target, cat_cols=None, rare_thresh=0.05, top_n=25, nans_d=None, frac=None, random_state=45, figsize=None, n_rows=None, n_cols=None, silent=False, hide_p_bar=False)[source]

Bases: Dist

Utilizing both Statsmodels and Scipy modules to perform univariate and Multi-variate analysis and visualizations to guide handling multi-level categorical features.

Notes

  • Categorical features are considered irrespective of their dtype, either string or numbers, users of this class are advised to explicitly specify whether the columns are categorical using cat_cols parameter as this will affect both grouping and underlying calculations.

  • If not explicitly specified then it is assumed that if features cols are categorical then target must be numeric and vise versa. This is to dictate the direction of categorical grouping of numeric data and method of calculating mutual information score MI.

  • Categories are preprocessed to highlight rare levels; only frequent levels are displayed as is while others are grouped into a single level called rare. This behavior can be controlled by top_n and rare_thresh parameters. For severely imbalanced datasets, ensure that rare_thresh parameter account for the minority class when plotting conditional distributions.

  • Missing values NaN in categorical features are not removed by default, rather, considered as a separate level called missing. If missing values happens to be also rare, then they will be labeled as rare instead of missing.

  • Missing values NaN in numeric columns are imputed with mean value of respective column, if desired otherwise, use nans_d parameter or do imputation before using this class.

  • Scikit-learn algorithm for Mutual Information score treats discrete features differently from continuous features, thus, anything that must have a float dtype is not discrete and will be flagged as such when calculating MI score when target is continuous. It is advised to specify appropriate dtypes for numeric features before using this class.

  • For Ordinal Categorical Features its better to check correlations and regression analysis.

  • Mind the frac parameter when using the internal preprocessed DataFrame.

Parameters:
  • df (pandas dataframe) – data source

  • cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index)) – column names of features to analyze. Better use homogeneous subsets, example: either all categorical or all numeric features; categorical subset can have multiple dtypes (object or numeric) depending on the nature of the feature

  • target (str) – column name of target variable

  • cat_cols (Bool or None) – indicate whether cols are categorical in nature or not and in-turn the direction of categorical grouping of numeric data, for Example: for binary target, numeric cols are to be grouped by the categorical target and vise versa. If None, inferred automatically.

  • rare_thresh, top_n (float, int) – max cardinality beyond which lvls are grouped and analysed as a single level; If rare_thresh = 0 or top_n < 2 then all levels are analysed as is, otherwise these levels are grouped as a single rare level.

    Notes

    • both are independent, for example considering categorical level to be rare (1% - 5%) can still result in a high cardinality categorical feature (> N levels)

    • missing values will be displayed as rare rather than missing if the new missing category is below rare_thresh

  • nans_d (dict or None) – dictionary where keys are column names and values are missing nan replacements. To perform multiple imputation for several numeric cols.

  • frac (float or None) – fraction of dataframe to use as a sample for analysis:

    • 0 < frac < 1 returns a random sample with size frac.

    • frac = 1 returns shuffled dataframe.

    • frac > 1 up-sample the dataframe, sampling of the same row more than once.

  • random_state (int) – for reproducibility, controls the random number generator for frac parameter and when calculating mutual information scores.

  • figsize (tuple) – dimensions of matplotlib figure (width, height)

  • n_rows (int) – number of rows in matplotlib subplot figure

  • n_cols (int) – number of columns in matplotlib subplot figure

  • silent (Bool) – solicit user input for continuation during iterative plotting. If True, plotting proceeds without user interaction.

  • hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default False

Variables:
  • z_inf_out (numpy array) – excluded columns having inf values, if any.

  • z_nans (numpy array) – numeric column names where imputation of nan values took place.

  • z_df (pandas dataframe) – preprocessed dataframe that was used internally

  • z_freq_lvls_map (dict) – where keys are column(s) name(s) and values are frequent levels. Only applicable when cols are categorical

  • xludd_feats (numpy array) – categorical column names excluded from the analysis being dominated by rare levels. Only applicable when cols are categorical

ana_owva(alpha=0.05, disp_res=True)[source]

One-way ANOVA & Kruskal-Wallis H (non-parametric equivalent of the One-Way ANOVA) for Numeric vs Categorical Features.

null: mean/median of all groups are equal

Parameters:
  • alpha (float) – Significance alpha for rejecting null hypothesis. Reject null if p-value < alpha

  • disp_res (bool) – triggers displaying ANOVA results DataFrame

Variables:

zefct_df (pandas dataframe) – dataframe showing effect of categorical feature on target distribution. Only applicable when cols are categorical

Returns:

anova_df (pandas dataframe) – dataframe of ANOVA and related assumptions results

ana_post(equal_var='levene', alternative='two-sided', alpha=0.05, multi_tst_corrc='bonf', disp_res=True)[source]

post-hoc analysis(T-test and Mann–Whitney U test) for categorical features having more than two levels.

Parameters:
  • equal_var (str) – method to apply when checking for equal variance assumption prior to calculating the T-test. One of levene or fligner. levene tests equal variance assumption assuming data is not normally distributed, Fligner-Killeen’s test is distribution free when populations are identical

  • alternative (str) – defines the alternative hypothesis

    • two-sided: distributions underlying the samples are unequal

    • less: the distribution underlying the first sample is less than the distribution underlying the second sample

    • greater: the distribution underlying the first sample is greater the distribution underlying the second sample.

  • alpha (float) – pre-adjusted alpha(significance level) for rejecting null hypothesis, will also be used in multiple comparison corrections. Reject null if p-value < alpha

  • multi_tst_corrc (str) – method used for testing and adjusting pvalues from statsmodels multipletests

    • bonferroni: one-step correction

    • sidak: one-step correction

    • holm-sidak: step down method using Sidak adjustments

    • holm: step-down method using Bonferroni adjustments

    • simes-hochberg: step-up method (independent)

    • hommel: closed method based on Simes tests (non-negative)

    • fdr_bh: Benjamini/Hochberg (non-negative)

    • fdr_by: Benjamini/Yekutieli (negative)

    • fdr_tsbh: two stage fdr correction (non-negative)

    • fdr_tsbky: two stage fdr correction (non-negative)

  • disp_res (bool) – triggers displaying results DataFrame, only for features having no significant results

Variables:

xludd_phoc_feats (numpy array) – column names excluded from the analysis having only two frequent levels. Only applicable when cols are categorical

Returns:

post_hoc_df (pandas dataframe) – dataframe of Post-Hoc analysis results

ana_chi2(alpha=0.05, disp_res=True)[source]

Chi2 test of independence between two categorical variables, best suited for nominal data.

null: categorical variables are independent

Notes

  • An often quoted guideline for the validity of this calculation is that the test should be used only if the observed and expected frequencies in each cell are at least 5.

  • This is a test for the independence of different categories of a population. The test is only meaningful when the dimension of observed is two or more. Applying the test to a one-dimensional table will always result in expected equal to observed and a chi-square statistic equal to 0.

  • This function does not handle masked arrays, because the calculation does not make sense with missing values.

Parameters:
  • alpha (float) – Significance alpha for rejecting null hypothesis. Reject null if p-value < alpha

  • disp_res (bool) – triggers displaying results DataFrame

Variables:
  • zefct_df (pandas dataframe) – dataframe showing effect of categorical feature on target distribution. Only applicable when cols are categorical

  • z_crss_tabs (dict) – where keys are analyzed columns and values are their corresponding cross_tabs. Only applicable when cols are categorical

Returns:

chi2_df (pandas dataframe) – dataframe of chi2 analysis results