Categorical Analysis
- class zaps.eda._cat_analysis.CatAna(df, cols, target, cat_cols=None, rare_thresh=0.05, top_n=25, nans_d=None, frac=None, random_state=45, figsize=None, n_rows=None, n_cols=None, silent=False, hide_p_bar=False)[source]
Bases:
DistUtilizing both Statsmodels and Scipy modules to perform univariate and Multi-variate analysis and visualizations to guide handling multi-level categorical features.
Notes
Categorical features are considered irrespective of their dtype, either string or numbers, users of this class are advised to explicitly specify whether the columns are categorical using
cat_colsparameter as this will affect both grouping and underlying calculations.If not explicitly specified then it is assumed that if features
colsare categorical thentargetmust be numeric and vise versa. This is to dictate the direction of categorical grouping of numeric data and method of calculating mutual information score MI.Categories are preprocessed to highlight rare levels; only frequent levels are displayed as is while others are grouped into a single level called rare. This behavior can be controlled by
top_nandrare_threshparameters. For severely imbalanced datasets, ensure thatrare_threshparameter account for the minority class when plotting conditional distributions.Missing values NaN in categorical features are not removed by default, rather, considered as a separate level called missing. If missing values happens to be also rare, then they will be labeled as rare instead of missing.
Missing values NaN in numeric columns are imputed with mean value of respective column, if desired otherwise, use
nans_dparameter or do imputation before using this class.Scikit-learn algorithm for Mutual Information score treats discrete features differently from continuous features, thus, anything that must have a float dtype is not discrete and will be flagged as such when calculating MI score when
targetis continuous. It is advised to specify appropriate dtypes for numeric features before using this class.For Ordinal Categorical Features its better to check correlations and regression analysis.
Mind the
fracparameter when using the internal preprocessed DataFrame.
- Parameters:
df (pandas dataframe) – data source
cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index)) – column names of features to analyze. Better use homogeneous subsets, example: either all categorical or all numeric features; categorical subset can have multiple dtypes (object or numeric) depending on the nature of the feature
target (str) – column name of target variable
cat_cols (Bool or None) – indicate whether
colsare categorical in nature or not and in-turn the direction of categorical grouping of numeric data, for Example: for binary target, numericcolsare to be grouped by the categoricaltargetand vise versa. If None, inferred automatically.rare_thresh, top_n (float, int) – max cardinality beyond which lvls are grouped and analysed as a single level; If
rare_thresh= 0 ortop_n< 2 then all levels are analysed as is, otherwise these levels are grouped as a single rare level.Notes
both are independent, for example considering categorical level to be rare (1% - 5%) can still result in a high cardinality categorical feature (> N levels)
missing values will be displayed as rare rather than missing if the new missing category is below
rare_thresh
nans_d (dict or None) – dictionary where keys are column names and values are missing nan replacements. To perform multiple imputation for several numeric
cols.frac (float or None) – fraction of dataframe to use as a sample for analysis:
0 <
frac< 1 returns a random sample with sizefrac.frac= 1 returns shuffled dataframe.frac> 1 up-sample the dataframe, sampling of the same row more than once.
random_state (int) – for reproducibility, controls the random number generator for
fracparameter and when calculating mutual information scores.figsize (tuple) – dimensions of matplotlib figure (width, height)
n_rows (int) – number of rows in matplotlib subplot figure
n_cols (int) – number of columns in matplotlib subplot figure
silent (Bool) – solicit user input for continuation during iterative plotting. If True, plotting proceeds without user interaction.
hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default False
- Variables:
z_inf_out (numpy array) – excluded columns having inf values, if any.
z_nans (numpy array) – numeric column names where imputation of nan values took place.
z_df (pandas dataframe) – preprocessed dataframe that was used internally
z_freq_lvls_map (dict) – where keys are column(s) name(s) and values are frequent levels. Only applicable when
colsare categoricalxludd_feats (numpy array) – categorical column names excluded from the analysis being dominated by rare levels. Only applicable when
colsare categorical
- ana_owva(alpha=0.05, disp_res=True)[source]
One-way ANOVA & Kruskal-Wallis H (non-parametric equivalent of the One-Way ANOVA) for Numeric vs Categorical Features.
null: mean/median of all groups are equal
- Parameters:
alpha (float) – Significance alpha for rejecting null hypothesis. Reject null if p-value <
alphadisp_res (bool) – triggers displaying ANOVA results DataFrame
- Variables:
zefct_df (pandas dataframe) – dataframe showing effect of categorical feature on
targetdistribution. Only applicable whencolsare categorical- Returns:
anova_df (pandas dataframe) – dataframe of ANOVA and related assumptions results
- ana_post(equal_var='levene', alternative='two-sided', alpha=0.05, multi_tst_corrc='bonf', disp_res=True)[source]
post-hoc analysis(T-test and Mann–Whitney U test) for categorical features having more than two levels.
- Parameters:
equal_var (str) – method to apply when checking for equal variance assumption prior to calculating the T-test. One of levene or fligner. levene tests equal variance assumption assuming data is not normally distributed, Fligner-Killeen’s test is distribution free when populations are identical
alternative (str) – defines the alternative hypothesis
two-sided: distributions underlying the samples are unequal
less: the distribution underlying the first sample is less than the distribution underlying the second sample
greater: the distribution underlying the first sample is greater the distribution underlying the second sample.
alpha (float) – pre-adjusted alpha(significance level) for rejecting null hypothesis, will also be used in multiple comparison corrections. Reject null if p-value <
alphamulti_tst_corrc (str) – method used for testing and adjusting pvalues from statsmodels multipletests
bonferroni: one-step correction
sidak: one-step correction
holm-sidak: step down method using Sidak adjustments
holm: step-down method using Bonferroni adjustments
simes-hochberg: step-up method (independent)
hommel: closed method based on Simes tests (non-negative)
fdr_bh: Benjamini/Hochberg (non-negative)
fdr_by: Benjamini/Yekutieli (negative)
fdr_tsbh: two stage fdr correction (non-negative)
fdr_tsbky: two stage fdr correction (non-negative)
disp_res (bool) – triggers displaying results DataFrame, only for features having no significant results
- Variables:
xludd_phoc_feats (numpy array) – column names excluded from the analysis having only two frequent levels. Only applicable when
colsare categorical- Returns:
post_hoc_df (pandas dataframe) – dataframe of Post-Hoc analysis results
- ana_chi2(alpha=0.05, disp_res=True)[source]
Chi2 test of independence between two categorical variables, best suited for nominal data.
null: categorical variables are independent
Notes
An often quoted guideline for the validity of this calculation is that the test should be used only if the observed and expected frequencies in each cell are at least 5.
This is a test for the independence of different categories of a population. The test is only meaningful when the dimension of observed is two or more. Applying the test to a one-dimensional table will always result in expected equal to observed and a chi-square statistic equal to 0.
This function does not handle masked arrays, because the calculation does not make sense with missing values.
- Parameters:
alpha (float) – Significance alpha for rejecting null hypothesis. Reject null if p-value <
alphadisp_res (bool) – triggers displaying results DataFrame
- Variables:
zefct_df (pandas dataframe) – dataframe showing effect of categorical feature on target distribution. Only applicable when
colsare categoricalz_crss_tabs (dict) – where keys are analyzed columns and values are their corresponding cross_tabs. Only applicable when
colsare categorical
- Returns:
chi2_df (pandas dataframe) – dataframe of chi2 analysis results