Uni-variate Analysis

class zaps.eda._uni_analysis.UniStat(df, col_drop=None, card_thresh=10, rare_thresh=0.05, skw_thresh=1, figsize=None, n_rows=None, n_cols=None, silent=False, hide_p_bar=False, theme='darkorange', color='lightblue')[source]

Bases: PlotMixin

Calculate and visualize Univariate statistics for all features, identifying feature problems (e.g.: missing values, skew, rare categories,…)

Parameters:
  • df (pandas dataframe) – data source

  • col_drop (sequence (lists, tuples, NumPy arrays or Pandas Base Index) or None) – column(s) name(s) to exclude when analysing duplicates

  • card_thresh (int) – threshold for considering categorical feature to be of high cardinality

  • rare_thresh (float) – threshold below which categorical levels are considered to be rare (e.g.: 1% or 5%). If 0 then all levels are considered during the analysis.

  • skw_thresh (int) – threshold for highlighting and plotting skewed numeric distributions, assuming normal theoretical distribution. Features are considered to be skewed if outside bounds of skw_thresh: abs(skew score) > skw_thresh

  • figsize (tuple or None) – dimensions of matplotlib figure (width, height)

  • n_rows (int) – number of rows in matplotlib subplot figure

  • n_cols (int) – number of columns in matplotlib subplot figure

  • silent (Bool) – solicit user input for continuation during iterative plotting. If True, plotting proceeds without user interaction.

  • hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default False

  • theme (str) – adjust axis and title colors as desired

  • color (str) – adjust color of Plotly Bar as desired

Note

Data types will be inferred automatically, however, to get optimal separation of categorical and numeric cols its better to ensure that correct data types are applied before using this class.

peek(disp_res=True)[source]

Calculate univariate statistics for Numeric and Categorical features while identifying the following:

  • Proportion of missing data

  • Highly skewed features, assuming normal distribution

  • High cardinality categorical features

  • Proportion of rare categorical levels

Parameters:

disp_res (Bool) – triggers displaying summary results

Variables:
  • z_summary (Pandas DataFrame) – info about the dataframe

  • z_miss_data (Pandas Series) – Proportion of missing data

  • z_hc_data (Pandas Series) – Count of categories/levels of high cardinality categorical columns

  • z_rare_cat (Pandas DataFrame) – Count and proportion of rare categories/levels

  • z_univ_stat_df (Pandas DataFrame) – univariate statistics

Returns:

  • num_cols (Pandas Index) – Numeric column names

  • cat_cols (Pandas Index) – Categorical column names

  • dup_df (Pandas DataFrame) – duplicate rows

stats_plot(width=None, height=None)[source]

Interactive plots visualizing:

  • Proportion of missing data

  • High cardinality categorical features

  • Proportion of rare categorical levels

Parameters:
  • width (int, default None) – The figure width in pixels

  • height (int, default None) – The figure height in pixels

skew_plot(dist='norm', cols=None)[source]

Generate Probability Plots for highly skewed features given a specific distribution (default Normal).

Probability Plots: Compare unscaled ordered feature values Y-axis vs Scaled theoretical Expected Quantiles of Normal Distribution X-axis representing Z-scores of standard normal distribution dist.ppf(p); where p is Order statistics of the uniform distribution and ppf is inverse cdf, so we basically generating x values with uniform p and norm mu and sigma.

If true, that generated(x) and actual(y) values both comes from same distribution, all values (blue dots) should form a straight line in the plot and lie on the red line.

Note that the red line is a function of Ordered Values(y) ~ theoretical x: OLS(dist.ppf(p), sort(x)), this OLS best-fit line provide insight as to whether or not the feature can be characterized by the distribution; if the two distributions are linearly related, but not similar, the blue dots will approximately lie on a line, but not necessarily on the line.

So the degree of similarity between both distributions can be assessed this way, guiding the methods for further data preprocessing.

Parameters:
  • dist (str or stats.distributions instance) – Distribution or distribution function name. The default is norm for a normal probability plot. Objects that look enough like a stats.distributions instance (i.e. they have a ppf method) are also accepted.

  • cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index) or None) – column(s) name(s) to fit distribution, if None, then peek method is invoked and cols are those having abs(skew score) > skew threshold.