Uni-variate Analysis
- class zaps.eda._uni_analysis.UniStat(df, col_drop=None, card_thresh=10, rare_thresh=0.05, skw_thresh=1, figsize=None, n_rows=None, n_cols=None, silent=False, hide_p_bar=False, theme='darkorange', color='lightblue')[source]
Bases:
PlotMixinCalculate and visualize Univariate statistics for all features, identifying feature problems (e.g.: missing values, skew, rare categories,…)
- Parameters:
df (pandas dataframe) – data source
col_drop (sequence (lists, tuples, NumPy arrays or Pandas Base Index) or None) – column(s) name(s) to exclude when analysing duplicates
card_thresh (int) – threshold for considering categorical feature to be of high cardinality
rare_thresh (float) – threshold below which categorical levels are considered to be rare (e.g.: 1% or 5%). If 0 then all levels are considered during the analysis.
skw_thresh (int) – threshold for highlighting and plotting skewed numeric distributions, assuming normal theoretical distribution. Features are considered to be skewed if outside bounds of
skw_thresh: abs(skew score) >skw_threshfigsize (tuple or None) – dimensions of matplotlib figure (width, height)
n_rows (int) – number of rows in matplotlib subplot figure
n_cols (int) – number of columns in matplotlib subplot figure
silent (Bool) – solicit user input for continuation during iterative plotting. If True, plotting proceeds without user interaction.
hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default False
theme (str) – adjust axis and title colors as desired
color (str) – adjust color of Plotly Bar as desired
Note
Data types will be inferred automatically, however, to get optimal separation of categorical and numeric
colsits better to ensure that correct data types are applied before using this class.- peek(disp_res=True)[source]
Calculate univariate statistics for Numeric and Categorical features while identifying the following:
Proportion of missing data
Highly skewed features, assuming normal distribution
High cardinality categorical features
Proportion of rare categorical levels
- Parameters:
disp_res (Bool) – triggers displaying summary results
- Variables:
z_summary (Pandas DataFrame) – info about the dataframe
z_miss_data (Pandas Series) – Proportion of missing data
z_hc_data (Pandas Series) – Count of categories/levels of high cardinality categorical columns
z_rare_cat (Pandas DataFrame) – Count and proportion of rare categories/levels
z_univ_stat_df (Pandas DataFrame) – univariate statistics
- Returns:
num_cols (Pandas Index) – Numeric column names
cat_cols (Pandas Index) – Categorical column names
dup_df (Pandas DataFrame) – duplicate rows
- stats_plot(width=None, height=None)[source]
Interactive plots visualizing:
Proportion of missing data
High cardinality categorical features
Proportion of rare categorical levels
- Parameters:
width (int, default None) – The figure width in pixels
height (int, default None) – The figure height in pixels
- skew_plot(dist='norm', cols=None)[source]
Generate Probability Plots for highly skewed features given a specific distribution (default Normal).
Probability Plots: Compare unscaled ordered feature values Y-axis vs Scaled theoretical Expected Quantiles of Normal Distribution X-axis representing Z-scores of standard normal distribution dist.ppf(p); where p is Order statistics of the uniform distribution and ppf is inverse cdf, so we basically generating x values with uniform p and norm mu and sigma.
If true, that generated(x) and actual(y) values both comes from same distribution, all values (blue dots) should form a straight line in the plot and lie on the red line.
Note that the red line is a function of Ordered Values(y) ~ theoretical x: OLS(dist.ppf(p), sort(x)), this OLS best-fit line provide insight as to whether or not the feature can be characterized by the distribution; if the two distributions are linearly related, but not similar, the blue dots will approximately lie on a line, but not necessarily on the line.
So the degree of similarity between both distributions can be assessed this way, guiding the methods for further data preprocessing.
- Parameters:
dist (str or stats.distributions instance) – Distribution or distribution function name. The default is norm for a normal probability plot. Objects that look enough like a stats.distributions instance (i.e. they have a ppf method) are also accepted.
cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index) or None) – column(s) name(s) to fit distribution, if None, then peek method is invoked and
colsare those having abs(skew score) >skew threshold.