Numeric Analysis

class zaps.eda._num_analysis.NumAna(df, cols, target, degree=1, fit=None, method='newton', lowess_frac=0.6666666666666666, it=3, delta=0.0, nans_d=None, frac=None, random_state=45, figsize=None, n_rows=None, n_cols=None, silent=False, hide_p_bar=False, theme='darkorange', **kwargs)[source]

Bases: PlotMixin

Collection of Numeric features analysis that includes:

  • Regression

  • Correlation

  • visualizations

Parameters:
  • df (pandas dataframe) – data source

  • cols (sequence (lists, tuples, NumPy arrays or Pandas Base Index)) – column names of numeric features

  • target (str) – target column name, categorical target will be encode as integer.

  • degree (int) – If degree is greater than 1, fit is ignored and polynomial regression is applied to the nth degree.

  • fit (str or None) – type of regression to fit. One of ols, logit or lws. If ols then Ordinary Least Squares regression is applied. If logit then logistic regression will be fitted, if lws then Locally Weighted Scatterplot Smoothing non-parametric regression is applied. If None it’s either Ordinary Least Squares or logistic regression based on type of target.

  • method (str) – Only applicable when fit = logit. The following solvers from scipy.optimize are accepted:

    • newton for Newton-Raphson, ‘nm’ for Nelder-Mead

    • bfgs for Broyden-Fletcher-Goldfarb-Shanno (BFGS)

    • lbfgs for limited-memory BFGS with optional box constraints

    • powell for modified Powell’s method

    • cg for conjugate gradient

    • ncg for Newton-conjugate gradient

    • basinhopping for global basin-hopping solver

    • minimize for generic wrapper of scipy minimize (BFGS by default)

    Note

    Each solver has several optional unique arguments. See **kwargs parameter below (or scipy.optimize) for the available arguments that each solver supports.

  • lowess_frac (float) – Between 0 and 1. The fraction of the data used when estimating each y-value for lowess fit.

  • it (int) – The number of residual-based reweightings to perform for lowess fit.

  • delta (float) – Distance within which to use linear-interpolation instead of weighted regression for lowess fit. ‘delta’ can be used to save computations.

    For each x_i, regressions are skipped for points closer than delta. The next regression is fit for the farthest point within delta of x_i and all points in between are estimated by linearly interpolating between the two regression fits.

    Judicious choice of delta can cut computation time considerably for large data (N > 5000). A good choice is delta = 0.01 * range(x).

  • nans_d (dict or None) – dictionary where keys are column names and values are missing(nan) replacements. To perform multiple imputation for several numeric columns.

  • frac (float or None) – fraction of dataframe to use as a sample for analysis:

    • 0 < frac < 1 returns a random sample with size frac.

    • frac = 1 returns shuffled dataframe.

    • frac > 1 up-sample the dataframe, sampling of the same row more than once.

  • random_state (int) – for reproducibility, controls the random number generator for frac parameter.

  • figsize (tuple or None) – dimensions of matplotlib figure (width, height)

  • n_rows (int) – number of rows in matplotlib subplot figure

  • n_cols (int) – number of columns in matplotlib subplot figure

  • silent (Bool) – solicit user input for continuation during iterative plotting. If True, plotting proceeds without user interaction.

  • hide_p_bar (Bool) – triggers hiding progress bar (tqdm module); Default ‘False’

  • theme (str) – adjust axis and title colors as desired

Keyword Arguments:

warn_convergence (bool, optional) –

If True, checks the model for the converged flag. If the converged flag is False, a ConvergenceWarning is issued. All other kwargs are passed to the chosen solver.

newton
tol: float

Relative error in params acceptable for convergence.

nm – Nelder Mead
xtol: float

Relative error in params acceptable for convergence

ftol: float

Relative error in loglike(params) acceptable for convergence

maxfun: int

Maximum number of function evaluations to make.

bfgs
gtol: float

Stop when norm of gradient is less than gtol.

norm: float

Order of norm (np.inf is max, -np.inf is min)

epsilon

If fprime is approximated, use this value for the step size. Only relevant if LikelihoodModel.score is None.

lbfgs
m: int

This many terms are used for the Hessian approximation.

factr: float

A stop condition that is a variant of relative error.

pgtol: float

A stop condition that uses the projected gradient.

epsilon

If fprime is approximated, use this value for the step size. Only relevant if LikelihoodModel.score is None.

maxfun: int

Maximum number of function evaluations to make.

bounds: sequence

(min, max) pairs for each element in x, defining the bounds on that parameter. Use None for one of min or max when there is no bound in that direction.

cg
gtol: float

Stop when norm of gradient is less than gtol.

norm: float

Order of norm (np.inf is max, -np.inf is min)

epsilon: float

If fprime is approximated, use this value for the step size. Can be scalar or vector. Only relevant if Likelihoodmodel.score is None.

ncg
fhess_p: callable f’(x,*args)

Function which computes the Hessian of f times an arbitrary vector, p. Should only be supplied if LikelihoodModel.hessian is None.

avextol: float

Stop when the average relative error in the minimizer falls below this amount.

epsilon: float or ndarray

If fhess is approximated, use this value for the step size. Only relevant if Likelihoodmodel.hessian is None.

powell
xtol: float

Line-search error tolerance

ftol: float

Relative error in loglike(params) for acceptable for convergence.

maxfun: int

Maximum number of function evaluations to make.

start_direc: ndarray

Initial direction set.

basinhopping
niter: int

The number of basin hopping iterations.

niter_success: int

Stop the run if the global minimum candidate remains the same for this number of iterations.

T: float

The “temperature” parameter for the accept or reject criterion. Higher “temperatures” mean that larger jumps in function value will be accepted. For best results T should be comparable to the separation (in function value) between local minima.

stepsize: float

Initial step size for use in the random displacement.

interval: int

The interval for how often to update the stepsize.

minimizer: dict

Extra keyword arguments to be passed to the minimizer scipy.optimize.minimize(), for example ‘method’ - the minimization method (e.g. ‘L-BFGS-B’), or ‘tol’ - the tolerance for termination. Other arguments are mapped from explicit argument of fit:

  • args <- fargs

  • jac <- score

  • hess <- hess

minimize
min_method: str, optional

Name of minimization method to use. Any method specific arguments can be passed directly. For a list of methods and their arguments, see documentation of scipy.optimize.minimize. If no method is specified, then BFGS is used.

Variables:
  • z_inf_out (numpy array) – excluded columns having inf values, if any.

  • z_nans (numpy array) – numeric column names where imputation of nan values took place.

  • z_df (pandas dataframe) – preprocessed dataframe that was used internally

corr(disp_corr='pearson', quant=0.75, thresh=None, alpha=None, plot=False)[source]

Calculate Pearson (linear) and Spearman (monotonic) correlation and generate heat-map visualizations;

Visualizations Plotly Module:
  • v1: Feature cols vs Target correlation either overall or for significant results only (P-value <= 0.05).

  • v2: Feature correlation for a selection of highly correlated features with target.

Note

to check correlation for categorical features, encode categories as integers first (e.g.: LabelEncoder, OrdinalEncoder, …).

Parameters:
  • disp_corr (str) – One of [‘Pearson’, ‘Spearman’], correlation method to be used in:

    • calculating correlation between features

    • sorting v1

  • quant (float) – proportion of features to be used in calculating feature correlation and display in v2; default is top 25% (> q3) of features that are highly correlated with target

  • thresh (float or None) – minimum correlation strength between features to display in v2; if not None, only display correlation >= thresh

  • alpha (float or None) – Significance alpha for rejecting null hypothesis (e.g.:0.05). if not None, V1 display features with significant results only (corr coef p-value <= alpha)

  • plot (Bool) – whether to run visualizations or not

Returns:

  • corr_df (pandas dataframe) – correlation coefficient and p-value for each feature vs target

  • feat_corr_df (pandas dataframe) – correlation coefficient of highly correlated features, only quant features are included

fit_models()[source]

Univariate model fitting:

  • Polynomial regression

  • Ordinary Least Squares regression

  • Locally Weighted Scatterplot Smoothing non-parametric regression

  • Logistic regression

Variables:
  • z_fit_results (dict) – where keys are cols and values are fitted regression model(s).

  • z_fit_out (numpy array) – excluded cols, if any, causing regression fit errors.

vis_fit(olrs_idx=None, olrs_mapping=None, x_jitter=None, y_jitter=None, scatter_kws={'alpha': 0.3}, tc_color='orange', olrs_color='red', nbins='auto', axis='x', tight=None, x_ax_rotation=None)[source]

Scatter plot visualization of univariate regression fits Seaborn Module

Parameters:
  • olrs_idx (pandas index, list or None,) – Index of outlier data points

  • olrs_mapping (dict or None) – column names as keys and outlier data points indices as values (pandas index or list) to highlight during plotting. Outliers from each column are plotted against their respective plot

  • {x, y}_jitter (float or None) – adds random noise to the observations on {x, y}_axis. applicable to main scatter plot of x and y.

  • scatter_kws (dict or None) – Additional keyword arguments passed to plt.scatter and plt.plot. Applide to main scatter plot of x and y.

  • tc_color (str) – Color of OLS trendline or Sigmoid/lOWESS Curve

  • olrs_color (str) – Color of outlier data points

  • nbins (int or ‘auto’) – For plot decoration, maximum number of axis intervals; 1 - max number of ticks. If the string ‘auto’, the number of bins will be automatically determined based on the length of the axis.

  • axis (str) – For plot decoration, one of [‘both’, ‘x’, ‘y’], axis on which to apply nbins.

  • tight (bool or None) – For plot decoration, controls expansion of axis limits, if True axis limits are only expanded using the margins; This does not set the margins to zero. If False, further expand the axis limits using the axis major locator.

  • x_ax_rotation (int or None) – For plot decoration, set degree of x_ticks rotation.

vis_ols_fit()[source]

Histograms and Scatter plots for Assessing OLS residuals’ normality and homoscedasticity assumptions

vis_multi(col, olrs_idx=None, color=None, size=None, size_max=15, symbol=None, symbol_sequence=None, symbol_map=None, hover_name=None, hover_data=None, custom_data=None, text=None, facet_row=None, facet_col=None, facet_col_wrap=0, facet_row_spacing=None, facet_col_spacing=None, error_x=None, error_x_minus=None, error_y=None, error_y_minus=None, labels=None, color_discrete_sequence=None, color_continuous_scale=None, opacity=None, marginal_x=None, marginal_y=None, category_orders=None, trendline=None, trendline_options=None, trendline_color_override=None, trendline_scope='trace', log_x=False, log_y=False, range_x=None, range_y=None, title=None, template=None, width=None, height=None, theme='darkorange')[source]

Interactive multivariate scatter plot visualization and trend analysis Plotly Module

Parameters:
  • col (str) – Name of column that goes to x axis

  • olrs_idx (pandas index, list or None) – Index of outlier data points

  • color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.

  • size (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign mark sizes.

  • size_max (int (default 20)) – Set the maximum mark size when using size.

  • symbol (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign symbols to marks.

  • symbol_sequence (list of str) – Strings should define valid plotly.js symbols. When symbol is set, values in that column are assigned symbols by cycling through symbol_sequence in the order described in category_orders, unless the value of symbol is a key in symbol_map.

  • symbol_map (dict with str keys and str values (default {})) – String values should define plotly.js symbols Used to override symbol_sequence to assign a specific symbols to marks corresponding with specific values. Keys in symbol_map should be values in the column denoted by symbol. Alternatively, if the values of symbol are valid symbol names, the string ‘identity’ may be passed to cause them to be used directly.

  • hover_name (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.

  • hover_data (str, or list of str or int, or Series or array-like, or dict) – Either a name or list of names of columns in data_frame, or pandas Series, or array_like objects or a dict with column names as keys, with values True (for default formatting) False (in order to remove this column from hover information), or a formatting string, for example ‘:.3f’ or ‘|%a’ or list-like data to appear in the hover tooltip or tuples with a bool or formatting string as first element, and list-like data to appear in hover as second element Values from these columns appear as extra data in the hover tooltip.

  • custom_data (str, or list of str or int, or Series or array-like) – Either name or list of names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not user-visible but is included in events emitted by the figure (lasso selection etc.)

  • text (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.

  • facet_row (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the vertical direction.

  • facet_col (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to facetted subplots in the horizontal direction.

  • facet_col_wrap (int) – Maximum number of facet columns. Wraps the column variable at this width, so that the column facets span multiple rows. Ignored if 0, and forced to 0 if facet_row or a marginal is set.

  • facet_row_spacing (float between 0 and 1) – Spacing between facet rows, in paper units. Default is 0.03 or 0.0.7 when facet_col_wrap is used.

  • facet_col_spacing (float between 0 and 1) – Spacing between facet columns, in paper units Default is 0.02.

  • error_x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.

  • error_x_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars in the negative direction. Ignored if error_x is None.

  • error_y (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.

  • error_y_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars in the negative direction. Ignored if error_y is None.

  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.

  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.

  • color_continuous_scale (list of str) – Strings should define valid CSS-colors This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.

  • opacity (float) – Value between 0 and 1. Sets the opacity for markers.

  • marginal_x (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a horizontal subplot is drawn above the main plot, visualizing the x-distribution.

  • marginal_y (str) – One of ‘rug’, ‘box’, ‘violin’, or ‘histogram’. If set, a vertical subplot is drawn to the right of the main plot, visualizing the y-distribution.

  • category_orders (dict with str keys and list of str values (default {})) – By default, in Python 3.6+, the order of categorical values in axes, legends and facets depends on the order in which these values are first encountered in data_frame (and no order is guaranteed by default in Python below 3.6). This parameter is used to force a specific ordering of values per column. The keys of this dict should correspond to column names, and the values should be lists of strings corresponding to the specific display order desired.

  • trendline (str or None) – One of ‘ols’, ‘lowess’, ‘rolling’, ‘expanding’ or ‘ewm’. If ‘ols’, an Ordinary Least Squares regression line will be drawn for each discrete-color/symbol group. If ‘lowess’, a Locally Weighted Scatterplot Smoothing line will be drawn for each discrete-color/symbol group. If ‘rolling’, a Rolling (e.g. rolling average, rolling median) line will be drawn for each discrete-color/symbol group. If ‘expanding’, an Expanding (e.g. expanding average, expanding sum) line will be drawn for each discrete-color/symbol group. If ‘ewm’, an Exponentially Weighted Moment (e.g. exponentially-weighted moving average) line will be drawn for each discrete-color/symbol group. See the docstrings for the functions in plotly.express.trendline_functions for more details on these functions and how to configure them with the trendline_options argument.

  • trendline_options (dict or None) – Options passed as the first argument to the function from plotly.express.trendline_functions named in the trendline argument. Valid keys for the trendline_options dict are as follows:

    ols
    add_constant: bool, default ‘True’

    if False, the trendline passes through the origin but if True a y-intercept is fitted.

    log_x and log_y: bool, default ‘False’

    if True the OLS is computed with respect to the base 10 logarithm of the input. Note that this means no zeros can be present in the input.

    lowess
    frac: float, default ‘0.6666666’

    Between 0 and 1. The fraction of the data used when estimating each y-value.

    rolling
    function: function, str, list or dict, default ‘mean’

    Function to use for aggregating the data. If a function, must either work when passed a Series/Dataframe or when passed to Series/Dataframe.apply. Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, ‘mean’]

    • dict of axis labels -> functions, function names or list of such.

    function_args: dict

    function arguments. For examples please refer to ‘win_type’ argument documentation below.

    window: int, timedelta, str, offset, or BaseIndexer subclass
    • Size of the moving window.

    • If an integer, the fixed number of observations used for each window.

    • If a timedelta, str, or offset, the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.

    • If a BaseIndexer subclass, the window boundaries based on the defined get_window_bounds method. Additional rolling keyword arguments, namely min_periods, center, closed and step will be passed to get_window_bounds.

    min_periods: int, default None

    Minimum number of observations in window required to have a value, otherwise, result is np.nan.

    center: bool, default False
    • If False, set the window labels as the right edge of the window index.

    • If True, set the window labels as the center of the window index.

    win_type: str, default None
    • If None, all points are evenly weighted.

    • If a string, it must be a valid scipy.signal window function.

    • e.g.: [barthann, bartlett, blackman, blackmanharris, bohman, boxcar, chebwin, cosine, exponential, flattop, gaussian, general_gaussian, hamming, hann, kaiser, nuttall, parzen, triang, tukey]

    • Certain Scipy window types require additional parameters to be passed in the aggregation function. The additional parameters must match the keywords specified in the Scipy window type method signature.

    • window and rolling are pandas subclasses utilizing window functions from scipy module.

    • If win_type is not None a window subclass is returned, otherwise a rolling subclass is returned. This affects the way function argument behaves, see examples below.

    on: str, optional
    • For a DataFrame, a column label or Index level on which to calculate the rolling window, rather than the DataFrame’s index.

    • Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

    closed: str, default None
    • If 'right', the first point in the window is excluded from calculations.

    • If 'left', the last point in the window is excluded from calculations.

    • If 'both', the no points in the window are excluded from calculations.

    • If 'neither', the first and last points in the window are excluded from calculations.

    • Default None ('right').

    step: int, default None

    Evaluate the window at every step result, equivalent to slicing as [::step]. window must be an integer. Using a step argument other than None or 1 will produce a result with a different shape than the input.

    expanding
    function and function_args

    same as in rolling

    min_periods: int, default 1

    Minimum number of observations in window required to have a value; otherwise, result is np.nan.

    ewm
    function and function_args

    same as in rolling

    com: float, optional

    Specify decay in terms of center of mass

    span: float, optional

    Specify decay in terms of span

    halflife: float, str, timedelta, optional
    • Specify decay in terms of half-life

    • If times is specified, a timedelta convertible unit over which an observation decays to half its value. Only applicable to mean(), and halflife value will not apply to the other functions.

    alpha: float, optional

    Specify smoothing factor

    min_periods: int, default 0

    Minimum number of observations in window required to have a value; otherwise, result is np.nan.

    adjust: bool, default True

    Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average).

    ignore_na: bool, default False

    Ignore missing values when calculating weights.

    timesnp.ndarray, Series, default None
    • Only applicable to mean().

    • Times corresponding to the observations. Must be monotonically increasing and datetime64[ns] dtype.

    • If 1-D array like, a sequence with the same shape as the observations.

  • trendline_color_override (str or None) – Valid CSS color. If provided, and if trendline is set, all trendlines will be drawn in this color rather than in the same color as the traces from which they draw their inputs.

  • trendline_scope (str (one of ‘trace’ or ‘overall’, default ‘trace’)) – If ‘trace’, then one trendline is drawn per trace (i.e. per color, symbol, facet, animation frame etc) and if ‘overall’ then one trendline is computed for the entire dataset, and replicated across all facets.

  • log_x (boolean (default False)) – If True, the x-axis is log-scaled in cartesian coordinates.

  • log_y (boolean (default False)) – If True, the y-axis is log-scaled in cartesian coordinates.

  • range_x (list of two numbers) – If provided, overrides auto-scaling on the x-axis in cartesian coordinates.

  • range_y (list of two numbers) – If provided, overrides auto-scaling on the y-axis in cartesian coordinates.

  • title (str) – The figure title.

  • template (str or dict or plotly.graph_objects.layout.Template instance) – The figure template name (must be a key in plotly.io.templates) or definition.

  • width (int (default None)) – The figure width in pixels.

  • height (int (default None)) – The figure height in pixels.

  • theme (str,) – adjust axis and title colors as desired

Variables:
  • z_plotly_ols_fit (pandas dataframe) – fitted Ordinary Least Squares model(s)

  • z_plotly_fit (pandas dataframe) – fitted Logistic or Polynomial model(s)

  • z_plotly_fit_out (pandas dataframe) – groups where fitting models fails, only applicable for Logistic or Polynomial fits if facet is assigned

Rolling Examples

>>> # Custom Function
>>> # pandas
>>> series.rolling('win_type' = None).aggregate(**opts)
>>> # trendline_options - lambda is the euclidean distance
>>> tl_opts = dict(
>>>                 function = 'aggregate',
>>>                 function_args = dict(
>>>                                      func = lambda x: np.sqrt(x.dot(x))
>>>                                      ),
>>>                  win_type = None)
>>> # Rolling object
>>> # pandas
>>> series.rolling('win_type' = None).sum(**opts)
>>> # trendline_options
>>> tl_opts = dict(
>>>                function = 'sum',
>>>                function_args = None,
>>>                win_type = None)
>>> # Window object
>>> # pandas
>>> series.rolling('win_type' = 'gaussian').sum(**opts)
>>> # trendline_options - 'std' is parameter required by
>>> # 'gaussian' window function, not the aggregation function 'sum'
>>> tl_opts = dict(
>>>                function = 'sum',
>>>                function_args = dict(std = 2),
>>>                win_type = 'gaussian')
vis_multi_d(x, y, z=None, olrs_idx=None, color=None, symbol=None, symbol_sequence=None, symbol_map=None, size=None, size_max=20, text=None, hover_name=None, hover_data=None, custom_data=None, error_x=None, error_x_minus=None, error_y=None, error_y_minus=None, error_z=None, error_z_minus=None, animation_frame=None, animation_group=None, category_orders=None, labels=None, color_discrete_sequence=None, color_continuous_scale=None, opacity=None, log_x=False, log_y=False, log_z=False, range_x=None, range_y=None, range_z=None, title=None, template=None, width=None, height=None, theme='darkorange')[source]

Interactive 3D multivariate scatter plot visualization Plotly Module

Parameters:
  • x (str) – Name of column that goes to x axis

  • y (str) – Name of column that goes to y axis

  • z (str or None) – Name of column that goes to z axis. If None, z-axis is the target variable

  • olrs_idx (pandas index, list or None) – Index of outlier data points

  • color (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign color to marks.

  • symbol (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign symbols to marks.

  • symbol_sequence (list of str) – Strings should define valid plotly.js symbols. When symbol is set, values in that column are assigned symbols by cycling through symbol_sequence in the order described in category_orders, unless the value of symbol is a key in symbol_map.

  • symbol_map (dict with str keys and str values (default {})) – String values should define plotly.js symbols Used to override symbol_sequence to assign a specific symbols to marks corresponding with specific values. Keys in symbol_map should be values in the column denoted by symbol. Alternatively, if the values of symbol are valid symbol names, the string ‘identity’ may be passed to cause them to be used directly.

  • size (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign mark sizes.

  • size_max (int (default 20)) – Set the maximum mark size when using size.

  • text (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in the figure as text labels.

  • hover_name (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like appear in bold in the hover tooltip.

  • hover_data (str, or list of str or int, or Series or array-like, or dict) – Either a name or list of names of columns in data_frame, or pandas Series, or array_like objects or a dict with column names as keys, with values True (for default formatting) False (in order to remove this column from hover information), or a formatting string, for example ‘:.3f’ or ‘|%a’ or list-like data to appear in the hover tooltip or tuples with a bool or formatting string as first element, and list-like data to appear in hover as second element Values from these columns appear as extra data in the hover tooltip.

  • custom_data (str, or list of str or int, or Series or array-like) – Either name or list of names of columns in data_frame, or pandas Series, or array_like objects Values from these columns are extra data, to be used in widgets or Dash callbacks for example. This data is not user-visible but is included in events emitted by the figure (lasso selection etc.)

  • error_x (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars. If error_x_minus is None, error bars will be symmetrical, otherwise error_x is used for the positive direction only.

  • error_x_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size x-axis error bars in the negative direction. Ignored if error_x is None.

  • error_y (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars. If error_y_minus is None, error bars will be symmetrical, otherwise error_y is used for the positive direction only.

  • error_y_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size y-axis error bars in the negative direction. Ignored if error_y is None.

  • error_z (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size z-axis error bars. If error_z_minus is None, error bars will be symmetrical, otherwise error_z is used for the positive direction only.

  • error_z_minus (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to size z-axis error bars in the negative direction. Ignored if error_z is None.

  • animation_frame (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to assign marks to animation frames.

  • animation_group (str or int or Series or array-like) – Either a name of a column in data_frame, or a pandas Series or array_like object. Values from this column or array_like are used to provide object-constancy across animation frames: rows with matching animation_group will be treated as if they describe the same object in each frame.

  • category_orders (dict with str keys and list of str values (default {})) – By default, in Python 3.6+, the order of categorical values in axes, legends and facets depends on the order in which these values are first encountered in data_frame (and no order is guaranteed by default in Python below 3.6). This parameter is used to force a specific ordering of values per column. The keys of this dict should correspond to column names, and the values should be lists of strings corresponding to the specific display order desired.

  • labels (dict with str keys and str values (default {})) – By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.

  • color_discrete_sequence (list of str) – Strings should define valid CSS-colors. When color is set and the values in the corresponding column are not numeric, values in that column are assigned colors by cycling through color_discrete_sequence in the order described in category_orders, unless the value of color is a key in color_discrete_map. Various useful color sequences are available in the plotly.express.colors submodules, specifically plotly.express.colors.qualitative.

  • color_continuous_scale (list of str) – Strings should define valid CSS-colors This list is used to build a continuous color scale when the column denoted by color contains numeric data. Various useful color scales are available in the plotly.express.colors submodules, specifically plotly.express.colors.sequential, plotly.express.colors.diverging and plotly.express.colors.cyclical.

  • opacity (float or None,) – Value between 0 and 1. Sets the opacity for markers.

  • log_x (boolean (default False)) – If True, the x-axis is log-scaled in cartesian coordinates.

  • log_y (boolean (default False)) – If True, the y-axis is log-scaled in cartesian coordinates.

  • log_z (boolean (default False)) – If True, the z-axis is log-scaled in cartesian coordinates.

  • range_x (list of two numbers) – If provided, overrides auto-scaling on the x-axis in cartesian coordinates.

  • range_y (list of two numbers) – If provided, overrides auto-scaling on the y-axis in cartesian coordinates.

  • range_z (list of two numbers) – If provided, overrides auto-scaling on the z-axis in cartesian coordinates.

  • title (str) – The figure title.

  • template (str or dict or plotly.graph_objects.layout.Template instance) – The figure template name (must be a key in plotly.io.templates) or definition.

  • width (int (default None)) – The figure width in pixels.

  • height (int (default None)) – The figure height in pixels.

  • theme (str) – adjust axis and title colors as desired