dashi.unsupervised_characterization.data_source_map package

Submodules

dashi.unsupervised_characterization.data_source_map.data_source_map module

Data Source Map main module.

class DataSourceMap(probability_map=None, counts_map=None, sources=None, support=None, variable_name=None, variable_type=None)[source]

Bases: object

A class that contains the statistical distributions of data estimated at a specific time period. Both relative and absolute frequencies are included

probability_map

Numerical matrix representing the probability distribution temporal map (relative frequency).

Type:

Union[List[List[float]], None]

counts_map

Numerical matrix representing the counts temporal map (absolute frequency).

Type:

Union[List[List[int]], None]

sources

List of sources (character) from which the data was obtained.

Type:

Union[List[str], None]

support

Numerical or character matrix representing the support (the value at each bin) of probability_map and counts_map.

Type:

Union[List[str], None]

variable_name

Name of the variable (character).

Type:

Union[str, None]

variable_type

Type of the variable (character).

Type:

Union[str, None]

period

Batching period among ‘week’, ‘month’ and ‘year’.

Type:

Union[str, None]

check()[source]

Validates the consistency of the DataSourceMap attributes. This method checks for various potential issues, such as mismatched dimensions, invalid periods, or unsupported variable types.

Returns:

Returns a list of error messages if any validation fails, otherwise returns True indicating the object is valid.

Return type:

Union[List[str], bool]

counts_map: Optional[List[List[int]]] = None
probability_map: Optional[List[List[float]]] = None
sources: Optional[List[str]] = None
support: Optional[List[str]] = None
variable_name: Optional[str] = None
variable_type: Optional[str] = None
class MultiVariateDataSourceMap(probability_map=None, counts_map=None, sources=None, support=None, variable_name=None, variable_type=None, multivariate_probability_map=None, multivariate_counts_map=None, multivariate_support=None)[source]

Bases: BaseMultiVariateMap, DataSourceMap

A subclass of DataSourceMap representing a multi-variate multi-source data map. In addition to the attributes inherited from the DataSourceMap class, this class includes additional properties specific to multivariate multi-source data.

multivariate_probability_map

List of matrices representing the multi-variate probability distribution temporal map (relative frequency) for each timestamp.

Type:

Optional[np.ndarray]

multivariate_counts_map

List of matrices representing the multi-variate counts temporal map (absolute) for each timestamp.

Type:

Optional[np.ndarray]

multivariate_support

List of matrices representing the support (the value at each bin) of the dimensions of multivariate_probability_map and multivariate_counts_map.

Type:

Optional[np.ndarray]

check()[source]

Validates the consistency of the DataSourceMap attributes. This method checks for various potential issues, such as mismatched dimensions, invalid periods, or unsupported variable types.

Returns:

Returns a list of error messages if any validation fails, otherwise returns True indicating the object is valid.

Return type:

Union[List[str], bool]

estimate_conditional_data_source_map(data, source_column_name, label_column_name, kde_resolution=10, dimensions=2, dim_reduction='PCA', scale=True, scatter_plot=False, verbose=False)[source]

Estimates a MultiVariateDataSourceMap object for the data corresponding to each label of a DataFrame containing multiple variables (in columns) over different sources, using dimensionality reduction techniques (e.g., PCA) to handle high dimensional data.

Parameters:
  • data (pd.DataFrame) – A DataFrame where each row represents an individual or data point, and each column represents a variable. One column should represent the analysis sources and another the labels.

  • source_column_name (str) – A string indicating the name of the column in the DataFrame that contains the source of the data.

  • label_column_name (str) – The name of the column that contains the labels or class/category for each observation (used for conditional analysis).

  • kde_resolution (int) – The resolution of the grid used for Kernel Density Estimation (KDE). This determines the granularity of the KDE grid and how fine or coarse the estimated density maps will be. Default is 10.

  • dimensions (int) – The number of dimensions to keep after applying dimensionality reduction (e.g., PCA). Default is 2, meaning the data will be projected into a 2D space. The maximum number of dimensions available are 3. For single variable datasets, dimensions can be set to 1

  • dim_reduction (str) – A dimensionality reduction technique to be used on the data. Default is ‘PCA’ (Principal Component Analysis) for numerical data. Other options can include ‘MCA’ (Multiple Correspondence Analysis) for categorical data or ‘FAMD’ (Factor Analysis of Mixed Data) for mixed data. Note: in case of using ‘FAMD’, numerical variables must be in float type. Otherwise they will be treated as categorical.

  • scale (str) – Applicable just when using PCA dimensionality reduction. If true scales the input data using z-score normalization. Defaults to True

  • scatter_plot (bool) – Whether to generate a scatter plot of the first two principal components of the dimensionality reduction.

  • verbose (bool) – Whether to display additional information during the process. Defaults to False.

Returns:

A dictionary where the keys are the labels in the dataset, and the values are MultiVariateDataSourceMap objects representing the multi-source maps generated for each label.

Return type:

Dict[str, MultiVariateDataSourceMap]

estimate_multivariate_data_source_map(data, source_column_name, kde_resolution=10, dimensions=2, dim_reduction='PCA', scale=True, scatter_plot=False, verbose=False)[source]

Estimates a MultiVariateDataSourceMap object from a DataFrame containing multiple variables (in columns) with a source column indicating the source of the data. The function performs dimensionality reduction on the data (e.g. PCA, MCA, FAMD) to handle high-dimensional data and estimates the probability distributions for each source.

Parameters:
  • data (pd.DataFrame) – A DataFrame where each row represents an individual or data point, and each column represents a variable. One column should represent the analysis sources.

  • source_column_name (str) – A string indicating the name of the column in the DataFrame that contains the source of the data.

  • kde_resolution (int) – The resolution of the grid used for Kernel Density Estimation (KDE). This determines the granularity of the KDE grid and how fine or coarse the estimated density maps will be. Default is 10.

  • dimensions (int) – The number of dimensions to keep after applying dimensionality reduction (e.g., PCA). Default is 2, meaning the data will be projected into a 2D space. The maximum number of dimensions available are 3.

  • dim_reduction (str) – A dimensionality reduction technique to be used on the data. Default is PCA (Principal Component Analysis) for numerical data. Other options can include ‘MCA’ (Multiple Correspondence Analysis) for categorical data or ‘FAMD’ (Factor Analysis of Mixed Data) for mixed data. Note: in case of using ‘FAMD’, numerical variables must be in float type. Otherwise they will be treated as categorical.

  • scale (bool) – Applicable just when using PCA dimensionality reduction. If true scales the input data using z-score normalization. Defaults to True.

  • scatter_plot (bool) – Whether to generate a scatter plot of the first two principal components of the dimensionality reduction

  • verbose (bool) – Whether to display additional information during the process. Defaults to False.

Returns:

A MultiVariateDataSourceMap object containing the estimated probability distributions for each source, along with the multivariate probability maps, counts maps, and supports for each dimension.

Return type:

MultiVariateDataSourceMap

estimate_univariate_data_source_map(data, source_column, supports=None, numeric_smoothing=True, numeric_variables_bins=100, verbose=False)[source]

Estimates a DataSourceMap object from a DataFrame containing individuals in rows and the variables in columns, being one of these columns the analysis source.

Parameters:
  • data (pd.DataFrame) – DataFrame containing the data to be analyzed. Each row represents an individual and each column a variable.

  • source_column (str) – Name of the column in the DataFrame that contains the source of the data. This column will be used to group the data and estimate the distributions.

  • supports (Union[Dict, None], optional) – A dictionary with structure {variable_name: variable_type_name} containing the support of the data distributions for each variable. If not provided, it is automatically estimated from the data.

  • numeric_smoothing (bool, optional) – Logical value indicating whether a Kernel Density Estimation smoothing (Gaussian kernel, default bandwidth) is to be applied on numerical variables or traditional histogram instead.

  • numeric_variables_bins (int) – The number of bins at which to define the frequency/density histogram for numerical variables when their support is not provided. 100 as default.

  • verbose (bool, optional) – If True, prints additional information about the estimation process. Default is False.

Returns:

The DataSourceMap object or a dictionary of DataSourceMap objects depending on the number of analysis variables.

Return type:

DataSourceMap

dashi.unsupervised_characterization.data_source_map.data_source_map_plotter module

Data Source Map plotting main functions and classes

plot_conditional_data_source_map(data_source_map_dict, absolute=False)[source]

Plots a Figure for each dimension selected in the data_temporal_map_dict. Each Figure represents the Data Temporal heatmap of each label in that dimension

Parameters:
  • data_source_map_dict (Dict[str, MultiVariateDataSourceMap]) – A dictionary where keys are labels (strings), and values are the corresponding MultiVariateDataSourceMap objects obtained from the ‘estimate_conditional_data_source_map’ function.

  • absolute (bool, optional) – If True, plot absolute values; otherwise, relative probabilities are plotted. Default is False.

Return type:

None

plot_multivariate_data_source_map(data_source_map, absolute=False)[source]

Plots a multivariate Data Source heatmap from a MultiVariateDataSourceMap object.

Parameters:
  • data_source_map (MultiVariateDataSourceMap) – The MultiVariateDataSourceMap object that contains multivariate data to be plotted.

  • absolute (bool, optional) – If True, plot absolute values; otherwise, the relative probabilities are plotted. Default is False.

Returns:

The Plotly figure object representing the multivariate heatmap.

Return type:

Figure

plot_univariate_data_source_map(data_source_map, absolute=False, log_transform=False, start_value=0, end_value=None, sorting_method='alphabetical', title=None)[source]

Plots a Data Source heatmap or series from a DataSourceMap object.

Parameters:
  • data_source_map (DataTemporalMap) – The DataSourceMap object that contains data to be plotted.

  • absolute (bool) – If True, plot absolute values; otherwise, the relative probabilities are plotted. Default is False.

  • log_transform (bool) – If True, applies a log transformation to the data for better visibility of small values. Default is False.

  • start_value (int, optional) – The value at which to start the plot. Default is 0.

  • end_value (int, optional) – The value at which to end the plot. If None, the plot extends to the last value. Default is None.

  • sorting_method (str, optional) – The method by which the data will be sorted for display (e.g., ‘frequency’, ‘alphabetical’). Default is ‘frequency’.

  • title (str, optional) – The title of the plot. If None, a default title is used. Default is None.

Returns:

The Plotly figure object representing the plot

Return type:

Figure

Module contents