dashi.unsupervised_characterization.data_temporal_map package
Submodules
dashi.unsupervised_characterization.data_temporal_map.data_temporal_map module
Data Temporal Map main functions and classes
- class DataTemporalMap(probability_map=None, counts_map=None, dates=None, support=None, variable_name=None, variable_type=None, period=None)[source]
Bases:
object
A class that contains the statistical distributions of data estimated at a specific time period. Both relative and absolute frequencies are included
- probability_map
Numerical matrix representing the probability distribution temporal map (relative frequency).
- Type:
Union[List[List[float]], None]
- counts_map
Numerical matrix representing the counts temporal map (absolute frequency).
- Type:
Union[List[List[int]], None]
- dates
Array of the temporal batches.
- Type:
Union[List[datetime], None]
- support
Numerical or character matrix representing the support (the value at each bin) of probability_map and counts_map.
- Type:
Union[List[str], None]
- variable_name
Name of the variable (character).
- Type:
Union[str, None]
- variable_type
Type of the variable (character).
- Type:
Union[str, None]
- period
Batching period among ‘week’, ‘month’ and ‘year’.
- Type:
Union[str, None]
- check()[source]
Validates the consistency of the DataTemporalMap attributes. This method checks for various potential issues, such as mismatched dimensions, invalid periods, or unsupported variable types.
- Returns:
Returns a list of error messages if any validation fails, otherwise returns True indicating the object is valid.
- Return type:
Union[List[str], bool]
-
counts_map:
Optional
[List
[List
[int
]]] = None
-
dates:
Optional
[List
[datetime
]] = None
-
period:
Optional
[str
] = None
-
probability_map:
Optional
[List
[List
[float
]]] = None
-
support:
Optional
[List
[str
]] = None
-
variable_name:
Optional
[str
] = None
-
variable_type:
Optional
[str
] = None
- class MultiVariateDataTemporalMap(probability_map=None, counts_map=None, dates=None, support=None, variable_name=None, variable_type=None, period=None, multivariate_probability_map=None, multivariate_counts_map=None, multivariate_support=None)[source]
Bases:
DataTemporalMap
A subclass of DataTemporalMap representing a multi-variate time series data map. In addition to the attributes inherited from the DataTemporalMap class, this class includes additional properties specific to multivariate time series data.
- multivariate_probability_map
List of matrices representing the multi-variate probability distribution temporal map (relative frequency) for each timestamp.
- Type:
Optional[List[List[float]]]
- multivariate_counts_map
List of matrices representing the multi-variate counts temporal map (absolute) for each timestamp.
- Type:
Optional[List[List[float]]]
- multivariate_support
List of matrices representing the support (the value at each bin) of the dimensions of multivariate_probability_map and multivariate_counts_map.
- Type:
Optional[List[float]]
- check()[source]
Validates the consistency of the MultiVariateDataTemporalMap attributes, ensuring that the multivariate probability map, counts map, and support dimensions are consistent, along with inherited checks from the parent class DataTemporalMap.
- Returns:
Returns a list of error messages if any validation fails, otherwise returns True indicating the object is valid.
- Return type:
Union[List[str], bool]
-
multivariate_counts_map:
Optional
[List
[List
[float
]]] = None
-
multivariate_probability_map:
Optional
[List
[List
[float
]]] = None
-
multivariate_support:
Optional
[List
[str
]] = None
- estimate_conditional_data_temporal_map(data, date_column_name, label_column_name, kde_resolution=10, dimensions=2, period='month', start_date=None, end_date=None, dim_reduction='PCA', scale=True, scatter_plot=False, verbose=False)[source]
Estimates a MultivariateDataTemporalMap object for the data corresponding to each label of the DataFrame containing multiple variables (in columns) over time, using dimensionality reduction techniques (e.g., PCA) to handle high dimensional data.
- Parameters:
data (pd.DataFrame) – A DataFrame where each row represents an individual or data point, and each column represents a variable. One column should represent the analysis date (typically the acquisition date).
date_column_name (str) – A string indicating the name of the column in data containing the analysis date variable.
label_column_name (str) – The name of the column that contains the labels or class/category for each observation (used for concept shift analysis).
kde_resolution (int) – The resolution of the grid used for Kernel Density Estimation (KDE). This determines the granularity of the KDE grid and how fine or coarse the estimated density maps will be. Default is 10.
dimensions (int) – The number of dimensions to keep after applying dimensionality reduction (e.g., PCA). Default is 2, meaning the data will be projected into a 2D space. The maximum number of dimensions available are 3. For single variable datasets, dimensions can be set to 1
period (str) – The period to batch the data for analysis. Options are: - ‘week’ (weekly analysis) - ‘month’ (monthly analysis, default) - ‘year’ (annual analysis)
start_date (pd.Timestamp) – A date object indicating the date at which to start teh analysis, in case of being different from the first chronological date in the date column.
end_date (pd.Timestamp) – A date object indicating the date at which to end the analysis, in case of being different from the last chronological date in the date column.
dim_reduction (str) – A dimensionality reduction technique to be used on the data. Default is ‘PCA’ (Principal Component Analysis) for numerical data. Other options can include ‘MCA’ (Multiple Correspondence Analysis) for categorical data or ‘FAMD’ (Factor Analysis of Mixed Data) for mixed data. Note: in case of using ‘FAMD’, numerical variables must be in float type. Otherwise they will be treated as categorical.
scale (str) – Applicable just when using PCA dimensionality reduction. If true scales the input data using z-score normalization. Defaults to True
scatter_plot (bool) – Whether to generate a scatter plot of the first two principal components of the dimensionality reduction.
verbose (bool) – Whether to display additional information during the process. Defaults to False.
- Returns:
A dictionary where the keys are the labels in the dataset, and the values are MultiVariateDataTemporalMap objects representing the temporal maps generated for each label.
- Return type:
Dict[str, MultiVariateDataTemporalMap]
- estimate_multivariate_data_temporal_map(data, date_column_name, kde_resolution=10, dimensions=2, period='month', start_date=None, end_date=None, dim_reduction='PCA', scale=True, scatter_plot=False, verbose=False)[source]
Estimates a MultiVariateDataTemporalMap object from a DataFrame containing multiple variables (in columns) over time, using dimensionality reduction techniques (e.g., PCA) to handle high-dimensional data.
- Parameters:
data (pd.DataFrame) – A DataFrame where each row represents an individual or data point, and each column represents a variable. One column should represent the analysis date (typically the acquisition date).
date_column_name (str) – A string indicating the name of the column in data containing the analysis date variable.
kde_resolution (int) – The resolution of the grid used for Kernel Density Estimation (KDE). This determines the granularity of the KDE grid and how fine or coarse the estimated density maps will be. Default is 10.
dimensions (int) – The number of dimensions to keep after applying dimensionality reduction (e.g., PCA). Default is 2, meaning the data will be projected into a 2D space. The maximum number of dimensions available are 3.
period (str) – The period to batch the data for analysis. Options are: - ‘week’ (weekly analysis) - ‘month’ (monthly analysis, default) - ‘year’ (annual analysis)
start_date (pd.Timestamp) – A date object indicating the date at which to start teh analysis, in case of being different from the first chronological date in the date column.
end_date (pd.Timestamp) – A date object indicating the date at which to end the analysis, in case of being different from the last chronological date in the date column.
dim_reduction (str) – A dimensionality reduction technique to be used on the data. Default is PCA (Principal Component Analysis) for numerical data. Other options can include ‘MCA’ (Multiple Correspondence Analysis) for categorical data or ‘FAMD’ (Factor Analysis of Mixed Data) for mixed data. Note: in case of using ‘FAMD’, numerical variables must be in float type. Otherwise they will be treated as categorical.
scale (str) – Applicable just when using PCA dimensionality reduction. If true scales the input data using z-score normalization. Defaults to True.
scatter_plot (bool) – Whether to generate a scatter plot of the first two principal components of the dimensionality reduction
verbose (bool) – Whether to display additional information during the process. Defaults to False.
- Returns:
The MultivariateDataTemporalMap object of the data
- Return type:
- estimate_univariate_data_temporal_map(data, date_column_name, period='month', start_date=None, end_date=None, supports=None, numeric_variables_bins=100, numeric_smoothing=True, date_gaps_smoothing=False, verbose=False)[source]
Estimates a DataTemporalMap object from a DataFrame containing individuals in rows and the variables in columns, being one of these columns the analysis date (typically the acquisition date).
- Parameters:
data (pd.DataFrame) – A DataFrame containing as many rows as individuals, and as many columns as teh analysis variables plus the individual acquisition date.
date_column_name (str) – A string indicating teh name of the column in data containing the analysis date variable.
period (
str
) – The period to batch the data for analysis. Options are: - ‘week’ (weekly analysis) - ‘month’ (monthly analysis, default) - ‘year’ (annual analysis)start_date (pd.Timestamp) – A date object indicating the date at which to start teh analysis, in case of being different from the first chronological date in the date column.
end_date (pd.Timestamp) – A date object indicating the date at which to end the analysis, in case of being different from the last chronological date in the date column.
supports (Union[Dict, None]) – A dictionary with structure {variable_name: variable_type_name} containing the support of the data distributions for each variable. If not provided, it is automatically estimated from the data.
numeric_variables_bins (int) – The number of bins at which to define the frequency/density histogram for numerical variables when their support is not provided. 100 as default.
numeric_smoothing (bool) – Logical value indicating whether a Kernel Density Estimation smoothing (Gaussian kernel, default bandwidth) is to be applied on numerical variables or traditional histogram instead.
date_gaps_smoothing (bool) – Logical value indicating whether a linear smoothing is applied to those time batches without data. By default, gaps are filled with NAs.
verbose (bool) – Whether to display additional information during the process. Defaults to False.
- Returns:
The DataTemporalMap object or a dictionary of DataTemporalMap objects depending on the number of analysis variables.
- Return type:
- trim_data_temporal_map(data_temporal_map, start_date=None, end_date=None)[source]
Trims the data in the DataTemporalMap object to the specified date range.
- Parameters:
data_temporal_map (DataTemporalMap) – The DataTemporalMap object to be trimmed.
start_date (Optional[datetime]) – The start date of the range to trim the data from. If None, the earliest date in data_temporal_map.dates will be used.
end_date (Optional[datetime]) – The end date of the range to trim the data from. If None, the latest date in data_temporal_map.dates will be used.
- Returns:
The input DataTemporalMap object with trimmed data.
- Return type:
dashi.unsupervised_characterization.data_temporal_map.data_temporal_map_plotter module
Data Temporal Map plotting main functions and classes
- plot_conditional_data_temporal_map(data_temporal_map_dict, absolute=False)[source]
Plots a Figure for each dimension selected in the data_temporal_map_dict. Each Figure represents the Data Temporal heatmap of each label in that dimension
- Parameters:
data_temporal_map_dict (Dict[str, MultiVariateDataTemporalMap]) – A dictionary where keys are labels (strings), and values are the corresponding MultiVariateDataTemporalMap objects obtained from the ‘estimate_conditional_data_temporal_map’ function.
absolute (bool, optional) – If True, plot absolute values; otherwise, relative probabilities are plotted. Default is False.
- Return type:
None
- plot_multivariate_data_temporal_map(data_temporal_map, absolute=False, plot_title=None)[source]
Plots a Data Temporal heatmap from a MultiVariateDataTemporalMap object.
- Parameters:
data_temporal_map (MultiVariateDataTemporalMap) – The MultiVariateDataTemporalMap object that contains the temporal data to be plotted.
absolute (bool, optional) – If True, plot absolute values; otherwise, the relative probabilities are plotted. Default is False.
plot_title (str, optional) – The title of the plot. If None, a default title is used. Default is None.
- Returns:
The Plotly figure object representing the plot.
- Return type:
Figure
- plot_univariate_data_temporal_map(data_temporal_map, absolute=False, log_transform=False, start_value=0, end_value=None, start_date=None, end_date=None, sorting_method='frequency', color_palette='Spectral', mode='heatmap', plot_title=None)[source]
Plots a Data Temporal heatmap or series from a DataTemporalMap object.
- Parameters:
data_temporal_map (DataTemporalMap) – The DataTemporalMap object that contains the temporal data to be plotted.
absolute (bool) – If True, plot absolute values; otherwise, the relative probabilities are plotted. Default is False.
log_transform (bool) – If True, applies a log transformation to the data for better visibility of small values. Default is False.
start_value (int, optional) – The value at which to start the plot. Default is 0.
end_value (int, optional) – The value at which to end the plot. If None, the plot extends to the last value. Default is None.
start_date (datetime, optional) – The starting date for the plot (filters the data). If None, uses the first date in the data. Default is None.
end_date (datetime, optional) – The ending date for the plot (filters the data). If None, uses the last date in the data. Default is None.
sorting_method (str, optional) – The method by which the data will be sorted for display (e.g., ‘frequency’, ‘alphabetical’). Default is ‘frequency’.
color_palette (str, optional) – The color palette to be used for the plot (e.g., ‘Spectral’, ‘viridis’, ‘viridis_r’, ‘magma’, ‘magma_r). Default is ‘Spectral’.
mode (str, optional) – The mode of visualization (e.g., ‘heatmap’, ‘series’). Default is ‘heatmap’.
plot_title (str, optional) – The title of the plot. If None, a default title is used. Default is None.
- Returns:
The Plotly figure object representing the plot
- Return type:
Figure