Quick Start Guide

This guide walks you through the main features of dashi.

1. Data Formatting

Before any analysis, format your DataFrame so that dates and types are correct:

import pandas as pd
import dashi as ds

df = pd.read_csv('my_data.csv')

df = ds.format_data(
    df,
    date_column_name='date',
    date_format='%Y/%m/%d',
    numerical_column_names=['age', 'weight'],
    categorical_column_names=['gender', 'diagnosis']
)

2. Unsupervised Temporal Analysis

Estimate how variable distributions change over time:

# Univariate analysis
dtm = ds.estimate_univariate_data_temporal_map(
    data=df,
    date_column='date',
    period='month'
)

# Plot heatmap
plot = ds.plot_univariate_data_temporal_map(dtm['weight'])

# Multivariate analysis with dimensionality reduction
mv_dtm = dashi.estimate_multivariate_data_temporal_map(
    data=df,
    date_column_name='date',
    period='month',
    dim_reduction='PCA',
    dimensions=2
)

# Plot heatmap
plot = ds.plot_multivariate_data_temporal_map(mv_dtm)

3. Unsupervised Multi-Source Analysis

Compare distributions across different data sources:

dsm = ds.estimate_univariate_data_source_map(
    data=df,
    source_column='hospital'
)

plot = ds.plot_univariate_data_source_map(dsm['weight'])

4. Variability Metrics (IGT & MSV)

Quantify temporal or source variability:

# Information Geometric Temporal (IGT) projection
igt = ds.estimate_igt_projection(dtm, embedding_type='classicalmds')
plot = ds.plot_IGT_projection(igt)

# Multi-Source Variability (MSV) metrics
msv = ds.estimate_MSV_metrics(dsm)
plot = ds.plot_MSV(msv)

5. Supervised Characterization

Evaluate model performance across temporal or source batches:

metrics = ds.estimate_multibatch_models(
    data=df,
    inputs_numerical_column_names=['age', 'weight'],
    inputs_categorical_column_names=['gender'],
    output_classification_column_name='diagnosis',
    date_column_name='date',
    period='month',
    learning_strategy='from_scratch',
    model_type='histogram_gradient_boosting'
)

performance_df = ds.arrange_performance_metrics(
    metrics=metrics,
    metric_name='AUC_MACRO'
)

plot = ds.plot_performance(
performance_df,
metric_name='ROC-AUC_MACRO
)