Note
Go to the end to download the full example code.
Deep Domain Aware Datasets
This example illustrate some uses of DeepDADatasets.
# Author: Maxence Barneche
#
# License: BSD 3-Clause
# sphinx_gallery_thumbnail_number = 4
import numpy as np
import pandas as pd
import torch
from skada.datasets import make_shifted_datasets
from skada.deep.base import DeepDADataset
Creation
Deep domain aware datasets are a unified representation of data for deep methods in skada.
- In those datasets, a data sample has four (optionally, five) attributes:
the data point
X
the label
y
the domain
sample_domain
optionally, the weight
sample_weight
- the sample index
sample_idx
(automatically generated), which is the index of the sample in the dataset, relative to its domain.
- the sample index
Note that the data is not shuffled, so the order of the samples is preserved.
Warning
In a dataset, either all data samples have a weight, or none of them.
On the other hand, it is possible that a sample has no associated label or domain.
In that case, it will be associated to label -1
and domain 0
.
DeepDADatasets can be created from numpy arrays, torch tensors, lists, tuples, or dictionary of one of the former.
If a dictionary is provided, it must contain the keys X
, :code:`y`(optional),
:code:`sample_domain`(optional) and :code:`sample_weight`(optional).
If both dictionary and positional arguments are provided, the dictionary arguments will take precedence over the positional ones.
# practice dataset as numpy arrays
raw_data = make_shifted_datasets(20, 20, random_state=42)
X, y, sample_domain = raw_data
# though these are not technically weights, they will act as such throughout the guide.
weights = np.ones_like(y)
dict_raw_data = {"X": X, "sample_domain": sample_domain, "y": y}
weighted_dict_raw_data = {
"X": X,
"sample_domain": sample_domain,
"y": y,
"sample_weight": weights,
}
dataset = DeepDADataset(X, y, sample_domain)
dataset_from_dict = DeepDADataset(dict_raw_data)
# it is possible to add weights to the dataset, either at creation or later
dataset_with_weights = DeepDADataset(X, y, sample_domain, weights)
dataset_with_weights_from_dict = DeepDADataset(weighted_dict_raw_data)
# these methods change the dataset in place and return the dataset itself
dataset = dataset.add_weights(weights)
dataset = dataset.remove_weights()
It is also possible to create a DeepDADataset from lists, tuples, tensors, pandas dataframes or any combination of those.
Note
Just like for the dictionary, if a pandas dataframe is provided it must
contain the keys X
, y
(optional), sample_domain`(optional)
and :code:`sample_weight
(optional).
Also, the data in the dataframe will take precedence over the positional arguments.
# from lists
dataset_from_list = DeepDADataset(X.tolist(), y.tolist(), sample_domain.tolist())
# from tuples
dataset_from_tuple = DeepDADataset(
tuple(X.tolist()), tuple(y.tolist()), tuple(sample_domain.tolist())
)
# from torch tensors
dataset_from_tensor = DeepDADataset(
torch.tensor(X), torch.tensor(y), torch.tensor(sample_domain)
)
# from pandas dataframe of same structure as the dictionary
df = pd.DataFrame(
{
"X": list(X),
"y": y,
"sample_domain": sample_domain,
"sample_weight": weights,
}
)
dataset_from_df = DeepDADataset(df)
It is also possible to merge two datasets, which will concatenate the data samples, the labels and the domains.
dataset2 = dataset.merge(dataset)
Accessing data
The data can be accessed with the same indexing methods as for a torch tensor.
The returned data is a tuple with a dictionary with the keys X
,
sample_domain
, sample_idx
, and optionally sample_weight
as first element and the corresponding label y
as second element.
- ..note::
The data is stored in torch tensors, with dimension 1 for
sample_domain
,y
andsample_weight
.
It is also possible to access the data through the various selection methods, all of which return DeepDADatasets instances.
# indexing methods return a tuple with the data as dict and the label
first_sample = dataset[0] # first sample
first_five_samples = dataset[0:5] # first five samples
# selecting methods return a DeepDADataset with the selected samples
domain_1_samples = dataset.select_domain(1) # all samples from domain 1
label_1_samples = dataset.select(
lambda label: label == 1, on="y"
) # all samples with label 1
Total running time of the script: (0 minutes 0.012 seconds)