.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/plot_how_to_use_skada.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_how_to_use_skada.py: How to use SKADA ==================================================== This is a short example to get started with SKADA and perform domain adaptation on a simple dataset. It illustrates the API choice specific to DA. .. GENERATED FROM PYTHON SOURCE LINES 8-14 .. code-block:: Python # Author: Remi Flamary # # License: BSD 3-Clause # sphinx_gallery_thumbnail_number = 1 .. GENERATED FROM PYTHON SOURCE LINES 15-38 .. code-block:: Python import matplotlib.pyplot as plt import numpy as np from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from skada import ( CORAL, CORALAdapter, GaussianReweightAdapter, PerDomain, SelectSource, SelectSourceTarget, make_da_pipeline, source_target_split, ) from skada.datasets import make_shifted_datasets from skada.metrics import PredictionEntropyScorer from skada.model_selection import SourceTargetShuffleSplit .. GENERATED FROM PYTHON SOURCE LINES 39-50 DA dataset ---------- We generate a simple 2D DA dataset. Note that DA datasets provided by SKADA are organized as follows: * :code:`X` is the input data, including the source and the target samples * :code:`y` is the output data to be predicted (labels on target samples are not used when fitting the DA estimator) * :code:`sample_domain` encodes the domain of each sample (integer >=0 for source and <0 for target) .. GENERATED FROM PYTHON SOURCE LINES 50-74 .. code-block:: Python # Get DA dataset X, y, sample_domain = make_shifted_datasets( 20, 20, shift="concept_drift", random_state=42 ) # split source and target for visualization Xs, Xt, ys, yt = source_target_split(X, y, sample_domain=sample_domain) sample_domain_s = np.ones(Xs.shape[0]) sample_domain_t = -np.ones(Xt.shape[0]) * 2 # plot data plt.figure(1, (10, 5)) plt.subplot(1, 2, 1) plt.scatter(Xs[:, 0], Xs[:, 1], c=ys, cmap="tab10", vmax=9, label="Source") plt.title("Source data") ax = plt.axis() plt.subplot(1, 2, 2) plt.scatter(Xt[:, 0], Xt[:, 1], c=yt, cmap="tab10", vmax=9, label="Target") plt.axis(ax) plt.title("Target data") .. image-sg:: /auto_examples/images/sphx_glr_plot_how_to_use_skada_001.png :alt: Source data, Target data :srcset: /auto_examples/images/sphx_glr_plot_how_to_use_skada_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Text(0.5, 1.0, 'Target data') .. GENERATED FROM PYTHON SOURCE LINES 75-81 DA Classifier estimator ----------------------- SKADA estimators are used like scikit-learn estimators. The only difference is that the :code:`sample_domain` array must be passed by name when fitting the estimator. .. GENERATED FROM PYTHON SOURCE LINES 81-96 .. code-block:: Python # create a DA estimator clf = CORAL() # train on all data clf.fit(X, y, sample_domain=sample_domain) # estimator is designed to predict on target by default yt_pred = clf.predict(Xt) # accuracy on source and target print("Accuracy on source:", clf.score(Xs, ys)) print("Accuracy on target:", clf.score(Xt, yt)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on source: 0.84375 Accuracy on target: 1.0 .. GENERATED FROM PYTHON SOURCE LINES 97-100 DA estimator in a pipeline ----------------------------- .. GENERATED FROM PYTHON SOURCE LINES 100-112 .. code-block:: Python # SKADA estimators can be used as the final estimator of a scikit-learn pipeline. # Again, the only difference is that the :code:`sample_domain` array must be passed # by name during in fit. # create a DA pipeline pipe = make_pipeline(StandardScaler(), CORAL(base_estimator=SVC())) pipe.fit(X, y, sample_domain=sample_domain) print("Accuracy on target:", pipe.score(Xt, yt)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on target: 1.0 .. GENERATED FROM PYTHON SOURCE LINES 113-124 DA Adapter pipeline ------------------- Several SKADA estimators include a data adapter that transforms the input data so that a scikit-learn estimator can be used. For those methods, SKADA provides a :code:`Adapter` class that can be used in a DA pipeline from :code:`make_da_pipeline`. Here is an example with the CORAL and GaussianReweight adapters. .. WARNING:: .. GENERATED FROM PYTHON SOURCE LINES 124-150 .. code-block:: Python # Note that as illustrated below for reweighting adapters, one needs a # subsequent estimator that takes :code:`sample_weight` as an input parameter. # This can be done using the :code:`set_fit_request` method of the estimator # by calling :code:`.set_fit_request(sample_weight=True)`. # If the estimator (for pipeline or DA estimator) does not # require sample weights, the DA pipeline will raise an error. # create a DA pipeline with CORAL adapter pipe = make_da_pipeline(StandardScaler(), CORALAdapter(), SVC()) pipe.fit(X, y, sample_domain=sample_domain) print("Accuracy on target:", pipe.score(Xt, yt)) # create a DA pipeline with GaussianReweight adapter (does not work well on # concept drift). pipe = make_da_pipeline( StandardScaler(), GaussianReweightAdapter(), LogisticRegression().set_fit_request(sample_weight=True), ) pipe.fit(X, y, sample_domain=sample_domain) print("Accuracy on target:", pipe.score(Xt, yt)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on target: 1.0 Accuracy on target: 0.5 .. GENERATED FROM PYTHON SOURCE LINES 151-157 DA estimators with score cross-validation ------------------------------------------- DA estimators are compatible with scikit-learn cross-validation functions. Note that the :code:`sample_domain` array must be passed in the :code:`params` dictionary of the :code:`cross_val_score` function. .. GENERATED FROM PYTHON SOURCE LINES 157-174 .. code-block:: Python # splitter for cross-validation of score cv = SourceTargetShuffleSplit(random_state=0) # DA scorer not using target labels (not available in DA) scorer = PredictionEntropyScorer() clf = CORAL(SVC(probability=True)) # needs probability for entropy score # cross-validation scores = cross_val_score( clf, X, y, params={"sample_domain": sample_domain}, cv=cv, scoring=scorer ) print(f"Entropy score: {scores.mean():1.2f} (+-{scores.std():1.2f})") .. rst-class:: sphx-glr-script-out .. code-block:: none Entropy score: -0.02 (+-0.01) .. GENERATED FROM PYTHON SOURCE LINES 175-181 DA estimator with grid search ----------------------------- DA estimators are also compatible with scikit-learn grid search functions. Note that the :code:`sample_domain` array must be passed in the :code:`fit` method of the grid search. .. GENERATED FROM PYTHON SOURCE LINES 181-201 .. code-block:: Python reg_coral = [0.1, 0.5, 1, "auto"] clf = make_da_pipeline(StandardScaler(), CORALAdapter(), SVC(probability=True)) # grid search grid_search = GridSearchCV( estimator=clf, param_grid={"coraladapter__reg": reg_coral}, cv=SourceTargetShuffleSplit(random_state=0), scoring=PredictionEntropyScorer(), ) grid_search.fit(X, y, sample_domain=sample_domain) print("Best regularization parameter:", grid_search.best_params_["coraladapter__reg"]) print("Accuracy on target:", np.mean(grid_search.predict(Xt) == yt)) .. rst-class:: sphx-glr-script-out .. code-block:: none Best regularization parameter: 0.1 Accuracy on target: 1.0 .. GENERATED FROM PYTHON SOURCE LINES 202-214 Advanced DA pipeline -------------------- The DA pipeline can be used with any estimator and any adapter. But more importantly all estimators in the pipeline are automatically wrapped in what we call in skada a `Selector`. The selector is a wrapper that allows you to choose which data is passed during fit and predict/transform. In the following example, one StandardScaler is trained per domain. Then a single SVC is trained on source data only. When predicting on target data the pipeline will automatically use the StandardScaler trained on target and the SVC trained on source. .. GENERATED FROM PYTHON SOURCE LINES 214-228 .. code-block:: Python # create a DA pipeline with SelectSourceTarget estimators pipe = make_da_pipeline( SelectSourceTarget(StandardScaler()), SelectSource(SVC()), ) pipe.fit(X, y, sample_domain=sample_domain) print("Accuracy on source:", pipe.score(Xs, ys, sample_domain=sample_domain_s)) print("Accuracy on target:", pipe.score(Xt, yt)) # target by default .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on source: 1.0 Accuracy on target: 1.0 .. GENERATED FROM PYTHON SOURCE LINES 229-232 Similarly one can use the PerDomain selector to train a different estimator per domain. This allows to handle multiple source and target domains. In this case :code:`sample_domain` must be provided to fit and predict/transform. .. GENERATED FROM PYTHON SOURCE LINES 232-242 .. code-block:: Python pipe = make_da_pipeline( PerDomain(StandardScaler()), SelectSource(SVC()), ) pipe.fit(X, y, sample_domain=sample_domain) print("Accuracy on all data:", pipe.score(X, y, sample_domain=sample_domain)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on all data: 1.0 .. GENERATED FROM PYTHON SOURCE LINES 243-245 One can use a default selector on the whole pipeline which allows for instance to train the whole pipeline only on the source data as follows: .. GENERATED FROM PYTHON SOURCE LINES 245-256 .. code-block:: Python pipe_train_on_source = make_da_pipeline( StandardScaler(), SVC(), default_selector=SelectSource, ) pipe_train_on_source.fit(X, y, sample_domain=sample_domain) print("Accuracy on source:", pipe_train_on_source.score(Xs, ys)) print("Accuracy on target:", pipe_train_on_source.score(Xt, yt)) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on source: 1.0 Accuracy on target: 0.5 .. GENERATED FROM PYTHON SOURCE LINES 257-260 One can also use a default selector on the whole pipeline but overwrite it for the last estimator. In the example below a :code:`StandardScaler` and a :code:`PCA` are estimated per domain but the final SVC is trained on source data only. .. GENERATED FROM PYTHON SOURCE LINES 260-275 .. code-block:: Python pipe_perdomain = make_da_pipeline( StandardScaler(), PCA(n_components=2), SelectSource(SVC()), default_selector=SelectSourceTarget, ) pipe_perdomain.fit(X, y, sample_domain=sample_domain) print( "Accuracy on source:", pipe_perdomain.score(Xs, ys, sample_domain=sample_domain_s) ) print( "Accuracy on target:", pipe_perdomain.score(Xt, yt, sample_domain=sample_domain_t) ) .. rst-class:: sphx-glr-script-out .. code-block:: none Accuracy on source: 1.0 Accuracy on target: 1.0 .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 1.697 seconds) .. _sphx_glr_download_auto_examples_plot_how_to_use_skada.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_how_to_use_skada.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_how_to_use_skada.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_how_to_use_skada.zip ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_