Merge pull request #8692 from freqtrade/feat/outsource-data-pipeline

Outsource data pipeline handling to improve flexibility
2026-03-04 21:03:31 +00:00 · 2023-06-18 13:39:36 +02:00
parent 3d72d32845 cca4fa1178
commit 02071df8fa
21 changed files with 587 additions and 995 deletions
--- a/docs/freqai-feature-engineering.md
+++ b/docs/freqai-feature-engineering.md
@@ -212,41 +212,7 @@ Another example, where the user wants to use live metrics from the trade databas

 You need to set the standard dictionary in the config so that FreqAI can return proper dataframe shapes. These values will likely be overridden by the prediction model, but in the case where the model has yet to set them, or needs a default initial value, the pre-set values are what will be returned.

-## Feature normalization
-
-FreqAI is strict when it comes to data normalization. The train features, $X^{train}$, are always normalized to [-1, 1] using a shifted min-max normalization:
-
-$$X^{train}_{norm} = 2 * \frac{X^{train} - X^{train}.min()}{X^{train}.max() - X^{train}.min()} - 1$$
-
-All other data (test data and unseen prediction data in dry/live/backtest) is always automatically normalized to the training feature space according to industry standards. FreqAI stores all the metadata required to ensure that test and prediction features will be properly normalized and that predictions are properly denormalized. For this reason, it is not recommended to eschew industry standards and modify FreqAI internals - however - advanced users can do so by inheriting `train()` in their custom `IFreqaiModel` and using their own normalization functions.
-
-## Data dimensionality reduction with Principal Component Analysis
-
-You can reduce the dimensionality of your features by activating the `principal_component_analysis` in the config:
-
-```json
-    "freqai": {
-        "feature_parameters" : {
-            "principal_component_analysis": true
-        }
-    }
-```
-
-This will perform PCA on the features and reduce their dimensionality so that the explained variance of the data set is >= 0.999. Reducing data dimensionality makes training the model faster and hence allows for more up-to-date models. 
-
-## Inlier metric
-
-The `inlier_metric` is a metric aimed at quantifying how similar the features of a data point are to the most recent historical data points. 
-
-You define the lookback window by setting `inlier_metric_window` and FreqAI computes the distance between the present time point and each of the previous `inlier_metric_window` lookback points. A Weibull function is fit to each of the lookback distributions and its cumulative distribution function (CDF) is used to produce a quantile for each lookback point. The `inlier_metric` is then computed for each time point as the average of the corresponding lookback quantiles. The figure below explains the concept for an `inlier_metric_window` of 5.
-
-![inlier-metric](assets/freqai_inlier-metric.jpg)
-
-FreqAI adds the `inlier_metric` to the training features and hence gives the model access to a novel type of temporal information. 
-
-This function does **not** remove outliers from the data set.
-
-## Weighting features for temporal importance
+### Weighting features for temporal importance

 FreqAI allows you to set a `weight_factor` to weight recent data more strongly than past data via an exponential function:

@@ -256,13 +222,103 @@ where $W_i$ is the weight of data point $i$ in a total set of $n$ data points. B

 ![weight-factor](assets/freqai_weight-factor.jpg)

+## Building the data pipeline
+
+By default, FreqAI builds a dynamic pipeline based on user congfiguration settings. The default settings are robust and designed to work with a variety of methods. These two steps are a `MinMaxScaler(-1,1)` and a `VarianceThreshold` which removes any column that has 0 variance. Users can activate other steps with more configuration parameters. For example if users add `use_SVM_to_remove_outliers: true` to the `freqai` config, then FreqAI will automatically add the [`SVMOutlierExtractor`](#identifying-outliers-using-a-support-vector-machine-svm) to the pipeline. Likewise, users can add `principal_component_analysis: true` to the `freqai` config to activate PCA. The [DissimilarityIndex](#identifying-outliers-with-the-dissimilarity-index-di) is activated with `DI_threshold: 1`. Finally, noise can also be added to the data with `noise_standard_deviation: 0.1`. Finally, users can add [DBSCAN](#identifying-outliers-with-dbscan) outlier removal with `use_DBSCAN_to_remove_outliers: true`.
+
+!!! note "More information available"
+    Please review the [parameter table](freqai-parameter-table.md) for more information on these parameters.
+
+
+### Customizing the pipeline
+
+Users are encouraged to customize the data pipeline to their needs by building their own data pipeline. This can be done by simply setting `dk.feature_pipeline` to their desired `Pipeline` object inside their `IFreqaiModel` `train()` function, or if they prefer not to touch the `train()` function, they can override `define_data_pipeline`/`define_label_pipeline` functions in their `IFreqaiModel`:
+
+!!! note "More information available"
+    FreqAI uses the the [`DataSieve`](https://github.com/emergentmethods/datasieve) pipeline, which follows the SKlearn pipeline API, but adds, among other features, coherence between the X, y, and sample_weight vector point removals, feature removal, feature name following. 
+
+```python
+from datasieve.transforms import SKLearnWrapper, DissimilarityIndex
+from datasieve.pipeline import Pipeline
+from sklearn.preprocessing import QuantileTransformer, StandardScaler
+from freqai.base_models import BaseRegressionModel
+
+
+class MyFreqaiModel(BaseRegressionModel):
+    """
+    Some cool custom model
+    """
+    def fit(self, data_dictionary: Dict, dk: FreqaiDataKitchen, **kwargs) -> Any:
+        """
+        My custom fit function
+        """
+        model = cool_model.fit()
+        return model
+
+    def define_data_pipeline(self) -> Pipeline:
+        """
+        User defines their custom feature pipeline here (if they wish)
+        """
+        feature_pipeline = Pipeline([
+            ('qt', SKLearnWrapper(QuantileTransformer(output_distribution='normal'))),
+            ('di', ds.DissimilarityIndex(di_threshold=1)
+        ])
+
+        return feature_pipeline
+    
+    def define_label_pipeline(self) -> Pipeline:
+        """
+        User defines their custom label pipeline here (if they wish)
+        """
+        label_pipeline = Pipeline([
+            ('qt', SKLearnWrapper(StandardScaler())),
+        ])
+
+        return label_pipeline
+```
+
+Here, you are defining the exact pipeline that will be used for your feature set during training and prediction. You can use *most* SKLearn transformation steps by wrapping them in the `SKLearnWrapper` class as shown above. In addition, you can use any of the transformations available in the [`DataSieve` library](https://github.com/emergentmethods/datasieve). 
+
+You can easily add your own transformation by creating a class that inherits from the datasieve `BaseTransform` and implementing your `fit()`, `transform()` and `inverse_transform()` methods:
+
+```python
+from datasieve.transforms.base_transform import BaseTransform
+# import whatever else you need
+
+class MyCoolTransform(BaseTransform):
+    def __init__(self, **kwargs):
+        self.param1 = kwargs.get('param1', 1)
+
+    def fit(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):
+        # do something with X, y, sample_weight, or/and feature_list
+        return X, y, sample_weight, feature_list
+
+    def transform(self, X, y=None, sample_weight=None,
+                  feature_list=None, outlier_check=False, **kwargs):
+        # do something with X, y, sample_weight, or/and feature_list
+        return X, y, sample_weight, feature_list
+
+    def inverse_transform(self, X, y=None, sample_weight=None, feature_list=None, **kwargs):
+        # do/dont do something with X, y, sample_weight, or/and feature_list
+        return X, y, sample_weight, feature_list
+```
+
+!!! note "Hint"
+    You can define this custom class in the same file as your `IFreqaiModel`.
+
+### Migrating a custom `IFreqaiModel` to the new Pipeline
+
+If you have created your own custom `IFreqaiModel` with a custom `train()`/`predict()` function, *and* you still rely on `data_cleaning_train/predict()`, then you will need to migrate to the new pipeline. If your model does *not* rely on `data_cleaning_train/predict()`, then you do not need to worry about this migration.
+
+More details about the migration can be found [here](strategy_migration.md#freqai---new-data-pipeline).
+
 ## Outlier detection

 Equity and crypto markets suffer from a high level of non-patterned noise in the form of outlier data points. FreqAI implements a variety of methods to identify such outliers and hence mitigate risk.

 ### Identifying outliers with the Dissimilarity Index (DI)

- The Dissimilarity Index (DI) aims to quantify the uncertainty associated with each prediction made by the model. 
+The Dissimilarity Index (DI) aims to quantify the uncertainty associated with each prediction made by the model. 

 You can tell FreqAI to remove outlier data points from the training/test data sets using the DI by including the following statement in the config:

@@ -274,7 +330,7 @@ You can tell FreqAI to remove outlier data points from the training/test data se
    }
 ```

- The DI allows predictions which are outliers (not existent in the model feature space) to be thrown out due to low levels of certainty. To do so, FreqAI measures the distance between each training data point (feature vector), $X_{a}$, and all other training data points:
+Which will add `DissimilarityIndex` step to your `feature_pipeline` and set the threshold to 1. The DI allows predictions which are outliers (not existent in the model feature space) to be thrown out due to low levels of certainty. To do so, FreqAI measures the distance between each training data point (feature vector), $X_{a}$, and all other training data points:

 $$ d_{ab} = \sqrt{\sum_{j=1}^p(X_{a,j}-X_{b,j})^2} $$

@@ -308,9 +364,9 @@ You can tell FreqAI to remove outlier data points from the training/test data se
    }
 ```

-The SVM will be trained on the training data and any data point that the SVM deems to be beyond the feature space will be removed.
+Which will add `SVMOutlierExtractor` step to your `feature_pipeline`. The SVM will be trained on the training data and any data point that the SVM deems to be beyond the feature space will be removed.

-FreqAI uses `sklearn.linear_model.SGDOneClassSVM` (details are available on scikit-learn's webpage [here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDOneClassSVM.html) (external website)) and you can elect to provide additional parameters for the SVM, such as `shuffle`, and `nu`.
+You can elect to provide additional parameters for the SVM, such as `shuffle`, and `nu` via the `feature_parameters.svm_params` dictionary in the config.

 The parameter `shuffle` is by default set to `False` to ensure consistent results. If it is set to `True`, running the SVM multiple times on the same data set might result in different outcomes due to `max_iter` being to low for the algorithm to reach the demanded `tol`. Increasing `max_iter` solves this issue but causes the procedure to take longer time.

@@ -328,7 +384,7 @@ You can configure FreqAI to use DBSCAN to cluster and remove outliers from the t
    }
 ```

-DBSCAN is an unsupervised machine learning algorithm that clusters data without needing to know how many clusters there should be.
+Which will add the `DataSieveDBSCAN` step to your `feature_pipeline`. This is an unsupervised machine learning algorithm that clusters data without needing to know how many clusters there should be.

 Given a number of data points $N$, and a distance $\varepsilon$, DBSCAN clusters the data set by setting all data points that have $N-1$ other data points within a distance of $\varepsilon$ as *core points*. A data point that is within a distance of $\varepsilon$ from a *core point* but that does not have $N-1$ other data points within a distance of $\varepsilon$ from itself is considered an *edge point*. A cluster is then the collection of *core points* and *edge points*. Data points that have no other data points at a distance $<\varepsilon$ are considered outliers. The figure below shows a cluster with $N = 3$.

--- a/docs/strategy_migration.md
+++ b/docs/strategy_migration.md
@@ -728,3 +728,86 @@ Targets now get their own, dedicated method.

        return dataframe
 ```
+
+
+### FreqAI - New data Pipeline
+
+If you have created your own custom `IFreqaiModel` with a custom `train()`/`predict()` function, *and* you still rely on `data_cleaning_train/predict()`, then you will need to migrate to the new pipeline. If your model does *not* rely on `data_cleaning_train/predict()`, then you do not need to worry about this migration. That means that this migration guide is relevant for a very small percentage of power-users. If you stumbled upon this guide by mistake, feel free to inquire in depth about your problem in the Freqtrade discord server.
+
+The conversion involves first removing `data_cleaning_train/predict()` and replacing them with a `define_data_pipeline()` and `define_label_pipeline()` function to your `IFreqaiModel` class:
+
+```python  linenums="1" hl_lines="11-14 47-49 55-57"
+class MyCoolFreqaiModel(BaseRegressionModel):
+    """
+    Some cool custom IFreqaiModel you made before Freqtrade version 2023.6
+    """
+    def train(
+        self, unfiltered_df: DataFrame, pair: str, dk: FreqaiDataKitchen, **kwargs
+    ) -> Any:
+
+        # ... your custom stuff
+
+        # Remove these lines
+        # data_dictionary = dk.make_train_test_datasets(features_filtered, labels_filtered)
+        # self.data_cleaning_train(dk)
+        # data_dictionary = dk.normalize_data(data_dictionary)
+        # (1)
+
+        # Add these lines. Now we control the pipeline fit/transform ourselves
+        dd = dk.make_train_test_datasets(features_filtered, labels_filtered)
+        dk.feature_pipeline = self.define_data_pipeline(threads=dk.thread_count)
+        dk.label_pipeline = self.define_label_pipeline(threads=dk.thread_count)
+
+        (dd["train_features"],
+         dd["train_labels"],
+         dd["train_weights"]) = dk.feature_pipeline.fit_transform(dd["train_features"],
+                                                                  dd["train_labels"],
+                                                                  dd["train_weights"])
+
+        (dd["test_features"],
+         dd["test_labels"],
+         dd["test_weights"]) = dk.feature_pipeline.transform(dd["test_features"],
+                                                             dd["test_labels"],
+                                                             dd["test_weights"])
+
+        dd["train_labels"], _, _ = dk.label_pipeline.fit_transform(dd["train_labels"])
+        dd["test_labels"], _, _ = dk.label_pipeline.transform(dd["test_labels"])
+
+        # ... your custom code
+
+        return model
+
+    def predict(
+        self, unfiltered_df: DataFrame, dk: FreqaiDataKitchen, **kwargs
+    ) -> Tuple[DataFrame, npt.NDArray[np.int_]]:
+
+        # ... your custom stuff
+
+        # Remove these lines:
+        # self.data_cleaning_predict(dk)
+        # (2)
+
+        # Add these lines:
+        dk.data_dictionary["prediction_features"], outliers, _ = dk.feature_pipeline.transform(
+            dk.data_dictionary["prediction_features"], outlier_check=True)
+
+        # Remove this line
+        # pred_df = dk.denormalize_labels_from_metadata(pred_df)
+        # (3)
+
+        # Replace with these lines
+        pred_df, _, _ = dk.label_pipeline.inverse_transform(pred_df)
+        if self.freqai_info.get("DI_threshold", 0) > 0:
+            dk.DI_values = dk.feature_pipeline["di"].di_values
+        else:
+            dk.DI_values = np.zeros(len(outliers.index))
+        dk.do_predict = outliers.to_numpy()
+
+        # ... your custom code
+        return (pred_df, dk.do_predict)
+```
+
+
+1. Data normalization and cleaning is now homogenized with the new pipeline definition. This is created in the new `define_data_pipeline()` and `define_label_pipeline()` functions. The `data_cleaning_train()` and `data_cleaning_predict()` functions are no longer used. You can override `define_data_pipeline()` to create your own custom pipeline if you wish.
+2. Data normalization and cleaning is now homogenized with the new pipeline definition. This is created in the new `define_data_pipeline()` and `define_label_pipeline()` functions. The `data_cleaning_train()` and `data_cleaning_predict()` functions are no longer used. You can override `define_data_pipeline()` to create your own custom pipeline if you wish.
+3. Data denormalization is done with the new pipeline. Replace this with the lines below.