Explain data frame analytics config Added in 7.3.0

POST /_ml/data_frame/analytics/_explain

This API provides explanations for a data frame analytics config that either exists already or one that has not been created yet. The following explanations are provided:

  • which fields are included or not in the analysis and why,
  • how much memory is estimated to be required. The estimate can be used when deciding the appropriate value for model_memory_limit setting later on. If you have object fields or fields that are excluded via source filtering, they are not included in the explanation.
application/json

Body

  • source object
    Hide source attributes Show source attributes object
    • index string | array[string] Required
    • runtime_mappings object
      Hide runtime_mappings attribute Show runtime_mappings attribute object
      • * object Additional properties
        Hide * attributes Show * attributes object
        • fields object

          For type composite

          Hide fields attribute Show fields attribute object
          • * object Additional properties
            Hide * attribute Show * attribute object
            • type string Required

              Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.

        • fetch_fields array[object]

          For type lookup

          Hide fetch_fields attributes Show fetch_fields attributes object
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • format string
        • format string

          A custom format for date type runtime fields.

        • input_field string

          Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

        • target_field string

          Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

        • target_index string
        • script object
          Hide script attributes Show script attributes object
          • source string | object

            One of:
          • id string
          • params object

            Specifies any named parameters that are passed into the script as variables. Use parameters instead of hard-coded values to decrease compile time.

            Hide params attribute Show params attribute object
            • * object Additional properties
          • lang string

            Any of:

            Values are painless, expression, mustache, or java.

          • options object
            Hide options attribute Show options attribute object
            • * string Additional properties
        • type string Required

          Values are boolean, composite, date, double, geo_point, geo_shape, ip, keyword, long, or lookup.

    • _source object
      Hide _source attributes Show _source attributes object
      • includes array[string]

        An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

      • excludes array[string]

        An array of strings that defines the fields that will be included in the analysis.

    • query object

      The Elasticsearch query domain-specific language (DSL). This value corresponds to the query object in an Elasticsearch search POST body. All the options that are supported by Elasticsearch can be used, as this object is passed verbatim to Elasticsearch. By default, this property has the following value: {"match_all": {}}.

      Query DSL
  • dest object
    Hide dest attributes Show dest attributes object
    • index string Required
    • results_field string

      Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

  • analysis object
    Hide analysis attributes Show analysis attributes object
    • classification object
      Hide classification attributes Show classification attributes object
      • alpha number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

      • dependent_variable string Required

        Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

      • downsample_factor number

        Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

      • early_stopping_enabled boolean

        Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

      • eta number

        Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

      • eta_growth_rate_per_tree number

        Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

      • feature_bag_fraction number

        Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

      • feature_processors array[object]

        Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

        Hide feature_processors attributes Show feature_processors attributes object
        • frequency_encoding object
          Hide frequency_encoding attributes Show frequency_encoding attributes object
          • feature_name string Required
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • frequency_map object Required

            The resulting frequency map for the field value. If the field value is missing from the frequency_map, the resulting value is 0.

        • multi_encoding object
          Hide multi_encoding attribute Show multi_encoding attribute object
          • processors array[number] Required

            The ordered array of custom processors to execute. Must be more than 1.

        • n_gram_encoding object
          Hide n_gram_encoding attributes Show n_gram_encoding attributes object
          • feature_prefix string

            The feature name prefix. Defaults to ngram__.

          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • length number

            Specifies the length of the n-gram substring. Defaults to 50. Must be greater than 0.

          • n_grams array[number] Required

            Specifies which n-grams to gather. It’s an array of integer values where the minimum value is 1, and a maximum value is 5.

          • start number

            Specifies the zero-indexed start of the n-gram substring. Negative values are allowed for encoding n-grams of string suffixes. Defaults to 0.

          • custom boolean
        • one_hot_encoding object
          Hide one_hot_encoding attributes Show one_hot_encoding attributes object
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • hot_map string Required

            The one hot map mapping the field value with the column name.

        • target_mean_encoding object
          Hide target_mean_encoding attributes Show target_mean_encoding attributes object
          • default_value number Required

            The default value if field value is not found in the target_map.

          • feature_name string Required
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • target_map object Required

            The field value to target mean transition map.

      • gamma number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • lambda number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • max_optimization_rounds_per_hyperparameter number

        Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

      • max_trees number

        Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

      • num_top_feature_importance_values number

        Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

      • prediction_field_name string

        Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

      • randomize_seed number

        Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

      • soft_tree_depth_limit number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

      • soft_tree_depth_tolerance number

        Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

      • training_percent string | number

      • class_assignment_objective string
      • num_top_classes number

        Defines the number of categories for which the predicted probabilities are reported. It must be non-negative or -1. If it is -1 or greater than the total number of categories, probabilities are reported for all categories; if you have a large number of categories, there could be a significant effect on the size of your destination index. NOTE: To use the AUC ROC evaluation method, num_top_classes must be set to -1 or a value greater than or equal to the total number of categories.

    • outlier_detection object
      Hide outlier_detection attributes Show outlier_detection attributes object
      • compute_feature_influence boolean

        Specifies whether the feature influence calculation is enabled.

      • feature_influence_threshold number

        The minimum outlier score that a document needs to have in order to calculate its feature influence score. Value range: 0-1.

      • method string

        The method that outlier detection uses. Available methods are lof, ldof, distance_kth_nn, distance_knn, and ensemble. The default value is ensemble, which means that outlier detection uses an ensemble of different methods and normalises and combines their individual outlier scores to obtain the overall outlier score.

      • n_neighbors number

        Defines the value for how many nearest neighbors each method of outlier detection uses to calculate its outlier score. When the value is not set, different values are used for different ensemble members. This default behavior helps improve the diversity in the ensemble; only override it if you are confident that the value you choose is appropriate for the data set.

      • outlier_fraction number

        The proportion of the data set that is assumed to be outlying prior to outlier detection. For example, 0.05 means it is assumed that 5% of values are real outliers and 95% are inliers.

      • standardization_enabled boolean

        If true, the following operation is performed on the columns before computing outlier scores: (x_i - mean(x_i)) / sd(x_i).

    • regression object
      Hide regression attributes Show regression attributes object
      • alpha number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This parameter affects loss calculations by acting as a multiplier of the tree depth. Higher alpha values result in shallower trees and faster training times. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to zero.

      • dependent_variable string Required

        Defines which field of the document is to be predicted. It must match one of the fields in the index being used to train. If this field is missing from a document, then that document will not be used for training, but a prediction with the trained model will be generated for it. It is also known as continuous target variable. For classification analysis, the data type of the field must be numeric (integer, short, long, byte), categorical (ip or keyword), or boolean. There must be no more than 30 different values in this field. For regression analysis, the data type of the field must be numeric.

      • downsample_factor number

        Advanced configuration option. Controls the fraction of data that is used to compute the derivatives of the loss function for tree training. A small value results in the use of a small fraction of the data. If this value is set to be less than 1, accuracy typically improves. However, too small a value may result in poor convergence for the ensemble and so require more trees. By default, this value is calculated during hyperparameter optimization. It must be greater than zero and less than or equal to 1.

      • early_stopping_enabled boolean

        Advanced configuration option. Specifies whether the training process should finish if it is not finding any better performing models. If disabled, the training process can take significantly longer and the chance of finding a better performing model is unremarkable.

      • eta number

        Advanced configuration option. The shrinkage applied to the weights. Smaller values result in larger forests which have a better generalization error. However, larger forests cause slower training. By default, this value is calculated during hyperparameter optimization. It must be a value between 0.001 and 1.

      • eta_growth_rate_per_tree number

        Advanced configuration option. Specifies the rate at which eta increases for each new tree that is added to the forest. For example, a rate of 1.05 increases eta by 5% for each extra tree. By default, this value is calculated during hyperparameter optimization. It must be between 0.5 and 2.

      • feature_bag_fraction number

        Advanced configuration option. Defines the fraction of features that will be used when selecting a random bag for each candidate split. By default, this value is calculated during hyperparameter optimization.

      • feature_processors array[object]

        Advanced configuration option. A collection of feature preprocessors that modify one or more included fields. The analysis uses the resulting one or more features instead of the original document field. However, these features are ephemeral; they are not stored in the destination index. Multiple feature_processors entries can refer to the same document fields. Automatic categorical feature encoding still occurs for the fields that are unprocessed by a custom processor or that have categorical values. Use this property only if you want to override the automatic feature encoding of the specified fields.

        Hide feature_processors attributes Show feature_processors attributes object
        • frequency_encoding object
          Hide frequency_encoding attributes Show frequency_encoding attributes object
          • feature_name string Required
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • frequency_map object Required

            The resulting frequency map for the field value. If the field value is missing from the frequency_map, the resulting value is 0.

        • multi_encoding object
          Hide multi_encoding attribute Show multi_encoding attribute object
          • processors array[number] Required

            The ordered array of custom processors to execute. Must be more than 1.

        • n_gram_encoding object
          Hide n_gram_encoding attributes Show n_gram_encoding attributes object
          • feature_prefix string

            The feature name prefix. Defaults to ngram__.

          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • length number

            Specifies the length of the n-gram substring. Defaults to 50. Must be greater than 0.

          • n_grams array[number] Required

            Specifies which n-grams to gather. It’s an array of integer values where the minimum value is 1, and a maximum value is 5.

          • start number

            Specifies the zero-indexed start of the n-gram substring. Negative values are allowed for encoding n-grams of string suffixes. Defaults to 0.

          • custom boolean
        • one_hot_encoding object
          Hide one_hot_encoding attributes Show one_hot_encoding attributes object
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • hot_map string Required

            The one hot map mapping the field value with the column name.

        • target_mean_encoding object
          Hide target_mean_encoding attributes Show target_mean_encoding attributes object
          • default_value number Required

            The default value if field value is not found in the target_map.

          • feature_name string Required
          • field string Required

            Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

          • target_map object Required

            The field value to target mean transition map.

      • gamma number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies a linear penalty associated with the size of individual trees in the forest. A high gamma value causes training to prefer small trees. A small gamma value results in larger individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • lambda number

        Advanced configuration option. Regularization parameter to prevent overfitting on the training data set. Multiplies an L2 regularization term which applies to leaf weights of the individual trees in the forest. A high lambda value causes training to favor small leaf weights. This behavior makes the prediction function smoother at the expense of potentially not being able to capture relevant relationships between the features and the dependent variable. A small lambda value results in large individual trees and slower training. By default, this value is calculated during hyperparameter optimization. It must be a nonnegative value.

      • max_optimization_rounds_per_hyperparameter number

        Advanced configuration option. A multiplier responsible for determining the maximum number of hyperparameter optimization steps in the Bayesian optimization procedure. The maximum number of steps is determined based on the number of undefined hyperparameters times the maximum optimization rounds per hyperparameter. By default, this value is calculated during hyperparameter optimization.

      • max_trees number

        Advanced configuration option. Defines the maximum number of decision trees in the forest. The maximum value is 2000. By default, this value is calculated during hyperparameter optimization.

      • num_top_feature_importance_values number

        Advanced configuration option. Specifies the maximum number of feature importance values per document to return. By default, no feature importance calculation occurs.

      • prediction_field_name string

        Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

      • randomize_seed number

        Defines the seed for the random generator that is used to pick training data. By default, it is randomly generated. Set it to a specific value to use the same training data each time you start a job (assuming other related parameters such as source and analyzed_fields are the same).

      • soft_tree_depth_limit number

        Advanced configuration option. Machine learning uses loss guided tree growing, which means that the decision trees grow where the regularized loss decreases most quickly. This soft limit combines with the soft_tree_depth_tolerance to penalize trees that exceed the specified depth; the regularized loss increases quickly beyond this depth. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.

      • soft_tree_depth_tolerance number

        Advanced configuration option. This option controls how quickly the regularized loss increases when the tree depth exceeds soft_tree_depth_limit. By default, this value is calculated during hyperparameter optimization. It must be greater than or equal to 0.01.

      • training_percent string | number

      • loss_function string

        The loss function used during regression. Available options are mse (mean squared error), msle (mean squared logarithmic error), huber (Pseudo-Huber loss).

      • loss_function_parameter number

        A positive number that is used as a parameter to the loss_function.

  • description string

    A description of the job.

  • model_memory_limit string

    The approximate maximum amount of memory resources that are permitted for analytical processing. If your elasticsearch.yml file contains an xpack.ml.max_model_memory_limit setting, an error occurs when you try to create data frame analytics jobs that have model_memory_limit values greater than that setting.

  • max_num_threads number

    The maximum number of threads to be used by the analysis. Using more threads may decrease the time necessary to complete the analysis at the cost of using more CPU. Note that the process may use additional threads for operational functionality other than the analysis itself.

  • analyzed_fields object
    Hide analyzed_fields attributes Show analyzed_fields attributes object
    • includes array[string]

      An array of strings that defines the fields that will be excluded from the analysis. You do not need to add fields with unsupported data types to excludes, these fields are excluded from the analysis automatically.

    • excludes array[string]

      An array of strings that defines the fields that will be included in the analysis.

  • allow_lazy_start boolean

    Specifies whether this job can start when there is insufficient machine learning node capacity for it to be immediately assigned to a node.

Responses

  • 200 application/json
    Hide response attributes Show response attributes object
    • field_selection array[object] Required

      An array of objects that explain selection for each field, sorted by the field names.

      Hide field_selection attributes Show field_selection attributes object
      • is_included boolean Required

        Whether the field is selected to be included in the analysis.

      • is_required boolean Required

        Whether the field is required.

      • feature_type string

        The feature type of this field for the analysis. May be categorical or numerical.

      • mapping_types array[string] Required

        The mapping types of the field.

      • name string Required

        Path to field or array of paths. Some API's support wildcards in the path to select multiple fields.

      • reason string

        The reason a field is not selected to be included in the analysis.

    • memory_estimation object Required
      Hide memory_estimation attributes Show memory_estimation attributes object
      • expected_memory_with_disk string Required

        Estimated memory usage under the assumption that overflowing to disk is allowed during data frame analytics. expected_memory_with_disk is usually smaller than expected_memory_without_disk as using disk allows to limit the main memory needed to perform data frame analytics.

      • expected_memory_without_disk string Required

        Estimated memory usage under the assumption that the whole data frame analytics should happen in memory (i.e. without overflowing to disk).

POST /_ml/data_frame/analytics/_explain
POST _ml/data_frame/analytics/_explain
{
  "source": {
    "index": "houses_sold_last_10_yrs"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "price"
    }
  }
}
curl \
 --request POST 'http://api.example.com/_ml/data_frame/analytics/_explain' \
 --header "Authorization: $API_KEY" \
 --header "Content-Type: application/json" \
 --data '"{\n  \"source\": {\n    \"index\": \"houses_sold_last_10_yrs\"\n  },\n  \"analysis\": {\n    \"regression\": {\n      \"dependent_variable\": \"price\"\n    }\n  }\n}"'
Request example
Run `POST _ml/data_frame/analytics/_explain` to explain a data frame analytics job configuration.
{
  "source": {
    "index": "houses_sold_last_10_yrs"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "price"
    }
  }
}
Response examples (200)
A succesful response for explaining a data frame analytics job configuration.
{
  "field_selection": [
    {
      "field": "number_of_bedrooms",
      "mappings_types": [
        "integer"
      ],
      "is_included": true,
      "is_required": false,
      "feature_type": "numerical"
    },
    {
      "field": "postcode",
      "mappings_types": [
        "text"
      ],
      "is_included": false,
      "is_required": false,
      "reason": "[postcode.keyword] is preferred because it is aggregatable"
    },
    {
      "field": "postcode.keyword",
      "mappings_types": [
        "keyword"
      ],
      "is_included": true,
      "is_required": false,
      "feature_type": "categorical"
    },
    {
      "field": "price",
      "mappings_types": [
        "float"
      ],
      "is_included": true,
      "is_required": true,
      "feature_type": "numerical"
    }
  ],
  "memory_estimation": {
    "expected_memory_without_disk": "128MB",
    "expected_memory_with_disk": "32MB"
  }
}