com.amazonaws.services.sagemaker.sparksdk.algorithms
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
The SageMaker TrainingJob Instance Type to use.
The number of instances of instanceType to run a SageMaker Training Job with.
The SageMaker Endpoint Confing instance type.
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Deserializes an Endpoint response into a series of Rows.
An S3 location to upload SageMaker Training Job input data to.
An S3 location for SageMaker to store Training Job output data to.
The EBS volume size in gigabytes of each instance
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
The SageMaker Channel name to input serialized Dataset fit input to
The MIME type of the training data.
The SageMaker Training Job S3 data distribution scheme.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format Options used during serialization of the Dataset being fit.
The SageMaker Training Job Channel input mode.
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
A KMS key ID for the Output Data Source
The environment variables that SageMaker will set on the model container during execution.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
Whether to remove the training data on s3 after training is complete or failed.
The NamePolicyFactory to use when naming SageMaker entities created during fit
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
L1 regularization term on weights.
L1 regularization term on weights. Increase this value will make model more conservative. Default = 0
The initial prediction score of all instances, global bias.
The initial prediction score of all instances, global bias. Default = 0.5
Which booster to use.
Which booster to use. Can be gbtree, gblinear or dart. The gbtree and dart values use a tree based model while gblinear uses a linear function. Default = gbtree
Subsample ratio of columns for each split, in each level.
Subsample ratio of columns for each split, in each level. Must be in (0, 1]. Default = 1
Subsample ratio of columns when constructing each tree.
Subsample ratio of columns when constructing each tree. Must be in (0, 1] Default = 1
Whether to remove the training data on s3 after training is complete or failed.
Whether to remove the training data on s3 after training is complete or failed.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage.
The SageMaker Endpoint Confing instance type.
The SageMaker Endpoint Confing instance type.
Step size shrinkage used in update to prevent overfitting.
Step size shrinkage used in update to prevent overfitting. After each boosting step, we can directly get the weights of new features and eta actually shrinks the feature weights to make the boosting process more conservative. Must be in [0, 1] Default = 0.3
Evaluation metrics for validation data.
Evaluation metrics for validation data. A default metric will be assigned according to the objective (rmse for regression, error for classification, and map for ranking ) Default according to objective
Fits a SageMakerModel on dataSet by running a SageMaker training job.
Fits a SageMakerModel on dataSet by running a SageMaker training job.
Minimum loss reduction required to make a further partition on a leaf node of the tree.
Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger, the more conservative the algorithm will be. Must be >= 0. Default = 0
Controls the way that new nodes are added to the tree.
Controls the way that new nodes are added to the tree. Can be "depthwise" or "lossguide". Currently supported only if tree_method is set to hist. Default = "depthwise"
A map from hyperParameter names to their respective values for training.
A map from hyperParameter names to their respective values for training.
L2 regularization term on weights.
L2 regularization term on weights. Increase this value will make model more conservative. Default = 1
L2 regularization term on bias.
L2 regularization term on bias. Must be in [0, 1]. Default = 0.0
Maximum number of discrete bins to bucket continuous features.
Maximum number of discrete bins to bucket continuous features. Used only if tree_method=hist. Default = 256
Maximum delta step allowed for each tree's weight estimation can be.
Maximum delta step allowed for each tree's weight estimation can be. Valid inputs: When a positive integer is used, it helps make the update more conservative. The preferred options is to use it in logistic regression. Set it to 1-10 to help control the update. Must be >= 0. Default = 0
Maximum depth of a tree, increase this value will make the model more complex (likely to be overfitting).
Maximum depth of a tree, increase this value will make the model more complex (likely to be overfitting). 0 indicates no limit, limit is required when grow_policy=depth-wise. Must be >= 0. Default = 6
Maximum number of nodes to be added.
Maximum number of nodes to be added. Relevant only if grow_policy = lossguide. Must be >= 0. Default = 0
Minimum sum of instance weight (hessian) needed in a child.
Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, then the building process will give up further partitioning. In linear regression mode, this simply corresponds to minimum number of instances needed to be in each node. The larger the algorithm is, the more conservative it will be. Must be >= 0. Default = 1
The environment variables that SageMaker will set on the model container during execution.
The environment variables that SageMaker will set on the model container during execution.
A SageMaker Model hosting Docker image URI.
A SageMaker Model hosting Docker image URI.
Whether the transformation result on Models built by this Estimator should also include the input Rows.
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
Number of parallel threads used to run xgboost.
Number of parallel threads used to run xgboost. Must be >= 1. Defaults to maximum number of threads available.
The NamePolicyFactory to use when naming SageMaker entities created during fit
The NamePolicyFactory to use when naming SageMaker entities created during fit
Type of normalization algorithm.
Type of normalization algorithm. Can be "tree" or "forest". Default = "tree"
No default.
No default. Used for softmax multiclass classification.
Number of rounds for gradient boosting.
Number of rounds for gradient boosting. Must be >= 1. Required.
Specifies the learning task and the corresponding learning objective.
Specifies the learning task and the corresponding learning objective. Default: "reg:linear"
Whether to drop at least one tree during the dropout.
Whether to drop at least one tree during the dropout. Default = 0
The type of boosting process to run.
The type of boosting process to run. Can be default or update. Default = "default"
Dropout rate (a fraction of previous trees to drop during the dropout).
Dropout rate (a fraction of previous trees to drop during the dropout). Must be in [0, 1]. Default = 0.0
A parameter of the 'refresh' updater plugin.
A parameter of the 'refresh' updater plugin. When set to true, tree leaves and tree node stats are updated. When set to false, only tree node stats are updated. Default = 1
The region in which to run the algorithm.
The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Deserializes an Endpoint response into a series of Rows.
Deserializes an Endpoint response into a series of Rows.
AmazonS3.
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
Amazon SageMaker client.
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
The SageMaker TrainingJob and Hosting IAM Role.
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
Type of sampling algorithm.
Type of sampling algorithm. Can be "uniform" or "weighted". Default = "uniform"
Controls the balance of positive and negative weights.
Controls the balance of positive and negative weights. It's useful for unbalanced classes. A typical value to consider: sum(negative cases) / sum(positive cases). Default = 1
Random number seed.
Random number seed. Default = 0
Whether in silent mode.
Whether in silent mode. Can be 0 or 1. 0 means printing running messages, 1 means silent mode. Default = 0
Used only for approximate greedy algorithm.
Used only for approximate greedy algorithm. Translates into O(1 / sketch_eps) number of bins. Compared to directly select number of bins, this comes with theoretical guarantee with sketch accuracy. Must be in (0, 1). Default = 0.03
Probability of skipping the dropout procedure during a boosting iteration.
Probability of skipping the dropout procedure during a boosting iteration. Must be in [0, 1]. Default: 0
AmazonSTS.
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
Subsample ratio of the training instance.
Subsample ratio of the training instance. Setting it to 0.5 means that XGBoost randomly collected half of the data instances to grow trees and this will prevent overfitting. Must be in (0, 1]. Default = 1
The SageMaker Channel name to input serialized Dataset fit input to
The SageMaker Channel name to input serialized Dataset fit input to
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
The MIME type of the training data.
The MIME type of the training data.
A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
The SageMaker Training Job Channel input mode.
The SageMaker Training Job Channel input mode.
An S3 location to upload SageMaker Training Job input data to.
An S3 location to upload SageMaker Training Job input data to.
The number of instances of instanceType to run a SageMaker Training Job with.
The number of instances of instanceType to run a SageMaker Training Job with.
The SageMaker TrainingJob Instance Type to use.
The SageMaker TrainingJob Instance Type to use.
The EBS volume size in gigabytes of each instance
The EBS volume size in gigabytes of each instance
A KMS key ID for the Output Data Source
A KMS key ID for the Output Data Source
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
An S3 location for SageMaker to store Training Job output data to.
An S3 location for SageMaker to store Training Job output data to.
The columns to project from the Dataset being fit before training.
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
The SageMaker Training Job S3 data distribution scheme.
The SageMaker Training Job S3 data distribution scheme.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format Options used during serialization of the Dataset being fit.
The Spark Data Format Options used during serialization of the Dataset being fit.
The tree construction algorithm used in XGBoost.
The tree construction algorithm used in XGBoost. Can be auto, exact, approx, hist. Default = "auto"
Parameter that controls the variance of the Tweedie distribution.
Parameter that controls the variance of the Tweedie distribution. Must be in (1, 2). Default = 1.5
The unique identifier of this Estimator.
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
A comma-separated string that defines the sequence of tree updaters to run.
A comma-separated string that defines the sequence of tree updaters to run. This provides a modular way to construct and to modify the trees. Default = "grow_colmaker,prune"
A SageMakerEstimator that runs an XGBoost training job in SageMaker and returns a SageMakerModel that can be used to transform a DataFrame using the hosted XGBoost model. XGBoost is an open-source distributed gradient boosting library that Amazon SageMaker has adapted to run on Amazon SageMaker.
XGBoost trains and infers on LibSVM-formatted data. XGBoostSageMakerEstimator uses Spark's LibSVMFileFormat to write the training DataFrame to S3, and serializes Rows to LibSVM for inference, selecting the column named "features" by default, expected to contain a Vector of Doubles.
Inferences made against an Endpoint hosting an XGBoost model contain a "prediction" field appended to the input DataFrame as a column of Doubles, containing the prediction corresponding to the given Vector of features.
https://github.com/dmlc/xgboost for more on XGBoost.