com.amazonaws.services.sagemaker.sparksdk.algorithms
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
The SageMaker TrainingJob Instance Type to use
The number of instances of instanceType to run an SageMaker Training Job with
The SageMaker Endpoint Confing instance type
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Deserializes an Endpoint response into a series of Rows.
An S3 location to upload SageMaker Training Job input data to.
An S3 location for SageMaker to store Training Job output data to.
The EBS volume size in gigabytes of each instance.
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
The SageMaker Channel name to input serialized Dataset fit input to
The MIME type of the training data.
The SageMaker Training Job S3 data distribution scheme.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format Options used during serialization of the Dataset being fit.
The SageMaker Training Job Channel input mode.
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
A KMS key ID for the Output Data Source
The environment variables that SageMaker will set on the model container during execution.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
Whether to remove the training data on s3 after training is complete or failed.
The NamePolicyFactory to use when naming SageMaker entities created during fit
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
The factor of extra centroids to create.
The factor of extra centroids to create. The number of initial centroids equals centerFactor * k. Must be > 0 or "auto". Default: "auto".
Whether to remove the training data on s3 after training is complete or failed.
Whether to remove the training data on s3 after training is complete or failed.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage
The SageMaker Endpoint Confing instance type
The SageMaker Endpoint Confing instance type
The number of passes done over the training data.
The number of passes done over the training data. Must be > 0. Default: 1.
Metric to be used for scoring the model.
Metric to be used for scoring the model. String of comma separated metrics. Support metrics are "msd" and "ssd". "msd" Means Square Error, "ssd": Sum of square distance Default = "msd"
The dimension of the input vectors.
The dimension of the input vectors. Must be > 0. Required.
Fits a SageMakerModel on dataSet by running a SageMaker training job.
Fits a SageMakerModel on dataSet by running a SageMaker training job.
The weight decaying rate of each point.
The weight decaying rate of each point. 0 means no decay at all. Must be >= 0. Default: 0.
A map from hyperParameter names to their respective values for training.
A map from hyperParameter names to their respective values for training.
The initialization algorithm to choose centroids.
The initialization algorithm to choose centroids. Must be "random" or "kmeans++". Default: "random".
The number of clusters to create (k).
The number of clusters to create (k). Must be > 1.
The local initialization algorithm to choose centroids.
The local initialization algorithm to choose centroids. Must be "random" or "kmeans++". Default: "kmeans++".
Maximum iterations for Lloyds EM procedure in the local kmeans used in finalized stage.
Maximum iterations for Lloyds EM procedure in the local kmeans used in finalized stage. Must be > 0. Default: 300.
The number of examples in a mini-batch.
The number of examples in a mini-batch. Must be > 0. Required.
The environment variables that SageMaker will set on the model container during execution.
The environment variables that SageMaker will set on the model container during execution.
A SageMaker Model hosting Docker image URI.
A SageMaker Model hosting Docker image URI.
Whether the transformation result on Models built by this Estimator should also include the input Rows.
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
The NamePolicyFactory to use when naming SageMaker entities created during fit
The NamePolicyFactory to use when naming SageMaker entities created during fit
The region in which to run the algorithm.
The region in which to run the algorithm. If not specified, gets the region from the DefaultAwsRegionProviderChain.
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
Deserializes an Endpoint response into a series of Rows.
Deserializes an Endpoint response into a series of Rows.
AmazonS3.
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
Amazon SageMaker client.
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
The SageMaker TrainingJob and Hosting IAM Role.
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
AmazonSTS.
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
Tolerance for change in ssd for early stopping in local kmeans.
Tolerance for change in ssd for early stopping in local kmeans. Must be in range [0, 1]. Default: 0.0001.
The SageMaker Channel name to input serialized Dataset fit input to
The SageMaker Channel name to input serialized Dataset fit input to
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
The MIME type of the training data.
The MIME type of the training data.
A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
The SageMaker Training Job Channel input mode.
The SageMaker Training Job Channel input mode.
An S3 location to upload SageMaker Training Job input data to.
An S3 location to upload SageMaker Training Job input data to.
The number of instances of instanceType to run an SageMaker Training Job with
The number of instances of instanceType to run an SageMaker Training Job with
The SageMaker TrainingJob Instance Type to use
The SageMaker TrainingJob Instance Type to use
The EBS volume size in gigabytes of each instance.
The EBS volume size in gigabytes of each instance.
A KMS key ID for the Output Data Source
A KMS key ID for the Output Data Source
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
An S3 location for SageMaker to store Training Job output data to.
An S3 location for SageMaker to store Training Job output data to.
The columns to project from the Dataset being fit before training.
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
The SageMaker Training Job S3 data distribution scheme.
The SageMaker Training Job S3 data distribution scheme.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
The Spark Data Format Options used during serialization of the Dataset being fit.
The Spark Data Format Options used during serialization of the Dataset being fit.
The number of trials of the local kmeans algorithm.
The number of trials of the local kmeans algorithm. The output with best loss will be chosen. Must be > 0 or "auto". Default: "auto".
The unique identifier of this Estimator.
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
A SageMakerEstimator that runs a K-Means Clustering training job on Amazon SageMaker upon a call to fit() on a DataFrame and returns a SageMakerModel that can be used to transform a DataFrame using the hosted K-Means model. K-Means Clustering is useful for grouping similar examples in your dataset.
Amazon SageMaker K-Means clustering trains on RecordIO-encoded Amazon Record protobuf data. SageMaker Spark writes a DataFrame to S3 by selecting a column of Vectors named "features" and, if present, a column of Doubles named "label". These names are configurable by passing a map with entries in trainingSparkDataFormatOptions with key "labelColumnName" or "featuresColumnName", with values corresponding to the desired label and features columns.
For inference, the SageMakerModel returned by fit() by the KMeansSageMakerEstimator uses ProtobufRequestRowSerializer to serialize Rows into RecordIO-encoded Amazon Record protobuf messages for inference, by default selecting the column named "features" expected to contain a Vector of Doubles.
Inferences made against an Endpoint hosting a K-Means model contain a "closest_cluster" field and a "distance_to_cluster" field, both appended to the input DataFrame as columns of Double.