SageMakerEstimator

Adapts a SageMaker learning Algorithm to a Spark Estimator. Fits a SageMakerModel by running a SageMaker Training Job on a Spark Dataset. Each call to fit submits a new SageMaker Training Job, creates a new SageMaker Model, and creates a new SageMaker Endpoint Config. A new Endpoint is either created by or the returned SageMakerModel is configured to generate an Endpoint on SageMakerModel transform.

On fit, the input dataset is serialized with the specified trainingSparkDataFormat using the specified trainingSparkDataFormatOptions and uploaded to an S3 location specified by trainingInputS3DataPath. The serialized Dataset is compressed with trainingCompressionCodec, if not None.

trainingProjectedColumns can be used to control which columns on the input Dataset are transmitted to SageMaker. If not None, then only those column names will be serialized as input to the SageMaker Training Job.

A Training Job is created with the uploaded Dataset being input to the specified trainingChannelName, with the specified trainingInputMode. The algorithm is specified trainingImage, a Docker image URI reference. The Training Job is created with trainingInstanceCount instances of type trainingInstanceType. The Training Job will time-out after trainingMaxRuntimeInSeconds, if not None.

SageMaker Training Job hyperparameters are built from the org.apache.spark.ml.param.Params set on this Estimator. Param objects set on this Estimator are retrieved during fit and converted to a SageMaker Training Job hyperparameter Map. Param objects are iterated over by invoking params on this Estimator. Param objects with neither a default value nor a set value are ignored. If a Param is not set but has a default value, the default value will be used. Param values are converted to SageMaker hyperparameter String values by invoking toString on the Param value.

SageMaker uses the IAM Role with ARN sagemakerRole to access the input and output S3 buckets and trainingImage if the image is hosted in ECR. SageMaker Training Job output is stored in a Training Job specific sub-prefix of trainingOutputS3DataPath. This contains the SageMaker Training Job output file as well as the SageMaker Training Job model file.

After the Training Job is created, this Estimator will poll for success. Upon success an SageMakerModel is created and returned from fit. The SageMakerModel is created with a modelImage Docker image URI, defining the SageMaker model primary container and with modelEnvironmentVariables environment variables. Each SageMakerModel has a corresponding SageMaker hosting Endpoint. This Endpoint runs on at least endpointInitialInstanceCount instances of type endpointInstanceType. The Endpoint is created either during construction of the SageMakerModel or on the first call to transform, controlled by endpointCreationPolicy. Each Endpoint instance runs with sagemakerRole IAMRole.

The transform method on SageMakerModel uses requestRowSerializer to serialize Rows from the Dataset undergoing transformation, to requests on the hosted SageMaker Endpoint. The responseRowDeserializer is used to convert the response from the Endpoint to a series of Rows, forming the transformed Dataset. If modelPrependInputRowsToTransformationRows is true, then each transformed Row is also prepended with its corresponding input Row.

Linear Supertypes

Estimator[SageMakerModel], PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any

Known Subclasses

KMeansSageMakerEstimator, PCASageMakerEstimator, XGBoostSageMakerEstimator

Instance Constructors

new SageMakerEstimator(trainingImage: String, modelImage: String, sagemakerRole: IAMRoleResource = IAMRoleFromConfig(), trainingInstanceType: String, trainingInstanceCount: Int, endpointInstanceType: String, endpointInitialInstanceCount: Int, requestRowSerializer: RequestRowSerializer, responseRowDeserializer: ResponseRowDeserializer, trainingInputS3DataPath: S3Resource = S3AutoCreatePath(), trainingOutputS3DataPath: S3Resource = S3AutoCreatePath(), trainingInstanceVolumeSizeInGB: Int = 1024, trainingProjectedColumns: Option[List[String]] = None, trainingChannelName: String = "train", trainingContentType: Option[String] = None, trainingS3DataDistribution: String = ..., trainingSparkDataFormat: String = "sagemaker", trainingSparkDataFormatOptions: Map[String, String] = Map(), trainingInputMode: String = TrainingInputMode.File.toString, trainingCompressionCodec: Option[String] = None, trainingMaxRuntimeInSeconds: Int = 24 * 60 * 60, trainingKmsKeyId: Option[String] = None, modelEnvironmentVariables: Map[String, String] = Map(), endpointCreationPolicy: EndpointCreationPolicy = ..., sagemakerClient: AmazonSageMaker = ..., s3Client: AmazonS3 = ..., stsClient: AWSSecurityTokenService = ..., modelPrependInputRowsToTransformationRows: Boolean = true, deleteStagingDataAfterTraining: Boolean = true, namePolicyFactory: NamePolicyFactory = new RandomNamePolicyFactory(), uid: String = Identifiable.randomUID("sagemaker"), hyperParameters: Map[String, String] = Map())

trainingImage
A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
modelImage
A SageMaker Model hosting Docker image URI.
sagemakerRole
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
trainingInstanceType
The SageMaker TrainingJob Instance Type to use
trainingInstanceCount
The number of instances of instanceType to run an SageMaker Training Job with
endpointInstanceType
The SageMaker Endpoint Confing instance type
endpointInitialInstanceCount
The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage
requestRowSerializer
Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
responseRowDeserializer
Deserializes an Endpoint response into a series of Rows.
trainingInputS3DataPath
An S3 location to upload SageMaker Training Job input data to.
trainingOutputS3DataPath
An S3 location for SageMaker to store Training Job output data to.
trainingInstanceVolumeSizeInGB
The EBS volume size in gigabytes of each instance
trainingProjectedColumns
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
trainingChannelName
The SageMaker Channel name to input serialized Dataset fit input to
trainingContentType
The MIME type of the training data.
trainingS3DataDistribution
The SageMaker Training Job S3 data distribution scheme.
trainingSparkDataFormat
The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
trainingSparkDataFormatOptions
The Spark Data Format Options used during serialization of the Dataset being fit.
trainingInputMode
The SageMaker Training Job Channel input mode.
trainingCompressionCodec
The type of compression to use when serializing the Dataset being fit for input to SageMaker.
trainingMaxRuntimeInSeconds
A SageMaker Training Job Termination Condition MaxRuntimeInHours.
trainingKmsKeyId
A KMS key ID for the Output Data Source
modelEnvironmentVariables
The environment variables that SageMaker will set on the model container during execution.
endpointCreationPolicy
Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
sagemakerClient
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
s3Client
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
stsClient
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
modelPrependInputRowsToTransformationRows
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
deleteStagingDataAfterTraining
Whether to remove the training data on s3 after training is complete or failed.
namePolicyFactory
The NamePolicyFactory to use when naming SageMaker entities created during fit
uid
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.
hyperParameters
A map from hyperParameter names to their respective values for training.

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def $[T](param: Param[T]): T

Attributes
protected
Definition Classes
Params
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def asInstanceOf[T0]: T0

Definition Classes
Any
final def clear(param: Param[_]): SageMakerEstimator.this.type

Definition Classes
Params
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def copy(extra: ParamMap): SageMakerEstimator

Definition Classes
SageMakerEstimator → Estimator → PipelineStage → Params
def copyValues[T <: Params](to: T, extra: ParamMap): T

Attributes
protected
Definition Classes
Params
final def defaultCopy[T <: Params](extra: ParamMap): T

Attributes
protected
Definition Classes
Params
val deleteStagingDataAfterTraining: Boolean

Whether to remove the training data on s3 after training is complete or failed.
val endpointCreationPolicy: EndpointCreationPolicy

Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.
val endpointInitialInstanceCount: Int

The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage
val endpointInstanceType: String

The SageMaker Endpoint Confing instance type
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def explainParam(param: Param[_]): String

Definition Classes
Params
def explainParams(): String

Definition Classes
Params
final def extractParamMap(): ParamMap

Definition Classes
Params
final def extractParamMap(extra: ParamMap): ParamMap

Definition Classes
Params
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def fit(dataSet: Dataset[_]): SageMakerModel

Fits a SageMakerModel on dataSet by running a SageMaker training job.
Fits a SageMakerModel on dataSet by running a SageMaker training job.

Definition Classes
SageMakerEstimator → Estimator
def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[SageMakerModel]

Definition Classes
Estimator
Annotations
@Since( "2.0.0" )
def fit(dataset: Dataset[_], paramMap: ParamMap): SageMakerModel

Definition Classes
Estimator
Annotations
@Since( "2.0.0" )
def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): SageMakerModel

Definition Classes
Estimator
Annotations
@Since( "2.0.0" ) @varargs()
final def get[T](param: Param[T]): Option[T]

Definition Classes
Params
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
final def getDefault[T](param: Param[T]): Option[T]

Definition Classes
Params
final def getOrDefault[T](param: Param[T]): T

Definition Classes
Params
def getParam(paramName: String): Param[Any]

Definition Classes
Params
final def hasDefault[T](param: Param[T]): Boolean

Definition Classes
Params
def hasParam(paramName: String): Boolean

Definition Classes
Params
def hashCode(): Int

Definition Classes
AnyRef → Any
val hyperParameters: Map[String, String]

A map from hyperParameter names to their respective values for training.
def initializeLogIfNecessary(isInterpreter: Boolean): Unit

Attributes
protected
Definition Classes
Logging
final def isDefined(param: Param[_]): Boolean

Definition Classes
Params
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
final def isSet(param: Param[_]): Boolean

Definition Classes
Params
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def log: Logger

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
val modelEnvironmentVariables: Map[String, String]

The environment variables that SageMaker will set on the model container during execution.
val modelImage: String

A SageMaker Model hosting Docker image URI.
val modelPrependInputRowsToTransformationRows: Boolean

Whether the transformation result on Models built by this Estimator should also include the input Rows.
Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.
val namePolicyFactory: NamePolicyFactory

The NamePolicyFactory to use when naming SageMaker entities created during fit
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
lazy val params: Array[Param[_]]

Definition Classes
Params
val requestRowSerializer: RequestRowSerializer

Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.
val responseRowDeserializer: ResponseRowDeserializer

Deserializes an Endpoint response into a series of Rows.
val s3Client: AmazonS3

AmazonS3.
AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.
val sagemakerClient: AmazonSageMaker

Amazon SageMaker client.
Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.
val sagemakerRole: IAMRoleResource

The SageMaker TrainingJob and Hosting IAM Role.
The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.
final def set(paramPair: ParamPair[_]): SageMakerEstimator.this.type

Attributes
protected
Definition Classes
Params
final def set(param: String, value: Any): SageMakerEstimator.this.type

Attributes
protected
Definition Classes
Params
final def set[T](param: Param[T], value: T): SageMakerEstimator.this.type

Definition Classes
Params
final def setDefault(paramPairs: ParamPair[_]*): SageMakerEstimator.this.type

Attributes
protected
Definition Classes
Params
final def setDefault[T](param: Param[T], value: T): SageMakerEstimator.this.type

Attributes
protected
Definition Classes
Params
val stsClient: AWSSecurityTokenService

AmazonSTS.
AmazonSTS. Used to resolve the account number when creating staging input / output buckets.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def toString(): String

Definition Classes
Identifiable → AnyRef → Any
val trainingChannelName: String

The SageMaker Channel name to input serialized Dataset fit input to
val trainingCompressionCodec: Option[String]

The type of compression to use when serializing the Dataset being fit for input to SageMaker.
val trainingContentType: Option[String]

The MIME type of the training data.
val trainingImage: String

A SageMaker Training Job Algorithm Specification Training Image Docker image URI.
val trainingInputMode: String

The SageMaker Training Job Channel input mode.
val trainingInputS3DataPath: S3Resource

An S3 location to upload SageMaker Training Job input data to.
val trainingInstanceCount: Int

The number of instances of instanceType to run an SageMaker Training Job with
val trainingInstanceType: String

The SageMaker TrainingJob Instance Type to use
val trainingInstanceVolumeSizeInGB: Int

The EBS volume size in gigabytes of each instance
val trainingKmsKeyId: Option[String]

A KMS key ID for the Output Data Source
val trainingMaxRuntimeInSeconds: Int

A SageMaker Training Job Termination Condition MaxRuntimeInHours.
val trainingOutputS3DataPath: S3Resource

An S3 location for SageMaker to store Training Job output data to.
val trainingProjectedColumns: Option[List[String]]

The columns to project from the Dataset being fit before training.
The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.
val trainingS3DataDistribution: String

The SageMaker Training Job S3 data distribution scheme.
val trainingSparkDataFormat: String

The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.
val trainingSparkDataFormatOptions: Map[String, String]

The Spark Data Format Options used during serialization of the Dataset being fit.
def transformSchema(schema: StructType): StructType

Definition Classes
SageMakerEstimator → PipelineStage
Annotations
@DeveloperApi()
def transformSchema(schema: StructType, logging: Boolean): StructType

Attributes
protected
Definition Classes
PipelineStage
Annotations
@DeveloperApi()
val uid: String

The unique identifier of this Estimator.
The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

Definition Classes
SageMakerEstimator → Identifiable
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object SageMakerEstimator | package sparksdk

class SageMakerEstimator extends Estimator[SageMakerModel]

Instance Constructors

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def $[T](param: Param[T]): T

final def ==(arg0: Any): Boolean

final def asInstanceOf[T0]: T0

final def clear(param: Param[_]): SageMakerEstimator.this.type

def clone(): AnyRef

def copy(extra: ParamMap): SageMakerEstimator

def copyValues[T <: Params](to: T, extra: ParamMap): T

final def defaultCopy[T <: Params](extra: ParamMap): T

val deleteStagingDataAfterTraining: Boolean

val endpointCreationPolicy: EndpointCreationPolicy

val endpointInitialInstanceCount: Int

val endpointInstanceType: String

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def explainParam(param: Param[_]): String

def explainParams(): String

final def extractParamMap(): ParamMap

final def extractParamMap(extra: ParamMap): ParamMap

def finalize(): Unit

def fit(dataSet: Dataset[_]): SageMakerModel

def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[SageMakerModel]

def fit(dataset: Dataset[_], paramMap: ParamMap): SageMakerModel

def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): SageMakerModel

final def get[T](param: Param[T]): Option[T]

final def getClass(): Class[_]

final def getDefault[T](param: Param[T]): Option[T]

final def getOrDefault[T](param: Param[T]): T

def getParam(paramName: String): Param[Any]

final def hasDefault[T](param: Param[T]): Boolean

def hasParam(paramName: String): Boolean

def hashCode(): Int

val hyperParameters: Map[String, String]

def initializeLogIfNecessary(isInterpreter: Boolean): Unit

final def isDefined(param: Param[_]): Boolean

final def isInstanceOf[T0]: Boolean

final def isSet(param: Param[_]): Boolean

def isTraceEnabled(): Boolean

def log: Logger

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

val modelEnvironmentVariables: Map[String, String]

val modelImage: String

val modelPrependInputRowsToTransformationRows: Boolean

val namePolicyFactory: NamePolicyFactory

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

lazy val params: Array[Param[_]]

val requestRowSerializer: RequestRowSerializer

val responseRowDeserializer: ResponseRowDeserializer

val s3Client: AmazonS3

val sagemakerClient: AmazonSageMaker

val sagemakerRole: IAMRoleResource

final def set(paramPair: ParamPair[_]): SageMakerEstimator.this.type

final def set(param: String, value: Any): SageMakerEstimator.this.type

final def set[T](param: Param[T], value: T): SageMakerEstimator.this.type

final def setDefault(paramPairs: ParamPair[_]*): SageMakerEstimator.this.type

final def setDefault[T](param: Param[T], value: T): SageMakerEstimator.this.type

val stsClient: AWSSecurityTokenService

final def synchronized[T0](arg0: ⇒ T0): T0

def toString(): String

val trainingChannelName: String

val trainingCompressionCodec: Option[String]

val trainingContentType: Option[String]