Class/Object

com.amazonaws.services.sagemaker.sparksdk

SageMakerEstimator

Related Docs: object SageMakerEstimator | package sparksdk

Permalink

class SageMakerEstimator extends Estimator[SageMakerModel]

Adapts a SageMaker learning Algorithm to a Spark Estimator. Fits a SageMakerModel by running a SageMaker Training Job on a Spark Dataset. Each call to fit submits a new SageMaker Training Job, creates a new SageMaker Model, and creates a new SageMaker Endpoint Config. A new Endpoint is either created by or the returned SageMakerModel is configured to generate an Endpoint on SageMakerModel transform.

On fit, the input dataset is serialized with the specified trainingSparkDataFormat using the specified trainingSparkDataFormatOptions and uploaded to an S3 location specified by trainingInputS3DataPath. The serialized Dataset is compressed with trainingCompressionCodec, if not None.

trainingProjectedColumns can be used to control which columns on the input Dataset are transmitted to SageMaker. If not None, then only those column names will be serialized as input to the SageMaker Training Job.

A Training Job is created with the uploaded Dataset being input to the specified trainingChannelName, with the specified trainingInputMode. The algorithm is specified trainingImage, a Docker image URI reference. The Training Job is created with trainingInstanceCount instances of type trainingInstanceType. The Training Job will time-out after trainingMaxRuntimeInSeconds, if not None.

SageMaker Training Job hyperparameters are built from the org.apache.spark.ml.param.Params set on this Estimator. Param objects set on this Estimator are retrieved during fit and converted to a SageMaker Training Job hyperparameter Map. Param objects are iterated over by invoking params on this Estimator. Param objects with neither a default value nor a set value are ignored. If a Param is not set but has a default value, the default value will be used. Param values are converted to SageMaker hyperparameter String values by invoking toString on the Param value.

SageMaker uses the IAM Role with ARN sagemakerRole to access the input and output S3 buckets and trainingImage if the image is hosted in ECR. SageMaker Training Job output is stored in a Training Job specific sub-prefix of trainingOutputS3DataPath. This contains the SageMaker Training Job output file as well as the SageMaker Training Job model file.

After the Training Job is created, this Estimator will poll for success. Upon success an SageMakerModel is created and returned from fit. The SageMakerModel is created with a modelImage Docker image URI, defining the SageMaker model primary container and with modelEnvironmentVariables environment variables. Each SageMakerModel has a corresponding SageMaker hosting Endpoint. This Endpoint runs on at least endpointInitialInstanceCount instances of type endpointInstanceType. The Endpoint is created either during construction of the SageMakerModel or on the first call to transform, controlled by endpointCreationPolicy. Each Endpoint instance runs with sagemakerRole IAMRole.

The transform method on SageMakerModel uses requestRowSerializer to serialize Rows from the Dataset undergoing transformation, to requests on the hosted SageMaker Endpoint. The responseRowDeserializer is used to convert the response from the Endpoint to a series of Rows, forming the transformed Dataset. If modelPrependInputRowsToTransformationRows is true, then each transformed Row is also prepended with its corresponding input Row.

Linear Supertypes
Estimator[SageMakerModel], PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any
Known Subclasses
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. SageMakerEstimator
  2. Estimator
  3. PipelineStage
  4. Logging
  5. Params
  6. Serializable
  7. Serializable
  8. Identifiable
  9. AnyRef
  10. Any
  1. Hide All
  2. Show all
Visibility
  1. Public
  2. All

Instance Constructors

  1. new SageMakerEstimator(trainingImage: String, modelImage: String, sagemakerRole: IAMRoleResource = IAMRoleFromConfig(), trainingInstanceType: String, trainingInstanceCount: Int, endpointInstanceType: String, endpointInitialInstanceCount: Int, requestRowSerializer: RequestRowSerializer, responseRowDeserializer: ResponseRowDeserializer, trainingInputS3DataPath: S3Resource = S3AutoCreatePath(), trainingOutputS3DataPath: S3Resource = S3AutoCreatePath(), trainingInstanceVolumeSizeInGB: Int = 1024, trainingProjectedColumns: Option[List[String]] = None, trainingChannelName: String = "train", trainingContentType: Option[String] = None, trainingS3DataDistribution: String = ..., trainingSparkDataFormat: String = "sagemaker", trainingSparkDataFormatOptions: Map[String, String] = Map(), trainingInputMode: String = TrainingInputMode.File.toString, trainingCompressionCodec: Option[String] = None, trainingMaxRuntimeInSeconds: Int = 24 * 60 * 60, trainingKmsKeyId: Option[String] = None, modelEnvironmentVariables: Map[String, String] = Map(), endpointCreationPolicy: EndpointCreationPolicy = ..., sagemakerClient: AmazonSageMaker = ..., s3Client: AmazonS3 = ..., stsClient: AWSSecurityTokenService = ..., modelPrependInputRowsToTransformationRows: Boolean = true, deleteStagingDataAfterTraining: Boolean = true, namePolicyFactory: NamePolicyFactory = new RandomNamePolicyFactory(), uid: String = Identifiable.randomUID("sagemaker"), hyperParameters: Map[String, String] = Map())

    Permalink

    trainingImage

    A SageMaker Training Job Algorithm Specification Training Image Docker image URI.

    modelImage

    A SageMaker Model hosting Docker image URI.

    sagemakerRole

    The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.

    trainingInstanceType

    The SageMaker TrainingJob Instance Type to use

    trainingInstanceCount

    The number of instances of instanceType to run an SageMaker Training Job with

    endpointInstanceType

    The SageMaker Endpoint Confing instance type

    endpointInitialInstanceCount

    The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage

    requestRowSerializer

    Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.

    responseRowDeserializer

    Deserializes an Endpoint response into a series of Rows.

    trainingInputS3DataPath

    An S3 location to upload SageMaker Training Job input data to.

    trainingOutputS3DataPath

    An S3 location for SageMaker to store Training Job output data to.

    trainingInstanceVolumeSizeInGB

    The EBS volume size in gigabytes of each instance

    trainingProjectedColumns

    The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.

    trainingChannelName

    The SageMaker Channel name to input serialized Dataset fit input to

    trainingContentType

    The MIME type of the training data.

    trainingS3DataDistribution

    The SageMaker Training Job S3 data distribution scheme.

    trainingSparkDataFormat

    The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.

    trainingSparkDataFormatOptions

    The Spark Data Format Options used during serialization of the Dataset being fit.

    trainingInputMode

    The SageMaker Training Job Channel input mode.

    trainingCompressionCodec

    The type of compression to use when serializing the Dataset being fit for input to SageMaker.

    trainingMaxRuntimeInSeconds

    A SageMaker Training Job Termination Condition MaxRuntimeInHours.

    trainingKmsKeyId

    A KMS key ID for the Output Data Source

    modelEnvironmentVariables

    The environment variables that SageMaker will set on the model container during execution.

    endpointCreationPolicy

    Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.

    sagemakerClient

    Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.

    s3Client

    AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.

    stsClient

    AmazonSTS. Used to resolve the account number when creating staging input / output buckets.

    modelPrependInputRowsToTransformationRows

    Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.

    deleteStagingDataAfterTraining

    Whether to remove the training data on s3 after training is complete or failed.

    namePolicyFactory

    The NamePolicyFactory to use when naming SageMaker entities created during fit

    uid

    The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

    hyperParameters

    A map from hyperParameter names to their respective values for training.

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def $[T](param: Param[T]): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  4. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  5. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  6. final def clear(param: Param[_]): SageMakerEstimator.this.type

    Permalink
    Definition Classes
    Params
  7. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  8. def copy(extra: ParamMap): SageMakerEstimator

    Permalink
    Definition Classes
    SageMakerEstimator → Estimator → PipelineStage → Params
  9. def copyValues[T <: Params](to: T, extra: ParamMap): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  10. final def defaultCopy[T <: Params](extra: ParamMap): T

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  11. val deleteStagingDataAfterTraining: Boolean

    Permalink

    Whether to remove the training data on s3 after training is complete or failed.

  12. val endpointCreationPolicy: EndpointCreationPolicy

    Permalink

    Defines how a SageMaker Endpoint referenced by a SageMakerModel is created.

  13. val endpointInitialInstanceCount: Int

    Permalink

    The SageMaker Endpoint Config minimum number of instances that can be used to host modelImage

  14. val endpointInstanceType: String

    Permalink

    The SageMaker Endpoint Confing instance type

  15. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  16. def equals(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  17. def explainParam(param: Param[_]): String

    Permalink
    Definition Classes
    Params
  18. def explainParams(): String

    Permalink
    Definition Classes
    Params
  19. final def extractParamMap(): ParamMap

    Permalink
    Definition Classes
    Params
  20. final def extractParamMap(extra: ParamMap): ParamMap

    Permalink
    Definition Classes
    Params
  21. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  22. def fit(dataSet: Dataset[_]): SageMakerModel

    Permalink

    Fits a SageMakerModel on dataSet by running a SageMaker training job.

    Fits a SageMakerModel on dataSet by running a SageMaker training job.

    Definition Classes
    SageMakerEstimator → Estimator
  23. def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[SageMakerModel]

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  24. def fit(dataset: Dataset[_], paramMap: ParamMap): SageMakerModel

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" )
  25. def fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): SageMakerModel

    Permalink
    Definition Classes
    Estimator
    Annotations
    @Since( "2.0.0" ) @varargs()
  26. final def get[T](param: Param[T]): Option[T]

    Permalink
    Definition Classes
    Params
  27. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  28. final def getDefault[T](param: Param[T]): Option[T]

    Permalink
    Definition Classes
    Params
  29. final def getOrDefault[T](param: Param[T]): T

    Permalink
    Definition Classes
    Params
  30. def getParam(paramName: String): Param[Any]

    Permalink
    Definition Classes
    Params
  31. final def hasDefault[T](param: Param[T]): Boolean

    Permalink
    Definition Classes
    Params
  32. def hasParam(paramName: String): Boolean

    Permalink
    Definition Classes
    Params
  33. def hashCode(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  34. val hyperParameters: Map[String, String]

    Permalink

    A map from hyperParameter names to their respective values for training.

  35. def initializeLogIfNecessary(isInterpreter: Boolean): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  36. final def isDefined(param: Param[_]): Boolean

    Permalink
    Definition Classes
    Params
  37. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  38. final def isSet(param: Param[_]): Boolean

    Permalink
    Definition Classes
    Params
  39. def isTraceEnabled(): Boolean

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  40. def log: Logger

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  41. def logDebug(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  42. def logDebug(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  43. def logError(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  44. def logError(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  45. def logInfo(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  46. def logInfo(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  47. def logName: String

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  48. def logTrace(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  49. def logTrace(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  50. def logWarning(msg: ⇒ String, throwable: Throwable): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  51. def logWarning(msg: ⇒ String): Unit

    Permalink
    Attributes
    protected
    Definition Classes
    Logging
  52. val modelEnvironmentVariables: Map[String, String]

    Permalink

    The environment variables that SageMaker will set on the model container during execution.

  53. val modelImage: String

    Permalink

    A SageMaker Model hosting Docker image URI.

  54. val modelPrependInputRowsToTransformationRows: Boolean

    Permalink

    Whether the transformation result on Models built by this Estimator should also include the input Rows.

    Whether the transformation result on Models built by this Estimator should also include the input Rows. If true, each output Row is formed by a concatenation of the input Row with the corresponding Row produced by SageMaker Endpoint invocation, produced by responseRowDeserializer. If false, each output Row is just taken from responseRowDeserializer.

  55. val namePolicyFactory: NamePolicyFactory

    Permalink

    The NamePolicyFactory to use when naming SageMaker entities created during fit

  56. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  57. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  58. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  59. lazy val params: Array[Param[_]]

    Permalink
    Definition Classes
    Params
  60. val requestRowSerializer: RequestRowSerializer

    Permalink

    Serializes Spark DataFrame Rows for transformation by Models built from this Estimator.

  61. val responseRowDeserializer: ResponseRowDeserializer

    Permalink

    Deserializes an Endpoint response into a series of Rows.

  62. val s3Client: AmazonS3

    Permalink

    AmazonS3.

    AmazonS3. Used to create a bucket for staging SageMaker Training Job input and/or output if either are set to S3AutoCreatePath.

  63. val sagemakerClient: AmazonSageMaker

    Permalink

    Amazon SageMaker client.

    Amazon SageMaker client. Used to send CreateTrainingJob, CreateModel, and CreateEndpoint requests.

  64. val sagemakerRole: IAMRoleResource

    Permalink

    The SageMaker TrainingJob and Hosting IAM Role.

    The SageMaker TrainingJob and Hosting IAM Role. Used by a SageMaker to access S3 and ECR resources. SageMaker hosted Endpoints instances launched by this Estimator run with this role.

  65. final def set(paramPair: ParamPair[_]): SageMakerEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  66. final def set(param: String, value: Any): SageMakerEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  67. final def set[T](param: Param[T], value: T): SageMakerEstimator.this.type

    Permalink
    Definition Classes
    Params
  68. final def setDefault(paramPairs: ParamPair[_]*): SageMakerEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  69. final def setDefault[T](param: Param[T], value: T): SageMakerEstimator.this.type

    Permalink
    Attributes
    protected
    Definition Classes
    Params
  70. val stsClient: AWSSecurityTokenService

    Permalink

    AmazonSTS.

    AmazonSTS. Used to resolve the account number when creating staging input / output buckets.

  71. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  72. def toString(): String

    Permalink
    Definition Classes
    Identifiable → AnyRef → Any
  73. val trainingChannelName: String

    Permalink

    The SageMaker Channel name to input serialized Dataset fit input to

  74. val trainingCompressionCodec: Option[String]

    Permalink

    The type of compression to use when serializing the Dataset being fit for input to SageMaker.

  75. val trainingContentType: Option[String]

    Permalink

    The MIME type of the training data.

  76. val trainingImage: String

    Permalink

    A SageMaker Training Job Algorithm Specification Training Image Docker image URI.

  77. val trainingInputMode: String

    Permalink

    The SageMaker Training Job Channel input mode.

  78. val trainingInputS3DataPath: S3Resource

    Permalink

    An S3 location to upload SageMaker Training Job input data to.

  79. val trainingInstanceCount: Int

    Permalink

    The number of instances of instanceType to run an SageMaker Training Job with

  80. val trainingInstanceType: String

    Permalink

    The SageMaker TrainingJob Instance Type to use

  81. val trainingInstanceVolumeSizeInGB: Int

    Permalink

    The EBS volume size in gigabytes of each instance

  82. val trainingKmsKeyId: Option[String]

    Permalink

    A KMS key ID for the Output Data Source

  83. val trainingMaxRuntimeInSeconds: Int

    Permalink

    A SageMaker Training Job Termination Condition MaxRuntimeInHours.

  84. val trainingOutputS3DataPath: S3Resource

    Permalink

    An S3 location for SageMaker to store Training Job output data to.

  85. val trainingProjectedColumns: Option[List[String]]

    Permalink

    The columns to project from the Dataset being fit before training.

    The columns to project from the Dataset being fit before training. If an Optional.empty is passed then no specific projection will occur and all columns will be serialized.

  86. val trainingS3DataDistribution: String

    Permalink

    The SageMaker Training Job S3 data distribution scheme.

  87. val trainingSparkDataFormat: String

    Permalink

    The Spark Data Format name used to serialize the Dataset being fit for input to SageMaker.

  88. val trainingSparkDataFormatOptions: Map[String, String]

    Permalink

    The Spark Data Format Options used during serialization of the Dataset being fit.

  89. def transformSchema(schema: StructType): StructType

    Permalink
    Definition Classes
    SageMakerEstimator → PipelineStage
    Annotations
    @DeveloperApi()
  90. def transformSchema(schema: StructType, logging: Boolean): StructType

    Permalink
    Attributes
    protected
    Definition Classes
    PipelineStage
    Annotations
    @DeveloperApi()
  91. val uid: String

    Permalink

    The unique identifier of this Estimator.

    The unique identifier of this Estimator. Used to represent this stage in Spark ML pipelines.

    Definition Classes
    SageMakerEstimator → Identifiable
  92. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  93. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  94. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )

Inherited from Estimator[SageMakerModel]

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Ungrouped