Skip to content

Amazon SageMaker AI Deployment

Deploy Text Embeddings Inference (TEI) on Amazon SageMaker AI endpoints. The container accepts model configuration via environment variables and serves on port 8080.

Specifying the Model

Set HF_MODEL_ID to a Hugging Face Hub model ID (e.g. BAAI/bge-base-en-v1.5). The container exits at startup if it is not set. For gated or private models, also pass HF_TOKEN.

To serve model artifacts from S3 instead, provide them via ModelDataUrl/ModelDataSource — SageMaker mounts them at /opt/ml/model — and set HF_MODEL_ID=/opt/ml/model.

SageMaker Python SDK v2

from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri

model = HuggingFaceModel(
    # Resolves the TEI GPU image for your session's region.
    # For CPU instances, use the "huggingface-tei-cpu" backend instead.
    image_uri=get_huggingface_llm_image_uri("huggingface-tei", version="1.8.2"),
    role="arn:aws:iam::<account_id>:role/<role_name>",
    env={"HF_MODEL_ID": "BAAI/bge-base-en-v1.5"},
)

predictor = model.deploy(
    instance_type="ml.g5.xlarge",
    initial_instance_count=1,
)

response = predictor.predict({"inputs": "What is deep learning?"})
print(response)  # [[0.0123, -0.0456, ...]]

# Cleanup
predictor.delete_model()
predictor.delete_endpoint(delete_endpoint_config=True)

SageMaker Python SDK v3

import json

import boto3
from sagemaker.core import image_uris
from sagemaker.core.resources import Endpoint, EndpointConfig, Model
from sagemaker.core.shapes import ContainerDefinition, ProductionVariant

# Requires a configured AWS region (always set inside SageMaker environments).
region = boto3.session.Session().region_name

# Resolves the TEI GPU image for `region`. For CPU instances, use the
# "huggingface-tei-cpu" framework instead.
image = image_uris.retrieve(
    framework="huggingface-tei",
    region=region,
    version="1.8.2",
    image_scope="inference",
    instance_type="ml.g5.xlarge",
)

model = Model.create(
    model_name="tei-model",
    primary_container=ContainerDefinition(
        image=image,
        environment={"HF_MODEL_ID": "BAAI/bge-base-en-v1.5"},
    ),
    execution_role_arn="arn:aws:iam::<account_id>:role/<role_name>",
    region=region,
)

ep_cfg = EndpointConfig.create(
    endpoint_config_name="tei-config",
    production_variants=[
        ProductionVariant(
            variant_name="default",
            model_name="tei-model",
            instance_type="ml.g5.xlarge",
            initial_instance_count=1,
        ),
    ],
    region=region,
)

endpoint = Endpoint.create(
    endpoint_name="tei-endpoint",
    endpoint_config_name="tei-config",
    region=region,
)
endpoint.wait_for_status("InService")

smrt = boto3.client("sagemaker-runtime", region_name=region)
resp = smrt.invoke_endpoint(
    EndpointName="tei-endpoint",
    ContentType="application/json",
    Body=json.dumps({"inputs": "What is deep learning?"}),
)
print(json.loads(resp["Body"].read()))  # [[0.0123, -0.0456, ...]]

# Cleanup
endpoint.delete()
ep_cfg.delete()
model.delete()

Boto3

import json

import boto3

# TEI 1.8.2 GPU image in us-east-1. The image URI is region-specific (both the
# account ID and the region segment) — for other regions, replace `region` and
# `image` together using the region table: ../index.md#images
region = "us-east-1"
image = "683313688378.dkr.ecr.us-east-1.amazonaws.com/tei:2.0.1-tei1.8.2-gpu-py310-cu122-ubuntu22.04"

# Clients are pinned to the same region as the image so the model, endpoint,
# and container stay co-located.
sm = boto3.client("sagemaker", region_name=region)
smrt = boto3.client("sagemaker-runtime", region_name=region)

sm.create_model(
    ModelName="tei-model-boto3",
    PrimaryContainer={
        "Image": image,
        "Environment": {"HF_MODEL_ID": "BAAI/bge-base-en-v1.5"},
    },
    ExecutionRoleArn="arn:aws:iam::<account_id>:role/<role_name>",
)

sm.create_endpoint_config(
    EndpointConfigName="tei-config-boto3",
    ProductionVariants=[{
        "VariantName": "default",
        "ModelName": "tei-model-boto3",
        "InstanceType": "ml.g5.xlarge",
        "InitialInstanceCount": 1,
    }],
)

sm.create_endpoint(
    EndpointName="tei-endpoint-boto3",
    EndpointConfigName="tei-config-boto3",
)
sm.get_waiter("endpoint_in_service").wait(EndpointName="tei-endpoint-boto3")

resp = smrt.invoke_endpoint(
    EndpointName="tei-endpoint-boto3",
    ContentType="application/json",
    Body=json.dumps({"inputs": "What is deep learning?"}),
)
print(json.loads(resp["Body"].read()))  # [[0.0123, -0.0456, ...]]

# Cleanup
sm.delete_endpoint(EndpointName="tei-endpoint-boto3")
sm.delete_endpoint_config(EndpointConfigName="tei-config-boto3")
sm.delete_model(ModelName="tei-model-boto3")

Configuration

TEI is configured through environment variables that mirror its CLI arguments. Commonly used options:

Environment Variable Purpose
HF_MODEL_ID Hugging Face Hub model ID to serve
HF_TOKEN Hub token for gated/private models
MAX_CONCURRENT_REQUESTS Maximum number of concurrent requests
MAX_BATCH_TOKENS Token budget per batch (controls throughput vs. memory)
MAX_CLIENT_BATCH_SIZE Maximum number of inputs in a single request
MAX_BATCH_REQUESTS Optional cap on requests per batch
DTYPE Activation dtype (float16, float32)
POOLING Override pooling method (cls, mean, splade, last-token)
AUTO_TRUNCATE Truncate inputs longer than the model's maximum sequence length instead of rejecting them

Notes

  • Choose the GPU image (tei) for GPU instances (e.g. ml.g5, ml.g6) and the CPU image (tei-cpu) for CPU instances (e.g. ml.c6i, ml.c7i, ml.m6i). For the CPU SDK v2 helper, use get_huggingface_llm_image_uri("huggingface-tei-cpu", version="1.8.2"). Embedding models are small relative to LLMs — single-GPU and CPU instances are usually sufficient.
  • Rerankers and classifiers deploy with the same images and code — only HF_MODEL_ID and the request payload change (e.g. BAAI/bge-reranker-v2-m3 with {"query": "...", "texts": ["...", "..."]}). See API Endpoints for the payload per model type.
  • Inputs longer than the model's maximum sequence length are rejected unless you request truncation, either per request ({"inputs": "...", "truncate": true}) or globally (AUTO_TRUNCATE=true).