If you are running Pulsar in a bare metal cluster, make sure that offloaders tarball is unzipped in every broker's Pulsar directory.
If you are running Pulsar in Docker or deploying Pulsar using a Docker image (such as K8s and DCOS), you can use the apachepulsar/pulsar-all image instead of the apachepulsar/pulsar image. apachepulsar/pulsar-all image has already bundled tiered storage offloaders.
You can configure the AWS S3 offloader driver in the configuration file broker.conf or standalone.conf.
Required configurations are as below.
Required configuration
Description
Example value
managedLedgerOffloadDriver
Offloader driver name, which is case-insensitive.
Note: there is a third driver type, S3, which is identical to AWS S3, though S3 requires that you specify an endpoint URL using s3ManagedLedgerOffloadServiceEndpoint. This is useful if using an S3 compatible data store other than AWS S3.
aws-s3
offloadersDirectory
Offloader directory
offloaders
s3ManagedLedgerOffloadBucket
Bucket
pulsar-topic-offload
Optional configurations are as below.
Optional
Description
Example value
s3ManagedLedgerOffloadRegion
Bucket region
Note: before specifying a value for this parameter, you need to set the following configurations. Otherwise, you might get an error.
A bucket is a basic container that holds your data. Everything you store in AWS S3 must be contained in a bucket. You can use a bucket to organize your data and control access to your data, but unlike directory and folder, you cannot nest a bucket.
To be able to access AWS S3, you need to authenticate with AWS S3.
Pulsar does not provide any direct methods of configuring authentication for AWS S3,
but relies on the mechanisms supported by the DefaultAWSCredentialsProviderChain.
Once you have created a set of credentials in the AWS IAM console, you can configure credentials using one of the following methods.
Use EC2 instance metadata credentials.
If you are on AWS instance with an instance profile that provides credentials, Pulsar uses these credentials if no other mechanism is provided.
Set the environment variables AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in conf/pulsar_env.sh.
"export" is important so that the variables are made available in the environment of spawned processes.
You can configure the size of a request sent to or read from AWS S3 in the configuration file broker.conf or standalone.conf.
Configuration
Description
Default value
s3ManagedLedgerOffloadReadBufferSizeInBytes
Block size for each individual read when reading back data from AWS S3.
1 MB
s3ManagedLedgerOffloadMaxBlockSizeInBytes
Maximum size of a "part" sent during a multipart upload to AWS S3. It cannot be smaller than 5 MB.
64 MB
Configure AWS S3 offloader to run automatically​
Namespace policy can be configured to offload data automatically once a threshold is reached. The threshold is based on the size of data that a topic has stored on a Pulsar cluster. Once the topic reaches the threshold, an offloading operation is triggered automatically.
Threshold value
Action
0 | It triggers the offloading operation if the topic storage reaches its threshold.
= 0|It causes a broker to offload data as soon as possible.
< 0 |It disables automatic offloading operation.
Automatic offloading runs when a new segment is added to a topic log. If you set the threshold on a namespace, but few messages are being produced to the topic, offloader does not work until the current segment is full.
You can configure the threshold size using CLI tools, such as pulsar-admin.
The offload configurations in broker.conf and standalone.conf are used for the namespaces that do not have namespace level offload policies. Each namespace can have its own offload policy. If you want to set offload policy for each namespace, use the command pulsar-admin namespaces set-offload-policies options command.
For more information about the pulsar-admin namespaces set-offload-threshold options command, including flags, descriptions, and default values, see here.
For individual topics, you can trigger AWS S3 offloader manually using one of the following methods:
Use REST endpoint.
Use CLI tools (such as pulsar-admin).
To trigger it via CLI tools, you need to specify the maximum amount of data (threshold) that should be retained on a Pulsar cluster for a topic. If the size of the topic data on the Pulsar cluster exceeds this threshold, segments from the topic are moved to AWS S3 until the threshold is no longer exceeded. Older segments are moved first.