Add CE3 based Kinesis sink #338

pondzix · 2023-08-23T07:32:08Z

No description provided.

istreeter

This is looking good.

However, the old collector had quite a intricate interplay between the kinesis and sqs queues. Here you have chosen to base your sink from the Enrich implementation, not the old collector implementation. That might be a good decision, but I don't know yet! We might find it's difficult to patch-in the SQS buffer into this implementation.

Time will tell....

istreeter · 2023-08-23T20:39:46Z

build.sbt

      Dependencies.Libraries.kinesis,
-      Dependencies.Libraries.sts,
-      Dependencies.Libraries.sqs,
+//      Dependencies.Libraries.sts,


We'll need sts as a runtime dependency, so because that's how authorization works when it runs on EKS.

istreeter · 2023-08-23T21:05:51Z

...src/main/scala/com.snowplowanalytics.snowplow.collectors.scalastream/sinks/KinesisSink.scala

 import com.amazonaws.client.builder.AwsClientBuilder.EndpointConfiguration
+import com.amazonaws.services.kinesis.model.{PutRecordsRequest, PutRecordsRequestEntry, PutRecordsResult}


Here you're using v1 of the aws sdk, i.e. classes that start with com.amazonaws. I know that's what enrich does, and I know that's what the old collector does. But I think it would be great to update to v2 of the aws sdk (classes that start with software.amazon).

The newer kinesis sdk has this async client which returns a CompletableFuture when you make a request. You could adapt the CompletableFuture to a F using Async. This would be slightly nicer than using the old blocking client and wrapping the request in Async.blocking(...).

istreeter · 2023-08-23T21:23:09Z

...c/main/scala/com.snowplowanalytics.snowplow.collectors.scalastream/sinks/KinesisClient.scala

+  private def getProvider(awsConfig: KinesisSinkConfig.AWSConfig): AWSCredentialsProvider = {
+    def isDefault(key: String): Boolean = key == "default"
+
+    def isIam(key: String): Boolean = key == "iam"
+
+    def isEnv(key: String): Boolean = key == "env"
+
+    ((awsConfig.accessKey, awsConfig.secretKey) match {
+      case (a, s) if isDefault(a) && isDefault(s) =>
+        new DefaultAWSCredentialsProviderChain()
+      case (a, s) if isDefault(a) || isDefault(s) =>
+        throw new IllegalArgumentException("accessKey and secretKey must both be set to 'default' or neither")
+      case (a, s) if isIam(a) && isIam(s) =>
+        InstanceProfileCredentialsProvider.getInstance()
+      case (a, s) if isIam(a) && isIam(s) =>
+        throw new IllegalArgumentException("accessKey and secretKey must both be set to 'iam' or neither")
+      case (a, s) if isEnv(a) && isEnv(s) =>
+        new EnvironmentVariableCredentialsProvider()
+      case (a, s) if isEnv(a) || isEnv(s) =>
+        throw new IllegalArgumentException("accessKey and secretKey must both be set to 'env' or neither")
+      case _ =>
+        new AWSStaticCredentialsProvider(
+          new BasicAWSCredentials(awsConfig.accessKey, awsConfig.secretKey)
+        )
+    })
+  }


I would vote to not do any of this. These days we always use the default credentials chain. I don't think anyone benefits from having extra credentials methods available. The default credentials chain is well documented by AWS, and it allows several different methods of authentication.

istreeter · 2023-08-23T21:24:37Z

.../src/main/scala/com.snowplowanalytics.snowplow.collectors.scalastream/KinesisCollector.scala

+      good <- KinesisSink.create[IO](config.sink, config.buffer, config.good)
+      bad  <- KinesisSink.create[IO](config.sink, config.buffer, config.bad)


By doing this you create two kinesis clients, one for each sink. That's probably OK. Is it worth trying to share a single kinesis client across both sinks?

istreeter · 2023-08-23T21:28:51Z

...in/scala/com.snowplowanalytics.snowplow.collectors.scalastream/sinks/KinesisSinkConfig.scala

+    case cn @ "cn-north-1"     => s"https://kinesis.$cn.amazonaws.com.cn"
+    case cn @ "cn-northwest-1" => s"https://kinesis.$cn.amazonaws.com.cn"


I know this is copied from some other Snowplow code..... but I don't like it. Feel like we are re-inventing what the aws sdk already does automatically.

I understand that we need a optional customEndpoint to enable localstack. But I would make it so that if customEndpoint is not set, then we simply do not call .withEndpointConfiguration on the client builder.

istreeter · 2023-08-23T21:46:20Z

...scala/com.snowplowanalytics.snowplow.collectors.scalastream/sinks/WritingToKinesisTask.scala

+      .chunks
+      .parEvalMapUnbounded(events => writeToKinesis(events, config, streamName, kinesis, bufferConfig))


I think this is not quite right, but the reason is hard to explain!! I will have a go at explaining.....

If you call .parEvalMapUnbounded then it executes the IO immediately as soon as there is something available upstream. And if you call .chunks on a queue then it emits the current contents of the queue, which might only be 1 event.

So I think here when you call writeToKinesis you are calling it with a very small chunk of events. Even if the collector is receiving a lot of events, you do not allow the queue to fill up significantly. Because it calls writeToKinesis as soon as the queue has any events in it.

This is a problem, because writing to kinesis is more efficient if we can build up big batches before writing.

istreeter · 2023-08-23T21:51:44Z

...src/main/scala/com.snowplowanalytics.snowplow.collectors.scalastream/sinks/KinesisSink.scala

+    for {
+      eventsBuffer        <- Resource.eval(Queue.unbounded[F, Option[Event]])
+      kinesisClient       <- KinesisClient.create(config, streamName)
+      kinesisWriteOutcome <- WritingToKinesisTask.run[F](config, buffer, streamName, eventsBuffer, kinesisClient)
+      sink                <- Resource.make(createSink(config, eventsBuffer))(stopSink(eventsBuffer, kinesisWriteOutcome))
+    } yield sink


pondzix · 2023-09-08T08:01:58Z

Closing in favour of #339

pondzix added 2 commits August 23, 2023 09:29

Init kinesis sink

354009a

Buffer with queue

697d803

pondzix force-pushed the http4s/kinesis_sink branch from 6f6d48e to 697d803 Compare August 23, 2023 07:42

istreeter reviewed Aug 23, 2023

View reviewed changes

pondzix changed the title ~~Add Kinesis sink~~ Add CE3 based Kinesis sink Aug 25, 2023

spenes mentioned this pull request Aug 25, 2023

Http4s - SQS sink #342

Closed

pondzix closed this Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CE3 based Kinesis sink #338

Add CE3 based Kinesis sink #338

pondzix commented Aug 23, 2023

istreeter left a comment

istreeter Aug 23, 2023

istreeter Aug 23, 2023

istreeter Aug 23, 2023

istreeter Aug 23, 2023

istreeter Aug 23, 2023

istreeter Aug 23, 2023

istreeter Aug 23, 2023

pondzix commented Sep 8, 2023

		import com.amazonaws.client.builder.AwsClientBuilder.EndpointConfiguration
		import com.amazonaws.services.kinesis.model.{PutRecordsRequest, PutRecordsRequestEntry, PutRecordsResult}

		good <- KinesisSink.create[IO](config.sink, config.buffer, config.good)
		bad <- KinesisSink.create[IO](config.sink, config.buffer, config.bad)

		case cn @ "cn-north-1" => s"https://kinesis.$cn.amazonaws.com.cn"
		case cn @ "cn-northwest-1" => s"https://kinesis.$cn.amazonaws.com.cn"

		.chunks
		.parEvalMapUnbounded(events => writeToKinesis(events, config, streamName, kinesis, bufferConfig))

Add CE3 based Kinesis sink #338

Add CE3 based Kinesis sink #338

Conversation

pondzix commented Aug 23, 2023

istreeter left a comment

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

istreeter Aug 23, 2023

Choose a reason for hiding this comment

pondzix commented Sep 8, 2023