Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support checkpointing on interval only, not batch completion #30

Open
mbrancato opened this issue Feb 9, 2019 · 8 comments
Open

Support checkpointing on interval only, not batch completion #30

mbrancato opened this issue Feb 9, 2019 · 8 comments
Assignees
Labels
enhancement New feature or request int-shortlist

Comments

@mbrancato
Copy link

Please add support for the checkpoint interval to be the only influence on writing a checkpoint to blob storage. Currently, checkpointing occurs when any batch completes as well. The problem is that some outputs are constrained by the batch size, and that might lead to smaller batches, causing significant write operations. I've seen this be very expensive even for smaller environments. Even then, most batches at the default size of 50 are going to cause large reads/writes for checkpointing.

That said, our use case would be fine with a purely time-based checkpoint, not batch based. By reducing checkpoint intervals to 30 seconds or so, a significant cost savings could be realized.

@jsvd jsvd added the enhancement New feature or request label Feb 12, 2019
@choovick
Copy link

Also interested in any feature that can improve costs of storage accounts. I'm surprised there is no way to use local filesystem or centralized cache/document storage like redis or mongo to save checkpoints...

@choovick
Copy link

@mbrancato Thinking about it again. Can't we achieve this by setting very large max_batch_size and set checkpoint_interval to desired delay? We will have to watch memory usage thou...

@mbrancato
Copy link
Author

I had other limitations on batch sizes. But batch sizes don’t let Logstash wait for the batch queue to fill up. If you are using Azure storage, be sure to use V1 storage accounts since the transaction costs are 90% less than V2.

@choovick
Copy link

@mbrancato IC, Thanks! it does not look that MS planing to End Of Life V1, definitely gonna try it out.

@SpencerLN
Copy link

+1 for any feature that can reduce the storage costs associated with check-pointing.

@shauryagarg2006
Copy link

Also I think the checkpointing interval at the moment is not serving any purpose. Every event is checkpointed.

I made the following change in my fork to fix it.

shauryagarg2006@60292a1

@ghost
Copy link

ghost commented Nov 18, 2020

Hello,
I had the same issue with the cost associated with the Azure Storage. I decided to switch to the Kafka input as Azure Event Hub supports it. The cost are quite low now !
The Microsoft documentation states that Azure Event Hub uses an Azure Storage internally when using the Kafka interface, I hope they won't change their mind about providing this storage without additional costs.
Some resources you might find useful :

@lucianaparaschivei
Copy link

hello,
we are facing the same issue with high costs on the azure storage. The plugin is making way too many storage transactions. Looks like 1 checkpoint per 3-4 messages which is a lot. Can you please address this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request int-shortlist
Projects
None yet
Development

No branches or pull requests

8 participants