Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams are not restored, consumers after failure in Docker Swarm [v2.10.20] #5876

Open
KIT-IT opened this issue Sep 11, 2024 · 7 comments
Open
Labels
defect Suspected defect such as a bug or regression

Comments

@KIT-IT
Copy link

KIT-IT commented Sep 11, 2024

Observed behavior

Hello! Docker image version: 2.10.18-alpine3.20
Client:
2.3.2
Nats JS runs in RAFT, has 3 services in docker swarm (not replicas) services with an enabled gateway to another stack, where there are also 3 services
Storage is mounted via volume.

`
volumes:

  • /etc/localtime:/etc/localtime:ro
  • {"type": "tmpfs", "target": "/tmp"}
  • {"type": "tmpfs", "target": "/run"}
  • /var/nats/natsjs_1/:/data/jetstream/
    `

Each service has its own storage. All services run on one node. Streams and consumers have three replicas.
nats str add Stream\ --subjects "stream.in, stream.out" \ --ack --max-msgs=-1 \ --max-bytes=-1 \ --max-age=3d \ --storage file \ --retention work \ --max-msg-size=-1 \ --discard old \ --dupe-window="2h" \ --no-deny-purge \ --deny-delete \ --no-allow-rollup \ --max-msgs-per-subject=-1 \ --replicas 3
The power was recently cut off and after restoration, nats js does not see streams, consumers, messages, but they are present in the storage. Why did this happen and how to get the data back? They are physically present.

image
An example of a log in the attachment, it is the same on all 3 nodes, there is no hint for recovery or anything like that
natsjs_1.log
Configs:
natsbug.zip

Expected behavior

All data has been restored

Server and client version

Server:
Docker image 2.10.18-alpine3.20
Client:

Host environment

No response

Steps to reproduce

No response

@KIT-IT KIT-IT added the defect Suspected defect such as a bug or regression label Sep 11, 2024
@derekcollison
Copy link
Member

We would want to check the meta layer storage. By default this is synched every 2m or so but is settable.

If the system does not have a meta assignment on restart it will declare the asset as an orphan and remove the actual data for the stream.

Can you confirm that after restart the data for the stream still seems present?

@wallyqs wallyqs changed the title Nats jetstream. Streams are not restored, consumers after failure. Streams are not restored, consumers after failure in Docker Swarm [v2.10.29] Sep 11, 2024
@wallyqs wallyqs changed the title Streams are not restored, consumers after failure in Docker Swarm [v2.10.29] Streams are not restored, consumers after failure in Docker Swarm [v2.10.20] Sep 11, 2024
@KIT-IT
Copy link
Author

KIT-IT commented Sep 12, 2024

@wallyqs
Data from storage The "backup" directory contains lost data.
storage.zip

"We would want to check the meta layer storage. By default this is synched every 2m or so but is settable."
Please tell me how I can set it up?

"If the system does not have a meta assignment on restart it will declare the asset as an orphan and remove the actual data for the stream."
Tell me, please, is this a specific file in the storage? It is stored in the storage when the storage disk is specified. Or in a container?

"Can you confirm that after restart the data for the stream still seems present?"
Yes, they are present. "Backup" catalog in zip.

@derekcollison
Copy link
Member

Under the top level jetstream block in the server config it is sync_interval. It can take a time duration or always which means everything is synched, but this will dramatically effect performance.

e.g.

	jetstream: {
		store_dir: "/var/data"
		sync_interval: 1s
	}

The meta layer information is stored on all servers, and can be found under the store directory..

e.g.

/var/data/jetstream/$SYS/_js_/_meta_

@KIT-IT
Copy link
Author

KIT-IT commented Sep 13, 2024

@wallyqs @derekcollison
We have not announced an interval, so 2 minutes worked. At the time of restart, the target was assigned to the storage. If you look in the attached data archive, you will see in $SYS (ACATXWZXTVRQQGGJ7WFUWYZ7YWMALUH3ZIHVNCV6VLQPL3KALSPTAZNZ) data directories. If it is not difficult for you, please give recommendations on how to eliminate similar situations. We will launch it into production soon, it is very scary to go with such a critical problem.

@derekcollison
Copy link
Member

If I may ask, are you a Synadia customer?

@KIT-IT
Copy link
Author

KIT-IT commented Sep 17, 2024

If I may ask, are you a Synadia customer?

No.

@derekcollison
Copy link
Member

ok no worries, we always do our best to help out folks in the ecosystem with properly configuring systems for production use cases. Obviously we need to prioritize Synadia customers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants