Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Team BYOB On-Prem stopped to work (SUPPORTED_FILE_STORES) #272

Open
vijay-wandb opened this issue Sep 3, 2024 · 2 comments
Open
Assignees

Comments

@vijay-wandb
Copy link

@flamarion @al I was incorrect in that scenario 2 worked without a code change. I could not connect from local server instance -> minio bucket on my localhost without making any code changes. I again had a false positive where the UI said the connection worked, and at the end of the day Thursday I forgot to check that backend actually was able to connect.

I still do not know if the code is in a working state, and if it is, for what scenarios it works. I am going to note my observations here below, but keep in mind that they could be due to issues with my local setup.


I spin up the local server instance in a docker container, and the minio bucket on my localhost in another docker container. Therefore, I still needed to use ngrok to get them to connect together. (I didn't yet have a chance to test spinning up the minio BYOB bucket in the same docker container.) So, for scenario 2 I needed to make these same code change that remove the AWS GetBucketRegion sdk call in order for the local gorilla app can connect to the bucket and upload objects to it. And the app can connect, and create a team.

However, the SDK fails to upload. For example,

>>> run = wandb.init(project="pOP", entity='onpremb')
wandb: Currently logged in as: andrew-levin (onpremb). Use `wandb login --relogin` to force relogin
wandb: ERROR wandb version 0.16.4.dev1 has been retired!  Please upgrade.
wandb: wandb version 0.17.0 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.4.dev1
wandb: Run data is saved locally in /Users/andrew/repos/core/services/gorilla/wandb/run-20240523_001524-d2v2js48
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run kind-water-1
wandb: ⭐️ View project at <https://app.wandb.test/onpremb/pOP>
wandb: 🚀 View run at <https://app.wandb.test/onpremb/pOP/runs/d2v2js48>
>>> wandb: ERROR Error uploading "wandb-metadata.json": CommError, <Response [404]>
wandb: ERROR Error uploading "upstream_diff_a85b9ecffdd97c42324aee25dd46526a2dbd450e.patch": CommError, <Response [404]>
wandb: ERROR Error uploading "diff.patch": CommError, <Response [404]>

and then the logs show

2024-05-23 0026,769 INFO    SenderThread:50951 [sender.py1403] saving file wandb-metadata.json with policy now
2024-05-23 0026,822 ERROR   wandb-upload_0:50951 [internal_api.py2765] upload_file exception https://app.wandb.test/privb/onpremb/pOP/d2v2js48/wandb-metadata.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=uwIvWSLpkzpzywH5Z94q%2F20240523%2Fwandb-local%2Fs3%2Faws4_request&X-Amz-Date=20240523T071526Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-User=andrew-levin&X-Amz-Signature=82bc2a4957c67265f2795faba2e130eaf42c1a9a8d721d4234f416988e7d9143: 404 Client Error: Not Found for url:...

Issue created in Slack from a [message](https://weightsandbiases.slack.com/archives/C0123GDE0NM/p1716451617039009?thread_ts=1715198459.714839&cid=C0123GDE0NM).




https://github.com/user-attachments/assets/66231f7e-9332-4b2a-abff-a620ccbf7937

@vijay-wandb
Copy link
Author

From Flamarion Jorge

Here is the repro

I tested 3 different kinds of Object Storage (Ceph (RadosGW), Minio, and now AWS S3) the TEAM BYOB using the SUPPORTED_FILE_STORES and it doesn’t work consistently.
This has worked in the past, and I tested myself with Minio. That worked just fine. The configuration is exactly the same. There’s no difference because the env var requires the content to be the same (although it doesn’t complain if you don’t set it correctly)

Here is the test using AWS S3 (you can replicate to any other S3)

$ export AWS_SECRET_ACCESS_KEY="FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya"
$ export AWS_ACCESS_KEY_ID="AKIA3G72DHZ4SFDSM5NK"
$ aws s3 ls flamarion-team-byob-test

There was no content in the bucket, then I upload a random file and listed it again

$ aws s3 ls flamarion-team-byob-test
2024-06-03 15:20:57        370 list_buckets.py

This means the Access and Secret are valid (you can use it to test).
Then I configured my local deployment

 $ kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep SUPPORTED
Defaulted container "app" out of: app, init-db (init)
SUPPORTED_FILE_STORES=s3://AKIA3G72DHZ4SFDSM5NK:FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya@s3.eu-central-1.amazonaws.com/flamarion-team-byob-test

Theoretically, I can create a Team and assign the bucket flamarion-team-byob-test but it doesn’t happen. There’s a video and this is the log.

{"level":"INFO","time":"2024-06-03T13:14:32.447014485Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"operationName":"ava│
ilableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"appPath":"/home"},"message":"Graphql operation availableTeam for user flamarion with variables map[teamName:team│
-a] from app path /home","dd.trace_id":"13034541662550076139"}                                                                                                                        │
{"level":"INFO","time":"2024-06-03T13:14:32.448078753Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"latencyNs":1066695,│
"statusCode":200,"operationName":"availableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"latencyStr":"1.066695ms"},"message":"Graphql operation availableTeam for us│
er flamarion with variables map[teamName:team-a] finished in 1.066695ms","dd.trace_id":"13034541662550076139"}                                                                        │
{"level":"INFO","time":"2024-06-03T13:14:32.448653588Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13034541662550076139","duration":"9.059942ms","statusCode":200,"path":"/graphql"},"message":│
"Finished request 13034541662550076139 in 9.059942ms with status %!s(int=200) on /graphql","dd.trace_id":"13034541662550076139"}                                                      │
{"level":"INFO","time":"2024-06-03T13:14:36.013749232Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:57","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","path":"/graphql"},"message":"Starting request 13318444034683680619 on│
 /graphql","dd.trace_id":"13318444034683680619"}                                                                                                                                      │
{"level":"INFO","time":"2024-06-03T13:14:36.022794835Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"operati│
onName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organizationID":null}},"appPath":"/home"},"messag│
e":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizationID:<nil> provider:AWS]] from app path /home"│
,"dd.trace_id":"13318444034683680619"}                                                                                                                                                │
{"level":"INFO","time":"2024-06-03T13:14:36.03393013Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd.│
service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"latencyN│
s":11142999,"statusCode":200,"operationName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organization│
ID":null}},"latencyStr":"11.142999ms"},"message":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizati│
onID:<nil> provider:AWS]] finished in 11.142999ms","dd.trace_id":"13318444034683680619"}                                                                                              │
{"level":"INFO","time":"2024-06-03T13:14:36.034779852Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","duration":"21.027626ms","statusCode":200,"path":"/graphql"},"message"│
:"Finished request 13318444034683680619 in 21.027626ms with status %!s(int=200) on /graphql","dd.trace_id":"13318444034683680619"}

To ensure the test is not biased or tied to a strange S3 API, I used the AWS S3, in Sandbox account, Region eu-central-1 while the main bucket of my installation is my local Storage Ceph RadosGW

kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep BUCKET
Defaulted container "app" out of: app, init-db (init)
BUCKET=s3://3IAXODZ870OCD6TCFIAD:[email protected]/wandb
OVERFLOW_BUCKET_ADDR=s3://3IAXODZ870OCD6TCFIAD:[email protected]/wandb
BUCKET_QUEUE=internal://

Please, let me know if there’s any other test I can do.

have Minio too, if necessary I can do the tests using Minio (the same that worked in the past and that I reported here. https://weightsandbiases.slack.com/archives/C0123GDE0NM/p1705056205042069?thread_ts=1704995296.871159&cid=C0123GDE0NM

@abhinavg6
Copy link

@flamarion @levinandrew - Is this a current issue? Sorry I hadn't heard of this until now. Should not be in the consultant queue anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants