Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Arrow doesn't pickup endpoint_url from ~/.aws/config #44119

Open
hutch3232 opened this issue Sep 14, 2024 · 0 comments
Open

[Python] Arrow doesn't pickup endpoint_url from ~/.aws/config #44119

hutch3232 opened this issue Sep 14, 2024 · 0 comments

Comments

@hutch3232
Copy link
Contributor

Describe the bug, including details regarding any error messages, version, and platform.

Awhile back I opened this issue: pandas-dev/pandas#57449 mistakenly thinking pandas parquet reader wasn't working quite right due to the odd error thrown. Recent testing has shown it is probably an issue with pyarrow.

Testing was done on Ubuntu 20.04 and pyarrow 17.0.0. I am using an on-prem S3-compatibible storage provider. This means AWS_REGION is irrelevant, but the AWS_ENDPOINT_URL is important.

I have defined endpoint_url in ~/.aws/config under the correct profile per the specification here:
https://aws.amazon.com/blogs/developer/new-improved-flexibility-when-configuring-endpoint-urls-with-the-aws-sdks-and-tools/

import os
import pyarrow.parquet as pq

os.environ["AWS_PROFILE"] = "my-bucket-role"

tbl = pq.read_table("s3://my-bucket/my-parquet")

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 14
---> 14 tbl = pq.read_table("s3://my-bucket/my-parquet")

File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification)
   1787     warnings.warn(
   1788         "Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 "
   1789         "and will be removed in a future version.",
   1790         FutureWarning, stacklevel=2)
   1792 try:
-> 1793     dataset = ParquetDataset(
   1794         source,
   1795         schema=schema,
   1796         filesystem=filesystem,
   1797         partitioning=partitioning,
   1798         memory_map=memory_map,
   1799         read_dictionary=read_dictionary,
   1800         buffer_size=buffer_size,
   1801         filters=filters,
   1802         ignore_prefixes=ignore_prefixes,
   1803         pre_buffer=pre_buffer,
   1804         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   1805         decryption_properties=decryption_properties,
   1806         thrift_string_size_limit=thrift_string_size_limit,
   1807         thrift_container_size_limit=thrift_container_size_limit,
   1808         page_checksum_verification=page_checksum_verification,
   1809     )
   1810 except ImportError:
   1811     # fall back on ParquetFile for simple cases when pyarrow.dataset
   1812     # module is not available
   1813     if filters is not None:

File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1344, in ParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, use_legacy_dataset)
   1341 if filesystem is None:
   1342     # path might be a URI describing the FileSystem as well
   1343     try:
-> 1344         filesystem, path_or_paths = FileSystem.from_uri(
   1345             path_or_paths)
   1346     except ValueError:
   1347         filesystem = LocalFileSystem(use_mmap=memory_map)

File /opt/conda/lib/python3.9/site-packages/pyarrow/_fs.pyx:477, in pyarrow._fs.FileSystem.from_uri()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

OSError: When resolving region for bucket 'my-bucket': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 28, Timeout was reached

If I also specify a definition of AWS_ENDPOINT_URL as an environmental variable, it does work.

os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com"
tbl = pq.read_table("s3://my-bucket/my-parquet") # no error

I think that pyarrow should read endpoint_url from ~/.aws/config if it exists and AWS_ENDPOINT_URL is not specified, per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That would avoid this confusing error about 'region' and would also be convenient to avoid having to specify an additional environmental variable.

Component(s)

Parquet, Python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant