[Python] Arrow doesn't pickup `endpoint_url` from `~/.aws/config` #44119

hutch3232 · 2024-09-14T16:29:12Z

Describe the bug, including details regarding any error messages, version, and platform.

Awhile back I opened this issue: pandas-dev/pandas#57449 mistakenly thinking pandas parquet reader wasn't working quite right due to the odd error thrown. Recent testing has shown it is probably an issue with pyarrow.

Testing was done on Ubuntu 20.04 and pyarrow 17.0.0. I am using an on-prem S3-compatibible storage provider. This means AWS_REGION is irrelevant, but the AWS_ENDPOINT_URL is important.

I have defined endpoint_url in ~/.aws/config under the correct profile per the specification here:
https://aws.amazon.com/blogs/developer/new-improved-flexibility-when-configuring-endpoint-urls-with-the-aws-sdks-and-tools/

import os
import pyarrow.parquet as pq

os.environ["AWS_PROFILE"] = "my-bucket-role"

tbl = pq.read_table("s3://my-bucket/my-parquet")

---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[1], line 14
---> 14 tbl = pq.read_table("s3://my-bucket/my-parquet")

File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification)
   1787     warnings.warn(
   1788         "Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 "
   1789         "and will be removed in a future version.",
   1790         FutureWarning, stacklevel=2)
   1792 try:
-> 1793     dataset = ParquetDataset(
   1794         source,
   1795         schema=schema,
   1796         filesystem=filesystem,
   1797         partitioning=partitioning,
   1798         memory_map=memory_map,
   1799         read_dictionary=read_dictionary,
   1800         buffer_size=buffer_size,
   1801         filters=filters,
   1802         ignore_prefixes=ignore_prefixes,
   1803         pre_buffer=pre_buffer,
   1804         coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
   1805         decryption_properties=decryption_properties,
   1806         thrift_string_size_limit=thrift_string_size_limit,
   1807         thrift_container_size_limit=thrift_container_size_limit,
   1808         page_checksum_verification=page_checksum_verification,
   1809     )
   1810 except ImportError:
   1811     # fall back on ParquetFile for simple cases when pyarrow.dataset
   1812     # module is not available
   1813     if filters is not None:

File /opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1344, in ParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, use_legacy_dataset)
   1341 if filesystem is None:
   1342     # path might be a URI describing the FileSystem as well
   1343     try:
-> 1344         filesystem, path_or_paths = FileSystem.from_uri(
   1345             path_or_paths)
   1346     except ValueError:
   1347         filesystem = LocalFileSystem(use_mmap=memory_map)

File /opt/conda/lib/python3.9/site-packages/pyarrow/_fs.pyx:477, in pyarrow._fs.FileSystem.from_uri()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

OSError: When resolving region for bucket 'my-bucket': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 28, Timeout was reached

If I also specify a definition of AWS_ENDPOINT_URL as an environmental variable, it does work.

os.environ["AWS_ENDPOINT_URL"] = "https://my-endpoint.com"
tbl = pq.read_table("s3://my-bucket/my-parquet") # no error

I think that pyarrow should read endpoint_url from ~/.aws/config if it exists and AWS_ENDPOINT_URL is not specified, per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That would avoid this confusing error about 'region' and would also be convenient to avoid having to specify an additional environmental variable.

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

hutch3232 added the Type: bug label Sep 14, 2024

github-actions bot added Component: Parquet Component: Python labels Sep 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Arrow doesn't pickup `endpoint_url` from `~/.aws/config` #44119

[Python] Arrow doesn't pickup `endpoint_url` from `~/.aws/config` #44119

hutch3232 commented Sep 14, 2024

[Python] Arrow doesn't pickup endpoint_url from ~/.aws/config #44119

[Python] Arrow doesn't pickup endpoint_url from ~/.aws/config #44119

Comments

hutch3232 commented Sep 14, 2024

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

[Python] Arrow doesn't pickup `endpoint_url` from `~/.aws/config` #44119

[Python] Arrow doesn't pickup `endpoint_url` from `~/.aws/config` #44119