You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug, including details regarding any error messages, version, and platform.
Awhile back I opened this issue: pandas-dev/pandas#57449 mistakenly thinking pandas parquet reader wasn't working quite right due to the odd error thrown. Recent testing has shown it is probably an issue with pyarrow.
Testing was done on Ubuntu 20.04 and pyarrow 17.0.0. I am using an on-prem S3-compatibible storage provider. This means AWS_REGION is irrelevant, but the AWS_ENDPOINT_URL is important.
importosimportpyarrow.parquetaspqos.environ["AWS_PROFILE"] ="my-bucket-role"tbl=pq.read_table("s3://my-bucket/my-parquet")
---------------------------------------------------------------------------OSErrorTraceback (mostrecentcalllast)
CellIn[1], line14--->14tbl=pq.read_table("s3://my-bucket/my-parquet")
File/opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1793, inread_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, memory_map, buffer_size, partitioning, filesystem, filters, use_legacy_dataset, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification)
1787warnings.warn(
1788"Passing 'use_legacy_dataset' is deprecated as of pyarrow 15.0.0 "1789"and will be removed in a future version.",
1790FutureWarning, stacklevel=2)
1792try:
->1793dataset=ParquetDataset(
1794source,
1795schema=schema,
1796filesystem=filesystem,
1797partitioning=partitioning,
1798memory_map=memory_map,
1799read_dictionary=read_dictionary,
1800buffer_size=buffer_size,
1801filters=filters,
1802ignore_prefixes=ignore_prefixes,
1803pre_buffer=pre_buffer,
1804coerce_int96_timestamp_unit=coerce_int96_timestamp_unit,
1805decryption_properties=decryption_properties,
1806thrift_string_size_limit=thrift_string_size_limit,
1807thrift_container_size_limit=thrift_container_size_limit,
1808page_checksum_verification=page_checksum_verification,
1809 )
1810exceptImportError:
1811# fall back on ParquetFile for simple cases when pyarrow.dataset1812# module is not available1813iffiltersisnotNone:
File/opt/conda/lib/python3.9/site-packages/pyarrow/parquet/core.py:1344, inParquetDataset.__init__(self, path_or_paths, filesystem, schema, filters, read_dictionary, memory_map, buffer_size, partitioning, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, use_legacy_dataset)
1341iffilesystemisNone:
1342# path might be a URI describing the FileSystem as well1343try:
->1344filesystem, path_or_paths=FileSystem.from_uri(
1345path_or_paths)
1346exceptValueError:
1347filesystem=LocalFileSystem(use_mmap=memory_map)
File/opt/conda/lib/python3.9/site-packages/pyarrow/_fs.pyx:477, inpyarrow._fs.FileSystem.from_uri()
File/opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:155, inpyarrow.lib.pyarrow_internal_check_status()
File/opt/conda/lib/python3.9/site-packages/pyarrow/error.pxi:92, inpyarrow.lib.check_status()
OSError: Whenresolvingregionforbucket'my-bucket': AWSErrorNETWORK_CONNECTIONduringHeadBucketoperation: curlCode: 28, Timeoutwasreached
If I also specify a definition of AWS_ENDPOINT_URL as an environmental variable, it does work.
os.environ["AWS_ENDPOINT_URL"] ="https://my-endpoint.com"tbl=pq.read_table("s3://my-bucket/my-parquet") # no error
I think that pyarrow should read endpoint_url from ~/.aws/config if it exists and AWS_ENDPOINT_URL is not specified, per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That would avoid this confusing error about 'region' and would also be convenient to avoid having to specify an additional environmental variable.
Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered:
Describe the bug, including details regarding any error messages, version, and platform.
Awhile back I opened this issue: pandas-dev/pandas#57449 mistakenly thinking
pandas
parquet reader wasn't working quite right due to the odd error thrown. Recent testing has shown it is probably an issue withpyarrow
.Testing was done on Ubuntu 20.04 and
pyarrow
17.0.0. I am using an on-prem S3-compatibible storage provider. This meansAWS_REGION
is irrelevant, but theAWS_ENDPOINT_URL
is important.I have defined
endpoint_url
in~/.aws/config
under the correct profile per the specification here:https://aws.amazon.com/blogs/developer/new-improved-flexibility-when-configuring-endpoint-urls-with-the-aws-sdks-and-tools/
If I also specify a definition of
AWS_ENDPOINT_URL
as an environmental variable, it does work.I think that
pyarrow
should readendpoint_url
from~/.aws/config
if it exists andAWS_ENDPOINT_URL
is not specified, per: https://docs.aws.amazon.com/cli/v1/userguide/cli-configure-endpoints.html. That would avoid this confusing error about 'region' and would also be convenient to avoid having to specify an additional environmental variable.Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: