Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileSource not accepting S3 endpoints as path #4993

Open
ShaktidharK1997 opened this issue Jan 31, 2025 · 1 comment
Open

FileSource not accepting S3 endpoints as path #4993

ShaktidharK1997 opened this issue Jan 31, 2025 · 1 comment

Comments

@ShaktidharK1997
Copy link
Contributor

Expected Behavior

Return a filesource which is connected to the S3 endpoint

Current Behavior

  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\feast\inference.py", line 180, in update_feature_views_with_inferred_features_and_entities
    _infer_features_and_entities(
  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\feast\inference.py", line 230, in _infer_features_and_entities
    provider.get_table_column_names_and_types_from_data_source(
  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\feast\infra\passthrough_provider.py", line 526, in get_table_column_names_and_types_from_data_source
    return self.offline_store.get_table_column_names_and_types_from_data_source(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\feast\infra\offline_stores\offline_store.py", line 390, in get_table_column_names_and_types_from_data_source
    return data_source.get_table_column_names_and_types(config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\feast\infra\offline_stores\file_source.py", line 181, in get_table_column_names_and_types
    schema = ParquetDataset(path, **kwargs).schema
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\shakt\anaconda3\envs\feast_test_env\Lib\site-packages\pyarrow\parquet\core.py", line 1348, in __init__
    finfo = filesystem.get_file_info(path_or_paths)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow\\_fs.pyx", line 590, in pyarrow._fs.FileSystem.get_file_info
  File "pyarrow\\error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status
OSError: [WinError 123] Failed querying information for path 'C:/Users/shakt/Documents/GIT/feast-artifact/feature_repo/s3:/bucket/flights.parquet'. 
Detail: [Windows error 123] The filename, directory name, or volume label syntax is incorrect.

Steps to reproduce

  1. Create a minio object store as a docker container exposing 9000 port
  2. Ran the below code to try and connect to that minio container as a File Source
bucket_name = "bucket"
file_name = "flights.parquet"
s3_endpoint = "https://localhost:9000" 


# Define the data source for flight data
flight_stats_source = FileSource(
    path=f"s3://{bucket_name}/{file_name}",  
    timestamp_field="FlightDate",
    file_format=ParquetFormat(),
    s3_endpoint_override="http://localhost:9000"  # Changed to http since use_ssl=False
)

Specifications

  • Version: 0.43.0
  • Platform:Windows
  • Subsystem:

Possible Solution

As mentioned in #4753, to revert to previous code

@Leumastai
Copy link

@ShaktidharK1997 a temporary fix is to edit the dask.py file directly from feast. /home/alijoe/anaconda3/lib/python3.11/site-packages/feast/infra/offline_stores/dask.py

Look for read_datasource function and change this line:

 if not Path(data_source.path).is_absolute():

to

if not data_source.path.startswith("s3://"):

this allows it to take S3 (MinIO) data. The problem is Path(data_source.path).is_absolute() is expecting a real filepath not a S3 url. If you've got any other solutions, let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants