Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

Added handling of filename_as_id and file_extractor to SharePointReader #934

Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 10 additions & 1 deletion llama_hub/microsoft_sharepoint/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ def __init__(
client_id: str,
client_secret: str,
tenant_id: str,
filename_as_id: bool = False,
file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a small mistake here:
file_extractor: Optional[Dict[str, BaseReader]] = None,

For ref: https://github.com/run-llama/llama_index/blob/0393b081f3aed854e0a628f49b8e51f8da7906ef/llama_index/readers/file/base.py#L118

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully my second commit should solve the issue that prevented the first run to work out... previously I just forgot to add the appropriate imports from typing (Optional and Union).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to add union for basereader. It'll be always a baseReader class

Replace file_extractor line with the below one

file_extractor: Optional[Dict[str, BaseReader]] = None,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it that way because I was shamelessly copying line 25 of llama_hub/minio/minio-client/base.py

file_extractor: Optional[Dict[str, Union[str, BaseReader]]] = None,

however, I'll rewrite it that way

) -> None:
"""
Initializes an instance of SharePoint reader.
Expand All @@ -37,11 +39,16 @@ def __init__(
The application must alse be configured with MS Graph permissions "Files.ReadAll", "Sites.ReadAll" and BrowserSiteLists.Read.All.
client_secret: The application secret for the app registered in Azure.
tenant_id: Unique identifier of the Azure Active Directory Instance.
file_extractor (Optional[Dict[str, BaseReader]]): A mapping of file
extension to a BaseReader class that specifies how to convert that file
to text. See `SimpleDirectoryReader` for more details.
"""
self.client_id = (client_id,)
self.client_secret = (client_secret,)
self.tenant_id = tenant_id
self._authorization_headers = None
self.file_extractor = file_extractor
self.filename_as_id = filename_as_id

def _get_access_token(self) -> str:
"""
Expand Down Expand Up @@ -343,7 +350,9 @@ def get_metadata(filename: str) -> Any:
simple_directory_reader = download_loader("SimpleDirectoryReader")

simple_loader = simple_directory_reader(
download_dir, file_metadata=get_metadata, recursive=recursive
download_dir, file_metadata=get_metadata, recursive=recursive,
filename_as_id=self.filename_as_id,
file_extractor=self.file_extractor
)
documents = simple_loader.load_data()
return documents
Expand Down
Loading