Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: structured json dataloader #151

Merged
merged 1 commit into from
Oct 22, 2024
Merged

Conversation

iwilltry42
Copy link
Collaborator

Example format:

{
  "metadata": {
    "source": "https://example.com",
    "filename": "foo.pdf"
  },
  "documents": [
    {
      "metadata": {
        "page": 1
      },
        "content": "This is the first page of the document."
    },
    {
      "metadata": {
        "page": 2
      },
      "content": "This is the second page of the document."
    }
  ]
}

knowledge load supports this as a default now: knowledge load mydoc.pdf - will print the loaded content from mydoc.pdf in the above format.
You may as well verify your own structured output via diff <(cat examples/structured-ingestion/example.json) <(knowledge load examples/structured-ingestion/example.json --loader "structured" -) (there may be formatting differences).
Note: knowledge load sets the global source metadata to the source file that you give it.

@iwilltry42 iwilltry42 merged commit eff3400 into main Oct 22, 2024
1 check passed
@iwilltry42 iwilltry42 deleted the feat/structure-ingestion branch October 22, 2024 11:51
iwilltry42 added a commit that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant