Skip to content

Commit

Permalink
script to download URLs from S3
Browse files Browse the repository at this point in the history
  • Loading branch information
bhlieberman committed Jul 31, 2024
1 parent 84e6c90 commit bdbd00d
Showing 1 changed file with 16 additions and 0 deletions.
16 changes: 16 additions & 0 deletions stages/load_urls.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#! /usr/bin/bash

rawpath="raw"

mkdir $rawpath

files=$(aws-cli.aws s3 ls --recursive --no-sign-request "s3://openalex/data/works/" | awk '{print $4}')

for file in $files;
do
filename="s3://openalex/$file"
outfile="$(basename "$file" .gz).csv"
duckdb -c "copy (select locations->'\$[0].pdf_url' as url
from read_json('$filename', ignore_errors=true) where url is not null)
to '$rawpath/$outfile' (HEADER false)"
done

0 comments on commit bdbd00d

Please sign in to comment.