Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSV for Seattle library checkouts - #103 #105

Draft
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

beatrizmilz
Copy link
Collaborator

@beatrizmilz beatrizmilz commented Nov 28, 2023

@scopinho
I started to translate this dataset. Since it will not be stored in the package (we will share it in an S3 Bucket), I added the code in the data-raw/.
I started with only the head of the data (10k rows).

If you want to start reviewing:

  1. What do you think of the names of the columns?
  2. Categories on MaterialType: there are some categories I need to search a bit to translate. This list is not final.
  3. Categories on CheckoutType: I have no idea how to translate that. These are the names of services, so I guess it would be better to use them in English
  4. Is it good to translate the values in the Subjects column? There are SO MANY of them. I can imagine some scenarios: 1) leave it in English; 2) translate the most frequent subjects; 3) translate them all (👀 )

@beatrizmilz
Copy link
Collaborator Author

4- I think that it's best not to translate the content in this column.. is going to take a long time that we could use in other translation tasks

@scopinho
Copy link
Contributor

scopinho commented Dec 4, 2023

@scopinho I started to translate this dataset. Since it will not be stored in the package (we will share it in an S3 Bucket), I added the code in the data-raw/. I started with only the head of the data (10k rows).

If you want to start reviewing:

  1. What do you think of the names of the columns?
  2. Categories on MaterialType: there are some categories I need to search a bit to translate. This list is not final.
  3. Categories on CheckoutType: I have no idea how to translate that. These are the names of services, so I guess it would be better to use them in English
  4. Is it good to translate the values in the Subjects column? There are SO MANY of them. I can imagine some scenarios: 1) leave it in English; 2) translate the most frequent subjects; 3) translate them all (👀 )

Hi @beatrizmilz ,

1-) I looked at the dataset description and came up with the names below. Pls, take a look and let me know your thoughts.
classe_uso
sistema_retirada
tipo_retirada
retirada_ano
retirada_mes
num_retiradas
titulo
isbn
autoria
assunto
editora
publicacao_ano

2-) I'll download the file and try to improve the list. Instead of using vroom for 10K, perhaps we can change the script to use arrow, so we should be able to look into the entire dataset. I'll try that in the next few days and keep u posted.

3-) Based on the content and column description, the best I could come up was "sistema_retirada"

4-) I agree. For now we could leave the content in English. If a "good soul" give us credit in openai api or similar, we could use AI to translate. I made some proof of concept and worked very well, but my API credits $ are gone now and the number of tokens we need is not small. :-(

@scopinho
Copy link
Contributor

scopinho commented Dec 4, 2023

Para os 71 descrições em MaterialType, montei esta lista também para ajudar, mas não coloquei no código:

English   Portugues
1 BOOK 1 LIVRO
2 EBOOK 2 EBOOK
3 SOUNDDISC 3 DISCO DE ÁUDIO
4 AUDIOBOOK 4 AUDIOLIVRO
5 VIDEODISC 5 DISCO DE VÍDEO
6 SONG 6 MÚSICA
7 MUSIC 7 MÚSICA
8 SOUNDREC 8 GRAVAÇÃO DE SOM
9 MOVIE 9 FILME
10 TELEVISION 10 TELEVISÃO
11 MAP 11 MAPA
12 REGPRINT IMPRESSO REGULAR
13 MIXED 13 MISTO
14 MAGAZINE 14 REVISTA
15 VISUAL 15 VISUAL
16 SOUNDDISC, VIDEODISC 16 DISCO DE ÁUDIO, DISCO DE VÍDEO
17 CR 17 CD-ROM
18 VIDEO 18 VÍDEO
19 ER, VIDEODISC 19 REGISTRO ELETRÔNICO, DISCO DE VÍDEO
20 VIDEOCART 20 CARTÃO DE VÍDEO
21 ER, SOUNDDISC 21 REGISTRO ELETRÔNICO, DISCO DE SOM
22 ER  22 REGISTRO ELETRÔNICO
23 ATLAS 23 ATLAS
24 SOUNDCASS 24 FITA DE ÁUDIO
25 VIDEOCASS 25 FITA DE VÍDEO
26 LARGEPRINT 26 LIVRO EM LETRA GRANDE
27 MUSICSNDREC 27 GRAVAÇÃO DE SOM MUSICAL
28 VIDEOREC 28 GRAVAÇÃO DE VÍDEO
29 REGPRINT, SOUNDDISC 29 IMPRESSO REGULAR, DISCO DE ÁUDIO
30 SOUNDDISC, SOUNDREC 30 DISCO DE ÁUDIO, GRAVAÇÃO DE SOM
31 GLOBE 31 GLOBO
32 SOUNDCASS, SOUNDDISC, VIDEOCASS, VIDEODISC 32 FITA DE ÁUDIO, DISCO DE ÁUDIO, FITA DE VÍDEO, DISCO DE VÍDEO
33 ER, VIDEOREC 33 REGISTRO ELETRÔNICO, GRAVAÇÃO DE VÍDEO
34 COMIC 34 QUADRINHO
35 FLASHCARD, SOUNDDISC 35 CARTÃO DIDÁTICO, DISCO DE ÁUDIO
36 VIDEOCASS, VIDEODISC 36 FITA DE VÍDEO, DISCO DE VÍDEO
37 KIT 37 KIT
38 NOTATEDMUSIC 38 PARTITURA
39 MICROFORM 39 MICROFORMA
40 ER, PRINT 40 REGISTRO ELETRÔNICO, IMPRESSO
41 SLIDE, SOUNDCASS, VIDEOCASS 41 SLIDE, FITA DE ÁUDIO, FITA DE VÍDEO
42 ER, NONPROJGRAPH 42 REGISTRO ELETRÔNICO, GRAFICO NÃO PROJETADO
43 SOUNDDISC, VIDEOCASS 43 DISCO DE ÁUDIO, FITA DE VÍDEO
44 REGPRINT, VIDEOREC 44 IMPRESSO REGULAR, GRAVAÇÃO DE VÍDEO
45 ER, REGPRINT 45 REGISTRO ELETRÔNICO, IMPRESSO
46 UNSPECIFIED 46 NÃO ESPECIFICADO
47 REMOTESEN 47 SISTEMA REMOTO
48 PICTURE 48 FIGURA
49 PRINT 49 IMPRESSO
50 FLASHCARD 50 CARTÃO DIDÁTICO
51 SOUNDCASS, SOUNDDISC 51 FITA DE ÁUDIO, DISCO DE ÁUDIO
52 ER, MAP 52 REGISTRO ELETRÔNICO, MAPA
53 ER, SOUNDREC 53 REGISTRO ELETRÔNICO, GRAVAÇÃO DE SOM
54 MAP, VIEW 54 MAPA, VISUALIZAÇÃO
55 SLIDE 55 SLIDE
56 SLIDE, VIDEOCASS 56 SLIDE, FITA DE VÍDEO
57 SLIDE, SOUNDCASS 57 SLIDE, FITA DE ÁUDIO
58 SOUNDCASS, VIDEOCASS 58 FITA DE ÁUDIO, FITA DE VÍDEO
59 COMPFILE 59 ARQUIVO DE COMPUTADOR
60 ER, SOUNDDISC, VIDEODISC 60 REGISTRO ELETRÔNICO, DISCO DE ÁUDIO, DISCO DE VÍDEO
61 PICTURE, VIDEODISC 61 FIGURA, DISCO DE VÍDEO
62 ER, PICTURE 62 REGISTRO ELETRÔNICO, FIGURA
63 SECTION 63 SEÇÃO
64 NONPROJGRAPH 64 GRÁFICO NÃO PROJETADO
65 BOOK, ER 65 LIVRO, REGISTRO ELETRÔNICO
66 ER, SOUNDDISC, SOUNDREC 66 REGISTRO ELETRÔNICO, DISCO DE ÁUDIO, GRAVAÇÃO DE SOM
67 CHART 67 GRÁFICO
68 ER, VIDEOCASS 68 REGISTRO ELETRÔNICO, FITA DE VÍDEO
69 ATLAS, ER 69 ATLAS, REGISTRO ELETRÔNICO
70 SOUNDCASS, SOUNDDISC, SOUNDREC 70 FITA DE ÁUDIO, DISCO DE ÁUDIO, GRAVAÇÃO DE SOM
71 PHOTO 71 FOTO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants