Preprocess text: first word in custom stopwords list is ignored #1028

wvdvegte · 2023-12-06T14:35:03Z

Describe the bug
In custom .txt (UTF-8) stopwords files, the first word is ignored as a stopword by Preprocess Text, i.e., it is not filtered out.

To Reproduce
Create a custom stopwords .txt file in UTF-8 encoding (in my case, I used MS Word), consisting of words separated by returns, and load it in Preprocess text. The first word will not be filtered out but the rest will. Leaving the first line empty solves the problem, but it's not the obvious thing to do.

Expected behavior
All custom stopwords should be filtered out.

Orange version:
3.36.2 (I don't know if it's the native Silicon version or the Intel version)

Text add-on version:
1.15.0

Operating system:
Mac OS 14.1.2 (23B92)

ajdapretnar · 2024-07-09T13:28:03Z

This is an editor issue. When I use Sublime text, the file contains word1\nword2. When I use TextEdit (OSX), the file contains '{\\rtf1\\ansi\\ansicpg1252\\cocoartf2636\n\\cocoatextscaling0\\cocoaplatform0{\\fonttbl\\f0\\fswiss\\fcharset0 Helvetica;}\n{\\colortbl;\\red255\\green255\\blue255;}\n{\\*\\expandedcolortbl;;}\n\\paperw11900\\paperh16840\\margl1440\\margr1440\\vieww11520\\viewh8400\\viewkind0\n\\pard\\tx566\\tx1133\\tx1700\\tx2267\\tx2834\\tx3401\\tx3968\\tx4535\\tx5102\\tx5669\\tx6236\\tx6803\\pardirnatural\\partightenfactor0\n\n\\f0\\fs24 \\cf0 of\\\nsystem}'.
I think MS Word does the same. You could test with:

with open("path/to/file.txt") as f:
    file = f.read()
file

See what you get.

ajdapretnar · 2024-07-09T13:28:21Z

@markotoplak Is there a way we could sanitize this internally?

janezd · 2024-07-09T14:04:27Z

@ajdapretnar, I guess you are saving text as rich text format (rtf), not plain text.

@wvdvegte probably has a different problem.

ajdapretnar · 2024-07-10T07:05:14Z

I thought the reason for not considering the first row for filtering is because in rtf, additional parameters get treated as text. So instead of a plain "orange" one would get "{fancyparam:15}orange" and thus the word would not be filtered.

wvdvegte · 2024-07-10T07:39:56Z

I was indeed referring to the use of plain text (TXT), not RTF.

ajdapretnar · 2024-07-10T07:41:47Z

@wvdvegte Could you perhaps send the stopword list? I cannot replicate the issue, so perhaps there's something about the file that is the problem. Thanks!

wvdvegte · 2024-07-10T11:49:13Z

I didn't manage to dig up what I was working on when I reported on this in December 2023, but when I'm trying to reproduce the problem, I'm not getting any of the custom stopwords filtered out:
stopword filtering.zip

ajdapretnar · 2024-07-10T12:11:26Z

Thank you! Now I've finally managed to reproduce the issue.
As I've suspected, it's the editor. The string reads: '\ufeffpig\ncow\nchicken\nhorse\n'. The first character is a BOM, typical for Windows apparently. We can solve this by reading the file with encoding='utf-8-sig'.
Will prepare and test the fix.

wvdvegte · 2024-07-10T12:20:55Z

Typical for Microsoft, perhaps? I created the text file using Word for Mac ...

ajdapretnar mentioned this issue Jul 10, 2024

Sanitize stopwords with BOM #1072

Merged

3 tasks

markotoplak closed this as completed in #1072 Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocess text: first word in custom stopwords list is ignored #1028

Preprocess text: first word in custom stopwords list is ignored #1028

wvdvegte commented Dec 6, 2023

ajdapretnar commented Jul 9, 2024

ajdapretnar commented Jul 9, 2024

janezd commented Jul 9, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024

Preprocess text: first word in custom stopwords list is ignored #1028

Preprocess text: first word in custom stopwords list is ignored #1028

Comments

wvdvegte commented Dec 6, 2023

ajdapretnar commented Jul 9, 2024

ajdapretnar commented Jul 9, 2024

janezd commented Jul 9, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024

ajdapretnar commented Jul 10, 2024

wvdvegte commented Jul 10, 2024