-
-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table Cells in Row: Tab Separator #242
Comments
This was first requested in #98 Am I right what you essentially seek for is HTML to CSV/TSV conversion? If that's the case then the right approach would be to have a separate formatter rather than an option for the default one shipped with html-to-text. I'll see whether I can include it in version 9. |
Yes this is correct: I am using html-to-text for HTML conversion to a JSON array (row-column) for an JavaScript spreadsheet, which is essentially CSV/TSV. I don't know NodeJS so I cannot customize formatter.js to add a \t before each th/td in a row (non-first). Even every th/td in a row would be fine for my needs. If you would include it, I would be pleased to sponsor a relatively small amount of USD150 for this feature. Something possibly like the following would be great. const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"
console.log(convert(vs1, {
selectors: [ { selector: 'table', format: 'dataTableRowCellSeparator', rowCellSeparator: '\\t' }} ]
})); I have tried to look at custom formatters and may not have fully understood your comments above. As an aside I note that const {convert} = require('html-to-text');
const vs1 = "<p>Heading</p><table><tr><th>Month</th><th>Savings</th></tr><tr><td>January</td><td>$100</td></tr><tr><td>February</td><td>$80</td></tr></table>"
console.log(convert(vs1, {
selectors: [ { selector: 'table', format: 'table' } ]
}));
console.log(convert(vs1, {
})); Seems to output all the words together. "Heading MonthSavingsJanuary$100February$80" |
This is a legacy format that comes together with
This is because {
wordwrap: false,
whitespaceCharacters: ' \r\n\f\u200b', // excluded tab character
formatters: {
'cellFormatter': function (elem, walk, builder, formatOptions) {
builder.addInline('\t');
walk(elem.children, builder);
}
},
selectors: [
{ selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
{ selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
{ selector: 'th + th', format: 'cellFormatter' },
{ selector: 'th + td', format: 'cellFormatter' },
{ selector: 'td + td', format: 'cellFormatter' }
]
} - this should do the job if there is no complex content inside cells. When I get to a dedicated (and more robust) formatter implementation - I think I'll call it With workaround figured out, I think I won't try to make 8.2.0 for this. And version 9 is still few months away - there are a couple of big issues to address. |
I haven't included {
wordwrap: false,
formatters: {
'cellFormatter': function (elem, walk, builder, formatOptions) {
builder.addLiteral('\t');
walk(elem.children, builder);
}
},
selectors: [
{ selector: 'table', format: 'block', options: { leadingLineBreaks: 2, trailingLineBreaks: 2 } },
{ selector: 'tr', format: 'block', options: { leadingLineBreaks: 1, trailingLineBreaks: 1 } },
{ selector: 'th + th', format: 'cellFormatter' },
{ selector: 'th + td', format: 'cellFormatter' },
{ selector: 'td + td', format: 'cellFormatter' }
]
} |
I am trying to use html-to-text as part of a spreadsheet IMPORTHTML function (webix sheets library).
It works really well using browserify.
With tables it would be wonderful if the cells could be separated by a tab character.
Possibly an option could be used such as selectors: [ { selector: 'table', rowCellSeparator: '\\t' } ]
Many thanks for this great project.
The text was updated successfully, but these errors were encountered: