Best practice for roundtrip of md -> docx -> md that helps differ collaborators' revisions ? #9780

jiucenglou · 2024-05-21T07:54:39Z

jiucenglou
May 21, 2024

I am writing in markdown and send pandoc-generated docx to my collaborators. Collaborators revise in docx and send back to me. Ideally, I would like to figure out their revisions easily and precisely, and merge them in the original markdown and commit in git.

My problem with this roundtrip of md -> docx -> md is that it is difficult for me to figure out the revisions easily and precisely: (1) In the original markdown, I write one sentence per line. The docx -> md conversion "hard-wrap"s lines. (2) The docx -> md conversion "hard-code"s the numbers of the figure references and citations.

Could you help to suggest what is the best practice for roundtrip of md -> docx -> md that helps differ collaborators' revisions ? Many thanks !

PS: A similar stackoverflow post mentions the --track-changes flag

alerque · 2024-05-21T08:25:17Z

alerque
May 21, 2024

Using Markdown as our canonical source format at a publishing house, I've run into this problem a lot. It is so much easier to work with the folks who are willing to use a Markdown editor ... any Markdown editor.

But for the cases when life just doesn't work out that way and can't be coerced, the only way I know of is to first round trip the Markdown back into Markdown, then take it back out to docx (or whatever). Then when you get it back you can at least compare the Markdown import with the already round-tripped Markdown. If you use the same formatting arguments like wrapping and header types for the Markdown→Markdown round trip as you do for the Docx→Markdown import you'll get something pretty similar.

That will at least get you something sensible to diff for changes.

Re-applying your preferred source formatting like using sentence-per-line is something I don't have a magic bullet for. I'm working in CaSILE to create diffing tools for prose that work across different source formatting, and also to re-apply formatting such as sentence-per-line, but while some parts of the system function quite well for production work, those two aspects are still pretty rough.

0 replies

jiucenglou · 2024-05-21T12:53:37Z

jiucenglou
May 21, 2024
Author

Thanks to the helpful suggestion of @alerque, the following steps can convert the edited docx to a sentence-per-line md, which seems to be sensible enough to diff for changes, especially if the original markdown or the not-yet-revised docx is also normalized once for reference.

step 1:

# --wrap=none from https://stackoverflow.com/questions/62967265/word-to-markdown-via-pandoc-prevent-line-breaks-in-paragraphs
./pandoc --wrap=none --extract-media ./ draft.docx -s -o draft.md

step 2:

# --wrap=preserve from https://tarleb.com/posts/semantic-line-breaks/
./pandoc draft.md -L break_lines.lua --wrap=preserve -s -o draft_lines.md

break_lines.lua is

function Inlines (inlines)
  starttime = os.date('%Y-%m-%d %H:%M:%S')
  table.insert(inlines, 1, pandoc.Space())
  table.insert(inlines, pandoc.Space())
  -- Go from end to start to avoid problems with shifting indices.
  for i = #inlines-1, 2, -1 do
    if inlines[i] and inlines[i].t == 'Space' then
      if inlines[i-1] and inlines[i-1].t == 'Str' and inlines[i-1].text:match("%.$") then
         table.insert(inlines, i, pandoc.SoftBreak())
      end
    end
  end
  inlines:remove(1)
  inlines:remove(#inlines)
  endtime = os.date('%Y-%m-%d %H:%M:%S')
  -- print(#inlines, starttime, endtime)
  return inlines
end

0 replies

jgm · 2024-05-21T15:53:15Z

jgm
May 21, 2024
Maintainer

I wonder if we should add an option --wrap=sentence?
The problem is that it is sometimes difficult to determine where a sentence ends. But we do try to do this for man output, where it matters.

5 replies

jiucenglou May 21, 2024
Author

That sounds awesome ! Would it also work for breaking CJK sentences : D

tarleb May 21, 2024
Collaborator

That'd be quite useful. For the time being: here is a Lua filter for that.

jiucenglou May 22, 2024
Author

@tarleb Could you help to suggest the difference between breaking sentences in function Inlines (inlines) and in both Plain and Para as you suggest ? Many thanks !

jiucenglou May 22, 2024
Author

@jgm Would this --wrap=sentence option also help one to figure out the revisions in md->latex->md roundtrip ? : D

tarleb May 22, 2024
Collaborator

@jiucenglou Sorry, I had missed your comment above, and now see that you had already referenced my post.

Using function Inlines would also add linebreaks in emphasized text, e.g. here Here's _This. And **that**._. That might or might not be what you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for roundtrip of md -> docx -> md that helps differ collaborators' revisions ? #9780

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best practice for roundtrip of md -> docx -> md that helps differ collaborators' revisions ? #9780

jiucenglou May 21, 2024

Replies: 3 comments · 5 replies

alerque May 21, 2024

jiucenglou May 21, 2024 Author

jgm May 21, 2024 Maintainer

jiucenglou May 21, 2024 Author

tarleb May 21, 2024 Collaborator

jiucenglou May 22, 2024 Author

jiucenglou May 22, 2024 Author

tarleb May 22, 2024 Collaborator

jiucenglou
May 21, 2024

Replies: 3 comments 5 replies

alerque
May 21, 2024

jiucenglou
May 21, 2024
Author

jgm
May 21, 2024
Maintainer

jiucenglou May 21, 2024
Author

tarleb May 21, 2024
Collaborator

jiucenglou May 22, 2024
Author

jiucenglou May 22, 2024
Author

tarleb May 22, 2024
Collaborator