Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CRediT roles to JATS #10153

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add CRediT roles to JATS #10153

wants to merge 11 commits into from

Conversation

cthoyt
Copy link

@cthoyt cthoyt commented Sep 4, 2024

Closes #10152

This PR adds support for annotating author roles using the Contribution Role Taxonomy (CRediT).

I'm motivated to add this to Pandoc since I want the Journal of Open Source Software (JOSS), which is built on top of Pandoc, to be able to create compliant JATS. We're already adding support for encoding this information in article metadata in parallel in openjournals/inara#75.

On my first attempts to implement this, I tried to inject some clever logic inside the Haskell code to automatically look up the primary labels, but this had the two issues of being programatically complicated, and too rigid towards internationalization. Now, you can specify the roles in the metadata like this:

title: CRediT Test
author:
  - name: Max Mustermann
    roles:
      - credit-id: software
        credit-name: Software
        degree: Lead
        name: Programas

The tests (with narrative documentation) can be found in https://github.com/cthoyt/pandoc/blob/patch-1/test/command/10152.md. Run these tests specifically with cabal test --test-options="-p 10152.md"

@jgm
Copy link
Owner

jgm commented Sep 5, 2024

You should add a "command test" to test the proposed feature, and to illustrate how it works.

See test/command/*.md for examples.

@cthoyt
Copy link
Author

cthoyt commented Sep 24, 2024

Hi @jgm , I think I figured out making a test file. I have a few questions I bet you could answer very quickly but would take me a long time to figure out:

  1. Is there a specific naming convention I should use for my test command file?
  2. I created a Credit.hs that contains a dictionary that I want to inject into the JATS output process. Where should I look to do that? you can see in the diff what I am currently trying in the JATS writer

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Is there a specific naming convention I should use for my test command file?

Usually we try to name them after the associated issue. So, if there's an open issue that this fixes you could use that number; otherwise you could use the PR's number, 10153.md. That makes it easy to figure out later why the test was added.

I created a Credit.hs that contains a dictionary that I want to inject into the JATS output process. Where should I look to do that? you can see in the diff what I am currently trying in the JATS writer

Instead of creating a separate module, I'd prefer if you just included this map in the JATS writer module itself.

@cthoyt
Copy link
Author

cthoyt commented Sep 24, 2024

@jgm thanks for that feedback, I addressed both of your points. Now, the only thing left is how to make this dictionary (renamed to creditNames) available inside the template article.jats_publishing.

My first try was $ defField "creditNames" creditNames because this group is also where some other things that appear in the template are referenced, but I am unfortunately not familiar with Haskell and this is a very complex codebase! Any help appreciated :)

https://github.com/cthoyt/pandoc/blob/71d53c6e2635ab23929287a94087bb50ee56b79b/src/Text/Pandoc/Writers/JATS.hs#L183

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

My first try was $ defField "creditNames" creditNames

what happens when you try that?

By the way, our usual convention for variable names would favor credit-names.

@cthoyt
Copy link
Author

cthoyt commented Sep 24, 2024

It seems the issue was the definition of creditNames :: M.Map String String - it didn't seem to have a way to handle strings and gave this error:

src/Text/Pandoc/Writers/JATS.hs:182:17: error: [GHC-39999]
    • Could not deduce ‘Text.DocTemplates.Internal.ToContext
                          Text (M.Map String String)’
        arising from a use of ‘defField’
      from the context: PandocMonad m
        bound by the type signature for:
                   docToJATS :: forall (m :: * -> *).
                                PandocMonad m =>
                                WriterOptions -> Pandoc -> JATS m Text
        at src/Text/Pandoc/Writers/JATS.hs:124:1-68

Switching the signature to use Text instead of String in creditNames :: M.Map Text Text and finally got it to compile, but it seems like the Text values in the dictionary don't get displayed in the template and instead only evaluate to true. Is there either:

  1. A way to add handling for regular dictionaries
  2. Unpack the Pandoc Text to be a regular string inside the template?

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

I don't understand why you'd be getting this behavior. Nested Maps seem to work fine in ToContext: here I explore using cabal repl in jgm/doctemplates:

ghci> toContext $ fromList [("a"::Text, fromList [("b"::Text, fromList [("c"::Text,"d"::Text)])])] :: Context Text
Context {unContext = fromList [("a",MapVal (Context {unContext = fromList [("b",MapVal (Context {unContext = fromList [("c",SimpleVal (Text 1 "d"))]}))]}))]}
ghci> Right (tpl :: Template Text) <- compileTemplate "." "$a.b.c$" 
ghci> renderTemplate tpl context
Text 1 "d"

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Btw, you can see what instances are defined for ToContext here:
https://hackage.haskell.org/package/doctemplates-0.11.0.1/docs/Text-DocTemplates.html

As you can see, it's only defined for Map with Text keys.

ToContext typeclass is relevant because of the type of defField:

defField   :: ToContext a b => Text -> b -> Context a -> Context a

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Are you saying that $credit-names.conceptualization$ is coming out as true?

@cthoyt
Copy link
Author

cthoyt commented Sep 24, 2024

Are you saying that $credit-names.conceptualization$ is coming out as true?

Yes! This is what I'm experiencing

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

I'm a bit confused because originally you had a nested map and now it seems not to be nested. With a nested map, in which the value of $credit-names.conceptualization$ is itself a map, it would render as true in the template. But with the non-nested map it should render as a string value.

@cthoyt
Copy link
Author

cthoyt commented Sep 24, 2024

I thought that the nested map was a bit over-complicated and unnecessary so I simplified it to a regular map. This means that the $credit-names[it.type]$ becomes true and the testing suite gives the error:

--- test/command/10152.md
+++ pandoc -s -t jats
+  25       vocab-term="true"
-  25       vocab-term="Software"
+  28     true
-  28     Software
+  33       vocab-term="true"
-  33       vocab-term="Methodology"
+  36     true
-  35     Methodology
+  45       vocab-term="true"
-  44       vocab-term="Software"
+  48     true
-  47     Software

ref: https://github.com/jgm/pandoc/actions/runs/11021333152/job/30608170051?pr=10153#step:9:1926

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Ah, I see what you are trying to do. I don't think variable[field] is a valid syntax for doctemplates.

The way to do this is to resolve these names in the writer. Probably the simplest approach would be to use your Map to add a name field to the role in the Context, before you pass it to the template. (You could do this only if a name field isn't already present, which would give the user the ability to override the name, or just do it always.)

@estedeahora
Copy link

estedeahora commented Sep 24, 2024

Ah, I see what you are trying to do. I don't think variable[field] is a valid syntax for doctemplates.

The way to do this is to resolve these names in the writer. Probably the simplest approach would be to use your Map to add a name field to the role in the Context, before you pass it to the template. (You could do this only if a name field isn't already present, which would give the user the ability to override the name, or just do it always.)

I believe that it may be useful to re-iterate this comment regarding this issue.

@jgm
Copy link
Owner

jgm commented Sep 24, 2024

Yes, overriding is actually important for people who don't want it to be in English!

@cthoyt cthoyt marked this pull request as ready for review September 25, 2024 09:23
@cthoyt
Copy link
Author

cthoyt commented Sep 25, 2024

Thanks @jgm and @estedeahora for the feedback. I've opted to make the input YAML format a bit more explicit, so you have to say exactly what the CRediT name is. You also have the option to override, to support the multi-language things.

I also improved the docs to give all of the different possibilities and added to the tests one for each. The tests (with narrative documentation) can be found in https://github.com/cthoyt/pandoc/blob/patch-1/test/command/10152.md.

@cthoyt
Copy link
Author

cthoyt commented Sep 25, 2024

cc @sneakers-the-rat - your input is also greatly appreciated!

@estedeahora
Copy link

It seems to me that this project is a significant and good modification for Pandoc. However, I don't seem to modify Pandoc without modifying the code in JATS.hs to generate vocab-term-identifier attributes from vocab-term (or vice versa) automatically. This change should leave the possibility to generate content automatically or custom.

Personally I think that if this change is not possible, it is better not to generate the change. If someone needs to implement it, they can read these discussions with alternative solutions with Lua filters and template modifications.

@cthoyt
Copy link
Author

cthoyt commented Sep 25, 2024

It seems to me that this project is a significant and good modification for Pandoc. However, I don't seem to modify Pandoc without modifying the code in JATS.hs to generate vocab-term-identifier attributes from vocab-term (or vice versa) automatically. This change should leave the possibility to generate content automatically or custom.

Personally I think that if this change is not possible, it is better not to generate the change. If someone needs to implement it, they can read these discussions with alternative solutions with Lua filters and template modifications.

Apologies @estedeahora, but I'm not quite sure what you're asking for.

@estedeahora
Copy link

I see that in the last commits you opted to generate the changes by incorporating everything manually in the yaml header and then these in the template (leaving aside the modification of JATS.hs). This is a “correct” solution. However, as you pointed out in #10152 it is an inefficient solution because writing three times variations of the same information can generate errors. I think it would be a contribution to generate this in Haskell, but I don't know how to do it yet. On the other hand, I don't think it is advisable to add this functionality to Pandoc if this point is not solved, because it generates too many fields making it not very concise.

As an alternative and provisional solution I think that users can use a lua filter and “inject” these fields through a dictionary. Then this can be placed in a custom template. I think our discussion provides elements for others to do this if they need to.

@jgm
Copy link
Owner

jgm commented Sep 25, 2024

It should be completely feasible to add the credit-name field automatically, from a standard map, if it is missing. @cthoyt if you can't see how to do this I may be able to help.

@cthoyt
Copy link
Author

cthoyt commented Sep 25, 2024

I am feeling like figuring out haskell and the specifics of this codebase might be too much of a lift for me at the moment. @jgm if you are willing to give a hand, then I would be very grateful.

I had squashed the commits earlier, but just added back the name lookup dictionary in JATS.hs

@jgm
Copy link
Owner

jgm commented Sep 25, 2024

Here are some tips. In Text.Pandoc.Writers.Shared you'll find

-- | Retrieve a field value from a template context.
getField   :: FromContext a b => Text -> Context a -> Maybe b

and

-- | Reset a field of a template context.  If the field already has a
-- value, the new value replaces it.
-- This is a utility function to be used in preparing template contexts.
resetField :: ToContext a b => Text -> b -> Context a -> Context a

So the basic template for adding these fields to the roles variable would be something like:

addCreditNames :: Context a -> Context a
addCreditNames context =
  case getField "roles" context of
    Nothing -> context
    Just roles -> resetField "roles" (map addCreditName roles) context

addCreditName :: M.Map Text Text -> M.Map Text Text
addCreditName rolemap =
  case M.lookup "credit-name" rolemap of
    Just _ -> rolemap
    Nothing -> maybe id (M.insert "credit-name")
      (M.lookup "credit-id" rolemap >>= flip M.lookup creditNames)

Then you'd put addCreditNames in the pipeline that currently has stuff like defField "blah" blah.

@jezcope
Copy link

jezcope commented Sep 25, 2024

I'm happy to give this a look, since I'm 100% in favour of getting more useful metadata into the journal ecosystem and getting CRediT roles into more JATS seems like a good thing. Probably won't have time to get properly into it until the weekend though!

@cthoyt
Copy link
Author

cthoyt commented Sep 25, 2024

Then you'd put addCreditNames in the pipeline that currently has stuff like defField "blah" blah.

Do you mean this section?

let context = defField "body" main
$ defField "back" back
$ resetField "title" title'
$ resetField "date" date
$ defField "mathml" (case writerHTMLMathMethod opts of
MathML -> True
_ -> False) metadata

I tried sticking it in the middle like

let context = defField "body" main
              $ defField "back" back
              $ addCreditNames
              $ resetField "title" title'
              $ resetField "date" date
              $ defField "mathml" (case writerHTMLMathMethod opts of
                                        MathML -> True
                                        _      -> False) metadata

but this doesn't appear to work correctly - what's the right way to apply it?

@jgm
Copy link
Owner

jgm commented Sep 25, 2024

The thing you said doesn't appear to work should work.

Each line in this pipeline has type Context Text -> Context Text. What error did you get?

("writing-original-draft", "Writing – original draft"),
("writing-review-editing", "Writing – review & editing")]

addCreditNames :: Context a -> Context a
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type might need to be changed to Context Text -> Context Text.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I think this helped. Now the error is inside the other functions:

src/Text/Pandoc/Writers/JATS.hs:79:8: error: [GHC-39999]
     No instance for Text.DocTemplates.Internal.FromContext
                         Text (M.Map Text Text)
        arising from a use of getField
     In the expression: getField "roles" context
      In the expression:
        case getField "roles" context of
          Nothing -> context
          Just roles -> resetField "roles" (map addCreditName roles) context
      In an equation for addCreditNames’:
          addCreditNames context
            = case getField "roles" context of
                Nothing -> context
                Just roles -> resetField "roles" (map addCreditName roles) context
   |
79 |   case getField "roles" context of
   |        ^^^^^^^^

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look at https://hackage.haskell.org/package/doctemplates-0.11.0.1/docs/Text-DocTemplates.html

You can see that there is a ToContext instance for ToContext b => Map Text b,
but there isn't a similar FromContext instance! Argh.

Probably I should just add one. Stand by.

Copy link
Owner

@jgm jgm Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually we may be able to work around this by operating directly with a Val type instead of a Map.

addCreditName :: Val Text -> Val Text
addCreditName val =
  case val of
    MapVal ctx ->
      case getField "credit-name" ctx of
        Just _ -> val
        Nothing -> maybe val (MapVal . resetField "credit-name")
          (getField "credit-id" ctx >>= flip M.lookup creditNames)
    _ -> val

(UNTESTED! But I think something along these lines will work, without the need for changes in doctemplates.)

Copy link
Author

@cthoyt cthoyt Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a couple variations on this. The code in #10153 (comment) gave this error:

src/Text/Pandoc/Writers/JATS.hs:95:40: error: [GHC-83865]
     Couldn't match type: Context a0 -> Context a0
                     with: Context Text
      Expected: Text -> Context Text
        Actual: Text -> Context a0 -> Context a0
     Probable cause: resetField is applied to too few arguments
      In the second argument of (.), namely resetField "credit-name"
      In the second argument of maybe, namely
        (MapVal . resetField "credit-name")
      In the expression:
        maybe
          val (MapVal . resetField "credit-name")
          (getField "credit-id" ctx >>= flip M.lookup creditNames)
   |
95 |         Nothing -> maybe val (MapVal . resetField "credit-name")
   |     

Unfortunately, I don't think it's very effective working this way and I'm not sure how to push this forward. If we agree on the base ideas (extending the documented data model, how the templating for JATS changes, and how the tests look) then I would say we should split this PR into two parts - one that introduces those changes, then a second as a follow-up that uses some haskell magic to help when data is incomplete (i.e. the changes we're talking about in this thread)

@jgm do you think that would work? Otherwise I fear that I might not be able to get the job done and am sucking up too much of your open source time, and/or this thread becomes impossible to follow

another idea is we could schedule a pair programming session and try to work through this together, if you have some time in the upcoming weeks. I'm on German time but relatively flexible

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I think we're making progress, but now it seems that the way getField is defined is causing the issue:

src/Text/Pandoc/Writers/JATS.hs:93:12: error: [GHC-39999]
     Ambiguous type variable b0 arising from a use of getField
      prevents the constraint (Text.DocTemplates.Internal.FromContext
                                  Text b0) from being solved.
      Probable fix: use a type annotation to specify what b0 should be.
      Potentially matching instances:
        instance Text.DocTemplates.Internal.TemplateTarget a =>
                 Text.DocTemplates.Internal.FromContext a a
          -- Defined in ‘Text.DocTemplates.Internal’
        instance Text.DocTemplates.Internal.TemplateTarget a =>
                 Text.DocTemplates.Internal.FromContext a (Doc a)
          -- Defined in ‘Text.DocTemplates.Internal’
        ...plus two others
        (use -fprint-potential-instances to see them all)
     In the expression: getField "credit-name" ctx
      In the expression:
        case getField "credit-name" ctx of
          Just _ -> val
          Nothing
            -> case getField "credit-id" ctx of
                 Nothing -> val
                 Just creditId
                   -> case M.lookup creditId creditNames of
                        Nothing -> ...
                        Just creditName -> ...
      In a case alternative:
          MapVal ctx
            -> case getField "credit-name" ctx of
                 Just _ -> val
                 Nothing
                   -> case getField "credit-id" ctx of
                        Nothing -> val
                        Just creditId -> ...
   |
93 |       case getField "credit-name" ctx of
   |            ^^

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  Probable fix: use a type annotation to specify what ‘b0’ should be.

This is your hint. getField is polymorphic, so we may need to explicitly specify the type of the intended result, if it's not pinned down by other uses.

in this case, getField "credit-name" ctx is ambiguous because we don't do anythnig with the Just _ return value ,so it could be any type. Thus, we need to say what the type is that we expect:

Just (_ :: Text) -> val

if we want to leave it alone if it's a text value, or

Just (_ :: Val Text) -> val

if we want to leave it alone if it's any kind of value.

Copy link
Author

@cthoyt cthoyt Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo! It has a few warnings but managed to compile! I added a corresponding test to make sure that it works (i.e., if there's a credit-id and no credit-name, add the right credit-name.) This isn't passing yet, so this needs some more invesigation

Warnings
src/Text/Pandoc/Writers/JATS.hs:91:3: warning: [GHC-62161] [-Wincomplete-patterns]
    Pattern match(es) are non-exhaustive
    In a case alternative:
        Patterns of type Val Text not matched:
            SimpleVal _
            ListVal _
            BoolVal _
            NullVal
   |
91 |   case val of
   |   ^^^^^^^^^^^...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At line 92 you match Just (MapVal _) but you don't match anything else.
You need a case for the catch-all "other" clause: _ -> val.
See my original example at #10153 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @jgm sorry I had to step away from this because I felt like I had hit a wall, but I realized an issue with the way my filtering was working was it didn't match the data structure (I was looking for a roles list, but the only thing that was available was an authors list, and there was a roles list in each)

I have (I hope) a final issue with the compiler, again related to me not understanding the typing system enough to annotate the cases correctly. Could you please give me another hint?

src/Text/Pandoc/Writers/JATS.hs:79:8: error: [GHC-39999]
    • No instance for ‘Text.DocTemplates.Internal.FromContext
                         Text (Context Text)’
        arising from a use of ‘getField’
    • In the expression: getField "authors" context
      In the expression:
        case getField "authors" context of
          Nothing -> error "No Authors!"
          Just authors
            -> resetField
                 "authors" (map addCreditNamesToAuthor authors) context
      In an equation for ‘addCreditNames’:
          addCreditNames context
            = case getField "authors" context of
                Nothing -> error "No Authors!"
                Just authors
                  -> resetField
                       "authors" (map addCreditNamesToAuthor authors) context
   |
79 |   case getField "authors" context of
   |        ^^

@cthoyt cthoyt changed the title Add author roles to JATS Add CRediT roles to JATS Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for CRediT roles in JATS export
4 participants