Skip to content

Latest commit

 

History

History
441 lines (373 loc) · 16.6 KB

data_projects.adoc

File metadata and controls

441 lines (373 loc) · 16.6 KB

Data projects

For a given set of objects, change object hierarchy relationships to nonhierarchical relationships

Note
This client’s profile has a descriptionlevel field in the object record.

Object selection criteria:

  • descriptionlevel = item or group of items

  • responsibledepartment = archaeology, archaeology-nagpra, or ethnographic

  • objectnumber begins with A

First, we find all existing hierarchical relationships where the selected objects are children of other objects.

WITH depts AS (
	SELECT id, STRING_AGG(item, '|') AS departments
	FROM collectionobjects_common_responsibledepartments
	GROUP BY id
  ),
children AS (
	SELECT DISTINCT cc.id, cc.objectnumber, hier.name AS csid, cohc.descriptionlevel, depts.departments AS dept
  FROM collectionobjects_common AS cc
  INNER JOIN collectionobjects_ohc AS cohc ON
  cc.id = cohc.id AND
  (cohc.descriptionlevel = 'urn:cspace:cspace.client.org:vocabularies:name(descriptionlevel):item:name(item)''item'''
   OR
   cohc.descriptionlevel = 'urn:cspace:cspace.client.org:vocabularies:name(descriptionlevel):item:name(groupofitems1553529920283)''group of items'''
  )
  INNER JOIN collectionobjects_common_responsibledepartments AS rd ON
  cc.id = rd.id AND
  ( rd.item = 'archaeology'
      OR
    rd.item = 'archaeology-nagpra'
      OR
    rd.item = 'ethnographic'
  )
  INNER JOIN misc ON misc.id = cc.id AND misc.lifecyclestate != 'deleted'
  INNER JOIN hierarchy AS hier ON cc.id = hier.id
  LEFT JOIN depts ON cc.id = depts.id
  WHERE
  cc.objectnumber LIKE 'A %'
	)
SELECT children.objectnumber AS childobjnum,
REGEXP_REPLACE(children.descriptionlevel, $$.*'(.*)'$$, '\1', 'g') AS childdesclevel,
children.dept AS childdept,
rel.subjectcsid AS childcsid,
rel.relationshiptype,
cc.objectnumber AS parentobjnum,
REGEXP_REPLACE(cohc.descriptionlevel, $$.*'(.*)'$$, '\1', 'g') AS parentdesclevel,
depts.departments AS parentdept,
rel.objectcsid AS parentcsid
FROM relations_common rel
INNER JOIN children ON (
	children.csid = rel.subjectcsid
	AND
	rel.relationshiptype = 'hasBroader'
	)
INNER JOIN misc ON misc.id = rel.id AND misc.lifecyclestate != 'deleted'
INNER JOIN hierarchy hier ON rel.objectcsid = hier.name
INNER JOIN collectionobjects_common cc ON hier.id = cc.id
LEFT OUTER JOIN collectionobjects_ohc cohc ON cc.id = cohc.id
LEFT JOIN depts ON cc.id = depts.id

Then we find all existing non-hierarchical relationships where the selected objects are in a relationship with another object:

WITH depts AS (
	SELECT id, STRING_AGG(item, '|') AS departments
	FROM collectionobjects_common_responsibledepartments
	GROUP BY id
  ),
children AS (
	SELECT DISTINCT cc.id, cc.objectnumber, hier.name AS csid, cohc.descriptionlevel, depts.departments AS dept
  FROM collectionobjects_common AS cc
  INNER JOIN collectionobjects_ohc AS cohc ON
  cc.id = cohc.id AND
  (cohc.descriptionlevel = 'urn:cspace:cspace.client.org:vocabularies:name(descriptionlevel):item:name(item)''item'''
   OR
   cohc.descriptionlevel = 'urn:cspace:cspace.client.org:vocabularies:name(descriptionlevel):item:name(groupofitems1553529920283)''group of items'''
  )
  INNER JOIN collectionobjects_common_responsibledepartments AS rd ON
  cc.id = rd.id AND
  ( rd.item = 'archaeology'
      OR
    rd.item = 'archaeology-nagpra'
      OR
    rd.item = 'ethnographic'
  )
  INNER JOIN misc ON misc.id = cc.id AND misc.lifecyclestate != 'deleted'
  INNER JOIN hierarchy AS hier ON cc.id = hier.id
  LEFT JOIN depts ON cc.id = depts.id
  WHERE
  cc.objectnumber LIKE 'A %'
	)
SELECT children.objectnumber AS childobjnum,
REGEXP_REPLACE(children.descriptionlevel, $$.*'(.*)'$$, '\1', 'g') AS childdesclevel,
children.dept AS childdept,
rel.subjectcsid AS childcsid,
rel.relationshiptype,
cc.objectnumber AS parentobjnum,
REGEXP_REPLACE(cohc.descriptionlevel, $$.*'(.*)'$$, '\1', 'g') AS parentdesclevel,
depts.departments AS parentdept,
rel.objectcsid AS parentcsid
FROM relations_common rel
INNER JOIN children ON (
	children.csid = rel.subjectcsid
	AND
	rel.relationshiptype = 'affects'
	)
INNER JOIN misc ON misc.id = rel.id AND misc.lifecyclestate != 'deleted'
INNER JOIN hierarchy hier ON rel.objectcsid = hier.name
INNER JOIN collectionobjects_common cc ON hier.id = cc.id
LEFT OUTER JOIN collectionobjects_ohc cohc ON cc.id = cohc.id
LEFT JOIN depts ON cc.id = depts.id

We don’t want to needlessly recreate nonhierarchical relations that already exist, so we throw the results of our queries in separate sheets of an Excel workbook for some quick and dirty formula fun.

  • create a relid for each row by concatenating parentobjnum and childobjnum

  • use XMATCH to mark rows in existinghier that are not in existingnonhier, comparing relid

  • remove the marked rows

We use the remaining rows to populate the nonhierarchicalrelationship CSV template, to import our new nonhierarchical relationships into CS.

We use the final report from the ingest process to populate the objecthierarchy CSV template with data only from successfully created non-hierarchical relationships, so we can ingest the hierarchical relationships for delete.

We can use a similar process (via wrangling in Excel) to identify pairs of objects for which there are both hierarchical and non-hierarchical relationships, and use that to populate an objecthierarchy CSV template for deletes.

The queries can be run again after the work has been done to identify stragglers and issues.

Export objectNumber, objectProductionDate, objectName, and namedCollection for all collectionobjects

  • if there are multiple values in the objectProductionDate, objectName, or namedCollection fields, take only the value that appears first in the CS record

  • return the dateDisplayDate element for the date field

  • return the first termDisplayName element from the Work authority linked to in namedCollection

Warning

Somewhat dangerous because it assumes unique objectNumber values across objects. This is not enforced by CS, though I knew it was the case in my data set.

It also does not remove any items marked as deleted.

WITH objects AS (
  SELECT
  hier.id AS objid,
  common.objectnumber AS objnum
from collectionobjects_common common
inner join hierarchy hier on hier.id = common.id
inner join misc on misc.id = hier.id AND misc.lifecyclestate != 'deleted'
inner join collectionspace_core core on core.id = misc.id
), prod_date_objs AS (
	SELECT DISTINCT ON (objects.objnum) objects.objnum, sdg.datedisplaydate AS date
	FROM hierarchy hier
	INNER JOIN structureddategroup sdg ON hier.id = sdg.id
	INNER JOIN objects ON hier.parentid = objects.objid
	WHERE hier.name = 'collectionobjects_common:objectProductionDateGroupList'
	ORDER BY objects.objnum ASC, hier.pos ASC
), named_coll AS (
	SELECT objects.objnum, wt.termdisplayname AS coll
	FROM objects
	INNER JOIN collectionobjects_common_namedcollections nc ON nc.id = objects.objid
	INNER JOIN works_common wc ON nc.item = wc.refname AND nc.pos = 0
	INNER JOIN hierarchy hier ON hier.parentid = wc.id AND hier.pos = 0
	INNER JOIN worktermgroup wt ON hier.id = wt.id
), obj_names AS (
	SELECT DISTINCT ON (objects.objnum) objects.objnum, ong.objectname AS objname
	FROM hierarchy hier
	INNER JOIN objectnamegroup ong ON hier.id = ong.id
	INNER JOIN objects ON hier.parentid = objects.objid
	WHERE hier.name = 'collectionobjects_common:objectNameList'
	ORDER BY objects.objnum ASC, hier.pos ASC

)

SELECT objects.objnum, pdo.date, nc.coll, ong.objname
FROM objects
LEFT OUTER JOIN prod_date_objs pdo ON pdo.objnum = objects.objnum
LEFT OUTER JOIN named_coll nc ON nc.objnum = objects.objnum
LEFT OUTER JOIN obj_names ong ON ong.objnum = objects.objnum

Publish all media handling procedure records to "All"

  • This only handles adding a publishTo value to fields that have nothing in that field.

Count of non-deleted MH records published to All

SELECT COUNT(mh.id)
FROM media_common mh
LEFT JOIN media_common_publishtolist pt ON mh.id = pt.id
INNER JOIN misc ON misc.id = mh.id AND misc.lifecyclestate != 'deleted'
WHERE pt.item = E'urn:cspace:{INSTANCEDOMAIN}:vocabularies:name(publishto):item:name(all)\'All\''
AND mh.identificationnumber IS NOT NULL

Count of non-deleted MH records with no publishTo value

SELECT COUNT(mh.id)
FROM media_common mh
LEFT JOIN media_common_publishtolist pt ON mh.id = pt.id
INNER JOIN misc ON misc.id = mh.id AND misc.lifecyclestate != 'deleted'
WHERE pt.id IS NULL
AND mh.identificationnumber IS NOT NULL

Update the publishTo field of unpublished records, setting them to "All"

WITH not_published AS (
SELECT mh.id
FROM media_common mh
LEFT JOIN media_common_publishtolist pt ON mh.id = pt.id
INNER JOIN misc ON misc.id = mh.id AND misc.lifecyclestate != 'deleted'
WHERE pt.id IS NULL
AND mh.identificationnumber IS NOT NULL
), to_insert AS (
	SELECT not_published.id,
	CAST(0 AS integer) AS pos,
	CAST(E'urn:cspace:{INSTANCEDOMAIN}:vocabularies:name(publishto):item:name(all)\'All\'' AS text) AS item
FROM not_published
)
INSERT INTO media_common_publishtolist
select * from to_insert

Is an authority term used in a field in deleted object records?

Client could not delete an Archeological Culture Concept authority record. They were getting an error about the term being used in records, but the "Used In" listing in the term record and searching the CS application showed that no records were using the term.

We wanted to verify that there were hidden "deleted" object records lurking in the system, preventing term deletion.

This was made easier by the fact that only one field in the object record would have used terms from this particular authority.

select oppg.id, h.parentid, m.lifecyclepolicy, m.lifecyclestate from objectproductionpeoplegroup oppg
inner join hierarchy h
on h.id = oppg.id
inner join misc m
on h.parentid = m.id
where oppg.objectproductionpeople = 'urn:cspace:{INSTANCEDOMAIN}:conceptauthorities:name(archculture):item:name(Prehistoric1558556422893)''Pre-Contact **DELETE - DO NOT USE **'''

Copy value from old, single-valued client-custom field to new, multi-valued community profile field

  • implemented a single value namedCollection field, populated by Work authority, for a client

  • later added a multi-value namedCollections field, also populated by Work authority, to community profiles

  • to get the client closer to community profile and fix confusion about apparent duplicate fields in search/import/export forms, we needed to migrate their data from the old field to the new field

count instances of client-specific field

SELECT COUNT(coo.id) FROM public.collectionobjects_ohc coo
WHERE coo.namedcollection IS NOT NULL

count instances of community profile field

SELECT COUNT(coc.id) FROM public.collectionobjects_common_namedcollections coc
WHERE coc.item IS NOT NULL

count objects where client-specific field is populated, but community profile field is not

SELECT COUNT(coo.id) FROM public.collectionobjects_ohc coo
LEFT JOIN
public.collectionobjects_common_namedcollections con
ON con.id = coo.id
WHERE
con.id IS NULL
AND coo.namedcollection IS NOT NULL

copy values in client-specific field to community profile field

INSERT INTO public.collectionobjects_common_namedcollections (id, pos, item)
SELECT coo.id, 0, coo.namedcollection FROM public.collectionobjects_ohc coo
LEFT JOIN
public.collectionobjects_common_namedcollections con
ON con.id = coo.id
WHERE
con.id IS NULL
AND coo.namedcollection IS NOT NULL

delete values in client-specific field

UPDATE public.collectionobjects_ohc
SET namedcollection = null
WHERE namedcollection IS NOT NULL

Set object publishTo value to public browser for conditionally selected set of objects

Only update objects where:

  • object does not have a 'deleted' workflow state

  • objectNumber starts with A, H, or N, and

  • inventoryStatus is NOT deaccessioned, transferred, missing, or stolen, and

  • publishTo value (is set, but is not NAGPRA or Do not publish) OR (publishTo value is not set)

Tip
"publishTo value is set" means there is at least one row in collectionobjects_common_publishtolist with id matching to collectionobjects_common id. There can be rows for objects in collectionobjects_common_publishtolist with null item field. These should be updated with the new value.

This project needed two separate actions: update existing publishTo values meeting criteria, and inserting new publishTo values into records meeting other criteria but having no publishTo values

Update existing publishTo values meeting criteria
UPDATE collectionobjects_common_publishtolist
SET item = 'urn:cspace:core.collectionspace.org:vocabularies:name(publishto):item:name(cspacepub)''CollectionSpace Public Browser'''
WHERE id IN (
  SELECT coc_pub.id
  FROM collectionobjects_common AS coc
  LEFT JOIN collectionobjects_common_inventorystatuslist AS coc_inv ON (coc.id = coc_inv.id)
  LEFT JOIN collectionobjects_common_publishtolist AS coc_pub ON (coc.id = coc_pub.id)
  INNER JOIN misc ON misc.id = coc.id AND misc.lifecyclestate != 'deleted'
  WHERE
    (coc.objectnumber LIKE 'A%' OR coc.objectnumber LIKE 'H%' OR coc.objectnumber LIKE 'N%')
  AND
    (
      (coc_inv.item NOT LIKE '%deaccessioned%' OR coc_inv.item NOT LIKE '%transferred%'OR coc_inv.item NOT LIKE '%missing%' OR coc_inv.item NOT LIKE '%stolen%')
      OR
      coc_inv.item IS NULL)
  AND
    coc_pub.id IS NOT NULL
  AND
    (
  	  coc_pub.item IS NULL
  	  OR
  	  (coc_pub.item NOT LIKE '%NAGPRA%' AND coc_pub.item NOT LIKE '%Do not publish%')
    )
);
Insert new publishTo values into records meeting other criteria, but having no existing publishTo values
WITH not_published AS (
SELECT DISTINCT coc.id FROM collectionobjects_common AS coc
    LEFT JOIN collectionobjects_common_inventorystatuslist AS coc_inv ON (coc.id = coc_inv.id)
    LEFT JOIN collectionobjects_common_publishtolist AS coc_pub ON (coc.id = coc_pub.id)
    INNER JOIN misc ON misc.id = coc.id AND misc.lifecyclestate != 'deleted'
WHERE (coc.objectnumber LIKE 'A%' OR coc.objectnumber LIKE 'H%' OR coc.objectnumber LIKE 'N%')
  AND
  (
    (coc_inv.item NOT LIKE '%deaccessioned%' OR coc_inv.item NOT LIKE '%transferred%'OR coc_inv.item NOT LIKE '%missing%' OR coc_inv.item NOT LIKE '%stolen%')
    OR coc_inv.item IS NULL)
  AND coc_pub.id IS NULL
), to_insert AS (
  SELECT not_published.id,
  CAST(0 AS integer) AS pos,
  CAST('urn:cspace:core.collectionspace.org:vocabularies:name(publishto):item:name(cspacepub)''CollectionSpace Public Browser''' AS text) AS item
  FROM not_published
)
INSERT INTO collectionobjects_common_publishtolist
SELECT * from to_insert
Final count of objects published to public browser
SELECT COUNT(coc.id) FROM collectionobjects_common AS coc
LEFT JOIN collectionobjects_common_publishtolist AS coc_pub ON (coc.id = coc_pub.id)
INNER JOIN misc ON misc.id = coc.id AND misc.lifecyclestate != 'deleted'
WHERE coc_pub.item LIKE '%Public Browser%'

Upon re-indexing, logs showed errors like:

org.nuxeo.ecm.core.api.NuxeoException: Could not select: SELECT "id", "annotationauthor", "annotationdate", "annotationtype", "annotationnote" FROM "annotationgroup" WHERE "id" IN (?)
Caused by: org.postgresql.util.PSQLException: Bad value for type timestamp/date/time: {1}
Caused by: java.lang.NumberFormatException: Expected time to be colon-separated, got '/'
Find rows in annotationgroup where annotationdate value contains /
select * from annotationgroup
where annotationdate LIKE '%/%'

For the client, generate a report of object numbers, previous invalid annotationdate values, and what you changed them to. The following will give you the first two pieces of that.

select cc.objectnumber, ag.annotationdate from annotationgroup ag
inner join hierarchy h on ag.id = h.id
inner join hierarchy hh on h.parentid = hh.id
inner join collectionobjects_common cc on hh.id = cc.id
where ag.annotationdate LIKE '%/%'

You need a list of the object record CSIDs where this will be fixed, so that only those can be re-indexed.

Get CSIDs of records containing annotationgroup entries with annotationdate values containing /
SELECT HH.NAME AS CSID,
	AG.ID AS GROUPID
FROM ANNOTATIONGROUP AG
LEFT JOIN HIERARCHY H ON AG.ID = H.ID
LEFT JOIN HIERARCHY HH ON H.PARENTID = HH.ID
WHERE ANNOTATIONDATE LIKE '%/%';