Skip to content
GitLab
Menu
Projects
Groups
Snippets
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in
Toggle navigation
Menu
Open sidebar
memoriav
Memobase 2020
utilities
Import Process CLI
Commits
0247f5f7
Commit
0247f5f7
authored
Oct 04, 2021
by
Jonas Waeber
Browse files
Remove old scripts
parent
208ba180
Changes
56
Show whitespace changes
Inline
Side-by-side
utilities/README.md
deleted
100644 → 0
View file @
208ba180
## Utility Scripts
These scripts and files serve various convenience or support functions.
### Clean Up Validation Helm Charts
A script to remove all the text file validation charts released on the kubernetes cluster.
Just run
`./run.sh`
to execute.
### Formats
See local readme for details. Used to update the format facet labels.
### Kafka
Utility scripts to manage the kafka cluster.
### Languages
See local readme for details. Used to update the language facet labels.
### Migration
Scripts & files which were used for the migration. Mostly obsolete.
### Publish
A script to change documents in an index to published (or unpublished).
### Reports
Scripts related to managing reports from the processes which are added to elasticsearch.
\ No newline at end of file
utilities/cleanup-validation-helm-charts/remove_all_releases.py
deleted
100644 → 0
View file @
208ba180
import
json
import
subprocess
if
__name__
==
'__main__'
:
with
open
(
'current_releases.json'
,
'r'
)
as
fp
:
releases
=
json
.
load
(
fp
)
for
release
in
releases
:
if
release
[
'chart'
].
startswith
(
'text-file-validation'
):
subprocess
.
run
([
"helm"
,
"uninstall"
,
release
[
'name'
]])
\ No newline at end of file
utilities/cleanup-validation-helm-charts/run.sh
deleted
100755 → 0
View file @
208ba180
helm list
--output
json
>
current_releases.json
python remove_all_releases.py
\ No newline at end of file
utilities/delete/delete_institution.sh
deleted
100644 → 0
View file @
208ba180
# requires two tunnels to work properly
# ssh -L 8080:dd-fed2:8080 swissbib@dd-fed2.ub.unibas.ch
# ssh -L 8081:mb-es1:8080 swissbib@mb-es1.memobase.unibas.ch
INSTITUTION_ID
=
""
INSTITUTION_INDEX
=
"institutions-v6"
# This password can be found in the memobase keypass database.
FEDORA_PASSWORD
=
""
# Deletes the institution record from fedora. Will only work if all record sets which reference this institution are
# deleted beforehand
curl
-u
"fedoraAdmin:
$FEDORA_PASSWORD
"
-XDELETE
http://localhost:8080/frcrep/rest/institution/
$INSTITUTION_ID
# Deletes the tombstone of the institution. This stores the fact that this identifier was once used but then deleted.
# Only delete this if you wish to free up the slot or do not want to keep a log of the deletion.
#curl -u"fedoraAdmin:$FEDORA_PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/institution/$INSTITUTION_ID/fcr:tombstone
curl
-XDELETE
"localhost:8081/
$INSTITUTION_INDEX
/_doc/
$INSTITUTION_ID
"
\ No newline at end of file
utilities/delete/delete_record_set.sh
deleted
100644 → 0
View file @
208ba180
# requires two tunnels to work properly
# ssh -L 8080:dd-fed2:8080 swissbib@dd-fed2.ub.unibas.ch
# ssh -L 8081:mb-es1:8080 swissbib@mb-es1.memobase.unibas.ch
RECORD_SET_ID
=
""
RECORD_SET_INDEX
=
"record-sets-v7"
PASSWORD
=
""
# Deletes the institution record from fedora. This will only work if all the records which reference this record set
# are deleted before.
curl
-u
"fedoraAdmin:
$PASSWORD
"
-XDELETE
http://localhost:8080/frcrep/rest/recordSet/RECORD_SET_ID
# Deletes the tombstone of the record set. This stores the fact that this identifier was once used but then deleted.
# Only delete this if you wish to free up the slot or do not want to keep a log of the deletion.
#curl -u"fedoraAdmin:$PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/recordSet/RECORD_SET_ID/fcr:tombstone
curl
-XDELETE
"localhost:8081/
$RECORD_SET_INDEX
/_doc/
$RECORD_SET_ID
"
\ No newline at end of file
utilities/elastic_scripts/copy_to_backup.py
deleted
100644 → 0
View file @
208ba180
from
simple_elastic
import
ElasticIndex
if
__name__
==
'__main__'
:
backup_index
=
ElasticIndex
(
'documents-v21'
,
url
=
'localhost:8081'
)
prod_index
=
ElasticIndex
(
'documents-v21'
,
url
=
'localhost:8080'
)
total
=
416398
current
=
1000
for
item
in
prod_index
.
scroll
(
size
=
1000
):
backup_index
.
bulk
(
item
,
identifier_key
=
'id'
,
keep_id_key
=
True
)
print
(
f
"Loaded
{
current
}
/
{
total
}
"
)
current
+=
1000
utilities/elastic_scripts/reindex.sh
deleted
100755 → 0
View file @
208ba180
curl
-X
POST
"localhost:8080/_reindex?pretty"
-H
'Content-Type: application/json'
-d
'
{
"source": {
"index": "institutions-v5"
},
"dest": {
"index": "institutions-v6"
}
}
'
\ No newline at end of file
utilities/elastic_scripts/udpate_by_query/query.json
deleted
100644 → 0
View file @
208ba180
{
"query"
:
{
"term"
:
{
"recordSet.facet"
:
"bar-001"
}
},
"script"
:
{
"params"
:
{
"institution"
:
[
{
"facet"
:
[],
"filter"
:
"csa"
,
"name"
:
{
"de"
:
[
"Cinémathèque suisse"
],
"fr"
:
[
"Cinémathèque suisse"
],
"it"
:
[
"Cinémathèque suisse"
],
"un"
:
[]
}
},
{
"facet"
:
[],
"filter"
:
"bar"
,
"name"
:
{
"de"
:
[
"Schweizerisches Bundesarchiv"
],
"fr"
:
[
"Archives fédérales suisses"
],
"it"
:
[
"Archivio federale svizzero"
],
"un"
:
[]
}
}
]
},
"source"
:
"ctx._source['institution'] = params.institution"
}
}
utilities/elastic_scripts/udpate_by_query/update_by_query.sh
deleted
100644 → 0
View file @
208ba180
curl
-X
POST
-H
'Content-Type: application/json'
-d
"@query.json"
"localhost:8080/documents-v21/_update_by_query"
\ No newline at end of file
utilities/elastic_scripts/unpublish_record_set.json
deleted
100644 → 0
View file @
208ba180
{
"query"
:
{
"term"
:
{
"id"
:
"klu-002"
}
},
"script"
:
{
"source"
:
"ctx._source.published = 'false'"
,
"lang"
:
"painless"
}
}
utilities/elastic_scripts/unpublish_record_set.sh
deleted
100644 → 0
View file @
208ba180
curl
-X
POST
-H
'Content-Type: application/json'
-d
"@unpublish_record_set.json"
"localhost:8080/record-sets-v7/_update_by_query"
utilities/formats/README.md
deleted
100644 → 0
View file @
208ba180
## Format Facet Labels
**IMPORTANT**
: The SPARQL query service requests do not work within the university VPN.
The missing labels should be directly updated on Wikidata.
### Files
The base folder to manage the format facet mapping labels.
*
`custom_labels.csv`
contains a list of all the labels for facet values
which are not linked to Wikidata. Needs to be updated manually if there are any changes.
*
`format_labels.csv`
is generated by the script based on Wikidata labels and the custom labels.
*
`missing_labels.csv`
is generated by the script on which Wikidata facet values is at
least one label missing.
*
`query.sparql`
contains the template for the SPARQL query used to retrieve the labels
from the Wikidata Query Service.
*
`script.py`
is run to update the format labels.
\ No newline at end of file
utilities/formats/custom_labels.csv
deleted
100644 → 0
View file @
208ba180
"Andere","Andere","Autres","Altri"
"Unbekannt","Unbekannt","Inconnue","Sconosciuto"
"Keine Angabe","Keine Angabe","Aucune information","Nessuna informazione"
\ No newline at end of file
utilities/formats/format_labels.csv
deleted
100644 → 0
View file @
208ba180
"id","de","fr","it"
"Q5294","DVD","DVD","DVD"
"Q6293","fotografischer Film","pellicule photographique","pellicola fotografica"
"Q34467","Compact Disc","disque compact","compact disc"
"Q42591","MP3","MPEG-1/2 Audio Layer 3","MP3"
"Q47770","Blu-ray Disc","disque Blu-ray","Blu-ray Disc"
"Q149757","Compact Cassette","cassette audio","musicassetta"
"Q166816","Diafilm","film inversible",""
"Q179744","Daguerreotypie","daguerréotype","dagherrotipia"
"Q183976","VHS","Video Home System","VHS"
"Q192425","Postkarte","carte postale","cartolina postale"
"Q193663","Magnetband","bande magnétique","nastro magnetico"
"Q194383","16-mm-Film","Format 16 mm","16 millimetri"
"Q201093","RealAudio","RealAudio","RealAudio"
"Q217570","RIFF WAVE","Waveform Audio File Format","Waveform Audio File Format"
"Q219763","MPEG-4","MPEG-4","MPEG-4"
"Q226528","35-mm-Film","format 35 mm","pellicola cinematografica 35 millimetri"
"Q261242","70-mm-Film","Format 70 mm","70 millimetri"
"Q270183","8-mm-Film","Film 8 mm","8 millimetri"
"Q275007","9,5-mm-Film","Film 9,5 mm","9,5 millimetri"
"Q275079","MiniDisc","MiniDisc","Minidisc"
"Q278080","U-matic","U-matic","U-matic"
"Q280761","Windows Media Video","Windows Media Video","Windows Media Video"
"Q336316","MP4","MPEG-4 Part 14","MPEG-4 Part 14"
"Q420778","CD-R","Disque compact enregistrable","CD-R"
"Q592654","MPEG-1 Audio Layer 2","MPEG-1 Audio Layer II","MPEG-1 Layer II"
"Q595597","Negativfilm","film négatif","Pellicola per negativi"
"Q597615","Digital Audio Tape","Digital Audio Tape","Digital Audio Tape"
"Q690148","Betamax","Betamax","Betamax"
"Q691783","Phonographenwalze","cylindre phonographique","cilindro fonografico"
"Q830904","Betacam SP","",""
"Q830910","Betacam","Bétacam","Betacam"
"Q841983","Langspielplatte","Long Play","long playing"
"Q875215","High Definition Video","High Definition Video","High Definition Video"
"Q912760","Fotopapier","papier photographique","carta fotografica"
"Q942350","","Fichier Quicktime","QuickTime File Format"
"Q1004803","Lichttonverfahren","",""
"Q1050875","Super Video Home System","Super VHS","S-VHS"
"Q1136889","ProRes","ProRes 422","ProRes"
"Q1138868","Fotoplatte","plaque photographique","lastra fotografica"
"Q1155472","Video 8","Video 8","Video8"
"Q1194529","HDCAM","HDCAM",""
"Q1361160","DVCAM","DVCAM",""
"Q1412320","Super 8","Super 8","Super 8 millimetri"
"Q1509636","Mikrokassette","Microcassette","microcassetta"
"Q1751553","Digital Betacam","Betacam numérique",""
"Q2121997","Quadruplex","Bande vidéo 2 pouces","2 pollici Quadruplex"
"Q2302273","Hi8","Hi-8",""
"Q2581328","Digital Cinema Package","Digital Cinema Package","Digital Cinema Package"
"Q3072028","17,5-mm-Film","Film 17,5 mm",""
"Q3796889","Digital Video","Digital Video","Digital Video"
"Q5273930","Dictabelt","Dictabelt",""
"Q6957908","MiniDV","MiniDV",""
"Q15945314","Half-inch tape","Half-inch tape","Half-inch tape"
"Q17010713","Direktschnitt","",""
"Q20183259","DVCPro-Familie","",""
"Q26987229","","fichier audio",""
"Q28919138","","",""
"Q28919141","","",""
"Q56055236","Abzug","épreuve photographique",""
"Q61996834","","",""
"Andere","Andere","Autres","Altri"
"Unbekannt","Unbekannt","Inconnue","Sconosciuto"
"Keine Angabe","Keine Angabe","Aucune information","Nessuna informazione"
\ No newline at end of file
utilities/formats/missing_labels.csv
deleted
100644 → 0
View file @
208ba180
"Q166816","Diafilm","film inversible",""
"Q830904","Betacam SP","",""
"Q942350","","Fichier Quicktime","QuickTime File Format"
"Q1004803","Lichttonverfahren","",""
"Q1194529","HDCAM","HDCAM",""
"Q1361160","DVCAM","DVCAM",""
"Q1751553","Digital Betacam","Betacam numérique",""
"Q2302273","Hi8","Hi-8",""
"Q3072028","17,5-mm-Film","Film 17,5 mm",""
"Q5273930","Dictabelt","Dictabelt",""
"Q6957908","MiniDV","MiniDV",""
"Q17010713","Direktschnitt","",""
"Q17165350","","",""
"Q20183259","DVCPro-Familie","",""
"Q26987229","","fichier audio",""
"Q28919138","","",""
"Q28919141","","",""
"Q56055236","Abzug","épreuve photographique",""
"Q61996834","","",""
utilities/formats/query.sparql
deleted
100644 → 0
View file @
208ba180
SELECT ?item
WHERE
{
wd:PLACEHOLDER rdfs:label ?item .
FILTER(lang(?item) = "de" || lang(?item) = "fr" || lang(?item) = "it" )
}
\ No newline at end of file
utilities/formats/script.py
deleted
100644 → 0
View file @
208ba180
import
csv
import
logging
import
sys
from
SPARQLWrapper
import
SPARQLWrapper
,
JSON
logging
.
basicConfig
(
stream
=
sys
.
stdout
,
level
=
logging
.
INFO
)
def
read_csv_file
(
path
:
str
):
with
open
(
path
,
'r'
)
as
fp
:
csv_rows
=
csv
.
reader
(
fp
,
dialect
=
'unix'
)
ids
=
set
()
strings
=
set
()
# skip the header
logging
.
info
(
"Reading format mapping file."
)
next
(
csv_rows
,
None
)
for
r
in
csv_rows
:
for
index
,
item
in
enumerate
(
r
[
1
:]):
if
index
<=
5
and
item
!=
""
:
ids
.
add
(
item
)
elif
index
>
5
and
item
!=
""
:
strings
.
add
(
item
)
logging
.
info
(
"Collected all mapped wikidata identifiers and custom strings."
)
logging
.
info
(
f
"There are
{
len
(
ids
)
}
unique wikidata identifiers present."
)
logging
.
info
(
f
"The following custom facet values are present:
{
', '
.
join
(
strings
)
}
"
)
return
ids
,
strings
if
__name__
==
'__main__'
:
source_path
=
'../../global-configs/prod/transforms/formats.csv'
wikidata_identifiers
,
custom_strings
=
read_csv_file
(
source_path
)
logging
.
info
(
"Check if all custom facet values are mapped to a label."
)
with
open
(
'custom_labels.csv'
,
'r'
)
as
cl
:
custom_label_text
=
cl
.
read
()
custom_labels
=
csv
.
reader
(
custom_label_text
.
split
(
'
\n
'
),
dialect
=
'unix'
)
defined_labels
=
set
()
for
row
in
custom_labels
:
defined_labels
.
add
(
row
[
0
])
difference
=
custom_strings
.
difference
(
defined_labels
)
if
len
(
difference
)
>
0
:
logging
.
error
(
f
"The following custom facet values have no labels:
{
', '
.
join
(
difference
)
}
."
)
else
:
logging
.
info
(
"All custom facet values have a label defined."
)
logging
.
info
(
"Setting up connection to service."
)
s
=
SPARQLWrapper
(
"https://query.wikidata.org/sparql"
,
agent
=
'Python Script (University Library Basel, jonas.waeber@unibas.ch)'
)
logging
.
info
(
"Reading SPARQL template."
)
with
open
(
'query.sparql'
,
'r'
)
as
sp
:
request_template
=
sp
.
read
()
logging
.
info
(
"Writing the format labels file."
)
missing_labels
=
list
()
wikidata_identifiers
=
sorted
(
wikidata_identifiers
,
key
=
lambda
x
:
int
(
x
.
replace
(
'Q'
,
''
)))
with
open
(
'format_labels.csv'
,
'w'
)
as
w
:
writer
=
csv
.
writer
(
w
,
dialect
=
'unix'
)
writer
.
writerow
([
'id'
,
'de'
,
'fr'
,
'it'
])
for
q
in
wikidata_identifiers
:
request
=
request_template
.
replace
(
'PLACEHOLDER'
,
q
)
s
.
setQuery
(
request
)
s
.
setReturnFormat
(
JSON
)
logging
.
info
(
f
"Query Wikidata service for value
{
q
}
."
)
results
=
s
.
query
().
convert
()
lang_values
=
dict
()
for
row
in
results
[
'results'
][
'bindings'
]:
lang_values
[
row
[
'item'
][
'xml:lang'
]]
=
row
[
'item'
][
'value'
]
de
=
lang_values
[
'de'
]
if
'de'
in
lang_values
else
''
fr
=
lang_values
[
'fr'
]
if
'fr'
in
lang_values
else
''
it
=
lang_values
[
'it'
]
if
'it'
in
lang_values
else
''
writer
.
writerow
([
q
,
de
,
fr
,
it
])
if
de
==
''
or
fr
==
''
or
it
==
''
:
missing_labels
.
append
([
q
,
de
,
fr
,
it
])
# add the custom facet value labels at the end.
w
.
write
(
custom_label_text
)
if
len
(
missing_labels
)
>
0
:
logging
.
info
(
"Writing missing labels."
)
missing_labels
=
sorted
(
missing_labels
,
key
=
lambda
x
:
int
(
x
[
0
].
replace
(
'Q'
,
''
)))
with
open
(
'missing_labels.csv'
,
'w'
)
as
w
:
writer
=
csv
.
writer
(
w
,
dialect
=
'unix'
)
for
row
in
missing_labels
:
writer
.
writerow
(
row
)
logging
.
info
(
"Finished processing format labels."
)
utilities/import/README.md
deleted
100644 → 0
View file @
208ba180
## Blank configurations for the import
This folder contains a blank mapping.yml and localTransforms.yml with example data. The folder new-001 can be copied to deployment and the config files can be adapted for the import of a new record set.
\ No newline at end of file
utilities/import/new-001/mappings/localTransforms.yml
deleted
100644 → 0
View file @
208ba180
# This is the local configuration for the step Metadata Normalization and Enrichment
splitEntity
:
-
type
:
skos:Concept
property
:
skos:prefLabel
delimiter
:
"
,"
-
type
:
rico:Person
property
:
rico:name
delimiter
:
"
;"
-
type
:
rico:CorporateBody
property
:
rico:name
delimiter
:
"
;"
-
type
:
rico:Place
property
:
rico:name
delimiter
:
"
,"
-
type
:
rico:Language
property
:
rico:name
delimiter
:
"
,"
-
type
:
rico:Agent
property
:
rico:name
delimiter
:
"
;"
normalizePerson
:
splitEntity
:
type
:
rico:Person
property
:
rico:name
delimiter
:
"
;"
creationRelationName
:
# only tries to extract a value if a DUMMY-VALUE rico:name is property is present in the relation.
pattern
:
"
\\
((?<relation>.+)
\\
)"
# " are necessary to ensure pattern is parsed correctly. Pattern needs to be double escaped!
language
:
NONE
nameOrder
:
# "last-to-first" (i.e. Tester, Thea) or "first-to-last" (i.e. Thea Tester)
singleNameIsLastName
:
true
nameDelimiter
:
SPACE
\ No newline at end of file
utilities/import/new-001/mappings/mapping.yml
deleted
100644 → 0
View file @
208ba180
record
:
uri
:
identifiers
:
original
:
type
:
title
:
titles
:
-
main
:
de
:
-
series
:
de
:
-
broadcast
:
de
:
scopeAndContent
:
de
:
sameAs
:
abstract
:
source
:
descriptiveNote
:
relation
:
conditionsOfUse
:
conditionsOfAccess
:
isSponsoredByMemoriav
:
# true or false
rights
:
holder
:
languages
:
-
content
:
-
caption
:
subject
:
-
prefLabel
:
genre
:
-
prefLabel
:
placeOfCapture
:
-
name
:
relatedPlaces
:
-
name
:
creationDate
:
issuedDate
:
temporal
:
creators
:
-
agent
:
name
:
relationName
:
-
person
:
name
:
relationName
:
-
corporateBody
:
name
:
relationName
:
contributors
:
-
agent
:
name
:
relationName
:
-
person
:
name
:
relationName
:
-
corporateBody
:
name
:
relationName
:
producers
:
-
agent
:
name
:
-
person
:
name
:
-
corporateBody
:
name
:
relatedAgents
:
-
agent
:
name
:
-
person
:
name
:
-
corporateBody
:
name
:
publishedBy
:
-
agent
:
name
:
-
person
:
name
:
-
corporateBody
:
name
:
physical
:
identifiers
:
callNumber
:
carrierType
:
descriptiveNote
:
duration
:
physicalCharacteristics
:
-
prefix
:
value
:
"
xxx:
"
field
:
colour
:
conditionsOfUse
:
conditionsOfAccess
:
rights
:
access
:
usage
:
name
:
const
:
"
Copyright
Not
Evaluated
(CNE)"
sameAs
:
const
:
"
http://rightsstatements.org/vocab/CNE/1.0/"
digital
:
descriptiveNote
:
locator
:
duration
:
conditionsOfUse
:
conditionsOfAccess
:
rights
:
access
:
usage
:
name
:
const
:
"
Copyright
Not
Evaluated
(CNE)"
sameAs
:
const
:
"
http://rightsstatements.org/vocab/CNE/1.0/"
\ No newline at end of file
Prev
1
2
3
Next
Write
Preview
Supports
Markdown
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment