Commit 0247f5f7 authored by Jonas Waeber's avatar Jonas Waeber
Browse files

Remove old scripts

parent 208ba180
## Utility Scripts
These scripts and files serve various convenience or support functions.
### Clean Up Validation Helm Charts
A script to remove all the text file validation charts released on the kubernetes cluster.
Just run `./run.sh` to execute.
### Formats
See local readme for details. Used to update the format facet labels.
### Kafka
Utility scripts to manage the kafka cluster.
### Languages
See local readme for details. Used to update the language facet labels.
### Migration
Scripts & files which were used for the migration. Mostly obsolete.
### Publish
A script to change documents in an index to published (or unpublished).
### Reports
Scripts related to managing reports from the processes which are added to elasticsearch.
\ No newline at end of file
import json
import subprocess
if __name__ == '__main__':
with open('current_releases.json', 'r') as fp:
releases = json.load(fp)
for release in releases:
if release['chart'].startswith('text-file-validation'):
subprocess.run(["helm", "uninstall", release['name']])
\ No newline at end of file
helm list --output json > current_releases.json
python remove_all_releases.py
\ No newline at end of file
# requires two tunnels to work properly
# ssh -L 8080:dd-fed2:8080 swissbib@dd-fed2.ub.unibas.ch
# ssh -L 8081:mb-es1:8080 swissbib@mb-es1.memobase.unibas.ch
INSTITUTION_ID=""
INSTITUTION_INDEX="institutions-v6"
# This password can be found in the memobase keypass database.
FEDORA_PASSWORD=""
# Deletes the institution record from fedora. Will only work if all record sets which reference this institution are
# deleted beforehand
curl -u"fedoraAdmin:$FEDORA_PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/institution/$INSTITUTION_ID
# Deletes the tombstone of the institution. This stores the fact that this identifier was once used but then deleted.
# Only delete this if you wish to free up the slot or do not want to keep a log of the deletion.
#curl -u"fedoraAdmin:$FEDORA_PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/institution/$INSTITUTION_ID/fcr:tombstone
curl -XDELETE "localhost:8081/$INSTITUTION_INDEX/_doc/$INSTITUTION_ID"
\ No newline at end of file
# requires two tunnels to work properly
# ssh -L 8080:dd-fed2:8080 swissbib@dd-fed2.ub.unibas.ch
# ssh -L 8081:mb-es1:8080 swissbib@mb-es1.memobase.unibas.ch
RECORD_SET_ID=""
RECORD_SET_INDEX="record-sets-v7"
PASSWORD=""
# Deletes the institution record from fedora. This will only work if all the records which reference this record set
# are deleted before.
curl -u"fedoraAdmin:$PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/recordSet/RECORD_SET_ID
# Deletes the tombstone of the record set. This stores the fact that this identifier was once used but then deleted.
# Only delete this if you wish to free up the slot or do not want to keep a log of the deletion.
#curl -u"fedoraAdmin:$PASSWORD" -XDELETE http://localhost:8080/frcrep/rest/recordSet/RECORD_SET_ID/fcr:tombstone
curl -XDELETE "localhost:8081/$RECORD_SET_INDEX/_doc/$RECORD_SET_ID"
\ No newline at end of file
from simple_elastic import ElasticIndex
if __name__ == '__main__':
backup_index = ElasticIndex('documents-v21', url='localhost:8081')
prod_index = ElasticIndex('documents-v21', url='localhost:8080')
total = 416398
current = 1000
for item in prod_index.scroll(size=1000):
backup_index.bulk(item, identifier_key='id', keep_id_key=True)
print(f"Loaded {current}/{total}")
current += 1000
curl -X POST "localhost:8080/_reindex?pretty" -H 'Content-Type: application/json' -d'
{
"source": {
"index": "institutions-v5"
},
"dest": {
"index": "institutions-v6"
}
}
'
\ No newline at end of file
{
"query": {
"term": {
"recordSet.facet": "bar-001"
}
},
"script": {
"params": {
"institution": [
{
"facet": [],
"filter": "csa",
"name": {
"de": [
"Cinémathèque suisse"
],
"fr": [
"Cinémathèque suisse"
],
"it": [
"Cinémathèque suisse"
],
"un": []
}
},
{
"facet": [],
"filter": "bar",
"name": {
"de": [
"Schweizerisches Bundesarchiv"
],
"fr": [
"Archives fédérales suisses"
],
"it": [
"Archivio federale svizzero"
],
"un": []
}
}
]
},
"source": "ctx._source['institution'] = params.institution"
}
}
curl -X POST -H 'Content-Type: application/json' -d "@query.json" "localhost:8080/documents-v21/_update_by_query"
\ No newline at end of file
{
"query": {
"term": {
"id":"klu-002"
}
},
"script": {
"source": "ctx._source.published = 'false'",
"lang": "painless"
}
}
curl -X POST -H 'Content-Type: application/json' -d "@unpublish_record_set.json" "localhost:8080/record-sets-v7/_update_by_query"
## Format Facet Labels
**IMPORTANT**: The SPARQL query service requests do not work within the university VPN.
The missing labels should be directly updated on Wikidata.
### Files
The base folder to manage the format facet mapping labels.
* `custom_labels.csv` contains a list of all the labels for facet values
which are not linked to Wikidata. Needs to be updated manually if there are any changes.
* `format_labels.csv` is generated by the script based on Wikidata labels and the custom labels.
* `missing_labels.csv` is generated by the script on which Wikidata facet values is at
least one label missing.
* `query.sparql` contains the template for the SPARQL query used to retrieve the labels
from the Wikidata Query Service.
* `script.py` is run to update the format labels.
\ No newline at end of file
"Andere","Andere","Autres","Altri"
"Unbekannt","Unbekannt","Inconnue","Sconosciuto"
"Keine Angabe","Keine Angabe","Aucune information","Nessuna informazione"
\ No newline at end of file
"id","de","fr","it"
"Q5294","DVD","DVD","DVD"
"Q6293","fotografischer Film","pellicule photographique","pellicola fotografica"
"Q34467","Compact Disc","disque compact","compact disc"
"Q42591","MP3","MPEG-1/2 Audio Layer 3","MP3"
"Q47770","Blu-ray Disc","disque Blu-ray","Blu-ray Disc"
"Q149757","Compact Cassette","cassette audio","musicassetta"
"Q166816","Diafilm","film inversible",""
"Q179744","Daguerreotypie","daguerréotype","dagherrotipia"
"Q183976","VHS","Video Home System","VHS"
"Q192425","Postkarte","carte postale","cartolina postale"
"Q193663","Magnetband","bande magnétique","nastro magnetico"
"Q194383","16-mm-Film","Format 16 mm","16 millimetri"
"Q201093","RealAudio","RealAudio","RealAudio"
"Q217570","RIFF WAVE","Waveform Audio File Format","Waveform Audio File Format"
"Q219763","MPEG-4","MPEG-4","MPEG-4"
"Q226528","35-mm-Film","format 35 mm","pellicola cinematografica 35 millimetri"
"Q261242","70-mm-Film","Format 70 mm","70 millimetri"
"Q270183","8-mm-Film","Film 8 mm","8 millimetri"
"Q275007","9,5-mm-Film","Film 9,5 mm","9,5 millimetri"
"Q275079","MiniDisc","MiniDisc","Minidisc"
"Q278080","U-matic","U-matic","U-matic"
"Q280761","Windows Media Video","Windows Media Video","Windows Media Video"
"Q336316","MP4","MPEG-4 Part 14","MPEG-4 Part 14"
"Q420778","CD-R","Disque compact enregistrable","CD-R"
"Q592654","MPEG-1 Audio Layer 2","MPEG-1 Audio Layer II","MPEG-1 Layer II"
"Q595597","Negativfilm","film négatif","Pellicola per negativi"
"Q597615","Digital Audio Tape","Digital Audio Tape","Digital Audio Tape"
"Q690148","Betamax","Betamax","Betamax"
"Q691783","Phonographenwalze","cylindre phonographique","cilindro fonografico"
"Q830904","Betacam SP","",""
"Q830910","Betacam","Bétacam","Betacam"
"Q841983","Langspielplatte","Long Play","long playing"
"Q875215","High Definition Video","High Definition Video","High Definition Video"
"Q912760","Fotopapier","papier photographique","carta fotografica"
"Q942350","","Fichier Quicktime","QuickTime File Format"
"Q1004803","Lichttonverfahren","",""
"Q1050875","Super Video Home System","Super VHS","S-VHS"
"Q1136889","ProRes","ProRes 422","ProRes"
"Q1138868","Fotoplatte","plaque photographique","lastra fotografica"
"Q1155472","Video 8","Video 8","Video8"
"Q1194529","HDCAM","HDCAM",""
"Q1361160","DVCAM","DVCAM",""
"Q1412320","Super 8","Super 8","Super 8 millimetri"
"Q1509636","Mikrokassette","Microcassette","microcassetta"
"Q1751553","Digital Betacam","Betacam numérique",""
"Q2121997","Quadruplex","Bande vidéo 2 pouces","2 pollici Quadruplex"
"Q2302273","Hi8","Hi-8",""
"Q2581328","Digital Cinema Package","Digital Cinema Package","Digital Cinema Package"
"Q3072028","17,5-mm-Film","Film 17,5 mm",""
"Q3796889","Digital Video","Digital Video","Digital Video"
"Q5273930","Dictabelt","Dictabelt",""
"Q6957908","MiniDV","MiniDV",""
"Q15945314","Half-inch tape","Half-inch tape","Half-inch tape"
"Q17010713","Direktschnitt","",""
"Q20183259","DVCPro-Familie","",""
"Q26987229","","fichier audio",""
"Q28919138","","",""
"Q28919141","","",""
"Q56055236","Abzug","épreuve photographique",""
"Q61996834","","",""
"Andere","Andere","Autres","Altri"
"Unbekannt","Unbekannt","Inconnue","Sconosciuto"
"Keine Angabe","Keine Angabe","Aucune information","Nessuna informazione"
\ No newline at end of file
"Q166816","Diafilm","film inversible",""
"Q830904","Betacam SP","",""
"Q942350","","Fichier Quicktime","QuickTime File Format"
"Q1004803","Lichttonverfahren","",""
"Q1194529","HDCAM","HDCAM",""
"Q1361160","DVCAM","DVCAM",""
"Q1751553","Digital Betacam","Betacam numérique",""
"Q2302273","Hi8","Hi-8",""
"Q3072028","17,5-mm-Film","Film 17,5 mm",""
"Q5273930","Dictabelt","Dictabelt",""
"Q6957908","MiniDV","MiniDV",""
"Q17010713","Direktschnitt","",""
"Q17165350","","",""
"Q20183259","DVCPro-Familie","",""
"Q26987229","","fichier audio",""
"Q28919138","","",""
"Q28919141","","",""
"Q56055236","Abzug","épreuve photographique",""
"Q61996834","","",""
SELECT ?item
WHERE
{
wd:PLACEHOLDER rdfs:label ?item .
FILTER(lang(?item) = "de" || lang(?item) = "fr" || lang(?item) = "it" )
}
\ No newline at end of file
import csv
import logging
import sys
from SPARQLWrapper import SPARQLWrapper, JSON
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
def read_csv_file(path: str):
with open(path, 'r') as fp:
csv_rows = csv.reader(fp, dialect='unix')
ids = set()
strings = set()
# skip the header
logging.info("Reading format mapping file.")
next(csv_rows, None)
for r in csv_rows:
for index, item in enumerate(r[1:]):
if index <= 5 and item != "":
ids.add(item)
elif index > 5 and item != "":
strings.add(item)
logging.info("Collected all mapped wikidata identifiers and custom strings.")
logging.info(f"There are {len(ids)} unique wikidata identifiers present.")
logging.info(f"The following custom facet values are present: {', '.join(strings)}")
return ids, strings
if __name__ == '__main__':
source_path = '../../global-configs/prod/transforms/formats.csv'
wikidata_identifiers, custom_strings = read_csv_file(source_path)
logging.info("Check if all custom facet values are mapped to a label.")
with open('custom_labels.csv', 'r') as cl:
custom_label_text = cl.read()
custom_labels = csv.reader(custom_label_text.split('\n'), dialect='unix')
defined_labels = set()
for row in custom_labels:
defined_labels.add(row[0])
difference = custom_strings.difference(defined_labels)
if len(difference) > 0:
logging.error(f"The following custom facet values have no labels: {', '.join(difference)}.")
else:
logging.info("All custom facet values have a label defined.")
logging.info("Setting up connection to service.")
s = SPARQLWrapper("https://query.wikidata.org/sparql",
agent='Python Script (University Library Basel, jonas.waeber@unibas.ch)')
logging.info("Reading SPARQL template.")
with open('query.sparql', 'r') as sp:
request_template = sp.read()
logging.info("Writing the format labels file.")
missing_labels = list()
wikidata_identifiers = sorted(wikidata_identifiers, key=lambda x: int(x.replace('Q', '')))
with open('format_labels.csv', 'w') as w:
writer = csv.writer(w, dialect='unix')
writer.writerow(['id', 'de', 'fr', 'it'])
for q in wikidata_identifiers:
request = request_template.replace('PLACEHOLDER', q)
s.setQuery(request)
s.setReturnFormat(JSON)
logging.info(f"Query Wikidata service for value {q}.")
results = s.query().convert()
lang_values = dict()
for row in results['results']['bindings']:
lang_values[row['item']['xml:lang']] = row['item']['value']
de = lang_values['de'] if 'de' in lang_values else ''
fr = lang_values['fr'] if 'fr' in lang_values else ''
it = lang_values['it'] if 'it' in lang_values else ''
writer.writerow([q, de, fr, it])
if de == '' or fr == '' or it == '':
missing_labels.append([q, de, fr, it])
# add the custom facet value labels at the end.
w.write(custom_label_text)
if len(missing_labels) > 0:
logging.info("Writing missing labels.")
missing_labels = sorted(missing_labels, key=lambda x: int(x[0].replace('Q', '')))
with open('missing_labels.csv', 'w') as w:
writer = csv.writer(w, dialect='unix')
for row in missing_labels:
writer.writerow(row)
logging.info("Finished processing format labels.")
## Blank configurations for the import
This folder contains a blank mapping.yml and localTransforms.yml with example data. The folder new-001 can be copied to deployment and the config files can be adapted for the import of a new record set.
\ No newline at end of file
# This is the local configuration for the step Metadata Normalization and Enrichment
splitEntity:
- type: skos:Concept
property: skos:prefLabel
delimiter: ","
- type: rico:Person
property: rico:name
delimiter: ";"
- type: rico:CorporateBody
property: rico:name
delimiter: ";"
- type: rico:Place
property: rico:name
delimiter: ","
- type: rico:Language
property: rico:name
delimiter: ","
- type: rico:Agent
property: rico:name
delimiter: ";"
normalizePerson:
splitEntity:
type: rico:Person
property: rico:name
delimiter: ";"
creationRelationName: # only tries to extract a value if a DUMMY-VALUE rico:name is property is present in the relation.
pattern: "\\((?<relation>.+)\\)" # " are necessary to ensure pattern is parsed correctly. Pattern needs to be double escaped!
language: NONE
nameOrder: # "last-to-first" (i.e. Tester, Thea) or "first-to-last" (i.e. Thea Tester)
singleNameIsLastName: true
nameDelimiter: SPACE
\ No newline at end of file
record:
uri:
identifiers:
original:
type:
title:
titles:
- main:
de:
- series:
de:
- broadcast:
de:
scopeAndContent:
de:
sameAs:
abstract:
source:
descriptiveNote:
relation:
conditionsOfUse:
conditionsOfAccess:
isSponsoredByMemoriav: # true or false
rights:
holder:
languages:
- content:
- caption:
subject:
- prefLabel:
genre:
- prefLabel:
placeOfCapture:
- name:
relatedPlaces:
- name:
creationDate:
issuedDate:
temporal:
creators:
- agent:
name:
relationName:
- person:
name:
relationName:
- corporateBody:
name:
relationName:
contributors:
- agent:
name:
relationName:
- person:
name:
relationName:
- corporateBody:
name:
relationName:
producers:
- agent:
name:
- person:
name:
- corporateBody:
name:
relatedAgents:
- agent:
name:
- person:
name:
- corporateBody:
name:
publishedBy:
- agent:
name:
- person:
name:
- corporateBody:
name:
physical:
identifiers:
callNumber:
carrierType:
descriptiveNote:
duration:
physicalCharacteristics:
- prefix:
value: "xxx: "
field:
colour:
conditionsOfUse:
conditionsOfAccess:
rights:
access:
usage:
name:
const: "Copyright Not Evaluated (CNE)"
sameAs:
const: "http://rightsstatements.org/vocab/CNE/1.0/"
digital:
descriptiveNote:
locator:
duration:
conditionsOfUse:
conditionsOfAccess:
rights:
access:
usage:
name:
const: "Copyright Not Evaluated (CNE)"
sameAs:
const: "http://rightsstatements.org/vocab/CNE/1.0/"
\ No newline at end of file
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment