In order to mitigate against the brute force attacks against Gitlab accounts, we are moving to all edu-ID Logins. We would like to remind you to link your account with your edu-id. Login will be possible only by edu-ID after December 31, 2021. Here you can find the instructions for linking your account.

If you don't have a SWITCH edu-ID, you can create one with this guide here

kind regards

README.md 3.27 KB
Newer Older
Sebastian Schüpbach's avatar
Sebastian Schüpbach committed
1
2
# Media Metadata Extractor

Sebastian Schüpbach's avatar
Sebastian Schüpbach committed
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
This service requests media metadata from two dedicated services,
[media-indexer](https://gitlab.switch.ch/memoriav/memobase/services/indexer)
and
[media-indexer-helper](https://gitlab.switch.ch/memoriav/memobase/services/histogram),
and enriches the record metadata with the returned media metadata.
media-indexer and media-indexer-helper use different tools to extract the
requested metadata:

- Siegfried: Mime-Type and PRONOM-id
- ffmpeg (especially ffprobe): AV metadata extraction; AV validation
- imagemagick (especially identify and convert): Image metadata extraction; image validation

If errors occur during the analysis, Media Metadata Extractor tries two enrich
the record as much as possible but will nevertheless issue a WARNING report.
If all went well, a SUCCESS is propagated.

While internally the same, there are actually two deployments of Media
Metadata Extractor running. One for images (fed by input topic
import-process-image-enrichment) and one for AV media (reading from input
topic import-process-av-enrichment).
Sebastian Schüpbach's avatar
Sebastian Schüpbach committed
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

## Configuration

In order to work correctly, some environment variables have to be set:

* `KAFKA_BOOTSTRAP_SERVERS`: Comma-separated list of Kafka bootstrap server addresses
* `APPLICATION_ID`: Id used by Kafka Streams application (see [Kafka documentation](https://kafka.apache.org/documentation/#streamsconfigs_application.id) for details)
* `TOPIC_IN`: Name of Kafka topic where messages are read from
* `TOPIC_OUT`: Name of Kafka topic where messages are written to (without environment postfix)
* `TOPIC_PROCESS`: Name of Kafka topic where status reports are written to
* `INDEXER_HOST`: Address of indexer service
* `INDEXER_CONNECT_TIMEOUT_MS`: Time in milliseconds after which a connection timeout occurs
* `INDEXER_READ_TIMEOUT_MS`: Duration in milliseconds in which a response from the indexer is expected; consider that the processing of a large media file can take a bit of time...
* `CONSUMER_MAX_POLL_INTERVAL_MS`: Maximum time of consumer idleness in milliseconds; after this period the consumer is considered failed (see [Kafka documentation](https://kafka.apache.org/documentation/#consumerconfigs_max.poll.interval.ms) for details)
* `CONSUMER_MAX_POLL_RECORDS`: Maximum number of records returned in a single call to poll() (see [Kafka documentation](https://kafka.apache.org/documentation/#consumerconfigs_max.poll.records) for details)
* `PARSER_ACTIONS_REMOTE`: Comma-separated list of actions which should be performed by the indexer when analysing a remote media file (see below for allowed actions)
* `PARSER_ACTIONS_LOCAL`: Comma-separated list of actions which should be performed by the indexer when analysing a locally available media file

## Possible actions

* `siegfried`: Identify mime-type and PRONOM-id with [Siegfried](https://github.com/richardlehane/siegfried)
* `identify`: Run ImageMagick's [`identify`](https://imagemagick.org/script/identify.php) subcommand
* `ffprobe`: Run ffmpeg's [`ffprobe`](https://ffmpeg.org/ffprobe.html) subcommand
* `histogram`: Create a histogram from the analysed image
* `validateimage`: Validate audio or video file with [ImageMagick](https://imagemagick.org)
* `validateav`: Validate audio or video file with [ffmpeg](https://ffmpeg.org)
* `exif`: Extract EXIF data with [ExifTool](https://exiftool.org)