3.1 Processing server (ferry)

3.1.1 Hardware

Requirements


DVA Profession was built with effectivity in mind. A current off-the-shelf PC, with average performance properties should do just fine as transcoding server, but due to the vast amounts of data it has to process if used with multiple ingest clients, you should consider buying a stronger machine.
  • Processor: at least 3.3 gigahertz (GHz) 64-bit (x64) - at least Dual-Core recommended
  • System memory (RAM): at least 2 gigabyte (GB) of system memory
  • Possibility to attach at least 2 SATA disks (directly on mainboard, or with external controller)
  • Hard disks:
    • 2x >= 1 terrabyte (TB) hard disks:
  • 2x gigabit (Gb) ethernet network adapter
    • One for communicating with the video ingest stations.
    • One for data transfer exclusively with the final video archive storage.
At the Austrian Mediathek, we are using the following hardware setup:
  • Processor: Intel Dual-Core i5 CPU 680 @ 3.60GHz
  • Motherboard: Intel DH57JG, Mini-ITX
  • System memory (RAM): 4 GB system memory
  • Graphics card (onboard): Intel Graphics Media  Accelerator HD
  • Network adapters
    • (onboard): Intel 82578DC Gigabit Network Connection
  • PCI-Express card: Intel 82571EB Gigabit Ethernet Controller (rev 06)
  • Hard disks:
    • 2x 1 TB Samsung HD103SJ

3.1.2 Operating system

3.1.2.1 Installation

GNU/Linux installation


Due to the high-availabilty demands of the background servers, we are using Debian Squeeze 6.0.2 as operating system for the ferry servers in our setup.
Therefore, these installation instructions refer to Debian, although most of them might not be distribution specific.

Partitioning:


Create one partition on each hard disk.
Partition 1: System and storage:
  • Partition type: Primary
  • Use the whole space minus 2 GB and use it as "physical volume for RAID"
Now, the listing of both drives should display 2 partitions each. In our setup, this looks as follows:
SCSI1 (0,0,0) (sda) - 1.0 TB ATA SAMSUNG HD103SJ
    #1 primary    1000.2 GB     K raid
SCSI2 (0,0,0) (sdb) - 1.0 TB ATA SAMSUNG HD103SJ
    #1 primary    1000.2 GB     K raid
In case of a necessary RAID 1 rebuild, a new replacement drive simply has to be partitioned like mentioned above, assigned to the RAID volume and that's it.
References:
After creating those partitions, a new menu option is available:
"Configure Software RAID"

Create a new "md" partition and assign the large partitions of each drive to it. No spare.
Now choose that md0 Software RAID partition and set the following partition settings:
  • Use as: ext3
  • Mount point: /
  • Mount options: noatime (for performance reasons)
  • Label: none
  • Reserved blocks: 5%
  • Typical usage: standard
NOTE: Why not create a swap partition?
GNU/Linux swap mechanisms have a history of not playing well with scheduling algorithms of software RAIDs. Although this has been fixed and is production stable for years now, automatic rebuild of a broken mirrored RAID1 still would require to disable swap, mirror the drive, then re-enable swap. This adds additional administrative and know-how requirements in case of a failure.
Even though the kernel has a built-in feature for handling multiple swap partitions by treating them like a RAID0 stripeset, we would not run into problems with a swap/RAID conflict, but in case of one drive failing, it could tear down the whole system if swap space on the failing drive would have been used.
Since swap is only required in case of lack of RAM, it is way better to simply equip the machines with sufficient memory instead of worrying about swap-handling adding additional problems in case of drive failures.
Example:
In our current production setup, we have 4 GB of RAM, of which only 1 are currently marked as used - so there should be no reason to require a swap partition.

Software selection


When the Debian installer offers you a predefined set of software collections to choose from, select the following:
  • Web server
  • DNS server
  • SSH server
  • Standard system utilities

3.1.2.2 Configuration

Repository sources

First of all: Remove the installation CD from the APT sources.

Removing it from the list forces apt to use the default online servers. If you'd keep the disk on the list, you might run into the problem that one day you want to install a package, but don't have the disk in the server's drive and you won't be able to install that package until you've removed the disk from the list - so it's better to do that now ;)

Add "Debian Multimedia" repository:

"www.debian-multimedia.org" is hosting a - quote:
"Unofficial repository of media utilities that cannot be included in main, contrib, or nonfree due to patents and other problems"
Go to their list of repository mirrors, choose a server suitable for you and copy/paste the deb/deb-src lines into the file "/etc/apt/sources.list.d/multimedia.list". Afterwards it should look somewhat like this:
    deb http://debian.mur.at/debian-multimedia/ stable main
    deb-src http://debian.mur.at/debian-multimedia/ stable main
Save it and then update your APT package list.

Enable "sudo"

Install the "sudo" package:
$ apt-get install sudo
Add your user to the "sudo" group:
$ usermod -a -G sudo <username>
NOTE: Before you can now use "sudo" you have to logout and login again in order to update your user's group memberships

Install some basic packages


Mandatory
  • screen
    (Used for running multiple processing tasks in the background)
  • rsync
    (Used for transferring files to final archive location)
Optional:
  • vim
  • htop
  • phpsysinfo

Add the "dva-profession" group


Since the ferry must have access to sensitive data on the final archive storage, it is wise to configure the access rights in a proper manner. Therefore we'll create the system group "dva-profession":
$ sudo groupadd dva-profession

Members of this group will have write access to the mounted drives, such as video-clients and storage.

3.1.2.3 Network

Network interfaces

In a typical DVA-Profession setup, the ferry-servers access e.g. the lossless high-resolution videos are directly accessed on the video ingest workstations. Addtionally, all produced data that will be archived must be transferred to the final storage. Since handling video data means transferring a huge amount of data across a network, it is desireable to have separate network interfaces for certain actions, in order to avoid congestion and delays for other services (e.g. regular office jobs).
A ferry-server will have at least mandatory 2 network interfaces:
  • video-network
    • All traffic between ingest clients and ferry-servers will use this network.
    • subnet: 192.168.200.0/24

  • storage-network
    • This network is solely intended for transferring data to and from the final archive storage.
    • subnet: 192.168.201.0/24

NOTE: Of course, you can use any other network addresses. The ones shown here are just an example. Here is an example of "/etc/network/interfaces", with network devices "eth0" for video and "eth1" for storage connections.
Replace them with their actual device names on your setup respectively.

    # This file describes the network interfaces available on your system and how
    # to activate them. For more information, see interfaces(5).  
    # The loopback network interface 
    auto lo 
    iface lo inet loopback

    # Video network:
    auto eth0
        iface eth0 inet static         
        address ferry-1.dva-profession.local
        netmask 255.255.255.0 

    # Storage connection: 
    auto eth1 
        iface eth1 inet static         
        address ferry-1.storage.dva-profession.local
        netmask 255.255.255.0

Enable IP forwarding

Although not ultimately necessary, it might be a good idea to enable IPv4 forwarding on the ferry-servers. Open the file "

/etc/sysctl.conf
" and search for the text "ipv4.ip_forward" and uncomment the line to make it look somewhat like this:

    # Uncomment the next line to enable packet forwarding for IPv4 
    net.ipv4.ip_forward=1 

3.1.3 DVA-Profession

3.1.3.1 Installation

Download and unpack the current release version of DVA-Profession into "/opt/dva-profession/":
Now there should be at least 2 folders:
  • bin
  • workflow
Make sure that the "bin" folder belongs to "root:root", and the "workflow" to the user the Apache webserver is running as (usually "www-data"):
$ chown -R root:root /opt/dva-profession/bin/ $ chown -R www-data:www-data /opt/dva-profession/workflow/

Now you have to make the "bin" folder accessible by the Apache webserver.
By default, Apache's "www-docs" root is located in "/var/www" - which means we can create a link called "dva-profession", pointing to the DVA-Profession "web" folder:
$ ln -s /opt/dva-profession/bin/web/ /var/www/dva-profession

If everything was successful, you can now access the DVA-Profession web-interface on a client under the following name:
ferry-x.dva-profession.local/dva-profession
(Where "ferry-x" is the name of the ferry you are currently setting up)

Starting the background processing (task_trigger):


Due to access right issues, the background processing is done by the same user, the web-interface is running as. This is currently the same user Apache is running as - therefore "www-data". You must give that user a password in order to be able to login and start a GNU/screen session for that user:
$ sudo passwd www-data

Now, open a new terminal and login as the "www-data" user.
In the "etc" subfolder of "bin", you will find a GNU/screen configuration file:
/opt/dva-profession/bin/etc/screenrc-processing

This file is configured for a processing server called "ferry-1". In case the server you are currently setting up is not called "ferry-1" you must edit this file and change the n ame to fit your needs accordingly (e.g. "ferry-2" for a second server).
Start the screen session using that config file:
$ screen -c /opt/dva-profession/bin/etc/screenrc-processing

This will spawn a new GNU/screen session, with multiple task_trigger windows and automatically start the background processing scripts.

3.1.3.2 File structure

The DVA-Profession system is designed without requiring any database at all. The reason for this is to have a system which is literally: "what you see is what you get", because all state-keeping and data storage is done directly in a file/folder structure - and it is impossible to have any delta between "what you see" in the user interface and "what you get" regarding your actual state and data.
Reasons for this design decision were:
The file/folder design resembles a classical "mechanical" approach, with all it's pros and cons. Similar to a mechanical audio player, like a tape machine:
  • If you look at it, you can see if there is a carrier loaded
  • By looking at the size of the tape, you can estimate its duration
  • You can immediately see which state it is in (play, rewind, stop, ...)
The concept of files and folders is virtually timeless - meaning, that you it can easily be migrated from one generation of computer systems to another one. This is a very important issue when dealing with long-term archiving.
  • As administrator, you can use any tool you like to edit the configuration. Since everything's stored in text files (mostly XML), it is almost self-explaining and there's no abstraction layer between what you and the configuration.
  • With a database, a backup always requires backing up the database *and* the actual files. With DVA-Profession's file/folder based design, all you have to do is backup a single folder structure including its subfolders. That backup can instantly be restored onto any other system simply by unpacking there. No re-import of any database dump necessary!
  • You can watch and control the actual workflow data from any computer, any operating system that can access and edit the workflow file/folder structure - by using a basic file explorer tool and a plain text editor of your choice.
In practice, this means that what you see in the DVA-Profession web-interface is a literal representation of your actual workflow file structure.
For example:
  • Task types listed in the task drop down menu represent the task subfolders in the workflow folder.
  • The order of processing (=task order) is simply controlled by changing the alphabetical sort order of the task folders.
  • The listing of signatures and their state (To do, In progress, Error, ...) simply represents in which folder they are currently in.
  • Task metadata and tool configuration is simply done by copying and editing XML files in a subfolder of each task.
  • ...and many more!

Folder categories:


The workflow folder structure can be split into a few categories:
Task list
Which tasks and in which order will be processed during the workflow is defined by simply creating a folder for each task, with a naming syntax as follows:
"<xx>-<task_type_name>"

Where.
  • "<xx>" is a zero-padded numeric index number used to define which task comes in which order
  • "<task_type_name>" is the technical name of a DVA-Profession task. This name is hardcoded for each task type and independent of the human readable label of a task - and therefore also translation-independent.
As you can see, you can simply change the processing order by changing the index number within the folder's name. Adding / removing tasks from the processing workflow is also done by simply adding or removing their task-folders.
For example, the default task list structure of the setup used at the Austrian Mediathek looks as follows:
  • 01-request_video_capture
  • 02-video_capture
  • 03-capture_export
  • 04-generate_thumbnails
  • 05-generate_preview
  • 06-scene_cut_detection
  • 07-embed_avi_metadata
  • 08-generate_checksums
  • 09-check_digitization
  • 10-finalize_metadata
  • 11-move_to_archive
  • 12-clean_up

Task status:


Each task has a fixed set of subfolders, used to handle the different states of a signature within, and the data generated during this task. This folder is called "task-folder" and has the following subfolders and files:
  • config
    • tools
      • <workstation_1>
      • <workstation_2>
      • <workstation_...>
      • <task-template.xml>
  • error
  • final
    • <signature_1>
    • <signature_2>
    • <signature_...>
  • in_progress
  • log
Capture request files stored directly in the root of a task-folder represent signatures that are in state "To do" of that task.
Each task folder only contains files generated by this task type. This is necessary for the DVA-Profession workflow mechanism to know which files belong to which task type, in or der to be able to remove them when resetting a task. For example, the metadata gathered in the task "request_video_capture" will be stored in the folder "xx-request_video_capture /final/<signature_xx>/metadata", whereas metadata gathered during other tasks of the workflow are stored in their "in_progress" or "final" subfolders respectively.
A flattened, task-independent structure is virtually generated in the so called "final-final" folder (see description below).
Configuration:
As seen in the "task status" paragraph, each task has a "config" folder, which contains the actual metadata describing the tasks and tools used in the workflow.
In order to make configuration easier, and keep the metadata of certain tools used at multiple tasks in the workflow consistent, there is a so called "global" configuration folde r, in the root of the workflow folder, which contains 2 subfolders:
  • tasks
  • tools
These subfolders of the global config folder contain a list of task-/tool-template files which are available for selection in the corresponding drop-down menus of the web-based t ask/tool configuration.
Task-template:
Each task-config folder requires exactly one so called "task-template" XML file, which contains the human readable description for being logged in the final metadata stored in the archive.
Since DVA-Profession is file-based, the filename of these task-template XML files is important and has a syntax:
task-<task_id>.xml

IMPORTANT: The hyphen (-) character is the delimiter between the individual parts of the template filename and must therefore not be used in any field value (task_id, name, ...)
For example:
"task-request_video_capture.xml" for the task of type "request_video_capture"
Its XML content used for describing task metadata conforms to the entity "taskType" entity described in the "PMD" specification, defined by the Library of Congress

Example:

    <task ID="request_video_capture">
    <task_label>Capture request</task_label>
    <task_description>Request digitization of video material</task_description>
    </task>

Tool-templates:
Tool templates contain metadata about the equipment, software, etc. being used to perform a certain task on a certain workstation.
Therefore, each workstation which performs any action within a task in the DVA-Profession system requires its tool-chain to be described. This is done by putting a so called "too l-template" XML file for each tool in a folder per workstation (see the "task-status" folder description for details about the structure).

Since DVA-Profession is file-based, the filename of these tool-template XML files is important and has a syntax:
tool-<tool_id>-<name>.xml

IMPORTANT: The hyphen (-) character is the delimiter between the individual parts of the template filename and must therefore not be used in any field value (tool_id, name , ...)
For example: "tool-workstation-videocube1.xml" for the ingest client "video-cube1", which is a tool of type "workstation".
Its XML content used for describing task metadata conforms to the entity "toolType" described in the "PMD" specification, defined by the Library of Congress
Currently, there is no way of adding/deleting tools per workstation for a certain task.
In order to use the web-interface for configuring the tool-chain, one has to initially copy tool-templates per tool-types used into the workstation folder. The web-interface then allows to select the actual tool out of the global config pool.
Example:
If you want to have tools for the task "request_video_capture" on workstation "video-cube1" with the following tool-types:
  • workstation
  • video_replayer
  • ad_converter
Just copy one file of each tool-type in the folder: "01-request_video-capture/config/tools/video-cube1". For example:
  • tool-workstation-video_cube1.xml
  • tool-video_replayer-DV_01.xml
  • tool-ad_converter-None.xml
Now, the tool-configuration in the web interface can be used to select different tools for that task - but only out of these 3 tool-types. See the "user manual" for details about how to use the tool-configuration web interface.
Logging:
Basically, there are 2 different kinds of logfiles written by the DVA-Profession system:
  • Task-related
  • Every action that is related to a certain task, is logged in a logfile in a subfolder of exactly the task-type concerned.
  • Others
  • Some things done by the DVA-Profession system (especially actions performed by the background processing, or generic things in the web-interface), are not related to any task and therefore logged in files stored in the global log folder, in the root of the workflow folder.
The "final-final" folder:
As mentioned above, files produced during a certain task are also stored only in subfolders of exactly that task. In order for the DVA-Profession system to access files regardless of which task they have been created in, it generates a virtual file structure - the so called "final-final" folder.
In this folder you will find a subfolder for each archive signature, and below its categories like "metadata", "log", etc - Just like you would in the task folders. The big difference herein is, that all files that belong to that signature are accessible from within a task-independent, flat structure.
For example: You will find all logfiles in the subfolder "log" and all metadata in the subfolder "metadata" - although those files actually point to actual files in the corresponding task folders. Currently this virtualization is implemented by using "symbolic links" (http://en.wikipedia.org/wiki/Symbolic_link) as they are very common, well-known and production-stable among filesystems used in GNU/Linux environments. That way it was not necessary to reinvent the wheel in order to provide this virtualized file structure.
Finished items:
The folder "finished" contains the workflow process files of items which have already been successfully written to their final archive location. These files can safely be deleted, but are currently kept here, as it might be useful to have the process metadata of tasks performed after "finalize_metadata" in case of analyzing issues with the system.
In future versions of DVA-Profession it is planned to include an automatic garbage collection for this in the task_trigger script and delete files which are older than a certain age (e.g. 6 months). Currently you have to manually delete this folder every now and then.
You could also use a cronjob like the following, which would delete all files older than 180 days (~6 months):
$ find /opt/dva-profession/data/finished -mtime +180 -exec rm {} \;

3.1.3.3 Configuration

3.1.3.3.1 Config file


During the initial setup, it is mostly necessary to edit a configuration file (config.inc.php) - which is actually a list of variables declarations in PHP code. But don't be afra id... :) It's just about changing text or numeric values - mostly path names. No programming skills required whatsoever.
The default path of the config file is:
/opt/dva-profession/bin/config.inc.php

Here is a list of the most important settings that you might want to verify/modify before using your DVA-Profession setup:
  • Paths
    • ROOT_SOURCE_DIR
    • ROOT_DATA_DIR
    • ROOT_ARCHIVE_DIR
    • WORKFLOW_BASE_DIR
  • Miscellaneous options
    • ORIGINATOR_NAME
    • XML_ENCODING
    • XML_VERSION
    • ORIGINATOR_NAME
    • XSLT_METS
    • CHECKSUM_TYPE
    • THROTTLE_ARCHIVE_COPY
  • Diskspace limits
    • DISKSPACE_LIMIT_WORKFLOW
    • DISKSPACE_LIMITS
  • Transcoding
      DVD-conform MPEG encoding:
    • LAVC_DVD_OPTIONS_HIGH
    • LAVC_DVD_OPTIONS_NORMAL
    • LAVC_DVD_OPTIONS_LOW
    • LAVC_DVD_OPTIONS
  • Command masks:
    • These strings define which commandline calls are executed in order to perform a certain action (transcode using ffmpeg, copy to archive using rsync, ...). They should not require any changes, unless you would like to use different tools or different parameters. In that case, feel free to edit these strings to fit your needs - but be aware of poss ible consequences.

3.1.3.3.2 Adding a new ingest workstation


Since this step only has to be performed once for a workstation in the DVA-Profession system, creating a comfortable user interface had little priority as it was not necessary, yet (It will probably be added in future version of DVA-Profession).

1) Create and share the DVA-Profession folder on the client
  • If it does not already exist create a folder called "DVA-Profession" on the client workstation
  • Share that folder as "DVA-Profession" with the following access rights:
    • Everyone: READ
    • <client_computername>\User: FULL CONTROL / CHANGE / READ
2) Create a mountpoint on the server
  • Create a mountpoint as follows:
    /mnt/video_clients/<client_computername>
  • Add an entry in "/etc/fstab" for mounting the workstation's DVA-Profession share:
    //<client_computername>/DVA-Profession /mnt/video_clients/<client_computername>  smbfs  
    noauto,user,uid=www-data,gid=dva-profession,file_mode=0664,dir_mode=0775,nounix,noserverino,
    credentials=/etc/samba/credentials/user_ingest
OPTIONAL: As the DVA-Profession workflow user on the ferry (e.g. "www-data"), create a file on that share and delete it - in order to quickly verify the correctness of the access rights.
Here's an example (but you can use any method you prefer)
$ sudo su <dva_user>; export TESTFILE="/mnt/video_clients/<client_computername>/DELME"; touch $TESTFILE 
&& rm $TESTFILE

3) Add the new workstation to "workstations.conf.php"

Add the client's workstation name to the DVA-Profession workstations config file: "/opt/dva-profession/workflow/config/workstations.conf.php".
Edit "workstations.conf.php" and search for the workstations array definition, which looks somewhat like this:
    // Names of available workstations:
    $WORKSTATIONS = array(
        // capture/ingest
        'ingest' => array(
            'video-cube1',
            'video-cube2',
            'video-cube3',
        ),

        // background processing
        'ferry' => array(
            'ferry-1',
            'ferry-2',
        ),

        // check digitization
        'check' => array(
        ),
    ); 

Add the new computername in between the "'ingest' => array()" braces.
The 'ingest' part of the configuration should now look like this:
 // Names of available workstations: 
$WORKSTATIONS = array(
// capture/ingest
'ingest' => array(
'video-cube1',
'video-cube2',
'video-cube3',
'<client_computername>',
),

4) Create a tool-xml for the new workstation

The new workstation needs to be added as a "tool" to the tool-xml-files:
/opt/dva-profession/bin/etc/workflow/tools/tool-workstation-<client_computername>.xml
Easiest way to do this would be to copy the xml-file of an existing workstation and adapt the file to correctly describe the new one.

5) Assign task tools to the new workstation

For each task that will be operated on the new workstation you have to create/copy a folder for the respective workstation. If you already have a workstation set up just copy and rename that existing folder.
The path for this folder is: /opt/dva-profession/workflow/<task_folder>/config/tools/<client_computername>
The actual tool configuration for each task on the workstation takes place in the user interface: Admin - Tool configuration.
Depending on the type of workstation you're adding, you only require to configure tools for the tasks you will be handling from that workstation.
For example, an ingest workstation usually handles the following tasks:
  • request_video_capture
  • video_capture
  • check_digitization
...whereas ferry-servers handle the automated tasks like:
  • generate_thumbnails
  • generate_preview
  • scene_cut_detection
  • ...
Now open the DVA-Profession web client interface and do the following things:
  • Open the "Admin" page and select the newly added <client_computername> as your workstation name.
  • Open the "Tool configuration" page and verify the tools for each task done by that client workstation.
You can now use the new workstation as ingest client.
The workflow folders will automatically be created as soon as you accept a "capture video" task on this workstation.
NOTE: Adding workstations of other types than 'ingest' is performed in the same way - except that you add the name in the appropriate array() section, matching the desired type (ferry, check, ...) - and configure it for different task types.

3.1.3.3.3 Introducing a new format to the system


3.1.3.3.3.1 Adding a new replayer

Create a tool-xml-file for the respective replayer

In the following folder you will find the tool xml-files used in your setup:
/opt/dva-profession/workflow/config/tools
Copy one of the existing xmls and adjust them to describe your replayer.

3.1.3.3.3.2 Physical carrier information

The physical carrier information, like type (VHS, DV, DigiBeta, ...), brand, name, etc. available as choice in the drop down fields of the carrier input is currently not editable except directly in the sourcecode. We're very sorry for this inconvenience - This is planned as feature for a future release (See Mantis issue #131).
If you need to edit/add physical carrier information, just open the file "video_carrier.conf.php" (/opt/dva-profession/workflow/config/) in your favorite text editor and edit the content of the "get_carrier_models()" function to fit your needs.
The properties for describing the physical carrier conforms to the "physical_dataType" entity described in the "Video Metadat a Description" (VMD) specification, defined by the Library of Congress and currently supports the following fields:
  • phys_format:
    Name of the carrier.
    Example: "E-120" for a VHS tape, or "DV60" for DV
  • signal_format:
    PAL / NTSC / SECAM
  • stock_brand:
    Name of the manufacturer of the carrier
  • condition:
    A free text field for taking notes about the physical condition of the carrier
  • videotape_type:
    The actual video format like VHS, DigiBeta, DV, etc.
  • videotape_extras:
    This field can have several sub-nodes, containing information for describing additional properties of the video signal on the tape.
Examples:
  • VHS:
    • Longplay (LP)
    • shortplay (SP)
  • DV: DV-stream format variants appearing on the carrier:
    • DV
    • DVCAM
    • HDV
    • DVC-PRO25
    • ...
NOTE: There has been no field for this kind of information in the VMD definition, so we decided to rather add that field than discard this information (No "embrace-extend-extinguish" intended!)
The structure of the $carrier_models variable is as follows:
    $carrier_models = array(
        videotape_type = array(
            phys_format = array(
                stock_brand => array(name1, name2, name3, ...)
            )
        )
    )
Kontakt:
Österreichische Mediathek
Mag. Hermann Lewetz
hermann.lewetz[at]mediathek.ac[dot]at
Österreichische Mediathek Digitalisierungsservice: