<!--
XML documentation system
Original author : Arjen Baart - arjen@andromeda.nl
- Version : $Revision: 1.1 $
+ Version : $Revision: 1.7 $
This document is prepared for XMLDoc. Transform to HTML,
LaTeX, Postscript or plain text with XMLDoc utilities and
<book>
<titlepage>
<title>Gnucomo - Computer Monitoring</title>
+ <subtitle>Design description</subtitle>
+ <para><picture src='logo.png' eps='logo' scale='0.7'/></para>
<author>Arjen Baart <code><arjen@andromeda.nl></code></author>
<author>Brenno de Winter<code><brenno@dewinter.com></code></author>
- <date>July 12, 2002</date>
+ <author>Peter Roozemaal<code><mathfox@xs4all.nl></code></author>
+ <date>November 26, 2003</date>
<docinfo>
- <infoitem label="Version">0.1</infoitem>
+ <infoitem label="Version">0.6</infoitem>
<infoitem label="Organization">Andromeda Technology & Automation</infoitem>
<infoitem label="Organization">De Winter Information Solutions</infoitem>
</docinfo>
<heading>Architecture</heading>
<para>
-The architecture of <strong>gnucomo</strong> is shown in the
-dataflow diagram below:
+The systems that are being monitored in <strong>gnucomo</strong> are called
+<emph>Object</emph>.
+These may be computers, routers, switches or other active components that
+are capable of sending reports about their internal workings
+to the <strong>gnucomo</strong> server.
+An <emph>Object</emph> plays a central role in the <strong>gnucomo</strong> system.
+Two separate aspects of an <emph>Object</emph> are monitored: the static state and the
+dynamic behaviour.
+The static state of an <emph>Object</emph> is represented by a set of paremeters
+and the values of these parameter's attributes.
+The dynamic behaviour of an <emph>Object</emph> is characterized by events that
+happen on an <emph>Object</emph>.
+One obvious way to collect a report of these events is to scan the log files of the
+system and its processes.
+<emph>Objects</emph> run services and these services are configured with a set
+of parameters. Also, services produce entries in log files.
</para>
<para>
- <picture src='dataflow.png' eps='dataflow.eps'/>
+The dataflow architecture of <strong>gnucomo</strong> is shown in the
+data flow diagram below:
+</para>
+
+<para>
+ <picture src='dataflow.png' eps='dataflow' scale='0.7'/>
+</para>
+
+<para>
+At the left of the diagram, information is acquired from the monitored system.
+Several agents can be used to obtain information from this system, in
+active or passive ways.
+A passive agent uses information which is available on the system anyway,
+such as log files or other lists.
+An active agent, requests explicit data from the monitored system.
+One example of a passive agent is <emph>logrunner</emph>, a program which
+monitors system log files and sends regular updates to the <strong>gnucomo</strong>
+server.
+The agents on the monitored system send the data to some kind of transportation channel.
+This can be any form of transport, such as Email, SOAP, plain file copying or
+some special network connection.
+If desired, the transportation may provide security.
+Once arrived at the server, the information from monitored systems is captured
+by the <emph>gcm_input</emph> process.
+This process can obtain the data through many forms of transport and from
+a number of input formats.
+<emph>Gcm_input</emph> will try to recognize as much as possible from an
+input message and store the obtained information into the <emph>Raw Storage</emph>
+database.
+The <emph>Raw Storage</emph> data is processed further and analyzed by
+the <emph>gcm_daemon</emph>, which scans the data, gathers statistics and
+stores its results into the <emph>Derived Storage</emph> database where
+it is available for human review and further analysis.
</para>
<para>
Architectural items to consider:
<itemize>
-<item>Active and passive data aquisition</item>
+<item>Active and passive data acquisition</item>
<item>Monitoring static and dynamic system parameters</item>
<item>Upper and lower limits for system parameters</item>
</itemize>
<heading>Database design</heading>
<para>
-Log entries are stored in a database with at least the following fields:
+The design of the database is described extensively in
+<reference href="manifest.html">the Manifest</reference>.
+Assuming development is done on the same system on which the real (production)
+gnucomo database is maintained, there is a need for a separate database
+on which to perform development and integration tests.
+Quite often, the test database will need to be destroyed and recreated.
+To enable testing of <strong>gnucomo</strong> applications, all programs
+need to access either the test database or the production database.
+To accommodate this, each application needs an option to override the
+default name of the <ref to='configuration'>configuration</ref> file (gnucomo.conf).
+</para>
+
+<para>
+To create a convenient programming interface for object oriented languages,
+a class <emph>gnucomo_database</emph> provides an abstract layer which
+hides the details of the database implementation.
+An object of this class maintains the connection to the database server
+and provides convenience functions for accessing information in the
+database.
+A constructor of the <emph>gnucomo_database</emph> is passed a reference to
+the <emph>gnucomo_configuration</emph> object in order to access the database.
+This accommodates for both production and test databases.
+The constructor will immediately try to connect to the database and check its
+validity.
+The destructor will of course close the database connection.
+</para>
+
+<para>
+Other methods provide access to the database in a low-level manner.
+There will be lots more in the future, but here are a few to begin with:
<itemize>
-<item>hostname</item>
-<item>timestamp</item>
-<item>service (kernel, daemon, ...)</item>
-<item>Log message</item>
+<item>Send a SQL query to the database.</item>
+<item>Read a tuple from a result set.</item>
+<item>Obtain the userid for the current database session.</item>
</itemize>
</para>
+
+<para>
+The information stored in the database as tuples is represented by classes in
+other programming languages such as C++ of PHP.
+Each class models a particular type of tuple (an <emph>entity</emph>)
+in the database.
+Such classes maintain the relation with the database on one end,
+while providing methods that are specific to the entity on the other end.
+All database communication and SQL queries are hidden inside the
+entity's class.
+This includes, for example, handling database result sets and access control.
+</para>
+<para>
+Properties and operations that are common to all classes that represent
+entities in the database are caught in a common base class.
+The base class, named <emph>database_entity</emph> provides default
+implementations for loading and storing tuples, construction and destruction
+and iteration.
+Most derived classes will override these functions.
+Two examples of classes that represent entities in the database
+are <emph>object</emph> and <emph>service</emph>.
+Both are derived from a <emph>database_entity</emph>, as show below:
+<para>
+ <picture src='class-database_entity.png' eps='class-database_entity'/>
+</para>
+</para>
+
+<para>
+Constructors of classes derived from <emph>database_entity</emph> come in
+two varieties: with or without database interaction.
+Constructors that do not interact with the database have only one argument:
+a reference to the <emph>gnucomo_database</emph> object which handles
+the low-level interaction with the database server.
+The example below shows a few of these constructors:
+<verbatim>
+
+ database_entity::database_entity(gnucomo_database &gdb)
+
+ object::object(gnucomo_database &gdb)
+
+ service::service(gnucomo_database &gdb)
+
+</verbatim>
+The objective of this type of constructor is to cerate a fresh tuple and
+store it in the database later on.
+All these constructors do is establish the connection to the database
+server and fill in the defaults for the fields in the tuple.
+A destructor will put the actual tuple into the database if any
+information in the object has changed.
+This may be by sending an INSERT if the object is completely fresh
+or an UPDATE if an already existing tuple was changed.
+The state information about the freshness of an object is a property
+common to all database entities and is therefore maintained in
+the <emph>database_entity</emph> class.
+</para>
+<para>
+Constructors that do interact with the database accept additional
+arguments after the initial <emph>gnucomo_database</emph> reference.
+These extra arguments are used to retrieve a tuple from the database.
+Examples of such constructors are:
+<verbatim>
+
+ object::object(gnucomo_database &gdb, String hostname)
+
+ object::object(gnucomo_database &gdb, long long oid)
+
+ service::service(gnucomo_database &gdb, String name)
+
+</verbatim>
+The set of arguments must of course correspond to a set of fields that
+uniquely identify the tuple.
+The primary key of the database table would be ideally suitable.
+If the tuple is not found in the database, data members of the object
+are set to default values and the object is marked as being fresh and
+not changed.
+</para>
+
+<para>
+Methods with the same name as a field in a tuple read or change the
+value of that field.
+Without an argument, such a method returns the current value of the field.
+With a single argument, the field is set to the new value passed in the
+argument and the method returns the original value.
+Whenever a field is set to a new value, the object is marked as being
+'changed'.
+A destructor will then save the tuple to the database.
+</para>
</section>
<section>
-<heading>Configuration</heading>
+<heading><label name='configuration'/>Configuration</heading>
<para>
-Configurational parameters are stored in a XML formatted configuration file.
+Configuration parameters are stored in a XML formatted configuration file.
The config file contains a two-level hierarchy.
The first level denotes the section for which the parameter is used
and the second level is the parameter itself.
the system-wide value.
</para>
<para>
-At the moment, the gnucomo configuration has one section, holding
-four parameters which define how to access the gnucomo database:
+The following sections and parameters are defined for the Gnucomo
+configuration:
<itemize>
-<item>type</item>
-<item>name</item>
-<item>user</item>
-<item>password</item>
+<item>database
+ <itemize>
+ <item>type</item>
+ <item>name</item>
+ <item>user</item>
+ <item>password</item>
+ <item>host</item>
+ <item>port</item>
+ </itemize>
+</item>
+<item>logging
+ <itemize>
+ <item>method</item>
+ <item>destination</item>
+ <item>level</item>
+ </itemize>
+</item>
+<item>gcm_input
+ <itemize>
+ <item>dbuser</item>
+ <item>password</item>
+ </itemize>
+</item>
+<item>gcm_daemon
+ <itemize>
+ <item>dbuser</item>
+ <item>password</item>
+ </itemize>
+</item>
</itemize>
-The <emph>type</emph> parameter must have the content <code>PostgreSQL</code>.
+The <emph>database</emph> section defines how the database is accessed.
+The <emph>database/type</emph> parameter must have the content <code>PostgreSQL</code>.
Other database systems are not supported yet.
+The <emph>database/user</emph> and <emph>database/password</emph> provide default
+login information onto the database server.
+Specific user names and passwords may be specified for separate applications, such
+as <emph>gcm_input</emph> and <emph>gcm_daemon</emph>.
</para>
<subsection>
<heading>gnucomo_config class</heading>
<para>
-Each Gnucomo application should have exectly one object of the
-<strong>gnucomo_config</strong> to obtain its configurational
+Each Gnucomo application should have exactly one object of the
+<strong>gnucomo_config</strong> to obtain its configuration
parameters.
The following methods are supported in this class:
</section>
<section>
+<heading>gcm_input</heading>
+
+<para>
+<strong>gcm_input</strong> is the application which captures messages from client
+systems in one form or another and tries to store information from these messages
+into the database.
+A client message may arrive in a number of forms and through any kind of
+transportation channel.
+Here are a few examples:
+
+<itemize>
+<item>Obtained directly from a local client's file system.</item>
+<item>From the output of another process, through standard input.</item>
+<item>Copied remotely from a client's file system, e.g. using
+ <code>ftp</code>, <code>rcp</code> or <code>scp</code>.
+ This is usually handled through spooled files.
+</item>
+<item>Through an email.</item>
+<item>As a SOAP web service, carried through HTTP or SMTP.</item>
+<item>Through a TCP connection on a special socket.</item>
+</itemize>
+
+On top of that, any message may be encrypted, for example with PGP or GnuPG.
+In any of these situations, <strong>gcm_input</strong> should be able to extract
+as much information as possible from the client's message.
+In case the message is encrypted, it may not be possible to run <strong>gcm_input</strong>
+in the background, since human intervention is needed to enter the secret key.
+</para>
+<para>
+The primary function of <strong>gcm_input</strong> is to store lines from a client's
+log files into the <emph>log</emph> table or scan a report from a probe and update
+the <emph>parameter</emph> table.
+To do this, we need certain information about the client message that is usually not
+in the content of a log file.
+This information includes:
+<itemize>
+<item>The source of the log file, most often in the form of the client's hostname.</item>
+<item>The time stamp of the time on which the log file arrived on the server.</item>
+<item>The service on the client which produced the log file.</item>
+</itemize>
+
+Sometimes, this information is available from the message itself, as in an email header.
+On other occasions, the information needs to be supplied externally,
+e.g. by using command line options.
+In any case, this type of 'header' information is relevant to the message
+as a whole.
+As a result, <emph>gcm_input</emph> can accept one and only one message at a time.
+For example, it is not possible to connect the standard output of
+<emph>logrunner</emph> to the standard input of <emph>gcm_input</emph> and have
+a continuous stream of messages from different log sources.
+Each message should be fed to <emph>gcm_input</emph> separately.
+Also when <emph>logrunner</emph> uses a special socket to send logging data,
+a new connection must be created for each message.
+The dataflow diagram below shows how a message travels from the input source
+to the database.
+</para>
+
+<para>
+ <picture src='gcm_input-dataflow.png' eps='gcm_input-dataflow' scale='0.6'/>
+</para>
+
+<para>
+Internally, <emph>gcm_input</emph> handles <ref to='XML_input'>XML input</ref>
+and each input item must have its data fields split into appropriate XML elements.
+When data is offered in some other form, this data must be filtered
+and transformed into XML before <emph>gcm_input</emph> can handle it.
+Two levels of transformation are possible.
+At the highest level, the whole message is transformed into an XML
+document with a <code><message></code> root element and the
+appropriate <code><header></code> and <code><data></code>
+elements, all of which are put in the proper namespace.
+At the lowest level, each line of the message's data can be transformed
+into a <code><cooked> <log></code> element.
+Two classes of replaceable filter objects take care of these transformations.
+Depending on the content of the message and/or command line options to
+<emph>gcm_input</emph>, an appropriate filter object is inserted into
+the data stream.
+</para>
+
+<para>
+The <ref to='message_filter'><emph>message_filter</emph></ref> transforms
+the raw input data into an XML document.
+The XML document is processed by the XML parser and stored into the database
+or saved into a spool area for later processing.
+The latter happens, for example, when the database is unavailable.
+The task of the <emph>message_filter</emph> object is to create the <header>
+elements and the <data> element containing either a <log> or
+a <parameters> element, along with all their child elements.
+To do this, a <emph>message_filter</emph> object must work closely
+together with a <emph>line_cooker</emph> object.
+</para>
+<para>
+There are two major classes of <emph>message_filter</emph> objects:
+one to create a <log> element and one to create a <parameters>
+element.
+Either one of these must be capable to create a <header> element which
+is filled with information from command line arguments or an email header
+in the input stream.
+The base <emph>message_filter</emph> is not much more than a short circuit,
+which merely copies the input stream into the internal XML buffer.
+This is used when the input is already in XML format.
+</para>
+
+<para>
+The <ref to='line_cooker'><emph>line_cooker</emph></ref>
+operates on a node in the DOM tree which is
+supposed to be a <raw> <log> element that contains one line
+from a log file.
+The <emph>line_cooker</emph> transforms a <emph>raw</emph> log line into
+its constituent parts that make up en <cooked> element.
+Since each type of log file uses a different layout and syntax,
+different line cookers can be used, depending on the type of log.
+This type is indicated by the <messagetype> element in the header
+part of the message.
+Clearly, the <emph>line_cooker</emph> is a polymorphic entity.
+Exactly which <emph>line_cooker</emph> is used is determined through
+<ref to='classifying'>classifying</ref>
+the content of the message or the message type indicated in the header.
+The <emph>line_cooker</emph> base class provides a default implementation
+for most methods, while derived classes provide the actual cooking.
+</para>
+
+<para>
+Output created by <emph>gcm_input</emph> for logging and debugging purposes
+can be sent to one of several destinations:
+<itemize>
+<item>standard error.</item>
+<item>a log file.</item>
+<item>the system log.</item>
+<item>an email address.</item>
+</itemize>
+The actual destination is stated in the <strong>gnucomo</strong>
+configuration file. The default is stderr.
+A <emph>log</emph> object filters output according to the debug level.
+</para>
+
+<subsection>
+<heading><label name='classifying'/>Classifying messages</heading>
+
+<para>
+Apart from determining information about the client's message, the content
+of the message needs to be analyzed in order to handle it properly.
+The body of the message may contain all sorts of information, such as:
+<itemize>
+<item>System log file</item>
+<item>Apache log file</item>
+<item>Report from a Gnucomo agent or other probe, for example "rpm -qa"
+ or "df -k".</item>
+<item>Generic XML input</item>
+<item>Something else...</item>
+</itemize>
+
+Basically, <strong>gcm_input</strong> accepts two kinds of input: Log lines
+and parameter reports.
+The message is analyzed to obtain information about what the message entails
+and where it came from.
+The message classification embodies the way in which a message must be
+handled and in what way information from the message can be put into
+the database.
+Aspects for handling the message are for example:
+<itemize>
+<item>Strip lines at the beginning or end.</item>
+<item>Store each line separately or store the message as a whole.</item>
+<item>How to extract hostname, arrival time and service from the message.</item>
+<item>How to break up the message into individual fields for a <emph>log</emph> record.</item>
+</itemize>
+These aspects are all handled in polymorphic <emph>message_filter</emph>
+and <emph>line_cooker</emph> classes.
+The result of classifying a message is the selection of the proper
+objects derived from these classes from a collection of such objects.
+</para>
+
+<para>
+The <strong>classify()</strong> method tries to extract that information.
+Sometimes, this information can not be determined with absolute 100% certainty.
+The certainty expresses how sure we are about the contents in the message.
+Classifying a message may be performed with an algorithm as shown in
+the following pseudo code:
+
+<verbatim>
+uncertainty = 1.0
+
+while uncertainty > ε AND not at end
+
+ Scan for a marker
+
+ if a marker matches
+
+ uncertainty = uncertainty * P // P < 1.0
+</verbatim>
+
+With <emph>uncertainty</emph> of course being the opposite of the certainty.
+It expresses how unsure we are about the content of the message, as a
+number between 0.0 and 1.0.
+In fact, it is the probability that the message is not what we think it is.
+Initially, a message is not classified and the uncertainty is 1.0.
+Some lines point toward a certain type of message but do not absolutely determine
+the type of a message. Other pieces of text are typical for a certain message type.
+Such pieces of text, called <emph>markers</emph> are discovered in a message,
+possibly by using regular expression matches.
+Examples of markers that determine the classification of a client message
+are discussed below.
+</para>
+
+<para>
+To determine the message type, <strong>classify()</strong> uses the collection
+of <ref to='line_cooker'><emph>line_cooker</emph></ref> objects and maintains
+the uncertainty associated with each <emph>line_cooker</emph> object.
+A line of input from the message is tested using the <emph>line_cooker::check_pattern</emph>
+method for each <emph>line_cooker</emph>object.
+When a marker matches, we are a bit more sure about the content of the message
+and the uncertainty for that <emph>line_cooker</emph> object decreases by
+multiplying the uncertainty by <strong>P</strong>, a number between 0 and 1.
+This process continues line after line from the input message until the
+uncertainty for one of the <emph>line_cooker</emph> objects is sufficiently low
+(i.e. less than a preset threshold, ε).
+At the end, the <emph>line_cooker</emph> object with the lowest uncertainty
+is selected.
+
+<verbatim>
+From - Sat Sep 14 15:01:15 2002
+</verbatim>
+
+This is almost certainly a UNIX style mail header.
+There should be lines beginning with <code>From:</code> and <code>Date:</code>
+before the first empty line is encountered.
+The hostname of the client that sent the message and the time of arrival
+can be determined from these email header lines.
+The content of the message is still to be determined by matching
+other markers.
+
+<verbatim>
+-----BEGIN PGP MESSAGE-----
+</verbatim>
+
+Such a line in the message certainly means that the message is PGP or GnuPG
+encrypted.
+Decrypting is possible only if someone or something provides a secret key.
+
+<verbatim>
+<?xml version='1.0'?>
+</verbatim>
+
+The XML header declares the message to be generic XML input.
+The structure of the XML message that <strong>gcm_input</strong> accepts
+is described in the next section.
+
+<verbatim>
+Sep 1 04:20:00 kithira kernel: solo1: unloading
+</verbatim>
+
+The general pattern of a system log file is an abbreviated month name, a day,
+a time, a name of a host without the domain, the name of a service followed
+by a colon and finally, the message of that service.
+We can match this with a regular expression to see if the message holds syslog lines.
+Similar matches can be used to find Apache log lines or output from the <emph>dump</emph>
+backup program or anything else.
+</para>
+</subsection>
+
+<subsection>
+<heading><label name='XML_input'/>Generic XML input</heading>
+
+<para>
+
+Since <strong>gcm_input</strong> can not understand every conceivable form
+of input, a client can offer its input in a more generic form which reflects
+the structure of the Gnucomo database.
+In this case, the input is structured in an XML document that contains the input
+data in a form that allows <strong>gcm_input</strong> to store the information
+into the database without knowing the nature of the input.
+The XML root element for <strong>gcm_input</strong> is a <emph><message></emph>, defined
+in the namespace with namespace name <code>http://gnucomo.org/transport/</code>.
+All other elements and attributes of the <emph><message></emph> must be defined
+within this namespace.
+</para>
+<para>
+Within the <emph><message></emph> element there is a <emph><header></emph>
+and a <emph><data></emph> element.
+The <emph><data></emph> element may contain the log data in an externally
+specified format.
+The <emph><header></emph> element contains a number of elements (fields), some
+mandatory, some optional. The text of the element contains the value of
+the element.
+The following elements have been defined:
+
+<itemize>
+<item>
+<emph><mesagetype></emph> mandatory
+ <para>
+ The type (format) of the log data in the data element. The message type
+ determines the way in which raw log elements are parsed and split up
+ into separate fields for insertion into the database.
+ The message types gcm_input understands are:
+ <itemize>
+ <item><code>system log</code> : The most common form of UNIX system logs.
+ Also used in most Linux distributions.
+ </item>
+ <item><code>IRIX system log</code> : Variation of system log, used by SGI.
+ </item>
+ <item><code>apache access log</code> : Access log of the Apache http daemon,
+ in default form.
+ </item>
+ <item><code>apache error log</code> : Error log of the Apache http daemon,
+ in default form.
+ </item>
+ </itemize>
+ There must also be a 'generic' system log in case all elements are
+ cooked already.
+ </para>
+</item>
+<item>
+<emph><hostname></emph> mandatory
+ <para>
+ The name of the system that generated the data in the data block.
+ This can be different from the computer composing the message.
+ </para>
+</item>
+<item>
+<emph><service></emph> optional
+ <para>
+ The (default) value of the service running on the host that
+ generated the message data. For log files that don't contain the
+ service name embedded in them.
+ </para>
+</item>
+<item>
+<emph><time></emph> optional
+ <para>
+ The best approximation to the time that the data was generated.
+ For (log)data that doesn't contain an embedded date stamp.
+ </para>
+</item>
+</itemize>
+
+The following example shows an XML message for <strong>gcm_input</strong>
+with a filled-in header and an empty <emph><data></emph> element:
+
+<verbatim>
+ <gcmt:message xmlns:gcmt='http://gnucomo.org/transport/'>
+ <gcmt:header>
+ <gcmt:messagetype>apache error log</gcmt:messagetype>
+ <gcmt:hostname>client.gnucomo.org</gcmt:hostname>
+ <gcmt:service>httpd</gcmt:service>
+ <gcmt:time>2003-04-17 14:40:46.312895+01:00</gcmt:time>
+ </gcmt:header>
+ <gcmt:data/>
+ </gcmt:message>
+</verbatim>
+</para>
+
+<para>
+The <emph>data</emph> element can hold one of two possible child
+elements: <emph><log></emph> or <emph><parameters></emph>.
+The <emph><log></emph> element may contain any number of lines from
+a system's log file, each line in a separate element.
+A single log line is the content of either a <emph><raw></emph> or
+a <emph><cooked></emph> element.
+The <emph><raw></emph> element contains the log line "as is" and nothing more.
+This is the easiest way to provide XML data for <strong>gcm_input</strong>.
+However, the log line itself must be in a form that <strong>gcm_input</strong>
+can understand.
+After all, <strong>gcm_input</strong> still needs to extract meaningful information
+from that line, such as the time stamp and the service that created the log.
+The client can also choose to provide that information separately by encapsulating
+the log line in a <emph><cooked></emph> element.
+This element may have up to four child elements, two of which are mandatory:
+<itemize>
+<item><emph><timestamp></emph> mandatory.
+ <para>
+ The time at which the log line was generated by the client.
+ </para>
+</item>
+<item><emph><hostname></emph> optional.
+ <para>
+ For logs that include a hostname in each line. This hostname is checked
+ against the hostname in the <emph><header></emph> element.
+ </para>
+</item>
+<item><emph><service></emph> optional.
+ <para>
+ If the service that generated the log is not provided in the <emph><header></emph>
+ the service must be stated for each log line separately.
+ Otherwise, each log line is assumed to be generated by the same service.
+ </para>
+</item>
+<item><emph><raw></emph> mandatory.
+ <para>
+ The content of the full log line. This would have the same content of the singular
+ <emph><raw></emph> element if the log line was not provided in a
+ <emph><cooked></emph> element.
+ </para>
+</item>
+</itemize>
+The following shows an example of the log message with two lines in the
+<emph><log></emph> element, one raw and one cooked:
+
+<verbatim>
+ <gcmt:data xmlns:gcmt='http://gnucomo.org/transport/'>
+ <gcmt:log>
+ <gcmt:raw>
+ Apr 13 04:31:03 schiza kernel: attempt to access beyond end of device
+ </gcmt:raw>
+ <gcmt:cooked>
+ <gcmt:timestamp>2003-04-13 04:31:03+02:00</gcmt:timestamp>
+ <gcmt:hostname>schiza</gcmt:hostname>
+ <gcmt:service>kernel</gcmt:service>
+ <gcmt:raw>
+ Apr 13 04:31:03 schiza kernel: 03:05: rw=0, want=1061109568, limit=2522173
+ </gcmt:raw>
+ </gcmt:cooked>
+ </gcmt:log>
+ </gcmt:data>
+</verbatim>
+</para>
+
+<para>
+The <emph><parameters></emph> element contains a list of parameters
+of the same class. The class is provided as an attribute in the
+<emph><parameters></emph> open tag.
+There is a <emph><parameter></emph> element for each parameter in the list.
+The child elements of a <emph><parameter></emph> are one optional
+<emph><description></emph> element and zero or more <emph><property></emph>
+elements.
+The names of a parameter and a property are provided by the mandatory <emph>name</emph>
+attributes in the respective elements.
+The following example shows a possible parameter report from a "df -k":
+<verbatim>
+ <gcmt:data xmlns:gcmt='http://gnucomo.org/transport/'>
+ <gcmt:parameters gcmt:class='filesystem'>
+ <gcmt:parameter gcmt:name='root'>
+ <gcmt:description>Root filesystem</gcmt:description>
+ <gcmt:property gcmt:name='size'>303344</gcmt:property>
+ <gcmt:property gcmt:name='used'>104051</gcmt:property>
+ <gcmt:property gcmt:name='available'>183632</gcmt:property>
+ <gcmt:property gcmt:name='device'>/dev/hda1</gcmt:property>
+ <gcmt:property gcmt:name='mountpoint'>/</gcmt:property>
+ </gcmt:parameter>
+ <gcmt:parameter gcmt:name='usr'>
+ <gcmt:description>Usr filesystem</gcmt:description>
+ <gcmt:property gcmt:name='size'>5044188</gcmt:property>
+ <gcmt:property gcmt:name='used'>3073716</gcmt:property>
+ <gcmt:property gcmt:name='available'>1714236</gcmt:property>
+ <gcmt:property gcmt:name='device'>/dev/hdd2</gcmt:property>
+ <gcmt:property gcmt:name='mountpoint'>/usr</gcmt:property>
+ </gcmt:parameter>
+ </gcmt:parameters>
+ </gcmt:data>
+</verbatim>
+</para>
+
+</subsection>
+
+<subsection>
+<heading>Gcm_input classes</heading>
+<para>
+The figure below shows the class diagram that is used for <strong>gcm_input</strong>:
+ <para>
+ <picture src='classes-gcm_input.png' eps='classes-gcm_input' scale='0.8'/>
+ </para>
+
+The heart of the application is a <emph>client_message</emph> object.
+This object reads the message through the
+<ref to='message_buffer'><emph>message_buffer</emph></ref> from some
+input stream (file, string, stdin or socket), classifies the message and
+enters information from the message into the database.
+The <emph>client_message</emph> object holds a collection of <emph>message_filter</emph>
+and associated <emph>line_cooker</emph> objects.
+The association is maintained in a <emph>xform</emph> object.
+Note that several <emph>line_cooker</emph> can be associated with
+with a single <emph>message_filter</emph> object.
+For example, a system log or a web server log are processed in a similar
+manner, i.e. each line is transformed into a <log> element.
+The patterns of the individual lines, however are entirely different.
+During the classification of the input data, one combination of a
+<emph>message_filter</emph> and <emph>line_cooker</emph> is selected.
+The classification process works by calculating the uncertainty with which
+a <emph>line_cooker</emph> matches with the input data.
+The one with the least uncertainty is selected.
+</para>
+
+<subsubsection>
+<heading><label name='client_message'/>client_message</heading>
+<para>
+The <emph>client_message</emph> has a relationship with a <emph>gnucomo_database</emph> object which
+is an abstraction of the tables in the database.
+These are the methods for the <emph>client_message</emph> class:
+
+<itemize>
+<item><code>client_message::client_message(istream *in, gnucomo_database *db)</code>
+ <para>
+ Constructor.
+ </para>
+</item>
+<item><code>void add_cooker(line_cooker *lc, message_filter *mf)</code>
+ <para>
+ Add another <emph>line_cooker</emph> object with the associated
+ <emph>message_filter</emph> object to the collection.
+ This initializes the uncertainty with which the <emph>line_cooker</emph>
+ is selected to 1.0.
+ </para>
+</item>
+<item><code>double client_message::classify(String host, date arrival_d,
+ hour arrival_t, String serv)</code>
+ <para>
+ Try to classify the message and return the certainty with which the class of the
+ message could be determined.
+ If the hostname, arrival time and service can not be extracted from the message,
+ use the arguments as default.
+ This will examine the first few lines of the input data
+ to select one of the <emph>message_filter</emph> with
+ associated <emph>line_cooker</emph> objects
+ from the collection built with the <emph>add_cooker</emph> method.
+ </para>
+</item>
+<item><code>int enter()</code>
+ <para>
+ Insert the message contents into the gnucomo database.
+ Returns the number of records inserted.
+ The input data from the <emph>message_buffer</emph>
+ is first transformed into an XML document (a strstream object)
+ by invoking the <emph>message_filter</emph> and <emph>line_cooker</emph> objects.
+ The XML document in the internal buffer is then parsed into an
+ XML DOM tree, using the Gnome XML parser.
+ The XML document may also be validated against an XML Schema definition.
+ </para>
+ <para>
+ After extracting and checking the <header> elements,
+ the data nodes are extracted and inserted into the database,
+ possibly using a <emph>line_cooker</emph> object to cook raw
+ log elements.
+ If an error occurs in some stage of this process, the XML document
+ is dumped in a spool area for later processing.
+ </para>
+</item>
+</itemize>
+
+</para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='message_filter'/>message_filter</heading>
+<para>
+A <emph>message_filter</emph> transforms the raw input data into an
+XML document suitable for further parsing and storage into the database.
+The base class, <emph>message_filter</emph> does nothing but copy the
+input stream into an internal XML buffer.
+An object of this class is used when the input is already in XML format.
+Classes derived from <emph>message_filter</emph> read the input line by line,
+possibly extracting information from an email header if available.
+<itemize>
+<item>
+ <code>message_filter::contruct_XML(message_buffer &in, strstream &xml)</code>
+ <para>
+ Simply copy the input stream into the internal XML buffer.
+ The base class function is used when the input is already in XML format.
+ Derived classes will override this function.
+ </para>
+</item>
+</itemize>
+</para>
+
+<para>
+Classes derived from <emph>message_filter</emph> transform various kinds
+of input into an XML document. The following diagram shows some examples:
+
+ <para>
+ <picture src='classes-message_filter.png' eps='classes-message_filter' scale='0.8'/>
+ </para>
+
+The two classes, derived directly from <emph>message_filter</emph> reflect
+that <strong>gcm_input</strong> can handle two kinds of input: The
+<emph>log_filter</emph> for log files
+and <emph>parameters_filter</emph> for parameter reports.
+Each of these types of input are transformed into entirely different XML
+documents and are stored quite differently into the database.
+Classes that are derived further down the hierarchy will handle more specific
+forms of input.
+</para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='line_cooker'/>line_cooker</heading>
+<para>
+To turn a raw line from a log file onto separate parts that can be stored
+in the database, i.e. parse the line, the <emph>client_message</emph>
+object uses a <emph>line_cooker</emph> object.
+This is a polymorphic object, so each type of log can have its own parser,
+while the <emph>client_message</emph> object uses a common interface
+for each one.
+For each message, one specific <emph>line_cooker</emph> object is
+selected as determined by the message type.
+E.g., the derived class <emph>syslog_cooker</emph> is used for system logs.
+When the <emph>client_message</emph> object encounters a <strong>raw</strong>
+log element, it takes the following steps to turn this into a
+<strong>cooked</strong> log element:
+<enumerate>
+<item>
+ Remove the TEXT node of the "raw" element and save its content.
+</item>
+<item>
+ Change the name of the element into "cooked".
+</item>
+<item>
+ Check if the content matches the syntax of the type of log we're
+ processing at the moment.
+ This depends on the message type and is therefore a task for the
+ <emph>line_cooker</emph> object.
+</item>
+<item>
+ Have the <emph>line_cooker</emph> parse the content and extract
+ the time stamp and optionally the hostname and service.
+</item>
+<item>
+ Insert new child elements into the cooked element.
+</item>
+</enumerate>
+After that, the <emph>cooked</emph> element is ready for further processing
+and possibly storing into the database.
+</para>
+
+<para>
+The <emph>line_cooker</emph> base class holds three protected members
+that must be filled with information by the derived classes:
+<itemize>
+<item><code>UTC ts</code> : the timestamp.</item>
+<item><code>String hn</code> : the hostname</item>
+<item><code>String srv</code> : the service</item>
+</itemize>
+Corresponding base class methods (<emph>timestamp</emph>,
+<emph>hostname</emph> and <emph>service</emph>) will do
+nothing more than return these values.
+It is up to the derived class's <emph>cook_this</emph> method
+to properly initialize these members.
+</para>
+
+<para>
+The methods for the <emph>line_cooker</emph> class are:
+<itemize>
+<item>
+ <code>bool line_cooker::check_pattern(String logline)</code>
+ <para>
+ Tries to match the <emph>logline</emph> against that patterns
+ that describe the message syntax.
+ In the <emph>line_cooker</emph> base class, this is a pure virtual function.
+ Returns true if the log line adheres to the message-specific syntax.
+ </para>
+</item>
+<item>
+ <code>bool line_cooker::cook_this(String logline, UTC arrival)</code>
+ <para>
+ Extracts information from the <emph>logline</emph>.
+ The <emph>arrival</emph> can be used to make corrections to incomplete
+ time stamps in the log file.
+ The implementation in a derived class must properly initialize the protected
+ members described above.
+ In the <emph>line_cooker</emph> base class, this is a pure virtual function.
+ Returns false if the pattern does not match.
+ </para>
+</item>
+<item>
+ <code>String line_cooker::message_type()</code>
+ <para>
+ Return the message type for which this line cooker is intended.
+ </para>
+</item>
+<item>
+ <code>UTC line_cooker::timestamp()</code>
+ <para>
+ Returns the timestamp that was previously extracted from the log line
+ or a 'null' timestamp if that information could nnot be extracted.
+ </para>
+</item>
+<item>
+ <code>String line_cooker::hostname()</code>
+ <para>
+ Returns the hostname that was previously extracted from the log line,
+ or an empty string if that information could not be extracted.
+ </para>
+</item>
+<item>
+ <code>String line_cooker::service()</code>
+ <para>
+ Returns the service that was previously extracted from the log line,
+ or an empty string if that information could not be extracted.
+ </para>
+</item>
+</itemize>
+</para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='message_buffer'/>message_buffer</heading>
+<para>
+Some kind of input buffering is needed when a client message is being processed.
+The contents of the message are not entirely clear until a few lines are analyzed,
+and these lines probably need to be read again.
+When the message is stored in a file, this is no problem; a simple lseek allows us
+to read any line over and over again.
+However, when the message comes from an input stream, like a TCP socket or just
+plain old stdin, this is a different matter.
+Lines of the messages that are already read will be lost forever, unless they are
+stored somewhere.
+To store an input stream temporarily, there are two options:
+<enumerate>
+<item>In an internal memory buffer.</item>
+<item>In an external (temporary) file.</item>
+</enumerate>
+The <emph>message_buffer</emph> class takes care of the input buffering, thus
+hiding these implementation details.
+On the outside, a <emph>message_buffer</emph> can be read line by line until the
+end of the input is reached.
+Lines of input can be read again by backing up to the beginning of the message
+by using the <strong>rewind()</strong> method or by backing up one line
+with the <strong>--</strong> operator.
+The <emph>message_buffer</emph> object maintains a pointer to the next
+available line.
+The <strong>++</strong> operator, being the opposite of the <strong>--</strong>
+operator, skips one line.
+</para>
+
+<para>
+The <strong>>></strong> operator reads data from the message
+into the second (String) operand, just like the <strong>>></strong>
+operator for an istream.
+There is a small difference, though.
+The <strong>>></strong> operator for a <emph>message_buffer</emph>
+returns a boolean value which states if there actually was input available.
+This value will usually turn to <code>false</code> at the end of file.
+A second difference is the fact that input data can only be read into
+<emph>String</emph> objects a line at a time.
+There are no functions for reading integer or floating point numbers.
+The <strong>>></strong> operator reads the next line either from
+an internal buffer or from the external input stream if the internal
+buffer is exhausted.
+Lines read from the input stream are cached in the internal buffer,
+so they are available for reading another time, e.g. after
+rewinding to the beginning of the message.
+</para>
+
+<para>
+Methods for the <emph>message_buffer</emph> class:
+
+<itemize>
+<item>message_buffer::message_buffer(istream *in)
+ <para>
+ Constructor.
+ </para>
+</item>
+<item>bool operator >>(message_buffer &, String &)
+</item>
+<item>message_buffer::rewind()</item>
+<item>message_buffer::operator --</item>
+<item>message_buffer::operator ++</item>
+</itemize>
+</para>
+
+</subsubsection>
+</subsection>
+
+<subsection>
+
+<heading>Command arguments</heading>
+
+<para>
+<strong>Gcm_input</strong> understands the following command line arguments:
+<itemize>
+<item>-c <name> : Configuration name</item>
+<item>-d <date> : Arrival time of the message</item>
+<item>-h <hostname> : FQDN of the client</item>
+<item>-s <service> : service that created the log</item>
+<item>-v : verbose output. Print lots of debug information</item>
+<item>-V : Print version and exit</item>
+</itemize>
+</para>
+</subsection>
+</section>
+
+<section>
+<heading>gcm_daemon</heading>
+<para>
+<strong>Gcm_daemon</strong> is the application that processes data just
+arrived in the database.
+It handles the log-information delivered by <strong>gcm_input</strong>
+in the <emph>log</emph> table of the database.
+With the data further storage and classification can be done.
+Where <strong>gcm_input</strong> is a highly versatile application that is
+loaded and ended all the time the daemon is continuously available monitoring
+the entire system. Basically the daemon monitors everything that happens
+in the database and executes continuous checks and processes all the data.
+The two applications (gcm_input and gcm_daemon) together are the core of the central system.
+The application has the following tasks:
+<itemize>
+<item>Processing data into other tables so that easy detection can take place</item>
+<item>Raising notifications based on the available input</item>
+<item>Maintain the status of notifications and changing priority when needed</item>
+<item>Periodically perform checks for alerts that are communicated through the notification-table</item>
+<item>Perform updates on the database when a new version of the software is loaded</item>
+</itemize>
+</para>
+
+<subsection>
+<heading>Performing checks</heading>
+
+<para>
+One of the most difficult tasks for the daemon is performing the automatic checks.
+Every check is different and will be made up of several parameters that have to test negative.
+That makes it hard to define this in software.
+Another downside is that some work may be very redundant.
+For that reason a more generic control structure is needed based on the technologies used.
+The logical choice is then to focus on the capabilities in the database and perform
+the job by executing queries.
+</para>
+<para>
+Since the system is about detecting problems and issues we build the detection in
+such a way that queries on the database result in 'suspicious' log records.
+So called 'innocent' records can be ignored. So if a query gives a result a
+problem is present, if there is no result there isn't a problem.
+As soon as we seek for common ground in the process of identifying problems
+it can be said that all results are based on the log-table
+(as stated in the manifest the log-table is the one and only table were input
+will arrive and stored for later use).
+Furthermore there are two ways of determining if a problem is present:
+<itemize>
+<item>
+A single log-record or a group of log-records is within or outside the boundaries set.
+If it is outside the boundaries the log record(s) is/are a potential problem.
+If there are more boundaries set all of these need to be applied.
+Based on fixed data results can be derived.
+</item>
+<item>
+A set of records outline a trend that throughout time may turnout
+to be a problem. These type of values are not fixed and directly legible
+but more or less derived data. That data is input for some checks (previous bullet).
+</item>
+</itemize>
+In both cases a set of queries can be run.
+If there are more queries to be executed the later queries can be executed on
+only the results. For that reason intermediate results have to be stored in a
+temporary table for later reuse.
+Saving a session in combination with the found log-records are sufficient.
+This is also true since log records are the basis of all derived presences in
+the numerous log_adv_xxx-tables and always have a reference to the log-table.
+</para>
+<para>
+Building the checks will thus be nothing more than combining queries
+and adding a classification to the results of that query.
+If this generic structure is being built properly with a simple (easy to understand) interface,
+many combinations can be made. People having a logically correct idea,
+but insufficient skills to program will be able to build checks.
+As a consequence we can offer the interface to the user,
+that in turn can also make particular checks for the environment that is unique to him/her.
+This - of course - doesn't mean that a clear SQL-interface shouldn't be offered.
+</para>
+<para>
+Whenever something happens, that is less than standard a line will be written to the syslogd.
+This will enable users and developers to trace exactly what happened.
+The gcm_daemon will also log startup and ending so that abrupt endings of the daemon will be detected.
+</para>
+</subsection>
+
+<subsection>
+<heading>The initial process</heading>
+<para>
+When gcm_daemon starts first some basic actions take place that go beyond
+just opening a connection to the database. The following actions also need to take place:
+<itemize>
+<item>
+Check the database version if it is still the most recent version.
+The daemon will check the version-number of the database.
+If the database is not the same version as gcm_daemon an update will be performed.
+When the database is up-to-date normal processing will continue.
+</item>
+<item>
+If the database reflects that the used version gcm_daemon is less recent than
+the running version (i.e. a new version has been installed) all records
+in the log-table that weren't recognized before will now be set to unprocessed since
+there is a fair change that they might be recognized this time. This will ensure that no data is lost.
+</item>
+</itemize>
+</para>
+
+</subsection>
+</section>
+
+<section>
<heading>Design ideas</heading>
<para>
-Use of a neural network to analyse system logs:
+Use of a neural network to analyze system logs:
<itemize>
<item>Classify words</item>
<item>Classify message based on word classification</item>
</itemize>
</para>
-
</section>
</chapter>