Added Solaris support in Gnucomo report scripts.

[gnucomo.git] / doc / design.xml
diff --git a/doc/design.xml b/doc/design.xml

index a09b579..a2387ac 100644 (file)
--- a/doc/design.xml
+++ b/doc/design.xml
@@ -6,7 +6,7 @@
  <!--
        XML documentation system
        Original author :  Arjen Baart - arjen@andromeda.nl
-      Version         : $Revision: 1.3 $
+      Version         : $Revision: 1.7 $
  
        This document is prepared for XMLDoc. Transform to HTML,
        LaTeX, Postscript or plain text with XMLDoc utilities and
@@ -17,14 +17,13 @@
  <titlepage>
     <title>Gnucomo - Computer Monitoring</title>
     <subtitle>Design description</subtitle>
-<!--
     <para><picture src='logo.png' eps='logo' scale='0.7'/></para>
--->
     <author>Arjen Baart <code>&lt;arjen@andromeda.nl&gt;</code></author>
     <author>Brenno de Winter<code>&lt;brenno@dewinter.com&gt;</code></author>
-   <date>September 10, 2002</date>
+   <author>Peter Roozemaal<code>&lt;mathfox@xs4all.nl&gt;</code></author>
+   <date>November 26, 2003</date>
     <docinfo>
-      <infoitem label="Version">0.1</infoitem>
+      <infoitem label="Version">0.6</infoitem>
        <infoitem label="Organization">Andromeda Technology &amp; Automation</infoitem>
        <infoitem label="Organization">De Winter Information Solutions</infoitem>
     </docinfo>
@@ -49,7 +48,26 @@ and is based upon the development manifest.
  <heading>Architecture</heading>
  
  <para>
-The architecture of <strong>gnucomo</strong> is shown in the
+The systems that are being monitored in <strong>gnucomo</strong> are called
+<emph>Object</emph>.
+These may be computers, routers, switches or other active components that
+are capable of sending reports about their internal workings
+to the <strong>gnucomo</strong> server.
+An <emph>Object</emph> plays a central role in the <strong>gnucomo</strong> system.
+Two separate aspects of an <emph>Object</emph> are monitored: the static state and the
+dynamic behaviour.
+The static state of an <emph>Object</emph> is represented by a set of paremeters
+and the values of these parameter's attributes.
+The dynamic behaviour of an <emph>Object</emph> is characterized by events that
+happen on an <emph>Object</emph>.
+One obvious way to collect a report of these events is to scan the log files of the
+system and its processes.
+<emph>Objects</emph> run services and these services are configured with a set
+of parameters. Also, services produce entries in log files.
+</para>
+
+<para>
+The dataflow architecture of <strong>gnucomo</strong> is shown in the
  data flow diagram below:
  </para>
  
@@ -58,6 +76,33 @@ data flow diagram below:
  </para>
  
  <para>
+At the left of the diagram, information is acquired from the monitored system.
+Several agents can be used to obtain information from this system, in
+active or passive ways.
+A passive agent uses information which is available on the system anyway,
+such as log files or other lists.
+An active agent, requests explicit data from the monitored system.
+One example of a passive agent is <emph>logrunner</emph>, a program which
+monitors system log files and sends regular updates to the <strong>gnucomo</strong>
+server.
+The agents on the monitored system send the data to some kind of transportation channel.
+This can be any form of transport, such as Email, SOAP, plain file copying or
+some special network connection.
+If desired, the transportation may provide security.
+Once arrived at the server, the information from monitored systems is captured
+by the <emph>gcm_input</emph> process.
+This process can obtain the data through many forms of transport and from
+a number of input formats.
+<emph>Gcm_input</emph> will try to recognize as much as possible from an
+input message and store the obtained information into the <emph>Raw Storage</emph>
+database.
+The <emph>Raw Storage</emph> data is processed further and analyzed by
+the <emph>gcm_daemon</emph>, which scans the data, gathers statistics and
+stores its results into the <emph>Derived Storage</emph> database where
+it is available for human review and further analysis.
+</para>
+
+<para>
  Architectural items to consider:
  <itemize>
  <item>Active and passive data acquisition</item>
@@ -94,7 +139,7 @@ Quite often, the test database will need to be destroyed and recreated.
  To enable testing of <strong>gnucomo</strong> applications, all programs
  need to access either the test database or the production database.
  To accommodate this, each application needs an option to override the
-default name of the configuration file (gnucomo.conf).
+default name of the <ref to='configuration'>configuration</ref> file (gnucomo.conf).
  </para>
  
  <para>
@@ -113,17 +158,105 @@ The destructor will of course close the database connection.
  </para>
  
  <para>
-Other methods provide access to the database.
+Other methods provide access to the database in a low-level manner.
  There will be lots more in the future, but here are a few to begin with:
  <itemize>
-<item>Find the objectid of a host, given its hostname</item>
-<item>Insert a log record into the log table</item>
+<item>Send a SQL query to the database.</item>
+<item>Read a tuple from a result set.</item>
+<item>Obtain the userid for the current database session.</item>
  </itemize>
  </para>
+
+<para>
+The information stored in the database as tuples is represented by classes in
+other programming languages such as C++ of PHP.
+Each class models a particular type of tuple (an <emph>entity</emph>)
+in the database.
+Such classes maintain the relation with the database on one end,
+while providing methods that are specific to the entity on the other end.
+All database communication and SQL queries are hidden inside the
+entity's class.
+This includes, for example, handling database result sets and access control.
+</para>
+<para>
+Properties and operations that are common to all classes that represent
+entities in the database are caught in a common base class.
+The base class, named <emph>database_entity</emph> provides default
+implementations for loading and storing tuples, construction and destruction
+and iteration.
+Most derived classes will override these functions.
+Two examples of classes that represent entities in the database
+are <emph>object</emph> and <emph>service</emph>.
+Both are derived from a <emph>database_entity</emph>, as show below:
+<para>
+  <picture src='class-database_entity.png' eps='class-database_entity'/>
+</para>
+</para>
+
+<para>
+Constructors of classes derived from <emph>database_entity</emph> come in
+two varieties: with or without database interaction.
+Constructors that do not interact with the database have only one argument:
+a reference to the <emph>gnucomo_database</emph> object which handles
+the low-level interaction with the database server.
+The example below shows a few of these constructors:
+<verbatim>
+
+  database_entity::database_entity(gnucomo_database &amp;gdb)
+
+  object::object(gnucomo_database &amp;gdb)
+
+  service::service(gnucomo_database &amp;gdb)
+
+</verbatim>
+The objective of this type of constructor is to cerate a fresh tuple and
+store it in the database later on.
+All these constructors do is establish the connection to the database
+server and fill in the defaults for the fields in the tuple.
+A destructor will put the actual tuple into the database if any
+information in the object has changed.
+This may be by sending an INSERT if the object is completely fresh
+or an UPDATE if an already existing tuple was changed.
+The state information about the freshness of an object is a property
+common to all database entities and is therefore maintained in
+the <emph>database_entity</emph> class.
+</para>
+<para>
+Constructors that do interact with the database accept additional
+arguments after the initial <emph>gnucomo_database</emph> reference.
+These extra arguments are used to retrieve a tuple from the database.
+Examples of such constructors are:
+<verbatim>
+
+   object::object(gnucomo_database &amp;gdb, String hostname)
+
+   object::object(gnucomo_database &amp;gdb, long long oid)
+
+   service::service(gnucomo_database &amp;gdb, String name)
+
+</verbatim>
+The set of arguments must of course correspond to a set of fields that
+uniquely identify the tuple.
+The primary key of the database table would be ideally suitable.
+If the tuple is not found in the database, data members of the object
+are set to default values and the object is marked as being fresh and
+not changed.
+</para>
+
+<para>
+Methods with the same name as a field in a tuple read or change the
+value of that field.
+Without an argument, such a method returns the current value of the field.
+With a single argument, the field is set to the new value passed in the
+argument and the method returns the original value.
+Whenever a field is set to a new value, the object is marked as being
+'changed'.
+A destructor will then save the tuple to the database.
+</para>
  </section>
  
  <section>
-<heading>Configuration</heading>
+<heading><label name='configuration'/>Configuration</heading>
  
  <para>
  Configuration parameters are stored in a XML formatted configuration file.
@@ -151,17 +284,47 @@ The value of a user-specific configuration parameter overrides
  the system-wide value.
  </para>
  <para>
-At the moment, the gnucomo configuration has one section, holding
-four parameters which define how to access the gnucomo database:
+The following sections and parameters are defined for the Gnucomo
+configuration:
  <itemize>
-<item>type</item>
-<item>name</item>
-<item>user</item>
-<item>password</item>
+<item>database
+   <itemize>
+   <item>type</item>
+   <item>name</item>
+   <item>user</item>
+   <item>password</item>
+   <item>host</item>
+   <item>port</item>
+   </itemize>
+</item>
+<item>logging
+   <itemize>
+   <item>method</item>
+   <item>destination</item>
+   <item>level</item>
+   </itemize>
+</item>
+<item>gcm_input
+   <itemize>
+   <item>dbuser</item>
+   <item>password</item>
+   </itemize>
+</item>
+<item>gcm_daemon
+   <itemize>
+   <item>dbuser</item>
+   <item>password</item>
+   </itemize>
+</item>
  </itemize>
  
-The <emph>type</emph> parameter must have the content <code>PostgreSQL</code>.
+The <emph>database</emph> section defines how the database is accessed.
+The <emph>database/type</emph> parameter must have the content <code>PostgreSQL</code>.
  Other database systems are not supported yet.
+The <emph>database/user</emph> and <emph>database/password</emph> provide default
+login information onto the database server.
+Specific user names and passwords may be specified for separate applications, such
+as <emph>gcm_input</emph> and <emph>gcm_daemon</emph>.
  </para>
  
  <subsection>
@@ -217,10 +380,15 @@ transportation channel.
  Here are a few examples:
  
  <itemize>
-<item>Copied directly from a local client's file system.</item>
+<item>Obtained directly from a local client's file system.</item>
+<item>From the output of another process, through standard input.</item>
  <item>Copied remotely from a client's file system, e.g. using
-<code>ftp</code>, <code>rcp</code> or <code>scp</code>.</item>
+   <code>ftp</code>, <code>rcp</code> or <code>scp</code>.
+   This is usually handled through spooled files.
+</item>
  <item>Through an email.</item>
+<item>As a SOAP web service, carried through HTTP or SMTP.</item>
+<item>Through a TCP connection on a special socket.</item>
  </itemize>
  
  On top of that, any message may be encrypted, for example with PGP or GnuPG.
@@ -230,8 +398,9 @@ In case the message is encrypted, it may not be possible to run <strong>gcm_inpu
  in the background, since human intervention is needed to enter the secret key.
  </para>
  <para>
-The primary function of <strong>gcm_input</strong> is to store lines from a client's log files
-into the <emph>log</emph> table.
+The primary function of <strong>gcm_input</strong> is to store lines from a client's
+log files into the <emph>log</emph> table or scan a report from a probe and update
+the <emph>parameter</emph> table.
  To do this, we need certain information about the client message that is usually not
  in the content of a log file.
  This information includes:
@@ -244,7 +413,101 @@ This information includes:
  Sometimes, this information is available from the message itself, as in an email header.
  On other occasions, the information needs to be supplied externally,
  e.g. by using command line options.
+In any case, this type of 'header' information is relevant to the message
+as a whole.
+As a result, <emph>gcm_input</emph> can accept one and only one message at a time.
+For example, it is not possible to connect the standard output of
+<emph>logrunner</emph> to the standard input of <emph>gcm_input</emph> and have
+a continuous stream of messages from different log sources.
+Each message should be fed to <emph>gcm_input</emph> separately.
+Also when <emph>logrunner</emph> uses a special socket to send logging data,
+a new connection must be created for each message.
+The dataflow diagram below shows how a message travels from the input source
+to the database.
  </para>
+
+<para>
+   <picture src='gcm_input-dataflow.png' eps='gcm_input-dataflow' scale='0.6'/>
+</para>
+
+<para>
+Internally, <emph>gcm_input</emph> handles <ref to='XML_input'>XML input</ref>
+and each input item must have its data fields split into appropriate XML elements.
+When data is offered in some other form, this data must be filtered
+and transformed into XML before <emph>gcm_input</emph> can handle it.
+Two levels of transformation are possible.
+At the highest level, the whole message is transformed into an XML
+document with a <code>&lt;message&gt;</code> root element and the
+appropriate <code>&lt;header&gt;</code> and <code>&lt;data&gt;</code>
+elements, all of which are put in the proper namespace.
+At the lowest level, each line of the message's data can be transformed
+into a <code>&lt;cooked&gt; &lt;log&gt;</code> element.
+Two classes of replaceable filter objects take care of these transformations.
+Depending on the content of the message and/or command line options to
+<emph>gcm_input</emph>, an appropriate filter object is inserted into
+the data stream.
+</para>
+
+<para>
+The <ref to='message_filter'><emph>message_filter</emph></ref> transforms
+the raw input data into an XML document.
+The XML document is processed by the XML parser and stored into the database
+or saved into a spool area for later processing.
+The latter happens, for example, when the database is unavailable.
+The task of the <emph>message_filter</emph> object is to create the &lt;header&gt;
+elements and the &lt;data&gt; element containing either a &lt;log&gt; or
+a &lt;parameters&gt; element, along with all their child elements.
+To do this, a <emph>message_filter</emph> object must work closely
+together with a <emph>line_cooker</emph> object.
+</para>
+<para>
+There are two major classes of <emph>message_filter</emph> objects:
+one to create a &lt;log&gt; element and one to create a &lt;parameters&gt;
+element.
+Either one of these must be capable to create a &lt;header&gt; element which
+is filled with information from command line arguments or an email header
+in the input stream.
+The base <emph>message_filter</emph> is not much more than a short circuit,
+which merely copies the input stream into the internal XML buffer.
+This is used when the input is already in XML format.
+</para>
+
+<para>
+The <ref to='line_cooker'><emph>line_cooker</emph></ref>
+operates on a node in the DOM tree which is
+supposed to be a &lt;raw&gt; &lt;log&gt; element that contains one line
+from a log file.
+The <emph>line_cooker</emph> transforms a <emph>raw</emph> log line into
+its constituent parts that make up en &lt;cooked&gt; element.
+Since each type of log file uses a different layout and syntax,
+different line cookers can be used, depending on the type of log.
+This type is indicated by the &lt;messagetype&gt; element in the header
+part of the message.
+Clearly, the <emph>line_cooker</emph> is a polymorphic entity.
+Exactly which <emph>line_cooker</emph> is used is determined through
+<ref to='classifying'>classifying</ref>
+the content of the message or the message type indicated in the header.
+The <emph>line_cooker</emph> base class provides a default implementation
+for most methods, while derived classes provide the actual cooking.
+</para>
+
+<para>
+Output created by <emph>gcm_input</emph> for logging and debugging purposes
+can be sent to one of several destinations:
+<itemize>
+<item>standard error.</item>
+<item>a log file.</item>
+<item>the system log.</item>
+<item>an email address.</item>
+</itemize>
+The actual destination is stated in the <strong>gnucomo</strong>
+configuration file. The default is stderr.
+A <emph>log</emph> object filters output according to the debug level.
+</para>
+
+<subsection>
+<heading><label name='classifying'/>Classifying messages</heading>
+
  <para>
  Apart from determining information about the client's message, the content
  of the message needs to be analyzed in order to handle it properly.
@@ -252,12 +515,33 @@ The body of the message may contain all sorts of information, such as:
  <itemize>
  <item>System log file</item>
  <item>Apache log file</item>
-<item>Report from a Gnucomo agent</item>
+<item>Report from a Gnucomo agent or other probe, for example "rpm -qa"
+      or "df -k".</item>
+<item>Generic XML input</item>
  <item>Something else...</item>
  </itemize>
  
+Basically, <strong>gcm_input</strong> accepts two kinds of input: Log lines
+and parameter reports.
  The message is analyzed to obtain information about what the message entails
  and where it came from.
+The message classification embodies the way in which a message must be
+handled and in what way information from the message can be put into
+the database.
+Aspects for handling the message are for example:
+<itemize>
+<item>Strip lines at the beginning or end.</item>
+<item>Store each line separately or store the message as a whole.</item>
+<item>How to extract hostname, arrival time and service from the message.</item>
+<item>How to break up the message into individual fields for a <emph>log</emph> record.</item>
+</itemize>
+These aspects are all handled in polymorphic <emph>message_filter</emph>
+and <emph>line_cooker</emph> classes.
+The result of classifying a message is the selection of the proper
+objects derived from these classes from a collection of such objects.
+</para>
+
+<para>
  The <strong>classify()</strong> method tries to extract that information.
  Sometimes, this information can not be determined with absolute 100% certainty.
  The certainty expresses how sure we are about the contents in the message.
@@ -265,18 +549,44 @@ Classifying a message may be performed with an algorithm as shown in
  the following pseudo code:
  
  <verbatim>
-while certainty &lt; &epsilon; AND not at end
+uncertainty = 1.0
+
+while uncertainty &gt; &epsilon; AND not at end
  
     Scan for a marker
  
-   Adjust certainty
+   if a marker matches
+
+      uncertainty = uncertainty * P    //  P &lt; 1.0
  </verbatim>
  
-Initially, a message is not classified and the certainty is 0.0.
-Some lines point toward a certain class of message but do not absolutely determine
-the class of a message. Other pieces of text are typical for a certain message class.
+With <emph>uncertainty</emph> of course being the opposite of the certainty.
+It expresses how unsure we are about the content of the message, as a
+number between 0.0 and 1.0.
+In fact, it is the probability that the message is not what we think it is.
+Initially, a message is not classified and the uncertainty is 1.0.
+Some lines point toward a certain type of message but do not absolutely determine
+the type of a message. Other pieces of text are typical for a certain message type.
+Such pieces of text, called <emph>markers</emph> are discovered in a message,
+possibly by using regular expression matches.
  Examples of markers that determine the classification of a client message
  are discussed below.
+</para>
+
+<para>
+To determine the message type, <strong>classify()</strong> uses the collection
+of <ref to='line_cooker'><emph>line_cooker</emph></ref> objects and maintains
+the uncertainty associated with each <emph>line_cooker</emph> object.
+A line of input from the message is tested using the <emph>line_cooker::check_pattern</emph>
+method for each <emph>line_cooker</emph>object.
+When a marker matches, we are a bit more sure about the content of the message
+and the uncertainty for that <emph>line_cooker</emph> object decreases by
+multiplying the uncertainty by <strong>P</strong>, a number between 0 and 1.
+This process continues line after line from the input message until the
+uncertainty for one of the <emph>line_cooker</emph> objects is sufficiently low
+(i.e. less than a preset threshold, &epsilon;).
+At the end, the <emph>line_cooker</emph> object with the lowest uncertainty
+is selected.
  
  <verbatim>
  From - Sat Sep 14 15:01:15 2002
@@ -299,6 +609,14 @@ encrypted.
  Decrypting is possible only if someone or something provides a secret key.
  
  <verbatim>
+&lt;?xml version='1.0'?&gt;
+</verbatim>
+
+The XML header declares the message to be generic XML input.
+The structure of the XML message that <strong>gcm_input</strong> accepts
+is described in the next section.
+
+<verbatim>
  Sep  1 04:20:00 kithira kernel: solo1: unloading
  </verbatim>
  
@@ -309,58 +627,441 @@ We can match this with a regular expression to see if the message holds syslog l
  Similar matches can be used to find Apache log lines or output from the <emph>dump</emph>
  backup program or anything else.
  </para>
+</subsection>
+
+<subsection>
+<heading><label name='XML_input'/>Generic XML input</heading>
  
  <para>
-The message classification embodies the way in which a message must be
-handled and in what way information from the message can be put into
-the database.
-Aspects for handling the message are for example:
+
+Since <strong>gcm_input</strong> can not understand every conceivable form
+of input, a client can offer its input in a more generic form which reflects
+the structure of the Gnucomo database.
+In this case, the input is structured in an XML document that contains the input
+data in a form that allows <strong>gcm_input</strong> to store the information
+into the database without knowing the nature of the input.
+The XML root element for <strong>gcm_input</strong> is a <emph>&lt;message&gt;</emph>, defined
+in the namespace with namespace name <code>http://gnucomo.org/transport/</code>.
+All other elements and attributes of the <emph>&lt;message&gt;</emph> must be defined
+within this namespace.
+</para>
+<para>
+Within the <emph>&lt;message&gt;</emph> element there is a <emph>&lt;header&gt;</emph>
+and a <emph>&lt;data&gt;</emph> element.
+The <emph>&lt;data&gt;</emph> element may contain the log data in an externally
+specified format.
+The <emph>&lt;header&gt;</emph> element contains a number of elements (fields), some
+mandatory, some optional. The text of the element contains the value of
+the element.
+The following elements have been defined:
+
  <itemize>
-<item>Strip lines at the beginning or end.</item>
-<item>Store each line separately or store the message as a whole.</item>
-<item>How to extract hostname, arrival time and service from the message.</item>
-<item>How to break up the message into individual fields for a <emph>log</emph> record.</item>
+<item>
+<emph>&lt;mesagetype&gt;</emph> mandatory
+     <para>
+      The type (format) of the log data in the data element. The message type
+      determines the way in which raw log elements are parsed and split up
+      into separate fields for insertion into the database.
+      The message types gcm_input understands are:
+      <itemize>
+      <item><code>system log</code> : The most common form of UNIX system logs.
+            Also used in most Linux distributions.
+      </item>
+      <item><code>IRIX system log</code> : Variation of system log, used by SGI.
+      </item>
+      <item><code>apache access log</code> : Access log of the Apache http daemon,
+            in default form.
+      </item>
+      <item><code>apache error log</code> : Error log of the Apache http daemon,
+            in default form.
+      </item>
+      </itemize>
+      There must also be a 'generic' system log in case all elements are
+      cooked already.
+     </para>
+</item>
+<item>
+<emph>&lt;hostname&gt;</emph> mandatory
+     <para>
+      The name of the system that generated the data in the data block.
+      This can be different from the computer composing the message.
+     </para>
+</item>
+<item>
+<emph>&lt;service&gt;</emph> optional
+     <para>
+      The (default) value of the service running on the host that
+      generated the message data. For log files that don't contain the
+      service name embedded in them.
+     </para>
+</item>
+<item>
+<emph>&lt;time&gt;</emph> optional
+     <para>
+      The best approximation to the time that the data was generated.
+      For (log)data that doesn't contain an embedded date stamp.
+     </para>
+</item>
  </itemize>
+
+The following example shows an XML message for <strong>gcm_input</strong>
+with a filled-in header and an empty <emph>&lt;data&gt;</emph> element:
+
+<verbatim>
+  &lt;gcmt:message xmlns:gcmt='http://gnucomo.org/transport/'&gt;
+     &lt;gcmt:header&gt;
+        &lt;gcmt:messagetype&gt;apache error log&lt;/gcmt:messagetype&gt;
+        &lt;gcmt:hostname&gt;client.gnucomo.org&lt;/gcmt:hostname&gt;
+        &lt;gcmt:service&gt;httpd&lt;/gcmt:service&gt;
+        &lt;gcmt:time&gt;2003-04-17 14:40:46.312895+01:00&lt;/gcmt:time&gt;
+     &lt;/gcmt:header&gt;
+     &lt;gcmt:data/&gt;
+  &lt;/gcmt:message&gt;
+</verbatim>
  </para>
  
  <para>
+The <emph>data</emph> element can hold one of two possible child
+elements: <emph>&lt;log&gt;</emph> or <emph>&lt;parameters&gt;</emph>.
+The <emph>&lt;log&gt;</emph> element may contain any number of lines from
+a system's log file, each line in a separate element.
+A single log line is the content of either a <emph>&lt;raw&gt;</emph> or
+a <emph>&lt;cooked&gt;</emph> element.
+The <emph>&lt;raw&gt;</emph> element contains the log line "as is" and nothing more.
+This is the easiest way to provide XML data for <strong>gcm_input</strong>.
+However, the log line itself must be in a form that <strong>gcm_input</strong>
+can understand.
+After all, <strong>gcm_input</strong> still needs to extract meaningful information
+from that line, such as the time stamp and the service that created the log.
+The client can also choose to provide that information separately by encapsulating
+the log line in a <emph>&lt;cooked&gt;</emph> element.
+This element may have up to four child elements, two of which are mandatory:
+<itemize>
+<item><emph>&lt;timestamp&gt;</emph> mandatory.
+   <para>
+   The time at which the log line was generated by the client.
+   </para>
+</item>
+<item><emph>&lt;hostname&gt;</emph> optional.
+   <para>
+   For logs that include a hostname in each line. This hostname is checked
+   against the hostname in the <emph>&lt;header&gt;</emph> element.
+   </para>
+</item>
+<item><emph>&lt;service&gt;</emph> optional.
+   <para>
+   If the service that generated the log is not provided in the <emph>&lt;header&gt;</emph>
+   the service must be stated for each log line separately.
+   Otherwise, each log line is assumed to be generated by the same service.
+   </para>
+</item>
+<item><emph>&lt;raw&gt;</emph> mandatory.
+   <para>
+   The content of the full log line. This would have the same content of the singular
+   <emph>&lt;raw&gt;</emph> element if the log line was not provided in a
+   <emph>&lt;cooked&gt;</emph> element.
+   </para>
+</item>
+</itemize>
+The following shows an example of the log message with two lines in the
+<emph>&lt;log&gt;</emph> element, one raw and one cooked:
+
+<verbatim>
+  &lt;gcmt:data xmlns:gcmt='http://gnucomo.org/transport/'&gt;
+    &lt;gcmt:log&gt;
+     &lt;gcmt:raw&gt;
+       Apr 13 04:31:03 schiza kernel: attempt to access beyond end of device
+     &lt;/gcmt:raw&gt;
+     &lt;gcmt:cooked&gt;
+        &lt;gcmt:timestamp&gt;2003-04-13 04:31:03+02:00&lt;/gcmt:timestamp&gt;
+        &lt;gcmt:hostname&gt;schiza&lt;/gcmt:hostname&gt;
+        &lt;gcmt:service&gt;kernel&lt;/gcmt:service&gt;
+        &lt;gcmt:raw&gt;
+         Apr 13 04:31:03 schiza kernel: 03:05: rw=0, want=1061109568, limit=2522173
+        &lt;/gcmt:raw&gt;
+     &lt;/gcmt:cooked&gt;
+    &lt;/gcmt:log&gt;
+  &lt;/gcmt:data&gt;
+</verbatim>
+</para>
+
+<para>
+The <emph>&lt;parameters&gt;</emph> element contains a list of parameters
+of the same class. The class is provided as an attribute in the
+<emph>&lt;parameters&gt;</emph> open tag.
+There is a <emph>&lt;parameter&gt;</emph> element for each parameter in the list.
+The child elements of a <emph>&lt;parameter&gt;</emph> are one optional
+<emph>&lt;description&gt;</emph> element and zero or more <emph>&lt;property&gt;</emph>
+elements.
+The names of a parameter and a property are provided by the mandatory <emph>name</emph>
+attributes in the respective elements.
+The following example shows a possible parameter report from a "df -k":
+<verbatim>
+  &lt;gcmt:data xmlns:gcmt='http://gnucomo.org/transport/'&gt;
+    &lt;gcmt:parameters gcmt:class='filesystem'&gt;
+      &lt;gcmt:parameter gcmt:name='root'&gt;
+         &lt;gcmt:description&gt;Root filesystem&lt;/gcmt:description&gt;
+         &lt;gcmt:property gcmt:name='size'&gt;303344&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='used'&gt;104051&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='available'&gt;183632&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='device'&gt;/dev/hda1&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='mountpoint'&gt;/&lt;/gcmt:property&gt;
+      &lt;/gcmt:parameter&gt;
+      &lt;gcmt:parameter gcmt:name='usr'&gt;
+         &lt;gcmt:description&gt;Usr filesystem&lt;/gcmt:description&gt;
+         &lt;gcmt:property gcmt:name='size'&gt;5044188&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='used'&gt;3073716&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='available'&gt;1714236&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='device'&gt;/dev/hdd2&lt;/gcmt:property&gt;
+         &lt;gcmt:property gcmt:name='mountpoint'&gt;/usr&lt;/gcmt:property&gt;
+      &lt;/gcmt:parameter&gt;
+    &lt;/gcmt:parameters&gt;
+  &lt;/gcmt:data&gt;
+</verbatim>
+</para>
+
+</subsection>
+
+<subsection>
+<heading>Gcm_input classes</heading>
+<para>
  The figure below shows the class diagram that is used for <strong>gcm_input</strong>:
     <para>
-   <picture src='classes-gcm_input.png' eps='classes-gcm_input'/>
+   <picture src='classes-gcm_input.png' eps='classes-gcm_input' scale='0.8'/>
     </para>
  
  The heart of the application is a <emph>client_message</emph> object.
-This object reads the message through the <emph>message_buffer</emph> from some
+This object reads the message through the
+<ref to='message_buffer'><emph>message_buffer</emph></ref> from some
  input stream (file, string, stdin or socket), classifies the message and
  enters information from the message into the database.
-It has a relationship with a <emph>gnucomo_database</emph> object which
+The <emph>client_message</emph> object holds a collection of <emph>message_filter</emph>
+and associated <emph>line_cooker</emph> objects.
+The association is maintained in a <emph>xform</emph> object.
+Note that several <emph>line_cooker</emph> can be associated with
+with a single <emph>message_filter</emph> object.
+For example, a system log or a web server log are processed in a similar
+manner, i.e. each line is transformed into a &lt;log&gt; element.
+The patterns of the individual lines, however are entirely different.
+During the classification of the input data, one combination of a
+<emph>message_filter</emph> and <emph>line_cooker</emph> is selected.
+The classification process works by calculating the uncertainty with which
+a <emph>line_cooker</emph> matches with the input data.
+The one with the least uncertainty is selected.
+</para>
+
+<subsubsection>
+<heading><label name='client_message'/>client_message</heading>
+<para>
+The <emph>client_message</emph> has a relationship with a <emph>gnucomo_database</emph> object which
  is an abstraction of the tables in the database.
  These are the methods for the <emph>client_message</emph> class:
  
  <itemize>
-<item>client_message::client_message(istream *in, gnucomo_database *db)
+<item><code>client_message::client_message(istream *in, gnucomo_database *db)</code>
     <para>
     Constructor.
     </para>
  </item>
-<item>double client_message::classify(String host, date arrival_d, hour arrival_t, String serv)
+<item><code>void add_cooker(line_cooker *lc, message_filter *mf)</code>
+  <para>
+  Add another <emph>line_cooker</emph> object with the associated
+  <emph>message_filter</emph> object to the collection.
+  This initializes the uncertainty with which the <emph>line_cooker</emph>
+  is selected to 1.0.
+  </para>
+</item>
+<item><code>double client_message::classify(String host, date arrival_d,
+                                             hour arrival_t, String serv)</code>
    <para>
    Try to classify the message and return the certainty with which the class of the
    message could be determined.
    If the hostname, arrival time and service can not be extracted from the message,
    use the arguments as default.
+  This will examine the first few lines of the input data
+  to select one of the <emph>message_filter</emph> with
+  associated <emph>line_cooker</emph> objects
+  from the collection built with the <emph>add_cooker</emph> method.
    </para>
  </item>
-<item>int enter()
+<item><code>int enter()</code>
    <para>
-  Insert the message contents into the <emph>log</emph> table of the gnucomo
-  database.
+  Insert the message contents into the gnucomo database.
    Returns the number of records inserted.
+  The input data from the <emph>message_buffer</emph>
+  is first transformed into an XML document (a strstream object)
+  by invoking the <emph>message_filter</emph> and <emph>line_cooker</emph> objects.
+  The XML document in the internal buffer is then parsed into an
+  XML DOM tree, using the Gnome XML parser.
+  The XML document may also be validated against an XML Schema definition.
    </para>
+  <para>
+  After extracting and checking the &lt;header&gt; elements,
+  the data nodes are extracted and inserted into the database,
+  possibly using a <emph>line_cooker</emph> object to cook raw
+  log elements.
+  If an error occurs in some stage of this process, the XML document
+  is dumped in a spool area for later processing.
+  </para>
+</item>
+</itemize>
+
+</para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='message_filter'/>message_filter</heading>
+<para>
+A <emph>message_filter</emph> transforms the raw input data into an
+XML document suitable for further parsing and storage into the database.
+The base class, <emph>message_filter</emph> does nothing but copy the
+input stream into an internal XML buffer.
+An object of this class is used when the input is already in XML format.
+Classes derived from <emph>message_filter</emph> read the input line by line,
+possibly extracting information from an email header if available.
+<itemize>
+<item>
+   <code>message_filter::contruct_XML(message_buffer &amp;in, strstream &amp;xml)</code>
+   <para>
+   Simply copy the input stream into the internal XML buffer.
+   The base class function is used when the input is already in XML format.
+   Derived classes will override this function.
+   </para>
  </item>
  </itemize>
+</para>
+
+<para>
+Classes derived from <emph>message_filter</emph> transform various kinds
+of input into an XML document. The following diagram shows some examples:
+
+   <para>
+   <picture src='classes-message_filter.png' eps='classes-message_filter' scale='0.8'/>
+   </para>
  
+The two classes, derived directly from <emph>message_filter</emph> reflect
+that <strong>gcm_input</strong> can handle two kinds of input: The
+<emph>log_filter</emph> for log files
+and <emph>parameters_filter</emph> for parameter reports.
+Each of these types of input are transformed into entirely different XML
+documents and are stored quite differently into the database.
+Classes that are derived further down the hierarchy will handle more specific
+forms of input.
  </para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='line_cooker'/>line_cooker</heading>
+<para>
+To turn a raw line from a log file onto separate parts that can be stored
+in the database, i.e. parse the line, the <emph>client_message</emph>
+object uses a <emph>line_cooker</emph> object.
+This is a polymorphic object, so each type of log can have its own parser,
+while the <emph>client_message</emph> object uses a common interface
+for each one.
+For each message, one specific <emph>line_cooker</emph> object is
+selected as determined by the message type.
+E.g., the derived class <emph>syslog_cooker</emph> is used for system logs.
+When the <emph>client_message</emph> object encounters a <strong>raw</strong>
+log element, it takes the following steps to turn this into a
+<strong>cooked</strong> log element:
+<enumerate>
+<item>
+  Remove the TEXT node of the "raw" element and save its content.
+</item>
+<item>
+  Change the name of the element into "cooked".
+</item>
+<item>
+  Check if the content matches the syntax of the type of log we're
+  processing at the moment.
+  This depends on the message type and is therefore a task for the
+  <emph>line_cooker</emph> object.
+</item>
+<item>
+  Have the <emph>line_cooker</emph> parse the content and extract
+  the time stamp and optionally the hostname and service.
+</item>
+<item>
+  Insert new child elements into the cooked element.
+</item>
+</enumerate>
+After that, the <emph>cooked</emph> element is ready for further processing
+and possibly storing into the database.
+</para>
+
+<para>
+The <emph>line_cooker</emph> base class holds three protected members
+that must be filled with information by the derived classes:
+<itemize>
+<item><code>UTC ts</code> : the timestamp.</item>
+<item><code>String hn</code> : the hostname</item>
+<item><code>String srv</code> : the service</item>
+</itemize>
+Corresponding base class methods (<emph>timestamp</emph>,
+<emph>hostname</emph> and <emph>service</emph>) will do
+nothing more than return these values.
+It is up to the derived class's <emph>cook_this</emph> method
+to properly initialize these members.
+</para>
+
+<para>
+The methods for the <emph>line_cooker</emph> class are:
+<itemize>
+<item>
+  <code>bool line_cooker::check_pattern(String logline)</code>
+  <para>
+  Tries to match the <emph>logline</emph> against that patterns
+  that describe the message syntax.
+  In the <emph>line_cooker</emph> base class, this is a pure virtual function.
+  Returns true if the log line adheres to the message-specific syntax.
+  </para>
+</item>
+<item>
+  <code>bool line_cooker::cook_this(String logline, UTC arrival)</code>
+  <para>
+  Extracts information from the <emph>logline</emph>.
+  The <emph>arrival</emph> can be used to make corrections to incomplete
+  time stamps in the log file.
+  The implementation in a derived class must properly initialize the protected
+  members described above.
+  In the <emph>line_cooker</emph> base class, this is a pure virtual function.
+  Returns false if the pattern does not match.
+  </para>
+</item>
+<item>
+  <code>String line_cooker::message_type()</code>
+  <para>
+  Return the message type for which this line cooker is intended.
+  </para>
+</item>
+<item>
+  <code>UTC line_cooker::timestamp()</code>
+  <para>
+  Returns the timestamp that was previously extracted from the log line
+  or a 'null' timestamp if that information could nnot be extracted.
+  </para>
+</item>
+<item>
+   <code>String line_cooker::hostname()</code>
+  <para>
+  Returns the hostname that was previously extracted from the log line,
+  or an empty string if that information could not be extracted.
+  </para>
+</item>
+<item>
+   <code>String line_cooker::service()</code>
+  <para>
+  Returns the service that was previously extracted from the log line,
+  or an empty string if that information could not be extracted.
+  </para>
+</item>
+</itemize>
+</para>
+</subsubsection>
+
+<subsubsection>
+<heading><label name='message_buffer'/>message_buffer</heading>
  <para>
  Some kind of input buffering is needed when a client message is being processed.
  The contents of the message are not entirely clear until a few lines are analyzed,
@@ -412,10 +1113,11 @@ rewinding to the beginning of the message.
  Methods for the <emph>message_buffer</emph> class:
  
  <itemize>
-<item>message_buffer::message_buffer(istream *in)</item>
+<item>message_buffer::message_buffer(istream *in)
    <para>
    Constructor.
    </para>
+</item>
  <item>bool operator &gt;&gt;(message_buffer &amp;, String &amp;)
  </item>
  <item>message_buffer::rewind()</item>
@@ -423,6 +1125,10 @@ Methods for the <emph>message_buffer</emph> class:
  <item>message_buffer::operator ++</item>
  </itemize>
  </para>
+
+</subsubsection>
+</subsection>
+
  <subsection>
  
  <heading>Command arguments</heading>
@@ -442,50 +1148,43 @@ Methods for the <emph>message_buffer</emph> class:
  </section>
  
  <section>
-<heading>Design ideas</heading>
-
-<para>
-Use of a neural network to analyze system logs:
-<itemize>
-<item>Classify words</item>
-<item>Classify message based on word classification</item>
-</itemize>
-</para>
-
-<header>gcm_daemon</header>
+<heading>gcm_daemon</heading>
  <para>
-<strong>Gcm_daemon</strong> is the application that process data just
+<strong>Gcm_daemon</strong> is the application that processes data just
  arrived in the database.
  It handles the log-information delivered by <strong>gcm_input</strong>
-in the table <strong>log</strong>-table of the database.
+in the <emph>log</emph> table of the database.
  With the data further storage and classification can be done.
  Where <strong>gcm_input</strong> is a highly versatile application that is
-loaded and ended all the time the daemon is continously available monitoring
+loaded and ended all the time the daemon is continuously available monitoring
  the entire system. Basically the daemon monitors everything that happens
-in the database and executes continous checks and processes all the data.
+in the database and executes continuous checks and processes all the data.
  The two applications (gcm_input and gcm_daemon) together are the core of the central system. 
  The application has the following tasks:
  <itemize>
  <item>Processing data into other tables so that easy detection can take place</item>
  <item>Raising notifications based on the available input</item>
  <item>Maintain the status of notifications and changing priority when needed</item>
-<item>Priodically perform checks for alerts that are communicated through the notification-table</item>
+<item>Periodically perform checks for alerts that are communicated through the notification-table</item>
  <item>Perform updates on the database when a new version of the software is loaded</item>
  </itemize>
  </para>
  
-</section>
  <subsection>
-<header>Performing checks</header>
-<para>One of the most difficult tasks for the daemon is performing the automatic checks.
+<heading>Performing checks</heading>
+
+<para>
+One of the most difficult tasks for the daemon is performing the automatic checks.
  Every check is different and will be made up of several parameters that have to test negative.
-This that makes it hard to define this into software.
+That makes it hard to define this in software.
  Another downside is that some work may be very redundant.
  For that reason a more generic control structure is needed based on the technologies used.
  The logical choice is then to focus on the capabilities in the database and perform
-the job by executing queries.</para>
-<para>Since the system is about detecting problems and issues we build the detection in
-such a way that queries on the database result in 'suspicious' logrecords.
+the job by executing queries.
+</para>
+<para>
+Since the system is about detecting problems and issues we build the detection in
+such a way that queries on the database result in 'suspicious' log records.
  So called 'innocent' records can be ignored. So if a query gives a result a
  problem is present, if there is no result there isn't a problem.
  As soon as we seek for common ground in the process of identifying problems
@@ -494,50 +1193,78 @@ it can be said that all results are based on the log-table
  will arrive and stored for later use).
  Furthermore there are two ways of determining if a problem is present: 
  <itemize>
-<item>A single log-record or a group of log-records is within or outside the boundaries set.
-If it is outside the boundaries the logrecord(s) is/are a potential problem.
+<item>
+A single log-record or a group of log-records is within or outside the boundaries set.
+If it is outside the boundaries the log record(s) is/are a potential problem.
  If there are more boundaries set all of these need to be applied.
-Based on fixed data results can be derived.</item>
-<item>A set of records outline a trend that throughout time may turnout
+Based on fixed data results can be derived.
+</item>
+<item>
+A set of records outline a trend that throughout time may turnout
  to be a problem. These type of values are not fixed and directly legible
-but more or less derived data. That data is input for some checks (previous bullet).</item>
+but more or less derived data. That data is input for some checks (previous bullet).
+</item>
  </itemize>
  In both cases a set of queries can be run.
  If there are more queries to be executed the later queries can be executed on
  only the results. For that reason intermediate results have to be stored in a
  temporary table for later reuse.
  Saving a session in combination with the found log-records are sufficient.
-This is also true since logrecords are the basis of all derived presences in
+This is also true since log records are the basis of all derived presences in
  the numerous log_adv_xxx-tables and always have a reference to the log-table. 
  </para>
-<para>Building the checks will thus be nothing more than combining queries
+<para>
+Building the checks will thus be nothing more than combining queries
  and adding a classification to the results of that query.
-If this generic structure is being build properly with a simple (easy to understand) interface,
-many combinations can be made. People having a logic correct idea,
+If this generic structure is being built properly with a simple (easy to understand) interface,
+many combinations can be made. People having a logically correct idea,
  but insufficient skills to program will be able to build checks.
  As a consequence we can offer the interface to the user,
  that in turn can also make particular checks for the environment that is unique to him/her.
  This - of course - doesn't mean that a clear SQL-interface shouldn't be offered.
  </para>
-<para>When ever something happens, that is less than standard a line will be written to the syslogd.
-This will enable users and developpers to trace exactly what happened.
+<para>
+Whenever something happens, that is less than standard a line will be written to the syslogd.
+This will enable users and developers to trace exactly what happened.
  The gcm_daemon will also log startup and ending so that abrupt endings of the daemon will be detected.
-<subsubsection>
-<header>The initial process</header>
+</para>
+</subsection>
+
+<subsection>
+<heading>The initial process</heading>
  <para>
  When gcm_daemon starts first some basic actions take place that go beyond
  just opening a connection to the database. The following actions also need to take place:
  <itemize>
-<item>Check the database version if it is still the most recent version.
+<item>
+Check the database version if it is still the most recent version.
  The daemon will check the version-number of the database.
  If the database is not the same version as gcm_daemon an update will be performed.
  When the database is up-to-date normal processing will continue.
-<item>If the database reflects that the used version gcm_daemon is less recent than
+</item>
+<item>
+If the database reflects that the used version gcm_daemon is less recent than
  the running version (i.e. a new version has been installed) all records
  in the log-table that weren't recognized before will now be set to unprocessed since
  there is a fair change that they might be recognized this time. This will ensure that no data is lost.
+</item>
  </itemize>
+</para>
+
  </subsection>
+</section>
+
+<section>
+<heading>Design ideas</heading>
+
+<para>
+Use of a neural network to analyze system logs:
+<itemize>
+<item>Classify words</item>
+<item>Classify message based on word classification</item>
+</itemize>
+</para>
+</section>
  
  </chapter>