Prepared for release 0.0.4

[gnucomo.git] / doc / design.xml
diff --git a/doc/design.xml b/doc/design.xml

index f14ef8f..505bd03 100644 (file)
--- a/doc/design.xml
+++ b/doc/design.xml
@@ -6,7 +6,7 @@
  <!--
        XML documentation system
        Original author :  Arjen Baart - arjen@andromeda.nl
-      Version         : $Revision: 1.1 $
+      Version         : $Revision: 1.4 $
  
        This document is prepared for XMLDoc. Transform to HTML,
        LaTeX, Postscript or plain text with XMLDoc utilities and
@@ -16,11 +16,15 @@
  <book>
  <titlepage>
     <title>Gnucomo - Computer Monitoring</title>
+   <subtitle>Design description</subtitle>
+<!--
+   <para><picture src='logo.png' eps='logo' scale='0.7'/></para>
+-->
     <author>Arjen Baart <code>&lt;arjen@andromeda.nl&gt;</code></author>
     <author>Brenno de Winter<code>&lt;brenno@dewinter.com&gt;</code></author>
-   <date>July 12, 2002</date>
+   <date>December 17, 2002</date>
     <docinfo>
-      <infoitem label="Version">0.1</infoitem>
+      <infoitem label="Version">0.2</infoitem>
        <infoitem label="Organization">Andromeda Technology &amp; Automation</infoitem>
        <infoitem label="Organization">De Winter Information Solutions</infoitem>
     </docinfo>
@@ -46,17 +50,17 @@ and is based upon the development manifest.
  
  <para>
  The architecture of <strong>gnucomo</strong> is shown in the
-dataflow diagram below:
+data flow diagram below:
  </para>
  
  <para>
-   <picture src='dataflow.png' eps='dataflow.eps'/>
+   <picture src='dataflow.png' eps='dataflow' scale='0.7'/>
  </para>
  
  <para>
  Architectural items to consider:
  <itemize>
-<item>Active and passive data aquisition</item>
+<item>Active and passive data acquisition</item>
  <item>Monitoring static and dynamic system parameters</item>
  <item>Upper and lower limits for system parameters</item>
  </itemize>
@@ -81,12 +85,39 @@ database as described in the manifest.
  <heading>Database design</heading>
  
  <para>
-Log entries are stored in a database with at least the following fields:
+The design of the database is described extensively in
+<reference href="manifest.html">the Manifest</reference>.
+Assuming development is done on the same system on which the real (production)
+gnucomo database is maintained, there is a need for a separate database
+on which to perform development and integration tests.
+Quite often, the test database will need to be destroyed and recreated.
+To enable testing of <strong>gnucomo</strong> applications, all programs
+need to access either the test database or the production database.
+To accommodate this, each application needs an option to override the
+default name of the configuration file (gnucomo.conf).
+</para>
+
+<para>
+To create a convenient programming interface for object oriented languages,
+a class <emph>gnucomo_database</emph> provides an abstract layer which
+hides the details of the database implementation.
+An object of this class maintains the connection to the database server
+and provides convenience functions for accessing information in the
+database.
+A constructor of the <emph>gnucomo_database</emph> is passed a reference to
+the <emph>gnucomo_configuration</emph> object in order to access the database.
+This accommodates for both production and test databases.
+The constructor will immediately try to connect to the database and check its
+validity.
+The destructor will of course close the database connection.
+</para>
+
+<para>
+Other methods provide access to the database.
+There will be lots more in the future, but here are a few to begin with:
  <itemize>
-<item>hostname</item>
-<item>timestamp</item>
-<item>service (kernel, daemon, ...)</item>
-<item>Log message</item>
+<item>Find the objectid of a host, given its hostname</item>
+<item>Insert a log record into the log table</item>
  </itemize>
  </para>
  </section>
@@ -95,7 +126,7 @@ Log entries are stored in a database with at least the following fields:
  <heading>Configuration</heading>
  
  <para>
-Configurational parameters are stored in a XML formatted configuration file.
+Configuration parameters are stored in a XML formatted configuration file.
  The config file contains a two-level hierarchy.
  The first level denotes the section for which the parameter is used
  and the second level is the parameter itself.
@@ -137,8 +168,8 @@ Other database systems are not supported yet.
  <heading>gnucomo_config class</heading>
  
  <para>
-Each Gnucomo application should have exectly one object of the
-<strong>gnucomo_config</strong> to obtain its configurational
+Each Gnucomo application should have exactly one object of the
+<strong>gnucomo_config</strong> to obtain its configuration
  parameters.
  The following methods are supported in this class:
  
@@ -174,17 +205,360 @@ The following methods are supported in this class:
  </subsection>
  </section>
  
+<section>
+<heading>gcm_input</heading>
+
+<para>
+<strong>gcm_input</strong> is the application which captures messages from client
+systems in one form or another and tries to store information from these messages
+into the database.
+A client message may arrive in a number of forms and through any kind of
+transportation channel.
+Here are a few examples:
+
+<itemize>
+<item>Copied directly from a local client's file system.</item>
+<item>Copied remotely from a client's file system, e.g. using
+<code>ftp</code>, <code>rcp</code> or <code>scp</code>.</item>
+<item>Through an email.</item>
+</itemize>
+
+On top of that, any message may be encrypted, for example with PGP or GnuPG.
+In any of these situations, <strong>gcm_input</strong> should be able to extract
+as much information as possible from the client's message.
+In case the message is encrypted, it may not be possible to run <strong>gcm_input</strong>
+in the background, since human intervention is needed to enter the secret key.
+</para>
+<para>
+The primary function of <strong>gcm_input</strong> is to store lines from a client's
+log files into the <emph>log</emph> table.
+To do this, we need certain information about the client message that is usually not
+in the content of a log file.
+This information includes:
+<itemize>
+<item>The source of the log file, most often in the form of the client's hostname.</item>
+<item>The time stamp of the time on which the log file arrived on the server.</item>
+<item>The service on the client which produced the log file.</item>
+</itemize>
+
+Sometimes, this information is available from the message itself, as in an email header.
+On other occasions, the information needs to be supplied externally,
+e.g. by using command line options.
+</para>
+<para>
+Apart from determining information about the client's message, the content
+of the message needs to be analyzed in order to handle it properly.
+The body of the message may contain all sorts of information, such as:
+<itemize>
+<item>System log file</item>
+<item>Apache log file</item>
+<item>Report from a Gnucomo agent</item>
+<item>Something else...</item>
+</itemize>
+
+The message is analyzed to obtain information about what the message entails
+and where it came from.
+The <strong>classify()</strong> method tries to extract that information.
+Sometimes, this information can not be determined with absolute 100% certainty.
+The certainty expresses how sure we are about the contents in the message.
+Classifying a message may be performed with an algorithm as shown in
+the following pseudo code:
+
+<verbatim>
+while certainty &lt; &epsilon; AND not at end
+
+   Scan for a marker
+
+   Adjust certainty
+</verbatim>
+
+Initially, a message is not classified and the certainty is 0.0.
+Some lines point toward a certain class of message but do not absolutely determine
+the class of a message. Other pieces of text are typical for a certain message class.
+Examples of markers that determine the classification of a client message
+are discussed below.
+
+<verbatim>
+From - Sat Sep 14 15:01:15 2002
+</verbatim>
+
+This is almost certainly a UNIX style mail header.
+There should be lines beginning with <code>From:</code> and <code>Date:</code>
+before the first empty line is encountered.
+The hostname of the client that sent the message and the time of arrival
+can be determined from these email header lines.
+The content of the message is still to be determined by matching
+other markers.
+
+<verbatim>
+-----BEGIN PGP MESSAGE-----
+</verbatim>
+
+Such a line in the message certainly means that the message is PGP or GnuPG
+encrypted.
+Decrypting is possible only if someone or something provides a secret key.
+
+<verbatim>
+Sep  1 04:20:00 kithira kernel: solo1: unloading
+</verbatim>
+
+The general pattern of a system log file is an abbreviated month name, a day,
+a time, a name of a host without the domain, the name of a service followed
+by a colon and finally, the message of that service.
+We can match this with a regular expression to see if the message holds syslog lines.
+Similar matches can be used to find Apache log lines or output from the <emph>dump</emph>
+backup program or anything else.
+</para>
+
+<para>
+The message classification embodies the way in which a message must be
+handled and in what way information from the message can be put into
+the database.
+Aspects for handling the message are for example:
+<itemize>
+<item>Strip lines at the beginning or end.</item>
+<item>Store each line separately or store the message as a whole.</item>
+<item>How to extract hostname, arrival time and service from the message.</item>
+<item>How to break up the message into individual fields for a <emph>log</emph> record.</item>
+</itemize>
+</para>
+
+<para>
+The figure below shows the class diagram that is used for <strong>gcm_input</strong>:
+   <para>
+   <picture src='classes-gcm_input.png' eps='classes-gcm_input'/>
+   </para>
+
+The heart of the application is a <emph>client_message</emph> object.
+This object reads the message through the <emph>message_buffer</emph> from some
+input stream (file, string, stdin or socket), classifies the message and
+enters information from the message into the database.
+It has a relationship with a <emph>gnucomo_database</emph> object which
+is an abstraction of the tables in the database.
+These are the methods for the <emph>client_message</emph> class:
+
+<itemize>
+<item>client_message::client_message(istream *in, gnucomo_database *db)
+   <para>
+   Constructor.
+   </para>
+</item>
+<item>double client_message::classify(String host, date arrival_d, hour arrival_t, String serv)
+  <para>
+  Try to classify the message and return the certainty with which the class of the
+  message could be determined.
+  If the hostname, arrival time and service can not be extracted from the message,
+  use the arguments as default.
+  </para>
+</item>
+<item>int enter()
+  <para>
+  Insert the message contents into the <emph>log</emph> table of the gnucomo
+  database.
+  Returns the number of records inserted.
+  </para>
+</item>
+</itemize>
+
+</para>
+<para>
+Some kind of input buffering is needed when a client message is being processed.
+The contents of the message are not entirely clear until a few lines are analyzed,
+and these lines probably need to be read again.
+When the message is stored in a file, this is no problem; a simple lseek allows us
+to read any line over and over again.
+However, when the message comes from an input stream, like a TCP socket or just
+plain old stdin, this is a different matter.
+Lines of the messages that are already read will be lost forever, unless they are
+stored somewhere.
+To store an input stream temporarily, there are two options:
+<enumerate>
+<item>In an internal memory buffer.</item>
+<item>In an external (temporary) file.</item>
+</enumerate>
+The <emph>message_buffer</emph> class takes care of the input buffering, thus
+hiding these implementation details.
+On the outside, a <emph>message_buffer</emph> can be read line by line until the
+end of the input is reached.
+Lines of input can be read again by backing up to the beginning of the message
+by using the <strong>rewind()</strong> method or by backing up one line
+with the <strong>--</strong> operator.
+The <emph>message_buffer</emph> object maintains a pointer to the next
+available line.
+The <strong>++</strong> operator, being the opposite of the <strong>--</strong>
+operator, skips one line.
+</para>
+
+<para>
+The <strong>&gt;&gt;</strong> operator reads data from the message
+into the second (String) operand, just like the <strong>&gt;&gt;</strong>
+operator for an istream.
+There is a small difference, though.
+The <strong>&gt;&gt;</strong> operator for a <emph>message_buffer</emph>
+returns a boolean value which states if there actually was input available.
+This value will usually turn to <code>false</code> at the end of file.
+A second difference is the fact that input data can only be read into
+<emph>String</emph> objects a line at a time.
+There are no functions for reading integer or floating point numbers.
+The <strong>&gt;&gt;</strong> operator reads the next line either from
+an internal buffer or from the external input stream if the internal
+buffer is exhausted.
+Lines read from the input stream are cached in the internal buffer,
+so they are available for reading another time, e.g. after
+rewinding to the beginning of the message.
+</para>
+
+<para>
+Methods for the <emph>message_buffer</emph> class:
+
+<itemize>
+<item>message_buffer::message_buffer(istream *in)
+  <para>
+  Constructor.
+  </para>
+</item>
+<item>bool operator &gt;&gt;(message_buffer &amp;, String &amp;)
+</item>
+<item>message_buffer::rewind()</item>
+<item>message_buffer::operator --</item>
+<item>message_buffer::operator ++</item>
+</itemize>
+</para>
+<subsection>
+
+<heading>Command arguments</heading>
+
+<para>
+<strong>Gcm_input</strong> understands the following command line arguments:
+<itemize>
+<item>-c &lt;name&gt; : Configuration name</item>
+<item>-d &lt;date&gt; : Arrival time of the message</item>
+<item>-h &lt;hostname&gt; : FQDN of the client</item>
+<item>-s &lt;service&gt; : service that created the log</item>
+<item>-v : verbose output. Print lots of debug information</item>
+<item>-V : Print version and exit</item>
+</itemize>
+</para>
+</subsection>
+</section>
+
+<section>
+<heading>gcm_daemon</heading>
+<para>
+<strong>Gcm_daemon</strong> is the application that processes data just
+arrived in the database.
+It handles the log-information delivered by <strong>gcm_input</strong>
+in the <emph>log</emph> table of the database.
+With the data further storage and classification can be done.
+Where <strong>gcm_input</strong> is a highly versatile application that is
+loaded and ended all the time the daemon is continously available monitoring
+the entire system. Basically the daemon monitors everything that happens
+in the database and excecutes continous checks and processes all the data.
+The two applications (gcm_input and gcm_daemon) together are the core of the central system. 
+The application has the following tasks:
+<itemize>
+<item>Processing data into other tables so that easy detection can take place</item>
+<item>Raising notifications based on the available input</item>
+<item>Maintain the status of notifications and changing priority when needed</item>
+<item>Priodically perform checks for alerts that are communicated through the notification-table</item>
+<item>Perform updates on the database when a new version of the software is loaded</item>
+</itemize>
+</para>
+
+<subsection>
+<heading>Performing checks</heading>
+
+<para>
+One of the most difficult tasks for the daemon is performing the automatic checks.
+Every check is different and will be made up of several parameters that have to test negative.
+That makes it hard to define this in software.
+Another downside is that some work may be very redundant.
+For that reason a more generic control structure is needed based on the technologies used.
+The logical choice is then to focus on the capabilities in the database and perform
+the job by executing queries.
+</para>
+<para>
+Since the system is about detecting problems and issues we build the detection in
+such a way that queries on the database result in 'suspicious' logrecords.
+So called 'innocent' records can be ignored. So if a query gives a result a
+problem is present, if there is no result there isn't a problem.
+As soon as we seek for common ground in the process of identifying problems
+it can be said that all results are based on the log-table
+(as stated in the manifest the log-table is the one and only table were input
+will arrive and stored for later use).
+Furthermore there are two ways of determining if a problem is present: 
+<itemize>
+<item>
+A single log-record or a group of log-records is within or outside the boundaries set.
+If it is outside the boundaries the logrecord(s) is/are a potential problem.
+If there are more boundaries set all of these need to be applied.
+Based on fixed data results can be derived.
+</item>
+<item>
+A set of records outline a trend that throughout time may turnout
+to be a problem. These type of values are not fixed and directly legible
+but more or less derived data. That data is input for some checks (previous bullet).
+</item>
+</itemize>
+In both cases a set of queries can be run.
+If there are more queries to be executed the later queries can be executed on
+only the results. For that reason intermediate results have to be stored in a
+temporary table for later reuse.
+Saving a session in combination with the found log-records are sufficient.
+This is also true since logrecords are the basis of all derived presences in
+the numerous log_adv_xxx-tables and always have a reference to the log-table. 
+</para>
+<para>
+Building the checks will thus be nothing more than combining queries
+and adding a classification to the results of that query.
+If this generic structure is being built properly with a simple (easy to understand) interface,
+many combinations can be made. People having a logically correct idea,
+but insufficient skills to program will be able to build checks.
+As a consequence we can offer the interface to the user,
+that in turn can also make particular checks for the environment that is unique to him/her.
+This - of course - doesn't mean that a clear SQL-interface shouldn't be offered.
+</para>
+<para>
+Whenever something happens, that is less than standard a line will be written to the syslogd.
+This will enable users and developpers to trace exactly what happened.
+The gcm_daemon will also log startup and ending so that abrupt endings of the daemon will be detected.
+</para>
+</subsection>
+
+<subsection>
+<heading>The initial process</heading>
+<para>
+When gcm_daemon starts first some basic actions take place that go beyond
+just opening a connection to the database. The following actions also need to take place:
+<itemize>
+<item>
+Check the database version if it is still the most recent version.
+The daemon will check the version-number of the database.
+If the database is not the same version as gcm_daemon an update will be performed.
+When the database is up-to-date normal processing will continue.
+</item>
+<item>
+If the database reflects that the used version gcm_daemon is less recent than
+the running version (i.e. a new version has been installed) all records
+in the log-table that weren't recognized before will now be set to unprocessed since
+there is a fair change that they might be recognized this time. This will ensure that no data is lost.
+</item>
+</itemize>
+</para>
+
+</subsection>
+</section>
+
  <section>
  <heading>Design ideas</heading>
  
  <para>
-Use of a neural network to analyse system logs:
+Use of a neural network to analyze system logs:
  <itemize>
  <item>Classify words</item>
  <item>Classify message based on word classification</item>
  </itemize>
  </para>
-
  </section>
  
  </chapter>