1. Introducing SIETS

This guide introduces SIETS from an application developer’s perspective and provides reference material for building customized applications based on SIETS.

This section includes the following:

1.1. What is SIETS?

SIETS is a system for information storage and retrieval. The SIETS system consists of the SIETS server and application programming interface (API) for building information storage and retrieval applications.

The SIETS server is an operational unit that performs information storing and retrieval tasks by executing a predefined set of commands.

SIETS API is used for building applications that are specific and customized according to your company needs.

Note: With the SIETS package integrated applications are delivered for most common source data formats and retrieval scenarios. However, it is not possible to cover all possible scenarios, therefore, it is within the scope of this guide to provide you reference material for building your own applications.

Nowadays, unstructured data amounts in companies are increasing very rapidly; the only way how to effectively retrieve such data from collections and, therefore, make the data usable, is full text search (FTS). Full text search is the main methodology implemented in SIETS server for information indexing and searching.

Full text search in SIETS is based on an optimized mathematical model, which ensures very high performance for searching poorly structured information in large amounts compared to traditional SQL systems. For this purpose, in SIETS, data are stored in an inverted index.

For more information on how data are stored in SIETS, see Understanding Storing Information in SIETS.

Subjects for full text search can be any unstructured data, for example, text collections, separate phrases or words in text documents, Web pages, Web addresses, several special markups for textual and numerical data, bookmarks of HTML or XML pages, domain names, SQL database entry key IDs, file names, and so on.

The following figure illustrates the SIETS system from a high level:

Figure 1: SIETS operational diagram

In Figure 1, users are accessing the SIETS server via FTS queries. However, also other technologies for data storing, manipulating and other implemented in SIETS, for example, retrieval queries, update requests, status and control commands, XML queries using XPath notation, and so on.

1.2. SIETS in Corporate Networks

SIETS system can be integrated into an existing corporate network system. The SIETS server is incorporated into the network system just like any other server. The following figure describes a sample corporate network with the SIETS server:

Figure 2: SIETS server in a corporate network

Application servers and transaction processors from an existing corporate network can access the SIETS server as active SIETS API clients; in that case, for security reasons, end users cannot directly access the SIETS server. In that way, SIETS server can be used in any corporate network, independently from the operation system, database environment, or programming language used for application development.

For very large data amounts, SIETS supports sever clustering, which implies performance scalability.

For more information on SIETS server multi-server architecture, see Multi-Server Architecture.

1.3. Understanding SIETS Environment

This section describes SIETS environment from users perspective.

This section contains the following topics:

1.3.1. Overview

As already described in the previous section, the SIETS system is used for data storage and retrieval. The SIETS server is an operational unit in the SIETS system that performs these tasks. There is a predefined set of commands that are understood and executed by the SIETS server. The commands are implemented as XML requests and replies and transported via HTTP POST.

This implies that for sending commands to the SIETS server, first these commands must be created in the form of XML request messages. However, to make sending commands to the SIETS server easier, a server side mechanism that reads HTTP GET parameters and composes XML request messages from the parameters is included in the SIETS system. In that case you do not have to worry about creating XML request messages, but only have to pass right parameters, from which XML request messages are automatically created and sent to the SIETS server.

Of course, you can also create XML messages at the application side at your own convenience.

In a similar way, there is also a mechanism that formats XML reply messages received from the SIETS server by using an XSLT stylesheet. Again, it is your decision whether to handle XML reply messages on the application side, or to create an XSLT stylesheet using which results received from the SIETS server are automatically formatted and can be directly passed to end users.

1.3.2. Accessing SIETS Server

The following diagram describes how the SIETS server is accessed:

Figure 3: Accessing SIETS server

The following steps provide a general description of how the SIETS server is accessed:

Users enter commands, such as search queries, for the SIETS server from a user interface of an application, for example, a Web search form.
A custom built application calls a SIETS API command with its parameters.
The Web server receives HTTP request and passes it to the SIETS Web server module using Comman Gateway Interface (CGI) of Apache API.
The SIETS Web server module translates each user command into an XML request and submits it to the SIETS server via UNIX domain sockets.
The SIETS server responds to the application returning XML replies that are optionally formatted using XSLT stylesheet and then can be displayed and viewed through the application user interface, for example, a Web page.

The steps just describe implies that the SIETS server is accessed using SIETS API.

However, for debugging purposes the SIETS server also can be accessed via SIETS console.

For a detailed SIETS console description, see SIETS Console.

1.4. Concepts

This section introduces and briefly explains concepts that readers must be familiar with before going into details.

This section contains the following topics:

1.4.1. SIETS Server

SIETS server is a stand-alone server for storing and retrieving information such as plain texts or XML structured documents. It can be run in one or more instances per computer.

For more information, see Multiple Storages Architecture.

1.4.2. SIETS FTS Capability

SIETS is designed to support retrieving information stored using full text search queries.

1.4.3. SIETS API

SIETS application programming interface (API) is a standardized set of commands for accessing the SIETS server.

1.4.4. SIETS Web Server Module

SIETS Web server module is a module integrated with the Web server that receives requests from an application through the Web server via HTTP POST and dispatches them to the SIETS server via UNIX domain sockets.

Also, functionality of composing XML request messages from HTTP GET or POST parameters and optional formatting of the XML reply messages with a given XSLT stylesheet is included the SIETS Web server module.

The SIETS system is designed so that the SIETS server module can be integrated with the Web server through the Common Gateway Interface (CGI) or Apache API. Thus, it can be integrated with any Web server through CGI, and also it can be integrated with the Apache Web server through Apache API, which increase effectives of the whole system.

1.4.5. SIETS Console

SIETS console is a simple text application for accessing the SIETS server directly using the same functions as in SIETS API.

1.4.6. SIETS Document

SIETS document is a unit in the SIETS storage against which searching is performed. It can be unstructured or XML structured.

1.4.7. SIETS Storage

SIETS storage is a data collection for storing SIETS documents in a format that ensures a search is performed very fast. The SIETS storage is serviced by one SIETS server instance, and consists of vocabulary, document repository, and inverted index. Multiple storages can be run on a single computer.

1.4.8. Vocabulary

Vocabulary is a list of all unique words in the SIETS storage. Unique words are found in documents and added to the vocabulary while storing these documents to the SIETS storage. Each SIETS storage has its own vocabulary. Each word in the vocabulary has an ID of the integer type assigned to it. Vocabulary is stored in RAM for better performance.

1.4.9. Document Repository

Document repository is a place where all SIETS documents are stored in the format, in which they were stored in the SIETS system, for returning the documents on a search request. Each SIETS storage has its own document repository.

1.4.10. Inverted Index

Inverted index is a list of words, where each word has a list of pointers to SIETS documents in which the word occurs. Inverted index ensures fast FTS functionality with possibility to build different logical expressions when performing a search. Each SIETS storage has its own inverted index.

1.5. SIETS Architecture

This section describes SIETS from various architectural perspectives.

This section contains the following topics:

1.5.1. Client — Server Architecture

The following figure describes SIETS architecture from the client — server perspective:

Figure 4: SIETS client — server architecture

From the client — server perspective, the SIETS system consists of the client part and the server part.

On the client side, developers and administrators work with an application server, which initializes SIETS API command calls and sends them to the SIETS server via HTTP.

On the server side, the SIETS server executes these SIETS API commands accessing data in the SIETS storage and sends a reply back to the client side’s application server.

1.5.2. Multiple Storages Architecture

The following figure describes how multiple SIETS server instances can be run on a single SIETS server:

Figure 5: SIETS multiple storages architecture

Multiple instances of the SIETS server can be run on a single computer, which each works with its own SIETS storage.

1.5.3. Multi-Server Architecture

To ensure scalability of larger amounts of data, the SIETS server can be clustered sharing a single SIETS storage across many computers.

The following figure describes how a single SIETS storage can be distributed on several SIETS servers:

Figure 6: SIETS multi-server architecture

For more information on SIETS clustering, see SIETS Clustering.

1.5.4. Understanding Storing Information in SIETS

The following figure describes how data are imported and stored in SIETS:

Figure 7: Storing information in SIETS

Data are entered by end users in custom built applications.
Using SIETS API commands data are submitted to the SIETS server via HTTP.
From the submitted data, the SIETS server creates an inverted index, vocabulary, and document repository, which all are contained by the SIETS storage.

For more information on the SIETS storage, see Indexing Documents in SIETS Storage and Querying SIETS Storage.

1.5.5. Indexing Documents in SIETS Storage

Note: This section contains some of SIETS system implementation details. Description provided in this section is very general and does not include implementation details for all SIETS functionality.

Note: The knowledge provided in this section is not required for SIETS application developers. However, it can be found useful for a better understanding of the SIETS system.

The following figure describes general principles how a document is indexed in the SIETS storage:

Figure 8: Indexing documents in SIETS storage

When the SIETS server receives an XML request containing document that must be imported in the SIETS storage, the Control demon^[1] parses the XML request.
The Control demon sends the document to the Document repository demon.
The Document repository demon stores the document in the document repository, assigns a unique ID of the integer type to it and sends the ID to the Control demon.
The Control demon sends all textual data from the document to the Vocabulary demon.
The Vocabulary demon translates all words in the document to unique IDs of the integer type and sends them to the Control demon.
The Control demon sends the document ID and all IDs of the words contained by it to the Inverted index demon.
The inverted index demon links word IDs with the document ID and inserts them in the inverted index.

1.5.6. Querying SIETS Storage

Note: The knowledge provided in this section is not required for SIETS application developers. However, it can be found useful for a better understanding of the SIETS system.

The following figure describes general principles how a query is processed in the SIETS storage:

Figure 9: Querying SIETS storage

When the SIETS server receives an XML request containing a query, the Control demon parses the XML request.
The Control demon sends the query to the Vocabulary demon.
The Vocabulary demon translates words from the query to IDs and sends them to the Control demon.
The Control demon sends the IDs to the Inverted index demon.
The Inverted index demon searches and returns from the inverted index to the Control demon a list of document IDs, which are linked to the query word IDs.
The Control demon sends the list of document IDs to the Document repository demon.
The Document repository demon searches and returns a result set containing a document list that matches the query.

1.6. Standards Compatibility

SIETS is designed to comply with the following standards:

Standard	Reference
XML 1.0	http://www.w3.org/TR/REC-xml
UTF-8	RFC 2279: UTF-8, a transformation format of ISO 10646
HTTP	Hypertext Transfer Protocol
XPath 1.0	http://www.w3.org/TR/xpath

1.7. Features

The SIETS features are listed and referred to a section in this guide, in which it is described, in the following table:

Title	Section
FTS	Search
Relevance	Relevance
Multi-language support	Multi-language Support and Character Encoding
Case support	Search
Boolean expressions	Boolean Expressions
Stemming	Stemming
Wildcard search	Wildcard Patterns
Fuzzy search	Alternatives
Markup search	Search within Markup

1.8. SIETS and the Future

It is planned for the nearest SIETS future releases:

to support public key cryptography for document encryption and authentication

Top