This section generally describes all SIETS API specification, which is implemented in XML.
This section contains the following topics:
This section contains the following topics:
XML request and reply messages are exchanged between the application and the SIETS storage via HTTP with the port 80 as the default.
As mentioned earlier, it is possible to transport SIETS commands to the SIETS server and receive replies as XML messages and, also it is possible to submit HTTP GET parameters and receive formatted replies.
Both options are described in the following sections:
The following figure illustrates submitting SIETS commands and receiving replies via XML messages directly:
A request is sent as a POST method.
As the HTTP resource identification, the URL http://host/cgi-bin/siets/api.cgi
must be used, where <host>
is the SIETS server host name.
The following figure illustrates submitting SIETS commands as HTTP GET parameters and receiving formatted XML replies:
A request is sent as a GET or POST method.
As the HTTP resource identification, the URL http://host/cgi-bin/siets/api.cgi
must be used, where <host>
is the SIETS server host name. Command specific parameters must be included in query
string or passed as POST
data.
As described previously, each XML message contains a command name, content data that are specific for the command, and other information, such as user name and request identifier, which is common for all XML messages and included in the so called XML message envelope.
For more information on the XML message envelope, see SIETS XML Message Envelope.
The following figure illustrates the common part for all XML messages and content part that is specific for each command:
Description of SIETS API commands is organized so that the common part is described in SIETS XML Message Envelope, and only the content parts are described for each command in separate sections named after the command.
XML elementsare presented as they appear in messages and each XML element is described within its tags.
The command syntax consists of an XML request and an XML reply, and as mentioned, XML requests can be submitted as HTTP GET or POST parameters. To describe XML request, XML reply, and HTTP GET parameters syntax, each section contains the following subsections:
Subsection |
Description |
---|---|
XML Request |
Lists all XML request elements that specific for the command as they
appear in XML request messages. Each element is described within its
tags. The description within the tags ends with an asterisk
|
HTTP GET Parameters |
Describes HTTP GET parameter syntax in the form of an example. The example looks as follows: http://host/cgi-bin/siets/api.cgi?param1=value¶m2=value where:
Note: In examples HTTP GET parameters are described, however, you can submit also POST parameters. |
XML Reply |
Lists all XML reply elements that are specific for the command as they appear in XML reply messages. Each element is described within its tags. |
Some elements in XML requests, and thus, respective parameters, if
submitting the XML request as HTTP GET parameters, are mandatory, and
some are not. The mandatory elements are marked with an asterisk
*
in the XML request description.
However, there are some XML request elements that are mandatory only if submitted as XML request, but are not mandatory if submitted as HTTP GET parameters. Such parameters first must be defined in the SIETS Web server module configuration file, and then, do not have to be submitted each time when sending a command. Parameters that can be defined in the SIETS Web server module configuration file are the following:
user name
user password
For more information on the SIETS Web server module configuration file, see the SIETS Administration and Configuration Guide.
This section describes the common parts of the XML request and reply for all SIETS API commands.
<?xml version=1.0 encoding=REQUEST-ENCODING?>
<siets:request xmlns:siets=www.siets.net>
<siets:storage>storage name*</siets:storage>
<siets:command>command name*</siets:command>
<siets:timestamp>message creation date and time</siets:timestamp>
<siets:requestid>message number</siets:requestid>
<siets:application>creator of message</siets:application>
<siets:user>user name*</siets:user>
<siets:password>user password*</siets:password>
<siets:timeout> function timeout period </siets:timeout>
<siets:reply_encoding>reply encoding</siets:reply_encoding>
<siets:content>command specific data </siets:content>
</siets_request>
<?xml version="1.0" encoding=REPLY-ENCODING?>
<siets:reply xmlns:siets=www.siets.net>
<siets:storage>storage name</siets:storage>
<siets:timestamp>reply creation date and time</siets:timestamp>
<siets:content>command specific data</siets:content>
<siets:command>command name for which the reply is created</siets:command>
<siets:requestid>message number for which the reply is created</siets:requestid>
<siets:seconds>time period for the reply creation</siets:seconds>
<siets:replyid>unique message id created by the SIETS server</siets:replyid>
</siets_reply>
This section describes the following data manipulation commands:
The insert command adds a document to the SIETS storage. If a document with such ID already exists, the command returns an error.
If a document with such ID exists in the SIETS storage, the update
command updates the document. If a document with such ID is not in the SIETS storage, the update
command adds it to the SIETS storage.
The replace
command replaces contents of a document in the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.
<siets:content>
<document>document content <document>
</siets:content>
Where the document content consists of document structure elements. The default SIETS document structure is as follows:
<document>
<id> document id * </id>
<title> document title </title>
<rate> document rate </rate>
<domain> document domain </domain>
<info> meta data </info>
<text> textual information, which is used for indexing </text>
<hidden> textual information, which is used for indexing, but which is not shown</hidden>
</document>
For more information on the default SIETS document structure, see Creating Document Structure with Application.
http://
host/cgi-bin/siets/api.cgi?command=insert&storage=
test&id=
1&title=
Doc1
http://
host/cgi-bin/siets/api.cgi?command=update&storage=
test&id=
1&title=
Doc1
http://
host/cgi-bin/siets/api.cgi?command=replace&storage=
test&id=
1&title=
Doc1
If the command is executed successfully, the XML reply does not contain any command specific data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
Note: The binary files conversion
functionality is available only starting from the SIETS server version 3.2.8.
Binary files conversion is integrated feature for the insert, update
, and replace
commands. The binary files conversion functionality converts binary
file contents into plain text. Thus, it is possible to add several
Microsoft Office files and other binary files to the
SIETS storage and perform full text search on them.
The following table lists extensions of binary files that can be added to the SIETS storage:
Extension |
Description |
---|---|
DOC |
Microsoft Word document. |
XLS |
Microsoft Excel document. |
PPT |
Microsoft PowerPoint document. |
RTF |
Rich text format document. |
|
Adobe portable document format document. |
PS |
Post script document. |
To use the binary files conversion functionality, in the XML request, in
the place of the
text
tag, use the file
tag in the following format:
<file store=yes/no <!--If store=yes, then the original document is stored in the SIETS storage and returned when retrieved. The default value is no-->>
<ext> extension of binary file </ext>
<data> data of binary file converted to the base64 encoding </data>
</file>
As described in the data
tag, binary file contents first must be converted to the base64
encoding. This is because XML does not support storing binary data
within a
tag.
The delete
command deletes a document from the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.
<siets:content>
<document>
<id>document id *</id>
</document>
</siets:content>
http://
host/cgi-bin/siets/api.cgi?command=delete&storage=
test&id=
1
If the command is executed successfully, the XML reply does not contain any command specific data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
After inserting, updating, replacing, or deleting documents in the SIETS storage, the SIETS server must permanently save the changes made to the inverted index. The SIETS server is able to make the decision, when to start saving the changes to the inverted index, on its own. However, to optimize performance, for large data amounts, it is recommended to inform the system when a portion of documents are loaded and in the nearest time period more documents are not to be loaded, in other words, the SIETS server can allocate all resource for the process of indexing.
The index
command tells the SIETS server to start the process of indexing.
The <siets:content>
element does not contain any command specific data.
http://
host/cgi-bin/siets/api.cgi?command=index
If the command is executed successfully, the XML reply does not contain any command specific data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
The clear
command deletes all documents from the SIETS storage. This command should be used only when a complete re-indexing
of the
SIETS storage is necessary.
The <siets:content>
element does not contain any command specific data.
http://
host/cgi-bin/siets/api.cgi?command=clear
If the command is executed successfully, the XML reply does not contain any command specific data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
The get_scheme
command retrieves the document structure definition, in other words,
scheme, from the
SIETS storage.
Note: It is also possible to review and edit the document policy scheme from SIETS Enterprise Manager. For information on SIETS Enterprise Manager, see the SIETS Administration and Configuration Guide, Configuring SIETS Storage.
The <siets:content>
element does not contain any command specific data.
http://
host/cgi-bin/siets/api.cgi?command=get_scheme
<siets:content>
<scheme>
<part>
<location>location of the document part in XPath notation</location>
<policy>policy for the document part, policy=value </policy>
</part>
</scheme>
</siets:content>
The set_scheme
command sets the document structure definition, in other words, scheme,
to the
SIETS storage.
Note: If you modify the scheme, it applies to all documents that are to be imported to the SIETS storage. However, it does not automatically modify the document structure for documents that already are imported to the SIETS storage.
Note: It is also possible to review and edit the document policy scheme from SIETS Enterprise Manager. For information on SIETS Enterprise Manager, see the SIETS Administration and Configuration Guide.
<siets:content>
<scheme>
<part>
<location>location of the document part in XPath notation</location>
<policy>policy for the document part, policy=value</policy>
</part>
</scheme>
</siets:content>
For more information on document policies, see Importing XML Structured Data.
The set_scheme
command cannot be submitted as HTTP GET parameters.
If the command is executed successfully, the XML reply does not contain any command specific data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
This section describes the Status command.
The status
command returns status information of the SIETS server instance. The status information includes:
number of documents in the SIETS storage
number of words in the vocabulary
total number of words in the SIETS storage
number of executed commands since the last startup of the instance
number of errors that have occurred since the last startup of the instance
The <siets:content>
element does not contain any command specific data.
http://
host/cgi-bin/siets/api.cgi?command=status
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<status>
<ctrld>
<started> date and time, when the SIETS server was started </started>
<age>time period the SIETS server is working since it was started</age>
<total_time_elapsed>total time spent by the SIETS sever executing commands</total_time_elapsed>
<transactions><--This element contains information about executed commands-->
<total> total number of commands executed</total>
<successful>number of commands that were successfully executed</successful>
<failed> number of commands that were executed unsuccessfully </failed>
<requests command="command name">number of times the command was executed </requests> <-- This element is repeated for every command that was executed.-->
</transactions>
<last_modified> date and time, when modifications in SIETS storage occurred last time </last_modified>
<queue> number of commands executed simultaneously </queue>
<version> SIETS version number</version>
</ctrld>
<mtxd> <-- This element contains information about the inverted index.-->
<journal>
<usage> indexing memory cache usage in percent</usage>
</journal>
<pool_state> index state: normal, expanding, or collapsing</pool_state>
</mtxd>
<wordd> <-- This element contains information about the vocabulary.-->
<unique_words>unique words in the SIETS storage</unique_words>
<total_words>total number of all words</total_words>
</wordd>
<docd>
<documents>total number of documents</documents>
<domains> number of distinct domains of documents</domains>
</docd>
</status>
</siets:content>
When importing data to the SIETS storage:
If the memory reserved for memory cache is enough for the data amount
being imported, the index state is
normal
.
If the memory reserved for memory cache is not enough for the data amount being imported, the index state is one of the following:
Title |
Description |
---|---|
|
The data being imported are written to another cache, which is written to the disk. |
|
When the importing is complete, the SIETS server is committing data written on the disk to the SIETS storage. |
Note: While the index state is expanding or collapsing, the data written to the disk are not available for FTS. Only when data are added to the inverted index, they are available for FTS.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
This section describes the following data retrieval commands:
The lookup
command searches for a document in the SIETS storage and returns the information whether the document with such ID
exists is in the
SIETS storage or it does not.
The retrieve
command returns a document from the SIETS storage. If a document with such ID is not in the SIETS storage, the command returns an error.
<siets:content>
<document>
<id>document id *</id>
</document>
</siets:content>
http://
host/cgi-bin/siets/api.cgi?command=lookup&storage=
test&id=
1
http://
host/cgi-bin/siets/api.cgi?command=retrieve&storage=
test&id=
1
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<found>indicator 1 or 0 if a document is found or not, respectively</found>
<results>
<document>
meta data for the lookup command
textual information for the retrieve command
</document>
</results>
</siets:content>
Meta data for the lookup
command is information included in tags, for which the policy list
is set to YES. By default, these are id, title
, and rate
tags.
For more information on policies, see Importing XML Structured Data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
The search
command performs FTS in the SIETS storage.
<siets:content>
<query> search query *</query>
<docs> number of documents in the result set </docs>
<offset> intend from the beginning of the result set</offset>
<case_sensitive> Boolean type parameter: YES to enable case sensitivity of the first letter of words when performing the search, NO not to enable case sensitivity </case_sensitive>
<relevance> Boolean type parameter: YES to order results by relevance, NO not to order results by relevance </relevance>
<max_from_domain> Maximum number of documents from one domain. Results from one domain are grouped together within one result page. If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set. </max_from_domain>
<rate_from> searching documents with in a rate range: the FROM value </rate_from>
<rate_to> searching documents with in a rate range: the TO value </rate_to>
<wildcards> <!-- This element contains parameters for configuring wildcard patterns support. Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->
<allow> Information whether the wildcard patterns search is enabled. Values yes or no.</allow>
<cover_factor> When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for to ensure a higher performance. This element defines the limit in percent from the sum of all words created from the wildcard pattern appearance in the SIETS storage.</cover_factor>
<min_expand> The minimum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if only 2 words fall in the cover_factor, but the min_exapand is 4, then 4 words are being used in the search.</min_expand>
<max_expand> The maximum limit of the wildcard patterns matching set from the SIETS storage vocabulary in absolute numbers. This parameter overcomes the cover_factor parameter. For example, if 20 words fall in the cover_factor, but the max_exapand is 16, then only 16 words are being used in the search.</max_expand>
</wildcards>
</siets:content>
If values for the wildcards
tag are not defined, corresponding parameters set in the SIETS storage configuration file are used.
For more information configuring SIETS storage, see the SIETS Administration and Configuration Guide.
This section contains the following topics:
SIETS provides several mechanisms for specifying your search query. Each mechanism has a definite syntax, which is described in the following subsections. For a better understanding, each subsection also contains an example of the mechanism described and an explanation about what the example search query returns.
This section contains the following topics:
To search for documents that contain a single search term, the search term must be entered as is.
Example:
John
returns documents that contain the word John.
To search for documents that contain all of the several terms, but which are not necessarily next to each other, the search term must be separated by the space character.
Example:
John Smith
returns documents that contain the word John and the word
Smith.
To search for documents that contain an exact phrase, the search phrase must be enclosed in the quotations marks.
Example:
John Smith
returns documents that contain the exact phrase John
Smith.
To search for documents that contain any of the search terms, the search
terms must be enclosed in
{ }
and separated with the space character.
Example:
{John Smith}
returns documents that contain either the word John or the
word
Smith.
To search for documents that do not contain the search term, the search
term must be preceded with
~
.
Example:
~John
returns documents that do not contain the word John.
AND, OR, and NOT logical connectives can be combined in more complex
search expressions using the brackets
( )
, which allows you to build any Boolean expression.
Example:
{(John Smith) (Abby Brown)}
returns documents that either contains the word John and
the word Smith, or the word Abby and the word
Brown.
{(A B ~C) D E}
is parsed in the expression tree as follows:
To search for documents that contain a class of words represent:
exactly one unknown character using the question mark ?
one or more unknown characters using the asterisk *
range of definite characters for one unknown character occurrence using
the square brackets
[ ]
Note: When wildcard patterns are used to define a class of words to be searched, only a limited number of statistically frequent words are searched for. This limitation is introduced to preserve the high performance of the SIETS server. However, the maximum number of the words being searched can be increased or decreased, when configuring the SIETS server. For more information on configuring the SIETS server, see the SIETS Administration and Configuration Guide.
Example:
ca?
returns documents that contain the word car,
cat, cap, can, and so
on.
Joh*
returns documents that contain the word John,
Johnson, Johnny, and so
on.
ca[pt]
returns documents that contain only the word cap or
cat.
c?[au]*
returns documents that contain the word counter,
club, chapter, country,
change, chat, council,
class, cpu, challenge",
church, couple, championship, and so
on.
By default, SIETS ignores common words and characters such as and, where, and how, as well as certain single characters and single letters, because they tend to slow down the search without improving the search results. Common words and characters like this are called ignored words.
The SIETS server detects words that appear in the SIETS storage most often and adds them to the ignored words list. It is possible to edit the limit of the ignored words list. For more information on managing the ignored word list limit, see the SIETS Administration and Configuration Guide.
If a common word or a character is essential to getting the results you
want, you can include it by preceding it with a plus sign
+
.
Example:
John +and Abby
returns documents that contain all three words: John,
and, and
Abby.
It is possible to include in one search request a word and its declinations, for example, go and going.
This feature is especially useful for so-called synthetic languages, in which syntactic relations within sentences are expressed by the change in the form of a word that indicates distinctions of tense, person, gender, number, mood, voice, and case, for example, German and Latin.
To enable the declination search, a shared library must be implemented, which exports a function that extracts word roots.
For information on installing the shared library for the SIETS server, see the SIETS Administration and Configuration Guide.
To search for documents that contain words in declinations, a word or a
phrase must be enclosed in the dollar signs
$ $
.
Example:
$John$
returns documents that contain the word John and
Johns.
To search for documents that contain the search term in a specific tag, the search term must be enclosed in the appropriate tags.
Note: The searching within markup can be performed only if the policy index
with values
xml
or all
is used. For the default document structure is the index
policy with the value xml
is set by default. For more information on policies see, Importing XML Structured Data.
Example:
<person>John Smith</person>
returns documents that contain the word John in the <person>
tag and the word Smith in the <person>
tag.
{<person>John</person> <address>New
York</address>}
returns documents that either contains the word John in the <person>
tag, or the phrase New York in the <address>
tag.
It is possible to define maximum of words, which appear between certain search terms. These search terms are also defined in the search query. Such feature is called proximity search.
To use the proximity search feature, the search term must be as follows:
@ N term1 term2 @
,
where N
is the maximum count of words between the search terms, and term1
and term2
are search terms. There can be any number of search terms included in
the proximity
search.
If N
is 1, then the search is exactly the same as if the phrase search was
used.
For more information on the phrase search, see Phrase Search.
Example:
@ 3 street city @
returns documents that contain the words street and
city not further than 3 words from each
other.
Note: The numeric search functionality is available only starting from the SIETS server version 3.3.
Due to the fact that the SIETS server is indexing not only text information, but it also indexes numeric information, it is possible to perform numeric search. Numeric search allows searching documents that contain numeric values within a numeric interval.
For example, each document contains information about an object including geographic coordinate information. In that case, the numeric search can be performed to retrieve all objects in definite range of geographic coordinate. Thus, SIETS can be used in online maps, where people can find information on different objects in a definite area.
The numeric search can be performed only together with a text search.
Numeric values in documents are indexed and stored as floating points, no matter if they are integers or floating points in original documents.
Fraction part is stored up to the sixth digits.
To use the numeric search functionality, the search term must be as follows:
To perform numeric search within a range of two numeric values, enter _textual search term_X .. Y, where X is the minimum value of the search numeric value, and Y is the maximum.
To perform numeric search for a document that contain numeric value greater than the given, enter _textual search term_>X.
To perform numeric earch for a document that contain numeric value smaller than the given, enter _textual search term_<X.
It does not matter if textual search term is entered before or after numeric search term.
Example:
Document content:
<document>
<id>32423</id>
<title>Johns profile</title>
<text>
<name>John Smith</name>
<age>32</age>
</text>
</document>
Search query that matches the document:
<query>
<name>Jonh</name> <age>30 .. 40</age>
</query>
<numeric_ordering>center</numeric_ordering>
Note: For performing numeric searching for one document tag, as in the previous example the <age> tag, only one numeric interval can be entered. If you enter more than one numeric interval for one tag, then nothing is returned since numeric intervals are joined with the AND logical operation.
For information on performing numeric search for more than tags, see Numeric Search in More Than One Tag.
The <numeric_ordering>
tag in the example denotes the order in which search results must be
returned.
Possible values for numeric ordering are the following:
Title |
Description |
---|---|
|
No numeric ordering is applied. |
|
Results that are closer to the mean value of the numeric search interval are returned first. This value is allowed only for numeric search within a range of two numeric values. |
|
Numeric search results are returned in ascending order. |
|
Numeric search results are returned in descending order. |
Numeric Search in More Than One Tag
It is possible to perform numeric search in more than one tag. It means that for each tag that contains numeric information a numeric search range can be performed.
Example:
Document content:
<document>
<id>32423</id>
<title>Johns profile</title>
<text>
<name>John Smith</name>
<age>32</age>
<children>2</children>
</text>
</document>
Search query that matches the document:
<query>
<name>Jonh</name> <age>30 .. 40</age> <children>< 2</children>
</query>
<numeric_ordering>center</numeric_ordering>
Numeric search in more than one tag is especially useful and necessary for geographic coordinate searching, where it is necessary to search for an object by its longitude and latitude.
For numeric search in more than one tag result ordering is combined in one for all tags.
The following table describes result ordering is combined:
Ordering type |
Description |
---|---|
|
Results are ordered ascending by the sum of all numeric values from tags in which the numeric search is performed. |
|
Results are ordered descending by the sum of all numeric values from tags in which the numeric search is performed. |
|
Ordered by shortest distance to the center of intervals in multi-dimensional space where each dimension represents a tag in which the numeric search is performed. Distance to the center of intervals in multi-dimensional space is calculated by the following formula: (x-xc)/xr*(x-xc)/xr + (y-yc)/yr*(y-yc)/yr + + (z-zc)/zr*(z-zc)zr, where x, y, z are numeric search intervals xc, yc, zc are centers of each interval, respectively xr, yr, zr are half of numeric interval range, respectively. |
Numeric search functionality in several tags or in several dimensions has additional feature that allows returning numeric search results that match:
§ a hypercube of all numeric intervals, which is default, or
§ only a hypersphere of all numeric intervals.
For example, if geographic coordinates of ATMs in a city are indexed, it is possible to search for an ATM that is not farther than 1 kilometer from a definite location. That is, you need to retrieve only those ATMs that match the circle (a hypersphere with 2 dimensions in this case) with a radius of 1 kilometer.
If in the previous example, the default numeric search is performed, results that match a square with the side length 2 kilometers are returned. This means that also ATMs that are square root of 2, which is approximately 1.41, are returned.
As said before, the default value for the multi dimensional shape
feature is a hypercube. Value for the multi dimensional shape feature is
defined in the <md_shape> tag, which is included in the
siets
command syntax.
Possible values for the <md_shape> tag are the following:
Ordering type |
Description |
---|---|
|
Results that match a hypercube are returned. |
|
Results that match a hypersphere are returned. |
Example:
Document content:
<document>
<id>32425</id>
<title>ATMs profile</title>
<text>
<name>ATM</name>
<x>1.2</x>
<y>3.7</y>
</text>
</document>
Search query that matches the document and finds ATMs within the distance of 1 kilometer from point (2.0, 4.0):
<query>
<name>ATM</name> <x>1.0 .. 3.0</x> <y>3.0 .. 5.0</y>
</query>
<numeric_ordering>center</numeric_ordering>
<md_shape>sphere</md_shape>
It is possible to perform case sensitive search for proper names, which means that case sensitivity is applied for the first letter of a search term.
The case sensitivity feature is switched on or off by setting the <case_sensitive>
parameter in the search
commands XML request.
For more information on the search
commands XML request, see XML Request.
Example:
If the <case_sensitive>
parameter is set to YES, and the search query contains
Bank, then the search command returns documents, in which
the word Bank is with the first capital. Note that in this
case, also documents, in which the word BANK is with all
capitals, are returned, since the case sensitivity is applied only to
the first letter of a search
term.
It is possible to set the maximum number of documents in a search result that are returned form one domain. If this feature is used, in the search result, documents from one domain are grouped together within one result page.
The grouping results by domain feature is defined by setting the <max_from_domain>
parameter in the search
commands XML request larger than 0.
If the parameter is not set, the default value is 0, which implies that no grouping by domains is performed and no limit is set.
For more information on the search
commands XML request, see XML Request.
It is possible to filter search results by document rate by setting the minimum and maximum of the rate range within which the rate of a document must be to appear in the search result.
Document rate is of the integer type. However, it is possible to convert any date and time into integer using the UNIX timestamp, which converts a date and time into amount of seconds from 01/01/1970 till the given date and time. Thus, it is possible to set date and time as document rate and to search for document within a certain time interval.
The filtering results by rate feature is defined by setting the <rate_from>
and <rate_to>
parameters in the search
commands XML request.
For more information on the search
commands XML request, see XML Request.
SIETS is designed for use in Web applications in mind. In many cases to display results in Web, the paging functionality is used. The paging functionality implies that the search result records are divided in parts, where each part is displayed in its own page, and each part contains a fixed amount of records.
The Web friendly result navigation feature is defined by setting the <docs>
and <offset>
parameters in the search
commands XML request.
For more information on the search
commands XML request, see XML Request.
Example:
If the <docs>
parameter is set to 20, and the <offset>
parameter is set to 40, the search command returns results from 40 till
59.
XML drilldown is feature that allows grouping documents into hierarchical structure and searching in this structure. Using this feature, you can create catalogues, index files and directories and even much more.
Setting classify policy
Classify policy should be set for those tags for which menus should be generated.
<part>
<location>//document/spectags</location>
<policy>index=classify</policy>
</part>
Policy schema should be set before indexing data.
Document import
Once you have set policy schema you can import documents into storage.
<document>
<id>3049223</id>
<title>Article</title>
<text>This is article</text>
<spectags>
<type>News_item<type>Comment</type></type>
<author>John_Smith</author>
</spectags>
</document>
Note that subtags of type or author also can be included to enable multi-level navigation.
Menu generation
Within search requests content tag supply menu tag, which identifies XPath location relative to, the tag for which classify policy is defined of the tag for which to generate menu with hit distribution. Note that menu is generated along with search query and represents number of hits.
Example request (single level drilldown):
<siets:command>search</siets:command>
<siets:content>
<query>article</query>
<menu>/type</menu>
</siets:content>
Example response (single level drilldown)::
<siets:content>
<menu>
<item hits="25">News_item</item>
<item hits="1">Comment</item>
<item hits="7">Question</item>
<item hits="7">Reply</item>
</menu>
</siets:content>
For multi level drilldown, simply pass correct deeper XPath location. Be sure to add =<selected value> to each parent category or you will receive invalid hits.
Example resquest (multi level drilldown):
<siets:command>search</siets:command>
<siets:content>
<query>article</query>
<menu>/type=News_item/type</menu>
</siets:content>
Example response (multi level drilldown):
<siets:content>
<menu>
<item hits="10">Comment</item>
<item hits="1">Question</item>
<item hits="3">Reply</item>
</menu>
</siets:content>
http://
host/cgi-bin/siets/api.cgi?command=search&storage=
test&query=
Jhon
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<ignored> common words that are ignored when performing the search </ ignored >
<realquery> real query that was used to perform the search, including the derived words from the wildcard usage and dropped ignored words </ realquery>
<found> number of documents found </found>
<hits> approximate total amount of results that match the search query </hits>
<more> number that indicates how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and an at least number if in the form of >N</more>
<from> documents in the result set within a numerical range: the FROM value </from>
<to> documents in the result set within a numerical range: the TO value </to>
<results>
<document> meta data of the document found </document>
</results>
</siets:content>
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
Note: The select
command is available only starting from the SIETS server version 3.2.6.
The select
command searches for document by their identifiers. It is possible to
select one document by a precisely entered document identifier or to use
wildcard pattern to select all documents that identifiers match the
wildcard pattern entered. For example, if only the asterisk
*
is entered, identifiers for all document in the SIETS storage will be returned.
The default number of document identifiers returned to result set is
1024, but this number can be changed by entering a different number in
the
<docs>
tag.
<siets:content>
<document>
<id>document id *</id>
</document>
<docs> number of document identifiers in the result set </docs>
<offset> intend from the beginning of the result set</offset>
</siets:content>
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<found> number of document identifiers matched </found>
<from> document identifiers in the result set within a numerical range: the FROM value </from>
<to> document identifiers in the result set within a numerical range: the TO value </to>
<results>
<id> meta data of the document found </id>
</results>
</siets:content>
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
The similar
command searches for similar documents in the SIETS storage to a textual information, which is given directly, or which is
contained by a document. The textual information, to which similar
documents are searched for, is also referred as the input text.
The algorithm that is searching for similar documents uses statistical information about the number of times words contained by the input text, or so called keywords, appear in documents and finds similar documents to the input text fragment or document with a given ID.
You must take into account that the algorithm uses statistical information about words and does not know their meaning. Therefore, similar documents might not be semantically alike, however, praxis, when working with large text collections that contain medium large documents, shows that the algorithm works fine.
<siets:content>
<id> document id to which similar documents must be searched for ** </id>
<text> textual information to which similar documents must be searched for ** </text>
<len> number of keywords in the input text * </len>
<quota> minimal amount of keywords that must be found in documents, which are returned the search result *</quota>
<docs> number of documents to be retuned in the result set </docs>
<offset> intend from the beginning of the result set </offset>
</siets:content>
For large text collections in the SIETS storage, praxis shows that the len element equal to 20 and the quota element equal to 4 gives the best results. However, you can experiment to find the best values for your specific text collection.
The two asterisks **
means that only one from the two elements must be entered, in other
means, the relationship between these two elements is
XOR.
http://
host/cgi-bin/siets/api.cgi?command=similar&storage=
test&id=
Doc1&len=
20"a=
4
http://
host/cgi-bin/siets/api.cgi?command=similar&storage=
test&text=
Jhon&len=
20"a=
4
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<found> number of documents found </found>
<hits> approximate total amount of results that match the search query </hits>
<more> number that indicates, how many more documents that match the search query are found, but are not returned to the result set yet, a precise number if in the form of =N, and the minimum number if in the form of >N</more>
<from> documents in the result set within a numerical range: the FROM value </from>
<to> documents in the result set within a numerical range: the TO value </to>
<results>
<document> meta data of the document found </document>
</results>
</siets:content>
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
If the alternatives
search is performed, the system returns a set of alternative words from
the
SIETS storage vocabulary, which are similar in spelling or has a different
language declination, for example, if you enter
bote
, then bite
and byte
are offered for searching. Note that only words from the SIETS storage are returned.
This feature can be used for fuzzy searches and for spelling error corrections.
Alternative words are returned from the vocabulary, which ensures that
the alternative words are actual words that are in imported to the
SIETS storage. When searching alternative words, the alternatives
command considers the statistical information about the occurrence of
the alternative word in the vocabulary, and the similarity of the
alternative word to the search term. In other words, alternatives that
occur in the
SIETS storage more often and that are more similar to the search term are
returned.
<siets:content>
<query> search query * </siets_query>
<cr> Minimum ratio to include the alternative in the search query between the occurrence of the alternative and the occurrence of the search term. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.</cr><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->
<idif> Maximum number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference. If you increase this parameter, there are greater number of results returned to the result set, however performance is reduced.</idif><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->
<h> Minimum number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value. If you increase this parameter, there are less number of results returned to the result set, however performance is improved.<h><!-- Functionality of this tag is available only starting from the SIETS server version 3.2.8.-->
</siets:content>
If values for the cr, idif
, or h
tags are not defined, corresponding parameters set in the SIETS storage configuration file are used.
For more information configuring SIETS storage, see the SIETS Administration and Configuration Guide.
http://
host/cgi-bin/siets/api.cgi?command=alternatives&storage=
test&query=
Jhon
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<alternatives_list>
<alternatives>
<to> alternative search term </to>
<count> number of times the alternative search term occurs in the SIETS storage</count>
<word count=number of times the alternative occurs in the SIETS storage cr=ratio between the occurrence of the alternative and the occurrence of the search term idif=number that indicates how much does the alternative differs from the search term, the greater the idif value, the greater the difference h=number that gives an overall estimation of the quality of the alternative, the greater the cr value and the smaller the idif value, the grater the h value> alternative </word>
</alternatives>
</alternatives_list>
</siets:content>
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
Note: The list-last
command is available only starting from the SIETS server version 3.2.9.
The list-last
command searched for documents in the SIETS storage that most recently have been inserted, updated, or replaced,
using the
insert
, update,
or replace
commands, respectively.
<siets:content>
<docs> number of documents in the result set </docs>
<offset> intend from the beginning of the result set</offset>
</siets:content>
http://
host/cgi-bin/siets/api.cgi?command=list-last&storage=
test&docs=
10&offset=
100
If the command is executed successfully, the XML reply contains the following command specific data.
<siets:content>
<found>number of documents returned to the result set</found>
<results>
<document> meta data for the list-last command</document>
</results>
</siets:content>
Meta data for the list-last
command is information included in tags, for which the policy list
is set to YES. By default, these are id, title
, and rate
tags.
For more information on policies, see Importing XML Structured Data.
If the command is not executed successfully, an error is returned. For more information on errors, see Error Handling.
SIETS Server allows user to use alerting functionality. Alerts are defined as search queries that can be performed against storage inside server. Alerts are not triggered automatically; special command must be used. This is done, to give user application even more flexibility in alert handling.
Alerting API commands are sent to server using standard SIETS XML messaging.
This command add trigger identified with supplied ID that will match documents against query supplied in filter tag.
<siets:command>add_trigger</siets:command>
<siets:content>
<id>Trigger id</id>
<filter>Trigger filter query</filter>
<recipient>Recipient of notification</recipient>
</siets:content>
This command removes specific trigger.
<siets:command>remove_trigger</siets:command>
<siets:content>
<id>Trigger id</id>
</siets:content>
This commnad clears all triggers.
<siets:command>clear_triggers</siets:command>
This command test document thats ID is supplied against all triggers. If notify parameter is set to yes shell script is executed for each trigger that matches document. Also in reply to this command list of trigger-ids that matched document is returned.
<siets:command>examine</siets:command>
<siets:content>
<document>
<id>document id to examine</id>
</document>
<notify>yes/no to send message or not</notify>
</siets:content>
Storage configuration can be used, to specify shell script that will be executed, when trigger is matched against document.
<config>
<alerts>
<action>Shell script to execute</action>
</alerts>
</config>
If a command sent to the SIETS server is not executed successfully, an error is returned in the following XML reply message:
<?xml version="1.0"?>
<siets:reply>
<siets:timestamp> date and time </siets:timestamp>
<siets:storage> storage name </siets:storage>
<siets:requestid>XML request ID</siets:requestid>
<siets:error>
<code>error code</code>
<text> error textual message</text>
<level> error severity</level>
<source>subsystem in which the error occurred</source>
</siets:error>
<siets:seconds> time period in which the XML reply is returned </siets:seconds>
</siets:reply>
The error severity can be one of the following:
Title |
Description |
---|---|
Warning |
Returned when the command is executed successfully, but there are some problem indications |
Failed |
Returned when incorrect input data. |
Error |
Returned when error in the command execution. |
Fatal |
Returned when the system is not functioning. |
The purpose of the error severity is to inform the system:
If the error severity is fatal or error, the system work must be interrupted and the system administrator must be informed.
If the error severity is failed or warning, the errors can be logged and analyzed, while the system work can be continued.
SIETS is a transaction-based system, which means that commands has a predefined timeout period. If a command is not executed during this predefined timeout period, the command returns the error.
It is possible to define a timeout period for the request, or configure it for the SIETS server.
For more information on configuring timeout periods for the SIETS server, see the SIETS Administration and Configuration Guide.