This section contains the following frequently asked questions:
How can I make SIETS to automatically ignore common words when performing FTS?
Is it possible to return more than 1000 documents to the result set?
To import binary data like MS Word or PDF document files to the SIETS storage, they must be entered in the info
document part.
Note: Data in the info
part are not available for FTS. If you want your data to be available
for FTS, they must be stored as plain
text.
Usually, binary data do not comply with the XML formatting standard. However, to be imported to the SIETS storage, they must comply with the XML formatting standard. Therefore, before importing to the SIETS storage, you must encode the binary data to the base64 encoding or other.
For more information on document parts, see Understanding SIETS Document Structure.
The SIETS server automatically detects words that appear in the SIETS storage most often and adds them to the ignored words list. These words are considered to be common words that are ignored during FTS.
It is possible to edit the limit of the ignored words list. For more information on managing the ignored word list limit, see the SIETS Administration and Configuration Guide.
For more information on ignored words, see Ignored Words.
Often the actual query that is used for FTS differ from that you entered as a search query. Reasons for this can be the following:
If the original query contains words from the ignored words list, they are dropped from the actual search query.
If the original query contains wildcard patterns, a class of words created from the wildcard pattern usage is entered in the actual search query.
To see the actual query used for FTS, use the <real_query>
tag of the XML reply to the search
command.
For more information on the search
command, see Search.
The vocabulary is a list of all unique words in the SIETS storage. Unique words are found in documents and added to the vocabulary while storing these documents to the SIETS storage. Each SIETS storage has its own vocabulary.
Unfortunately, it is not possible to export the vocabulary with any of the SIETS API commands. However, on the file level the vocabulary is stored in a text file, where each line contains one word. You can copy this text file and view it.
For information on the vocabulary text file, see the SIETS Administration and Configuration Guide.
For more information on vocabulary, see Understanding Storing Information in SIETS.
Importing data to the SIETS server, just like any other operation with the SIETS server, is performed by transporting XML requests and replies via HTTP.
When importing large amount of data to the SIETS server, many TCP/IP connections are opened. After the connections are closed, they remain in the TIME_WAIT state for a definite time period.
By default, in the Windows NT 4.0 or Windows 2000 environment, the limit of the connections is inconsiderably small and the TIME_WAIT state time period is too long.
Therefore, because the number of new connections created per second can be very large and the closed connections remain in the TIME_WAIT state for some time period, the number of connections can reach the limit very fast.
In that case, the system does not allow to create a new connection and the error is returned.
To configure the limit of the connections and the TIME_WAIT state time period, configure the following key in the Windows NT 4.0 or Windows 2000 registry:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters]
"TcpTimedWaitDelay"=dword:00000015
where, the value is a decimal number representing seconds.
By default, the limit of documents to be returned to the result set is 1000. It is possible to increase this limit. However, there is the following functionality, which is designed for the maximum number of documents in the result set equal to 1000:
sorting search results by the relevance
grouping search results by a domain
If you increase the limit of documents in the result set, the limit will be applied for all functions, except, if sorting search results by relevance or domain, only 1000 documents will be returned to the result set.
If you increase the limit of documents in the result set, it means that transactions in the SIETS server will be performed in a longer time period. Therefore, you should also increase the timeout period of functions.
For more information on configuring the limit of documents in the result set, see the SIETS Administration and Configuration Guide.
For more information on the relevance, see Relevance.
For more information on grouping documents by a domain, see Search.
When importing data to the SIETS storage, if the memory reserved for memory cache is not enough for the data amount being imported, then:
The data being imported are written to another cache, which is written
to the disk, and the index state is
expanding
.
When the importing is complete, the SIETS server is committing data
written on the disk to the inverted index, and the index state is
collapsing
.
While the index state is expanding or collapsing, the data written to the disk are not available for FTS. Only when data are added to the inverted index, they are available for FTS.
For example, if the data amount to be imported is tens of GB, these data will not be available for FTS for few hours.
For more information on the index state, see Status.