.
.

How it works?

The API Data Flow Model

The API Data Flow Model

Data gateway

Developers must define how they will access their data repository. This may mean as little as opening a local file in the file system, accessing the remote data through a standard protocol such as JDBC or HTTP, or writing a custom set of access methods (gateways) for accessing another repository, such as a news feed application, help desk application, or document management system.

Document parsing

Parsing refers to the extraction of human-readable content and related field information from the data repository. For this task, a third-party parser that corresponds to the MIME type of data is used, such as an HTML or XML parser, or a custom-written parser. Usually parsing is as trivial as telling the computer how to map SQL columns to index fields or how to extract sections of a text document (e.g., e-mail files, news feed).

Indexable object

The output from the parsing stage is an indexable object consisting of content and metadata. Developers specify mappings of the content into indexable fields, if necessary, and define specific rules of relevance most appropriate for the application. Relevance is used to make searches more successful. The developer can further manipulate the object's data prior to sending it to the tokenizer.

Rules-based tokenizer

Once parsed, the content of fields is cut-up into a stream of tokens [which] that are then submitted to the indexer. The Interseek / API gives the developer an option of supplying alphabets and rules to define tokens for words and characters. This feature can enable searches in non-Latin languages and can also greatly improve the success rate of incoming searches. For example, you can define whether punctuation marks and other non-alphanumeric characters be indexed in certain cases only (i.e., in decimal numbers, proper nouns - C++, OS/2), and make other similar definitions.

Indexer

The Indexer has direct access to Index data structures. Different schedules for batch processing or real-time indexer calls can be implemented in the target application. When using live index collections, the indexer operates without a commit cycle or search process suspension.

Index data structures

The data structures store index information in a complex proprietary format designed for maximum search speeds. Developers can adjust the structure size to conserve storage space by using Write-only/Read-only mode, and can determine whether the index will be stored completely in RAM for faster searches, or whether some of it will reside on disk.

Searcher

The searcher implements special algorithms that do not compromise between functionality (available search techniques) and performance.

Stemmer

Stemming is an attempt to remove certain surface markings from search words so as to reveal a root form and to improve the users' search experience. Stemming techniques are completely language-dependent and must appropriately deal with searches in which, for example, the query contains the plural form leaves and the document contains the singular leaf. Developers can implement their own stemmer for a preferred language or use the included implementation of the Porter's stemmer or a more precise English morphological stemmer.

Tokenizer

The same rules used at the indexing phase are applied.

Query parser

The Query parser parses the query string and translates it into search commands and parameters.

Search method call with a query string

Available search methods are:

  • Ranked search with proximity option
  • Grouped ranked search with proximity option
  • Ranked QBE search
  • Grouped ranked QBE search
  • Unranked search

Search results

Search results are internally represented by a list of logical pointers to the actual files, documents, records, etc. from the data repository. Actual data can then be transparently retrieved by using the data gateway. This universal approach allows users of Interseek / API applications to focus on searching for information, regardless of the formats in which the information is stored.

© Interseek Ltd. | Production: Creatim RP