The API Data Flow Model
Data gateway
Developers must define how they will access their data repository. This may mean as little as opening a local file in the file system, accessing the remote data through a standard protocol such as JDBC or HTTP, or writing a custom set of access methods (gateways) for accessing another repository, such as a news feed application, help desk application, or document management system.
Document parsing
Parsing refers to the extraction of human-readable content and related field information from the data repository. For this task, a third-party parser that corresponds to the MIME type of data is used, such as an HTML or XML parser, or a custom-written parser. Usually parsing is as trivial as telling the computer how to map SQL columns to index fields or how to extract sections of a text document (e.g., e-mail files, news feed).
Indexable object
The output from the parsing stage is an indexable object consisting of content and metadata. Developers specify mappings of the content into indexable fields, if necessary, and define specific rules of relevance most appropriate for the application. Relevance is used to make searches more successful. The developer can further manipulate the object's data prior to sending it to the tokenizer.
Rules-based tokenizer
Once parsed, the content of fields is cut-up into a stream of tokens [which] that are then submitted to the indexer. The Interseek / API gives the developer an option of supplying alphabets and rules to define tokens for words and characters. This feature can enable searches in non-Latin languages and can also greatly improve the success rate of incoming searches. For example, you can define whether punctuation marks and other non-alphanumeric characters be indexed in certain cases only (i.e., in decimal numbers, proper nouns - C++, OS/2), and make other similar definitions.
Indexer
The Indexer has direct access to Index data structures. Different schedules for batch processing or real-time indexer calls can be implemented in the target application. When using live index collections, the indexer operates without a commit cycle or search process suspension. |