Information Retrieval System

Module 1: Introduction


Information Retrieval (IR):

irs cycle
Information retrieval (IR) is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for the metadata that describes data, and for databases of texts, images or sounds.


Information Retrieval System:

An IR system is a software system that provides access to books, journals and other documents; stores and manages those documents. Web search engines are the most visible IR applications.

Automated information retrieval systems are used to reduce what has been called information overload. 


Information Retrieval Process:


IR Process


An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example, search strings in web search engines. In information retrieval, a query does not uniquely identify a single object in the collection. Instead, several objects may match the query, perhaps with different degrees of relevancy.


First of all, before the retrieval process can even be initiated, it is necessary to define the text database. This is usually done by the manager of the database, which specifies the following:

(a) the documents to be used, 

(b) the operations to be performed on the text, and 

(c) the text model (i.e., the text structure and what elements can be retrieved

d). The text operations transform the original documents and generate a logical view of them.


Once the logical view of the documents is defined, the database manager (using the DB Manager Module) builds an index of the text. An index is a critical data structure because it allows fast searching over large volumes of data. Different index structures might be used, but the most popular one is the inverted file. The resources (time and storage space) spent on defining the text database and building the index are amortized by querying the retrieval system many times.


Given that the document database is indexed, the retrieval process can be initiated. The user first specifies a user need which is then parsed and transformed by the same text operations applied to the text. Then, query operations might be applied before the actual query, which provides a system representation for the user need, is generated. The query is then processed to obtain the retrieved documents. Fast query processing is made possible by the index structure previously built.


Definition : 

An Information Retrieval System is a system that is capable of storage,

retrieval, and maintenance of information. Information in this context can be

composed of text (including numeric and date data), images, audio, video and

other multimedia objects. Although the form of an object in an Information

Retrieval System is diverse, the text aspect has been the only data type that lent

itself to full functional processing. The other data types have been treated as

highly informative sources, but are primarily linked for retrieval based upon a search

of the text. 


An Information Retrieval System consists of a software program that

facilitates a user in finding the information file user needs. The system may use

standard computer hardware or specialized hardware to support the search

subfunction. The gauge of the success of an information system is how well it can

minimize the overhead for a user to find the needed information. Overhead from a

user's perspective is tile time required to find tile information needed, excluding

the time for actually reading the relevant data. Thus search composition, search

execution and reading non-relevant items are all aspects of information retrieval

overhead. 



The objective of the IRS :

The first objective of an Information Retrieval System is the support of user
search generation. 

The general objective of an Information Retrieval System is to minimize
the overhead of a user locating needed information. Overhead can be expressed as
the time a user spends in all of the steps leading to reading an item containing the
needed information (e.g., query generation, query execution, scanning results of
query to select items to read, reading non-relevant items).


Information Retrieval System is composed of four major functional processes: 

  • Item Normalization, 
  • Selective Dissemination of Information (i.e., "Mail"), 
  • Archival Document Database Search, and 
  • An Index Database Search along with the Automatic File Build process that supports Index Files. 



Components and Parts of Information System:

  1. Computer Hardware
  2. Computer Software
  3. Networks
  4. Database
  5. Human Resources


1. Computer Hardware:


Physical equipment used for input, output and processing. What hardware to use it depends upon the type and size of the organisation. It consists of input, an output device, operating system, processor, and media devices. This also includes computer peripheral devices.


2. Computer Software:


The programs/ application program used to control and coordinate the hardware components. It is used for analysing and processing of the data. These programs include a set of instruction used for processing information.


Software is further classified into 3 types:


  • System Software
  • Application Software
  • Procedures


3. Databases:


Data are the raw facts and figures that are unorganised that are and later processed to generate information. Softwares are used for organising and serving data to the user, managing physical storage of media and virtual resources. As the hardware can’t work without software the same as software needs data for processing. Data are managed using the Database management system.

Database software is used for efficient access for required data, and to manage knowledge bases.


4. Network:


Networks resources refer to the telecommunication networks like the intranet, extranet and the internet.

These resources facilitate the flow of information in the organisation.

Networks consist of both the physicals devises such as networks cards, routers, hubs and cables and software such as operating systems, web servers, data servers and application servers.

Telecommunications networks consist of computers, communications processors, and other devices interconnected by communications media and controlled by software.

Networks include communication media and Network Support.


5. Human Resources:


It is associated with the manpower required to run and manage the system. People are the end-user of the information system, end-user use information produced for their own purpose, the main purpose of the information system is to benefit the end-user. The end-user can be accountants, engineers, salespersons, customers, clerks, or managers etc. People are also responsible to develop and operate information systems. They include systems analysts, computer operators, programmers, and other clerical IS personnel, and managerial techniques.



Types of Information System:

There are various types of information systems, for example, 

  • transaction processing systems, 
  • Management Information System,
  • decision support systems, 
  • Experts systems,  


1. Transaction Processing System (TPS):


Transaction Processing System are information system that processes data resulting from the occurrences of business transactions

Their objectives are to provide transaction in order to update records and generate reports i.e to perform storekeeping function

The transaction is performed in two ways: Batching processing and Online transaction processing.

Example: Bill system, payroll system, Stock control system.


2. Management Information System (MIS):


Management Information System is designed to take relatively raw data available through a Transaction Processing System and convert them into a summarized and aggregated form for the manager, usually in a report format. It reports tend to be used by middle management and operational supervisors.

Many different types of report are produced in MIS. Some of the reports are a summary report, on-demand report, ad-hoc reports and an exception report.

Example: Sales management systems, Human resource management system.


3. Decision Support System (DSS):


Decision Support System is an interactive information system that provides information, models and data manipulation tools to help in making the decision in a semi-structured and unstructured situation.

Decision Support System comprises tools and techniques to help in gathering relevant information and analyze the options and alternatives, the end-user is more involved in creating DSS than an MIS.

Example: Financial planning systems, Bank loan management systems.


4. Experts System:


Experts systems include expertise in order to aid managers in diagnosing problems or in problem-solving. These systems are based on the principles of artificial intelligence research.

Experts Systems is a knowledge-based information system. It uses its knowledge about a species are to act as an expert consultant to users.

Knowledgebase and software modules are the components of an expert system. These modules perform inference on the knowledge and offer answers to a user’s question



Module 3: Query Processing and Operations


Keyword-based Querying

A query is the formulation of a user information need. In its simplest form, a query is composed of keywords and the documents containing such keywords are searched for. Keyword-based queries are popular because they are intuitive, easy to express, and allow for fast ranking. Thus, a query can be (and in many cases is) simply a word, although it can, in general, be a more complex combination of operations involving several words.


Pattern Matching


A pattern is a set of syntactic features that must occur in a  text segment. Those segments satisfying the pattern specifications are said to ‘match’ the pattern. We are interested in documents containing segments which match a given search pattern. Each system allows the specification of some types of patterns, which range from very simple (for example, words) to rather complex (such as regular expressions). In general, as more powerful is the set of patterns allowed, more involved are the queries that the user can formulate and more complex is the implementation of the search. The most used types of patterns are:


  • Words A string (sequence of characters) which must be a word in the text. This is the most basic pattern.
  • Prefixes A string which must form the beginning of a text word. For instance, given the prefix ‘comput all the documents containing words such as aa ‘computer, ‘computation, ‘computing, etc. are retrieved.
  • Suffixes A string which must form the termination of a text word. For instance, given the suffix ‘tera’ all the documents containing words such as ‘computers,’ ‘testers, ‘painters,' etc. are retrieved.
  • Substrings A string which can appear within a text word. For instance, given the substring ‘cal’ all the documents containing words such as ae ‘coastal, ‘talk, ‘metallic, etc. are retrieved.  This query can be restricted to find the substrings inside words, or it can go further and search the substring anywhere in the text (in this case the query is not restricted to be a sequence of letters but can contain word separators). For instance, a search for ‘any I low’ will match in the phrase .' any I lower’
  • Ranges A pair of strings which matches any word lying between them in lexicographical order. Alphabets are normally sorted, and this induces an order into the strings which is called leEcogmphical order (this is indeed the order in which words in a dictionary are listed). For instance, the range between words ‘held’ and ‘hold’ will retrieve strings such as ‘hoax’ and ‘hissing.’
  • Allowing errors A word together with an error threshold. This search pattern retrieves all text words which are ‘similar’ to the given word. The concept of similarity can be defined in many ways. 
  • Regular expressions Some text retrieval systems allow searching for regular expressions. A regular expression is a rather general pattern built up by simple strings 



Structural Queries:

 

The text collections tend to have some structure built into them, and allowing the user to query those texts based on their structure (and not only their content) is becoming attractive. The standardization of languages to represent structured texts such as HTML has pushed forward in this direction.


1. Fixed Structure:


The structure allowed in texts was traditionally quite restrictive. The documents had a fixed set of fields, much like a filled form. Each field had some text inside. Some fields were not present in all documents. Only rarely could the fields appear in any order or repeat across a document. 

This model is reasonable when the text collection has a fixed structure.

For instance, a mail archive could be regarded as a set of mails, where each mail has a sender, a receiver, a date, a subject, and a body field. 


2. Hypertext:


Hypertexts probably represent the maximum freedom with respect to structuring power. A hypertext is a directed graph where the nodes hold some text and the links represent connections between nodes or between positions inside the nodes. Hypertexts have received a lot of attention since the explosion of the Web, which is indeed a gigantic hypertext-like database spread across the world.


3. Hierarchical Structure:


An intermediate structuring model which lies between fixed structure and hypertext is the hierarchical structure. This model represents a recursive decomposition of the text and is a natural model for many text collections (e.g., books, articles, legal documents, structured programs, etc.). 



Query Protocols:


Some of the Query Languages are proposed as standards for querying CD-ROMs or as intermediate languages to query library systems. Because they are not intended for human use, we refer to them as protocols rather than languages.


  • zae.so is a protocol approved ae a standard in 1995 by ANSI and NISO. This protocol is intended to query bibliographical information using a standard interface between the client and the host database manager which is independent of the client user interface and of the query database language at the host. 

  • WAIS (Wide Area Information Service) is a suite of protocols that was popular at the beginning of the 1990s before the boom of the Web. The goal of WAIS was to be a network publishing protocol and to be able to query databases through the Internet.

  • CCL (Common Command Language) is a NISO proposal (Z39.58 or ISO 8777) based on Z39.50. It defines 19 commands that can be used interactively. It is more popular in Europe, although very few products use it. It is based on the classical Boolean model.

  • CD-RDx (Compact Disk Read-only Data exchange) uses a client-server architecture and has been implemented in most platforms. The client is generic while the server is designed and provided by the CD-ROM publisher who includes it with the database in the CD-ROM. It allows fixed-length fields, images, and audio, and is supported by such US national agencies as the CIA, NASA, and GSA.

  • SFQL (Structured Full-text Query Language) is based on SQL and also has a client-Server architecture. SFQL has been adopted as a standard by the aerospace community (the Air Transport Association/Aircraft Industry Association). Documents are rows in a relational table and can be tagged using SGML. The language defines the format of the answer, which has a header and a variable-length message area. The language does not define any specific formatting or markup. 

For example, a query in SFQL is:

Se1ect abstract Croa j ouzoa1.papers vltere t1t1e contains "text searcb"



User Relevance Feedback:


Relevance feedback is the most popular query reformulation strategy. In a relevance feedback cycle, the user is presented with a list of the retrieved documents and, after examining them, marks those which are relevant. In practice, only the top 10 (or 20) ranked documents need to be examined. The main idea consists of selecting important terms, or expressions, attached to the documents that have been identified as relevant by the user, and of enhancing the importance of these terms in a new query formulation. The expected effect is that the new query will be moved towards the relevant documents and away from the non-relevant ones.


Relevance feedback presents the following main advantages over other query reformulation strategies: 


(a) it shields the user from the details of the query

reformulation process, because all the user has to provide, is a relevance judgement on documents; 

(b) it breaks down the whole searching task into a sequence of small steps which are easier to grasp, and 

(c) it provides a controlled process designed to emphasize some terms (relevant ones) and de-emphasize others (non- relevant ones).


The usage of user relevance feedback to 

(a) expand queries with the vector model, 

(b) reweight query terms with the probabilistic model, and 

(c) reweight query terms with a variant of the probabilistic model.



a)  Expand queries with the vector model


The application of relevance feedback to the vector model considers that the term- weight vectors of the documents identified as relevant (to a given query) have similarities among themselves (i.e., relevant documents resemble each other). Further, it is assumed that non-relevant documents have term-weight vectors which are dissimilar from the once for the relevant documents. The basic idea is to reformulate the query such that it gets closer to the term-weight vector space of the relevant documents.


b) Reweight query terms with the probabilistic model


The probabilistic model dynamically ranks documents similar to a query q according to the probabilistic ranking principle. 

The main advantages of this relevance feedback procedure are that the feedback process is directly related to the derivation of new weights for query terms and that the term reweighting is optimal under the assumptions of term independence and binary document indexing (w;„ e {0, 1) and w„,  e  {0, 1}). The disadvantages include: (1) document term weights are not taken into account during the feedback loop; (2) weight of terms in the previous query formulations are also disregarded, and (3) no query expansion is used (the same set of index terms in the original query is reweighted over and over again).  As a result of these disadvantages, the probabilistic relevance feedback methods do not, in general, operate as effectively as the conventional vector modification methods.

To extend the probabilistic model with query expansion capabilities, different approaches have been proposed in the literature ranging from term weighting for query expansion to term clustering techniques based on spanning trees. All of these approaches treat probabilistic query expansion separately from probabilistic term reweighting. 


c) Reweight query terms with a variant of the probabilistic model


This variant of probabilistic term reweighting is more flexible (and also more powerful) 

This variant of probabilistic term reweighting has the following advantages:

(1) it takes into account the within-document frequencies; 

(2) it adopts a normalized version of these frequencies; and 

(3) it introduces the constants G' and K which provide for greater flexibility. However, it constitutes a more complex formulation and, as before, it operates solely on the terms originally in the query (without query expansion).



Automatic Local Analysis:


In a user relevance feedback cycle, the user examines the top-ranked documents and separates them into two classes: the relevant ones and the non-relevant ones. This information is then to select new terms for query expansion.

The reasoning is that the expanded query will retrieve more relevant documents. Thus, there is an underlying notion of clustering supporting the feedback strategy. According to this notion, known relevant documents contain terms which can be used to describe a larger cluster of relevant documents. In this case, the description of this larger cluster of relevant documents is built interactively with assistance from the user.


In a local strategy, the documents retrieved for a given query q are examined at query time to determine terms for query expansion.  This is similar to a relevance feedback cycle but might be done without assistance from the user (i.e., the approach might be fully automatic). Two local strategies are discussed below: local clustering and local context analysis. The first is based on the work done by Attar and Fraenkel in 1977 and is used here to establish many of the fundamental ideas and concepts regarding the usage of clustering for query expansion. The second is a recent work done by Xu and Croft in 1996 and illustrates the advantages of combining techniques from both local and global analysis.



Automatic Global Analysis:


The methods of local analysis discussed above extract information from the local set of documents retrieved to expand the query. It is well accepted that such a procedure yields improved retrieval performance with various collections. An alternative approach is to expand the query using information from the whole set of documents in the collection. Strategies based on this idea are called global analyses procedures. Until the beginning of the 1990s, global analysis wae considered to be a technique which failed to yield consistent improvements in retrieval performance with general collections. This perception has changed with the appearance of modern procedures for global analysis. In the following, we discuss two of these modern variants. Both of them are based on a thesaurus-like structure built using all the documents in the collection. However, the approach is taken for building the thesaurus and the procedure for selecting terms for query expansion are quite distinct in the two cases.



Multimedia IR Query Languages :


1. The SQL-3 Multimedia Language:


From the multimedia perspective point of view, the aspects described making SQL3 suitable for being used as an interface language for multimedia applications. In particular, the ability to deal with external functions and user-defined data types enables the language to deal with objects with a complex structure, as multimedia objects. Note that, Without thi8 characteristic, the ability to deal with BLOB would have been useless since it reduces the view of multimedia data to single large uninterpreted data values, which are not adequate for the rich semantics of multimedia data. By the use of triggers, spatial and temporal constraints can be enforced, thus preserving the database consistency. Finally, as SQL3 is a widespread standard, it allows one to model multimedia objects in the framework of a well-understood technology.


The extensible type system and in general the ability to deal with complex objects make SQL3 suitable for modelling multimedia data. From the query language point of view, the major improvements of SQL3 with respect to SQL-92 can be summarized as follows:


*Functions and stored procedures. SQL3 allows the user to integrate external functionalities with data manipulation. This means that the functions of an external library can be introduced into a database system as external Junction. Such functions can be either implemented by using an external language and in this case, SQL3 only specifies which is the language and where the function can be found or can be directly implemented by using SQL3 itself. In this way, the impedance mismatch between two different programming languages and type systems is avoided. Of course, this approach requires an extension of SQL with imperative programming languages constructs.

*Active database facilities. Another important property of SQL3 is the support of active rules, by which the database is able to react to some system- or user-dependent events by executing specific actions. Active rules, or triggers, are very useful to enforce integrity constraints.


Though the above facilities make SQL3 suitable for use as an interface for multimedia applications, there are also some limitations.


The main drawback is related to retrieval support and,  as a consequence,  optimization.  Indeed, no IR techniques are integrated into the SQL3 query processor. This means that the ability to perform the content-based search is application dependent. As a consequence, objects are not ranked and are therefore returned to the application as a unique set. Moreover, specialized indexing techniques can be used but they are not transparent to the user.


2. The MULTOS Query Language


The development of the MULTOS query language has been driven by a number of requirements: first, it should be possible to easily navigate through the document structure. Path-names can be used for this purpose. Path-names can be total if the path identifies only one component, or partial if several components are identified by the path. Path-names are similar to object-oriented path expressions. Queries both on the content and on document structure must be supported.

Query predicates on complex components must be supported. In this case, the predicate applies to all the document subcomponents that have a type compatible with the type required by the query. This possibility is very used when a user does not recall the structure of a complex component.

In general, a MULTOS query has the form:


FIND DOCUMENTS VERSION version-clause

SCOPE scope-clause

TYPE  type-clause

WHERE condition-clause

WITH component


where:


The version-clause specifies which versions of the documents should be considered by the query.

 

The scope-clause restricts the query to a particular set of documents. This set of documents is either a user-defined document collection or a set of documents retrieved by a previous query.

The type-clause allows the restriction of a query to documents belonging to a prespecified set of types. The conditions expressed by the condition- clause only apply to the documents belonging to these types and their subtypes. When no type is specified, the query is applied to all document types.

The condition-clause is a Boolean combination of simple conditions (i.e., predicates) on documents components. Predicates are expressed on conceptual components of documents. Conceptual components are referenced by path-names. The general form of a predicate is:

component restriction

where the component is a path-name and restriction is an operator followed by

an expression.

The with-clause allows one to express structural predicates. The component is a path-name and the clause looks for all documents structurally containing such a component.


Different types of conditions can be specified in order to query different types of media. In particular, MULTOS supports three main classes of predicates: predicates on data attributes, on which an exact match search is performed; predicates on textual components, determining all objects containing some specific strings; and predicates on images, specifying conditions on the image content. Image predicates allow one to specify conditions on the class to which an image should belong or conditions on the existence of a specified object within an image and on the number of occurrences of an object within an image. The following example illustrates the basic features of the MULTOS query language.

.

 





Post a Comment

0 Comments