TMG_QUERY - Text to Matrix Generator, query vector constructor
    TMG_QUERY parses a query text collection and generates the 
    query vectors corresponding to the supplied dictionary.
    Q = TMG_QUERY(FILENAME, DICTIONARY) returns the query 
    vectors, that corresponds to the text collection contained 
    in files of directory FILENAME. DICTIONARY is the array of 
    terms corresponding to a text collection.
    Each query must be separeted by a blank line (or another 
    delimiter that is defined by OPTIONS argument) in each file. 
    [Q, WORDS_PER_QUERY] = TMG_QUERY(FILENAME, DICTIONARY) 
    returns statistics for each query, i.e. the number of terms 
    for each query. 
    Finally, [Q, WORDS_PER_QUERY, TITLES, FILES] = 
    TMG_QUERY(FILENAME) returns in FILES the filenames contained 
    in directory (or file) FILENAME and a cell array (TITLES) 
    that containes a declaratory title for each query, as well 
    as the query's first line.
 
    TMG_QUERY(FILENAME, DICTIONARY, OPTIONS) defines optional 
    parameters: 
        - OPTIONS.delimiter: The delimiter between queries within 
          the same file. Possible values are 'emptyline' (default), 
          'none_delimiter' (treats each file as a single query) 
          or any other string.
        - OPTIONS.line_delimiter: Defines if the delimiter takes a 
          whole line of text (default, 1) or not.
        - OPTIONS.stoplist: The filename for the stoplist, i.e. a 
          list of common words that we don't use for the indexing 
          (default no stoplist used).
        - OPTIONS.stemming: Indicates if the stemming algorithm is 
          used (1) or not (0 - default).
        - OPTIONS.update_step: The step used for the incremental 
          built of the inverted index (default 10,000).
        - OPTIONS.local_weight: The local term weighting function 
          (default 't'). Possible values (see [1, 2]): 
                't': Term Frequency
                'b': Binary
                'l': Logarithmic
                'a': Alternate Log
                'n': Augmented Normalized Term Frequenct
        - OPTIONS.global_weights: The vector of term global 
          weights (returned by tmg).
        - OPTIONS.dsp: Displays results (default 1) or not (0).
        - OPTIONS.remove_num: Indicates if we remove the numbers from the
          dictionary (value 1) or not (value 0- default).
        - OPTIONS.remove_al: Indicates if we remove the alphanumerics from 
            the dictionary (value 1) or not (value 0- default).
        - OPTIONS.parse_subd: Indicates if we parse all the subdirectories
          without be questioned (value 1), or we are asked which 
          subdirectories to parse (value 0-default). This option is
          recommended for large collections with many subdirectories
          so that they can be run in batch mode. Setting this options we
          are avoiding questions during the parsing.
 
    REFERENCES: 
    [1] M.Berry and M.Browne, Understanding Search Engines, 
    Mathematical Modeling and Text Retrieval, Philadelphia, 
    PA: Society for Industrial and Applied Mathematics, 1999.
    [2] T.Kolda, Limited-Memory Matrix Methods with Applications,
    Tech.Report CS-TR-3806, 1997.
 
  Copyright 2011 Dimitrios Zeimpekis, Eugenia Maria Kontopoulou, 
                 Efstratios Gallopoulos
					
				

Return to main page