TMG - Text to Matrix Generator
TMG parses a text collection and generates the term -
document matrix.
A = TMG(FILENAME) returns the term - document matrix,
that corresponds to the text collection contained in
files of directory (or file) FILENAME.
Each document must be separeted by a blank line (or
another delimiter that is defined by OPTIONS argument)
in each file.
[A, DICTIONARY] = TMG(FILENAME) returns also the
dictionary for the collection, while [A, DICTIONARY,
GLOBAL_WEIGHTS, NORMALIZED_FACTORS] = TMG(FILENAME)
returns the vectors of global weights for the dictionary
and the normalization factor for each document in case
such a factor is used. If normalization is not used TMG
returns a vector of all ones.
[A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS,
WORDS_PER_DOC] = TMG(FILENAME) returns statistics for
each document, i.e. the number of terms for each document.
[A, DICTIONARY, GLOBAL_WEIGHTS, NORMALIZATION_FACTORS,
WORDS_PER_DOC, TITLES, FILES] = TMG(FILENAME) returns in
FILES the filenames contained in directory (or file)
FILENAME and a cell array (TITLES) that containes a
declaratory title for each document, as well as the
document's first line. Finally [A, DICTIONARY,
GLOBAL_WEIGHTS, NORMALIZATION_FACTORS, WORDS_PER_DOC,
TITLES, FILES, UPDATE_STRUCT] = TMG(FILENAME) returns a
structure that keeps the essential information for the
collection' s update (or downdate).
TMG(FILENAME, OPTIONS) defines optional parameters:
- OPTIONS.use_mysql: Indicates if results are to be
stored in MySQL.
- OPTIONS.db_name: The name of the directory where
the results are to be saved.
- OPTIONS.delimiter: The delimiter between documents
within the same file. Possible values are 'emptyline'
(default), 'none_delimiter' (treats each file as a
single document) or any other string.
- OPTIONS.line_delimiter: Defines if the delimiter
takes a whole line of text (default, 1) or not.
- OPTIONS.stoplist: The filename for the stoplist,
i.e. a list of common words that we don't use for
the indexing (default no stoplist used).
- OPTIONS.stemming: Indicates if the stemming algorithm
is used (1) or not (0 - default).
- OPTIONS.update_step: The step used for the incremental
built of the inverted index (default 10,000).
- OPTIONS.min_length: The minimum length for a term
(default 3).
- OPTIONS.max_length: The maximum length for a term
(default 30).
- OPTIONS.min_local_freq: The minimum local frequency for
a term (default 1).
- OPTIONS.max_local_freq: The maximum local frequency for
a term (default inf).
- OPTIONS.min_global_freq: The minimum global frequency
for a term (default 1).
- OPTIONS.max_global_freq: The maximum global frequency
for a term (default inf).
- OPTIONS.local_weight: The local term weighting function
(default 't'). Possible values (see [1, 2]):
't': Term Frequency
'b': Binary
'l': Logarithmic
'a': Alternate Log
'n': Augmented Normalized Term Frequency
- OPTIONS.global_weight: The global term weighting function
(default 'x'). Possible values (see [1, 2]):
'x': None
'e': Entropy
'f': Inverse Document Frequency (IDF)
'g': GfIdf
'n': Normal
'p': Probabilistic Inverse
- OPTIONS.normalization: Indicates if we normalize the
document vectors (default 'x'). Possible values:
'x': None
'c': Cosine
- OPTIONS.dsp: Displays results (default 1) or not (0) to
the command window.
- OPTIONS.remove_num: Indicates if we remove the numbers from the
dictionary (value 1) or not (value 0- default).
- OPTIONS.remove_al: Indicates if we remove the alphanumerics from
the dictionary (value 1) or not (value 0- default).
- OPTIONS.parse_subd: Indicates if we parse all the subdirectories
without be questioned (value 1), or we are asked which
subdirectories to parse (value 0-default). This option is
recommended for large collections with many subdirectories
so that they can be run in batch mode. Setting this options we
are avoiding questions during the parsing.
REFERENCES:
[1] M.Berry and M.Browne, Understanding Search Engines, Mathematical
Modeling and Text Retrieval, Philadelphia, PA: Society for Industrial
and Applied Mathematics, 1999.
[2] T.Kolda, Limited-Memory Matrix Methods with Applications,
Tech.Report CS-TR-3806, 1997.
Copyright 2011 Dimitrios Zeimpekis, Eugenia Maria Kontopoulou,
Efstratios Gallopoulos