VICTOR/VICTORWeb normalization rules
What follows is a comparison of the effects of punctuation, special characters and diacritics on searching in VICTOR vs. VICTORWeb. CARL normalization rules are contrasted with actual normalization in both versions of USM's catalogs. The author wishes to hear of any necessary additions or corrections to these normalization descriptions.
Normalization of:
"A", "An", "The" as initial words in title browse
Stopwords--neither VICTOR nor VICTORWeb observe any stopwords in names searches, except
VICTOR will not permit searching of Boolean operator words as elements of name headings via name keyword (//n)
Also,VICTOR keyword (//w) searches containing stopwords fail.
CARL normalization rules
as recorded in B500 appendix E (12/97)
In general, the normalization rules that apply to a text string when it is
indexed are the same ones which apply to a text string when it is entered by a user in PAC. For instance, the rules for each type of keyword search (Word, Name, and Subject Word) are the same. However, the rules for browsable searches (Name Browse, Subject Browse, Title Browse, Series Browse, and Control Number Browse) vary for each specific search.
Word (//W), Name (//N), and Subject Word (//SW/)
- Diacritics and other non-printable characters are translated to printed equivalents, if available; otherwise they are removed.
- All letters are shifted to uppercase.
- Asterisks are retained if they are embedded in a word; otherwise, they are removed.
M*A*S*H --> M*A*S*H
- Commas are removed if they are embedded in a number; otherwise, they are replaced with spaces.
1,001 --> 1001
A,B,C --> A B C
- Periods are retained if they are embedded in a number; otherwise, they are turned into spaces.
6.0 --> 6.0
A.L.A --> A L A
- Hyphens are replaced with spaces.
stress-induced --> STRESS INDUCED
on-line --> ON LINE
- Exclamation points and slashes are replaced with spaces.
snap!crackle!pop --> SNAP CRACKLE POP
snap/crackle/pop --> SNAP CRACKLE POP
- Any other punctuation is removed.
its' --> ITS
it's --> ITS
C++ --> C
etc.
- The following stopwords are not indexed in the Word or Subject Word indexes:
A AN
AND BY
EDITED FOR
IN OF
ON THE
TO WITH
There are no stopwords in the Name index.
- Diacritics and other non-printable characters are translated to printed
equivalents, if available; otherwise, they are removed.
- The first character is changed to uppercase while the remaining characters are changed to lowercase.
- The first comma is retained while the others are replaced with spaces.
Hemingway, Ernest --> Hemingway, ernest
- Hyphens are retained if followed by a number; otherwise, they are replaced with spaces. Spaces around hyphens are removed.
1899-1961 --> 1899-1961
1899- --> 1899
Masters-Johnson --> Masters johnson
1899 - 1961 --> 1899-1961
- Apostrophes are retained.
O'Neill --> O'neill
- Any other punctuation is replaced with spaces.
U.S. --> U s
Master/Johnson --> Masters johnson
- Leading and trailing spaces are removed. Multiple spaces are collapsed into one space.
- Diacritics and other non-printable characters are translated to printed equivalent, if available; otherwise, they are removed.
- The first character is changed to upper case while the remaining characters are changed to lowercase.
- Punctuation is retained/blanked/removed as described in the section about Word (//W). Exception: for tags 600, 610, and 611, punctuation is treated as described in the section about Name Browse (//NB/).
- Leading and trailing spaces are removed. Multiple spaces are collapsed into one space.
Characters represented by the nonfiling indicator for tags 130, 240, 245, 730, and 740 are removed from the beginning of the title.
_4^aThe moon by night -->Moon by night
Diacritics and other non-printable characters are translated to printed equivalents, if available; otherwise, they are removed.
The first character is changed to uppercase while the remaining characters are changed to lowercase.
Asterisks are removed if they occur at the beginning of the title; otherwise, they are retained.
Commas are removed if they are embedded in a number; otherwise, they are replaced with spaces.
1,001 dalmations --> 1001 dalmations
Snap, crackle, pop --> Snap crackle pop
Periods are retained if they are embedded in a number; otherwise, they are replaced with spaces.
DOS 6.0 --> Dos 6.0
A.L.A. directory --> A l a directory
Hyphens are retained if they are embedded in a number; otherwise, they are replaced with spaces.
1994-95 catalog --> 1994-95 catalog
On-line searching --> On line searching
Apostrophes are retained if they are embedded in a word; otherwise, they are removed.
It's alive --> It's alive
Its' annual report --> Its annual report
'Tis magic --> Tis magic
Slashes (/), colons (:), quotes ("), semicolons (;), question marks (?), left and right brackets ([]}, underscores (_), grave accents (`), left and right braces ({}), verticle lines (|), and tildes (~) are replaced with spaces. Other punctuation is retained.
Leading and trailing spaces are removed. Multiple spaces are collapsed into one space.
|
|
|
© 1999 University of Maryland Libraries
Last Revised: May 23, 1999
|