help.search.old

http://www.ccl.net/cca/instructions/help.search.old.shtml
CCL help.search.old
About CCL
Rules, Instructions, Contributing Material, Supporting, About Us,
Resources
Software Archive, List Archive, Data Archives, Document Archives,
Search CCL
Text Search, RegExp Search,
Announcements
Conferences, Jobs, Resumes,
Links
Topics, Data, Software Sites, Hardware Sites, Institutions, Listsoftsites, Search Engines,
E-mail us
Send E-mail to CCL Administrators,
------------- HOW TO SEARCH COMPUTATIONAL CHEMISTRY ARCHIVES -------------

This file can be obtained from anonymous ftp at www.ccl.net as
 pub/chemistry/instructions/help.search
or via e-mail by sending a message:
 HELP SEARCH
to MAILSERV@ccl.net

Computational chemistry archives can be searched by sending a search
query to the address: chemistry-search@ccl.net. As a result, the list
of files which satisfy the query, will be sent back via e-mail to the
originator of the search request. This is an experimental service and
may be improved (or discontinued) in the future, so please send your
comments and ideas to the author: jkl@ccl.net.

The following document describes the format of the search query. The format
is not simple. However, it allows for precise and elaborate search queries.
If you have suggestions how to make it simpler without loosing its generality,
please tell me. I will appreciate your comments.
If you do not have time to read these instructions, do not bother to send
your search query, unless you want to check the "garbage in, garbage out"
paradigm. Reading through this instructions is not a complete waste of
time, though. You will learn about regular expressions, the concept which is
very popular in UNIX and elsewhere.

The search query specifies:
 1. How many text patterns to look for,
 2. Text patterns to look for --- so called "regular expressions",
 3. Logical relation between patterns which needs be satisfied for
    file to be included in the search results.

Ad. 1) You may want to look for several pieces of text and than choose
       only these files which, contain all of them, e.g., 
       files which mention MOPAC and CHARGE, or MM2 and PARAMETERS.
       However, you have a freedom to choose more complicated relation
       between regular expressions.
Ad. 2) If the regular expression is satisfied (i.e., there is a match for
       it in the text of the file) it assumes value TRUE. If there is
       no match for it in the text, its value is FALSE.
Ad. 3) The logical relation tells when the file name should be
       reported. It represents a relation between values of regular
       expressions after they were matched with the contents of the file.
       E.g., if you look for words MOPAC (1) and CHARGE (2), you may have
       the following relations:
         a) Report files where both /MOPAC/ AND /CHARGE/ appear. [1 AND 2]
         b) Report files where /MOPAC/ OR /CHARGE/ appear (one of them
            is enough). [1 OR 2].
         c) Report files where word /MOPAC/ appears but not the word /CHARGE/
            (all files with MOPAC where CHARGE is not mentioned). [1 AND Not 2]
         d) Report files where word /MOPAC/ is not mentioned but word CHARGE is
            (i.e., all files where CHARGE is mentioned, but not in the context
            of MOPAC). [Not 1 AND 2].
         e) Report files where /MOPAC/ appears or /CHARGE/ does not appear.
            [1 OR Not 2].
         f) Report files where /MOPAC does not appear or /CHARGE/ appears.
            [Not 1 OR 2].
         g) Report files where /MOPAC/ is not mentioned and /CHARGE/ is not
            mentioned (neither of two words is mentioned). [Not 1 AND Not 2].
         h) Report files where either /MOPAC/ or /CHARGE/ is not mentioned
            (or both are not mentioned).  [Not 1 OR Not 2].
       Two words and so many possibilities. Moreover, some of these relations
       can be written in a different way. E.g., in relation g) the expression
       [Not 1 AND Not 2] is equivalent to [Not(1 OR 2)], i.e, saying that
       files which contain either word should be rejected. While expression h)
       [Not 1 OR Not 2] can be written as [Not(1 AND 2)], i.e., reject files
       which contain both words. So please, think about your logical relation
       before you write it down, though you will soon find out that logical
       thinking is not that easy...

1. REGULAR EXPRESSIONS
======================
In its simplest form it is just a word which you want to find in the file name
and/or a text of the file itself. This world has to be surrounded by a unique
character, called delimiter. This character must not appear inside the regular
expression. For example: /Gaussian/, bGaussianb, .Gaussian., all denote the
same regular expression which matches word Gaussian. Since during the
processing, the trailing spaces are removed at the beginning and at the end
of the line, the delimiter is the only way to distinguish between significant
and unsignificant spaces.

Some characters (so called "metacharacters") and groups of characters have
a special meaning within the regular expression and if their original meaning
is needed, they have to be "quoted" by preceding them with a backslash
character "\". On the other hand, quoting some ordinary letters, may
attach some special meaning to them, so use the backslash judiciously.
The list below corresponds to a Perl convention for regular expressions
It is slightly different from the one used in UNIX regular expressions.
Remember also that:
  a) the file appears to the searching program as a single, long line of
     text where words are separated by single spaces (new lines are replaced
     with spaces, multiple spaces, tabs and other white space is contracted
     into a single space, hyphenated words at the end of the line are joined).
     There are two exceptions to this rule: 1) file names, if searched, are
     treated as separate pieces of text, 2) when searching files containing
     archived messages posted to the list at a given date, each message is
     searched separately, as if it was a separate file. If the logical
     expression is satisfied for the message, then the name of the file
     is reported. Please note that only text files are scanned for text
     (contents of binary files is not). However, all file names (for binary
     as well as text files) are scanned for, if requested.
  b) Search is lettercase insensitive, i.e., searching for words: charge,
     Charge, cHaRgE, etc., will produce the same result.
  c) The description below includes full metacharacter definitions, however,
     some characters will not be found in the text, since, e.g., there are
     no new_lines and tabs. Also due to letter case insensitibity A-Z is
     the same as a-z, Mopac, MOPAC, mopac are equivalent,  etc.
     
Constructs in regular expressions:

     .     --- Period matches any character (except a new line, however there
               are no new lines here).
     \     --- Backslash character. It is used for quoting special characters.
               Some characters have a special meaning (e.g., the period above).
               If you want their original meaning (i.e., that . matches the
               dot, i.e, decimal point or a period at the end of the phrase)
               you need to quote them with a backslash. In this case, quote
               the dot, i.e., use: \. to protect its native meaning.
               Since backslash is not present on some keyboards and is
               sometimes "swallowed" by network gateways, you can substitute
               it with a backtick ` (grave accent) or ~ (tilde).
     ~     --- tilde, is a substitute for backslash, i.e., ~b and \b are
               equivalent. You cannot search for this character or use it
               as a delimiter of the regular expression.
     `     --- backtick, grave accent is a substitute for backslash, i.e.,
               the `S+ and \S+ are equivalent. You cannot search for this
               character or use it as a delimiter of the regular expression.
    [ ]    --- Any character within the square brackets matches, e.g. [a,b]
               matches a, comma, b. Ranges are also allowed: [0-9a-z] will
               match any digit or letter. Note that if you are searching for
               "-", you must put it just before the right ], or it will be
               treated as a range. Negation within square brackets is achieved
               with a caret character (circumflex accent) immediately following
               the left bracket "[^", e.g., [^a-z] will match everything but
               letters, [^_] will match everything but underscore. Note that
               within brackets, characters: .?*+|()$^{} should not be quoted,
               while character [ or  \ or  ] should be entered as \[ or \\
               or  \] if in case you need their original meaning. Characters:
               ?.*+|()$^{}[] outside square brackets should be quoted by
               backslash if their original meaning is needed (but you would
               rarely need their original meaning, unless you are searching
               for a particular line in the C-program or a UNIX script).
    \d     --- Matches any digit (i.e., is a shorthand for [0-9]).
    \D     --- Matches everything but digit (i.e., is a shorthand for [^0-9]).
    \w     --- Matches "word" characters, i.e., letters, digits and underscore
               (same as [a-zA-Z0-9_]).
    \W     --- Matches "nonword" character (same as [^a-zA-Z0-9_]).
    \s     --- Matches a white-space (i.e., space, new_line, tab, etc.).
    \S     --- Matches a non-white-space character (i.e., characters which
               use pigment in your printer).
    \xxx   --- Matches an ASCII octal code of a character (xxx are digits).
    x?     --- matches 0 or 1 occurrences of character x (or any other, i.e.,
               [a-z]? matches 0 or 1 letter).
    x*     --- matches 0 or more occurrences of character x.
    x+     --- matches 1 or more occurrences of character x.
    x{m,n} --- matches at least m, but no more than n occurrences of x, e.g.,
               [0-9.+-]{2,3} will match: 1.2, .2, +11, -1, 123, +-.
    |      --- alternative: \son\s|\sin\s|\sup\s|\sat\s will match words:
               on, in, up, and at.
    \b     --- matches word boundary (outside [] only). It corresponds to
               white space, punctation marks and the very beginning and
               end of the text.

The following constructs are valid in perl language, but SHOULD NOT BE
USED HERE, since they may confuse internal working of the search.
    ^      --- outside [] marks the beginning of the string. DO NOT USE HERE.
    $      --- outside [] marks the end of the string. DO NOT USE HERE.
    \n     --- Matches new line. DO NOT USE HERE.
    \r     --- Matches carriage return. DO NOT USE HERE.
    \t     --- Matches a tab. DO NOT USE HERE.
    \f     --- Matches a formfeed. DO NOT USE HERE.
    \b     --- Matches a backspace inside []. DO NOT USE HERE.
    ()     --- () are special characters to quote substrings for substitution.
               We do not do substitutions here. If you search for parentheses,
               use \( and \) outside square brackets.
    \1, \2 ... \9  --- used only in substitution strings. DO NOT USE HERE.
    \B     --- matches non word boundary DO NOT USE HERE.

Since some (few, to be truthful) computers and gateways "do not like the
backslash \" character, I provided alternatives ` (backtick, grave accent) 
or ~ (tilde). For the reasons which I do not want to go into, you must not
use characters: : or ! inside your regular expressions.

Note that in UNIX, parentheses and braces are quoted to get their special
meaning, while here, they need to be quoted to get their ordinary meaning.

Regular expressions, bracketed with a delimiter character can be optionally
preceded with a label and a search scope identifier followed by a colon ":".
The label is an integer number and the scope is one of the letters:
   T - text only (file names will not be matched to a regular expression),
   F - file name only (text inside the file will not be scanned),
   B - both file name and text will be scanned for matching (default).
Both the label and the scope can be omitted, but if either exists, the colon
must be present. For the purpose of the search, all the expressions below
are identical:
   1B : /[MA][MO]PAC/
   2:   #ampac|mopac#
   :    +MOPAC|AMPAC+
   B4:  *Ampac|Mopac*
        -[am][mo]pac-
Note that the numerical value of integer label is disregarded by the program,
and the program assigns the numbers to regular expressions based on the
order in which they were specified. It is here only for your convenience.



2. LOGICAL RELATION
===================
Once you specified your regular expressions, you need to specify a logical
relation between them, i.e, under which conditions the file should be
included in the search report. The logical relation comprises of numbers
refering to regular expressions and logical operatorions. It is evaluated
after regular expressions were matched to file's text and/or name.
The expression numbers refer to the order in which regular expressions
were specified. The operators are:
  &  AND   (you can also use && if you are a UNIX or C fan)
  |  OR    (you can also use || if you are a UNIX or C fan)
  !  NOT
The relation may also contain parantheses. For example:
   1 OR 2 is equivalent to 1 | 2 is equivalent to 1 || 2
while
   1 AND NOT 2 is the same as 1 & NOT 2 is the same as 1&!2 is equivalent to
   1 AND (NOT 2) is equivalent to (1and(nOT2))
Spaces are optional and operators AND, OR, NOT can be written in any
lettercase. Regular expression represents the status (FOUND or NOT_FOUND)
matching the regular expression with the text/name of file.
Remember also that in logic the & takes precedence before |, and the ! is
a unary operator and takes precedence before both of them, but use the
parentheses for better readability.


3. QUERY FORMAT AND EXAMPLES
============================
The complete query has the following format: 
    Number of regular expressions (N)
    regular expression 1
    regular expression 2
       ....  
    regular expression N
    logical relation

The query should be send to chemistry-search@ccl.net, and when the search
is finished, the resulting list of files satisfying your query will be
sent to you automatically. The list will also include the matched pieces
of text.  Do not be impatient and do not send your next query before the
previous results arrive since your request will be denied. Please remember
that this is a flat file search and it is very demanding as I/O and CPU
is concerned. Therefore only one query will be running at the given moment.
Before you send a query, try to analyze it, and make it specific. You do
not want to get a listing of the whole archive. 
The queries which search only file names are much faster than the queries
which scan the whole file.

If you look only for a single expression, you can use the abbreviated
one-line format which consists of a single line containing a regular
expression.

Now, a few examples. Please analyse them before attempting your search.


Example 1.
----------
Assume that you want to find files which mention AMBER. You could do
it by saying:
    1
    /amber/
    1
or, since you have only one regular expression, you can say
  /amber/
The lettercase does not matter, so you can say also: /AMBER/, /AmbeR/, /Amber/.
However, note that in principle your query could find words:
camber, chamber, chamberlain, clamber or lambert, and this is not what you
want. To be more specific, you need to specifically request that you want
a word "amber" not any amber-containing word. You could do it as:
    / amber /
i.e., putting spaces around it. However, what if somebody said:
  For this calculations I used AMBER.
You would not find it, since "." is not a space. It is therefore best to use
\b, i.e., the "word" boundary character:
    /\bamber\b/
This is not without problems, however. What if somebody said:
       Amber3.0 is slower than Amber3.1.
You would not find it. Again, digits and underscores are considered parts of
the word (by programmers, at least). I think that in this case, the best
compromise is to look for
    /\bamber[^a-z]/
i.e., for word "amber" which is not followed by letter. I hope that you now
realize, how important it is to analyze all possible combinations which will
match your regular expression. You do not want to get too many unrelated files,
but you want to be sure that you included all the files which relate to the
topic of your search.


Example 2.
----------
Assume that you want to search for the information on MM2, MM3 or MM2P or MMP2.
You can search the archives by giving the following query:
   4
   /MM2/
   /MM3/
   /MM2P/
   /MMP2/
   1 | 2 | 3 | 4
You can also write it differently:
   1
   /MM2|MM3|MM2P|MMP2/
   1
or even as:
   /MM2|MM3|MM2P|MMP2/
However, if you analyzed example 1, you might want to change your
query to:

  1
  1B: /\bMM[23]\b|MMP2\b|\bMM2P\b/
  1

which is equivalent to saying:

B: /\bMM[23]\b|MMP2\b|\bMM2P\b/

or

/\bMM[23]\b|MMP2\b|\bMM2P\b/

since the default is both text and file names. You search for all text files
which refer to MM2, MM3 or MM2P, MMP2 or file names which contain MM2, MM3,
MMP2, MM2P.


Example 3.
----------
Now the following query:
   /basis\sset/
The \s stand for a white space character. Actually, since all tabs and new
lines are converted to single spaces, and multiple spaces are contracted
to single ones, the query above is equivalent to:
   /basis set/
Beside the term "basis set", it will also find "basis sets" and "basis set."
or "basis set,", etc. 
Note that this example is not equivalent to the query:
   2
   /basis/
   /set/
   1 & 2
since in the first case, the words "basis" and "set" must be side by side,
while in the latter case they may be separated by many words and in fact
the "set" may be found before the "basis" is. Also, the latter case will
find all the file names having "basis" or "set" in them, while there is no
file names in the archive which have a space embedded in them.
Actually, none of the above queries is good if you really want to find
all the references about basis sets. People frequently say "basis" or
"set", sometimes they say "basis functions" or "contracted gaussians",
or whatever. You would need more elaborate expression to be confident
that you found most of the references to this topic. It could look like:
   /basis|set|gaussians|contracted|6-31G\*|631G\*|gaussian exponent|/
and could be much larger, but you would run out of line length (there
is no continuation lines). But you can make your query look like:
   2
   /basis|set|gaussians| contracted |6-31G\*|631G\*|gaussian exponent|/
   /\bDZP\b|\bTZP\b|\bDZ\b|\bTZ\b|gaussian function|STO-?\dG|\+G\*/
   1 | 2
Note a few points. I did not use "gaussian" but "gaussians" as a separate
term (by term, I mean the text between "|" signs) in the regular expression
since otherwise, I would get all the files refering to GAUSSIAN program, and
there are plenty of them, not necessarily about basis sets. Please note, that
the star was quoted with a backspace character, or otherwise it would much
anything (you might want this side effect by the way, if you wanted sets like
6-31G(3d,2f) or 6-31G without polarization). At the end of the second regular
expression, you have some strange characters. "-?" means 0 or 1 minus signs
(some people say STO-3G, and some incorrectly say STO3G). The \d means any
digit (e.g., STO-3G or STO-4G, etc.). The term "\+G\*" means G preceded with
a plus sign and followed with a star. The backslashes are necessary, since
otherwise, the + sign would be interpreted as a "1 or more occurences" and
the star as "0 or more occurences" of G, and this is not what we want.


Example 4
---------
Somebody wants to search for files which talk about MNDO and d-orbitals.
Here is an example of the query which could use for this purpose:
   2
   /\bMNDO[^a-z]/
   /\bd[^a-z]|\bd[_\s-]orbital|\bd[_\s-]function/
   1 & 2
It will look for MNDO and strings words: "d", "d-orbital", "d orbital",
d_orbital, "d-function", etc. Note the use of [] brackets. In this case
you say: _, space or - can go there. According to rules, the "-" can only be
used as last item within brackets to be itself. It would be treated as a range
declaration if put in the middle of the [] contents. The logical expression
requires that only files where MNDO and "d" was simultaneously mentioned will
be collected. Of course, I could also write it as:
   3
   /\bMNDO\b/
   /\bd[_\s-]/
   /\borbital|\bfunction/
   1 & 2 & 3
It is not equivalent to the previous one, but is close. In both examples, I did
not put \b at the end of "orbital" or "function", so "orbitals", "orbital,",
"orbital.", etc. can also be included as matches.

Example 5.
----------
Now a real one... It should be now elementary, my dear Watson.
  3
  1T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/
  2T: /\bCHARGE/
  3T: /\bHYDROGEN[\s_-]?BOND[si\s.,;]|\bH[\s_-]?BOND[si\s.,;]/
  1 AND (2 OR 3)
3 regular expressions were specified. Expression 1 looks for a word MOPAC,
AMPAC, AM1, PM3, MNDO, MINDO. Note that it may be either MOPAC or MOPAC6 so
it is safer to require a non-letter after MOPAC rather than a space
or word boundary. The 2nd expression can find "CHARGE ", "CHARGES",
"CHARGE,", "CHARGE.", "CHARGE=", "CHARGE-", etc. The 3rd one is
a challenge for you. Note that people may say: HYDROGEN BOND, HYDROGEN-BOND,
HYDROGEN_BOND, HYDROGENBOND, H-BOND, H BOND, H_BOND, HBOND, and may say BONDS,
BONDING, and may put .,; after BOND. Note that all the regular expressions
given above request searching for the text of the file only, not its name.
The logical relation simply says: "find me the files which mention MOPAC or
AMPAC or AM1 or PM3 or MNDO or MINDO and say also something about
CHARGes or HYDROGEN BONDs". 


In short:
  1. prepare your search query as described above and look at it for a few
     minutes. It is not easy to write a good query !
  2. send it to chemistry-search@ccl.net
  3. wait (at least 2 hours for an answer (the wait time depends how many
     searches are pending).



4. USING SEARCH FROM WITHIN MAILSERV
====================================

When using the MAILSERV program, you can perform searches of the current
directory and all subdirectories by first CD-ing to the appropriate directory
and then issuing the following command:

SEARCH
number of regular expressions
/regular expression 1/
/regular expression 2/
.
.
.
/regular expression n/
logical expression

or

SEARCH
/regular expression/

Thus, the format of the MAILSERV search query is the same as before, but with
the word SEARCH preceding it.  Note that the search only takes place in the
current directory and its subdirectories.  This can be used to reduce search
time if you have a reasonably good idea of where your target file or files will
be, and if they aren't spread out all over the place.  Unlike the search
program above, this can be used to search not only them Computational Chemistry
archives but the Russian archives as well.



-----------------------------
I will welcome suggestions for improving this description and corrections
to my spelling and grammar (as you know, I am not a native English speaker).

Jan Labanowski
jkl@ccl.net
[ CCL Home Page ]
[ About CCL ] [ Resources ] [ Search CCL ] [ Announcements ] [ Links ] [ E-mail us ]
[ Raw Version of this page ]
Modified: Thu Jan 6 17:00:00 1994 GMT
Page accessed 6415 times since Sat Apr 17 21:20:48 1999 GMT