------------- HOW TO SEARCH COMPUTATIONAL CHEMISTRY ARCHIVES ------------- This file can be obtained from anonymous ftp at www.ccl.net as pub/chemistry/instructions/help.search or via e-mail by sending a message: HELP SEARCH to MAILSERV@ccl.net Computational chemistry archives can be searched by sending a search query to the address: chemistry-search@ccl.net. As a result, the list of files which satisfy the query, will be sent back via e-mail to the originator of the search request. This is an experimental service and may be improved (or discontinued) in the future, so please send your comments and ideas to the author: jkl@ccl.net. The following document describes the format of the search query. The format is not simple. However, it allows for precise and elaborate search queries. If you have suggestions how to make it simpler without loosing its generality, please tell me. I will appreciate your comments. If you do not have time to read these instructions, do not bother to send your search query, unless you want to check the "garbage in, garbage out" paradigm. Reading through this instructions is not a complete waste of time, though. You will learn about regular expressions, the concept which is very popular in UNIX and elsewhere. The search query specifies: 1. How many text patterns to look for, 2. Text patterns to look for --- so called "regular expressions", 3. Logical relation between patterns which needs be satisfied for file to be included in the search results. Ad. 1) You may want to look for several pieces of text and than choose only these files which, contain all of them, e.g., files which mention MOPAC and CHARGE, or MM2 and PARAMETERS. However, you have a freedom to choose more complicated relation between regular expressions. Ad. 2) If the regular expression is satisfied (i.e., there is a match for it in the text of the file) it assumes value TRUE. If there is no match for it in the text, its value is FALSE. Ad. 3) The logical relation tells when the file name should be reported. It represents a relation between values of regular expressions after they were matched with the contents of the file. E.g., if you look for words MOPAC (1) and CHARGE (2), you may have the following relations: a) Report files where both /MOPAC/ AND /CHARGE/ appear. [1 AND 2] b) Report files where /MOPAC/ OR /CHARGE/ appear (one of them is enough). [1 OR 2]. c) Report files where word /MOPAC/ appears but not the word /CHARGE/ (all files with MOPAC where CHARGE is not mentioned). [1 AND Not 2] d) Report files where word /MOPAC/ is not mentioned but word CHARGE is (i.e., all files where CHARGE is mentioned, but not in the context of MOPAC). [Not 1 AND 2]. e) Report files where /MOPAC/ appears or /CHARGE/ does not appear. [1 OR Not 2]. f) Report files where /MOPAC does not appear or /CHARGE/ appears. [Not 1 OR 2]. g) Report files where /MOPAC/ is not mentioned and /CHARGE/ is not mentioned (neither of two words is mentioned). [Not 1 AND Not 2]. h) Report files where either /MOPAC/ or /CHARGE/ is not mentioned (or both are not mentioned). [Not 1 OR Not 2]. Two words and so many possibilities. Moreover, some of these relations can be written in a different way. E.g., in relation g) the expression [Not 1 AND Not 2] is equivalent to [Not(1 OR 2)], i.e, saying that files which contain either word should be rejected. While expression h) [Not 1 OR Not 2] can be written as [Not(1 AND 2)], i.e., reject files which contain both words. So please, think about your logical relation before you write it down, though you will soon find out that logical thinking is not that easy... 1. REGULAR EXPRESSIONS ====================== In its simplest form it is just a word which you want to find in the file name and/or a text of the file itself. This world has to be surrounded by a unique character, called delimiter. This character must not appear inside the regular expression. For example: /Gaussian/, bGaussianb, .Gaussian., all denote the same regular expression which matches word Gaussian. Since during the processing, the trailing spaces are removed at the beginning and at the end of the line, the delimiter is the only way to distinguish between significant and unsignificant spaces. Some characters (so called "metacharacters") and groups of characters have a special meaning within the regular expression and if their original meaning is needed, they have to be "quoted" by preceding them with a backslash character "\". On the other hand, quoting some ordinary letters, may attach some special meaning to them, so use the backslash judiciously. The list below corresponds to a Perl convention for regular expressions It is slightly different from the one used in UNIX regular expressions. Remember also that: a) the file appears to the searching program as a single, long line of text where words are separated by single spaces (new lines are replaced with spaces, multiple spaces, tabs and other white space is contracted into a single space, hyphenated words at the end of the line are joined). There are two exceptions to this rule: 1) file names, if searched, are treated as separate pieces of text, 2) when searching files containing archived messages posted to the list at a given date, each message is searched separately, as if it was a separate file. If the logical expression is satisfied for the message, then the name of the file is reported. Please note that only text files are scanned for text (contents of binary files is not). However, all file names (for binary as well as text files) are scanned for, if requested. b) Search is lettercase insensitive, i.e., searching for words: charge, Charge, cHaRgE, etc., will produce the same result. c) The description below includes full metacharacter definitions, however, some characters will not be found in the text, since, e.g., there are no new_lines and tabs. Also due to letter case insensitibity A-Z is the same as a-z, Mopac, MOPAC, mopac are equivalent, etc. Constructs in regular expressions: . --- Period matches any character (except a new line, however there are no new lines here). \ --- Backslash character. It is used for quoting special characters. Some characters have a special meaning (e.g., the period above). If you want their original meaning (i.e., that . matches the dot, i.e, decimal point or a period at the end of the phrase) you need to quote them with a backslash. In this case, quote the dot, i.e., use: \. to protect its native meaning. Since backslash is not present on some keyboards and is sometimes "swallowed" by network gateways, you can substitute it with a backtick ` (grave accent) or ~ (tilde). ~ --- tilde, is a substitute for backslash, i.e., ~b and \b are equivalent. You cannot search for this character or use it as a delimiter of the regular expression. ` --- backtick, grave accent is a substitute for backslash, i.e., the `S+ and \S+ are equivalent. You cannot search for this character or use it as a delimiter of the regular expression. [ ] --- Any character within the square brackets matches, e.g. [a,b] matches a, comma, b. Ranges are also allowed: [0-9a-z] will match any digit or letter. Note that if you are searching for "-", you must put it just before the right ], or it will be treated as a range. Negation within square brackets is achieved with a caret character (circumflex accent) immediately following the left bracket "[^", e.g., [^a-z] will match everything but letters, [^_] will match everything but underscore. Note that within brackets, characters: .?*+|()$^{} should not be quoted, while character [ or \ or ] should be entered as \[ or \\ or \] if in case you need their original meaning. Characters: ?.*+|()$^{}[] outside square brackets should be quoted by backslash if their original meaning is needed (but you would rarely need their original meaning, unless you are searching for a particular line in the C-program or a UNIX script). \d --- Matches any digit (i.e., is a shorthand for [0-9]). \D --- Matches everything but digit (i.e., is a shorthand for [^0-9]). \w --- Matches "word" characters, i.e., letters, digits and underscore (same as [a-zA-Z0-9_]). \W --- Matches "nonword" character (same as [^a-zA-Z0-9_]). \s --- Matches a white-space (i.e., space, new_line, tab, etc.). \S --- Matches a non-white-space character (i.e., characters which use pigment in your printer). \xxx --- Matches an ASCII octal code of a character (xxx are digits). x? --- matches 0 or 1 occurrences of character x (or any other, i.e., [a-z]? matches 0 or 1 letter). x* --- matches 0 or more occurrences of character x. x+ --- matches 1 or more occurrences of character x. x{m,n} --- matches at least m, but no more than n occurrences of x, e.g., [0-9.+-]{2,3} will match: 1.2, .2, +11, -1, 123, +-. | --- alternative: \son\s|\sin\s|\sup\s|\sat\s will match words: on, in, up, and at. \b --- matches word boundary (outside [] only). It corresponds to white space, punctation marks and the very beginning and end of the text. The following constructs are valid in perl language, but SHOULD NOT BE USED HERE, since they may confuse internal working of the search. ^ --- outside [] marks the beginning of the string. DO NOT USE HERE. $ --- outside [] marks the end of the string. DO NOT USE HERE. \n --- Matches new line. DO NOT USE HERE. \r --- Matches carriage return. DO NOT USE HERE. \t --- Matches a tab. DO NOT USE HERE. \f --- Matches a formfeed. DO NOT USE HERE. \b --- Matches a backspace inside []. DO NOT USE HERE. () --- () are special characters to quote substrings for substitution. We do not do substitutions here. If you search for parentheses, use \( and \) outside square brackets. \1, \2 ... \9 --- used only in substitution strings. DO NOT USE HERE. \B --- matches non word boundary DO NOT USE HERE. Since some (few, to be truthful) computers and gateways "do not like the backslash \" character, I provided alternatives ` (backtick, grave accent) or ~ (tilde). For the reasons which I do not want to go into, you must not use characters: : or ! inside your regular expressions. Note that in UNIX, parentheses and braces are quoted to get their special meaning, while here, they need to be quoted to get their ordinary meaning. Regular expressions, bracketed with a delimiter character can be optionally preceded with a label and a search scope identifier followed by a colon ":". The label is an integer number and the scope is one of the letters: T - text only (file names will not be matched to a regular expression), F - file name only (text inside the file will not be scanned), B - both file name and text will be scanned for matching (default). Both the label and the scope can be omitted, but if either exists, the colon must be present. For the purpose of the search, all the expressions below are identical: 1B : /[MA][MO]PAC/ 2: #ampac|mopac# : +MOPAC|AMPAC+ B4: *Ampac|Mopac* -[am][mo]pac- Note that the numerical value of integer label is disregarded by the program, and the program assigns the numbers to regular expressions based on the order in which they were specified. It is here only for your convenience. 2. LOGICAL RELATION =================== Once you specified your regular expressions, you need to specify a logical relation between them, i.e, under which conditions the file should be included in the search report. The logical relation comprises of numbers refering to regular expressions and logical operatorions. It is evaluated after regular expressions were matched to file's text and/or name. The expression numbers refer to the order in which regular expressions were specified. The operators are: & AND (you can also use && if you are a UNIX or C fan) | OR (you can also use || if you are a UNIX or C fan) ! NOT The relation may also contain parantheses. For example: 1 OR 2 is equivalent to 1 | 2 is equivalent to 1 || 2 while 1 AND NOT 2 is the same as 1 & NOT 2 is the same as 1&!2 is equivalent to 1 AND (NOT 2) is equivalent to (1and(nOT2)) Spaces are optional and operators AND, OR, NOT can be written in any lettercase. Regular expression represents the status (FOUND or NOT_FOUND) matching the regular expression with the text/name of file. Remember also that in logic the & takes precedence before |, and the ! is a unary operator and takes precedence before both of them, but use the parentheses for better readability. 3. QUERY FORMAT AND EXAMPLES ============================ The complete query has the following format: Number of regular expressions (N) regular expression 1 regular expression 2 .... regular expression N logical relation The query should be send to chemistry-search@ccl.net, and when the search is finished, the resulting list of files satisfying your query will be sent to you automatically. The list will also include the matched pieces of text. Do not be impatient and do not send your next query before the previous results arrive since your request will be denied. Please remember that this is a flat file search and it is very demanding as I/O and CPU is concerned. Therefore only one query will be running at the given moment. Before you send a query, try to analyze it, and make it specific. You do not want to get a listing of the whole archive. The queries which search only file names are much faster than the queries which scan the whole file. If you look only for a single expression, you can use the abbreviated one-line format which consists of a single line containing a regular expression. Now, a few examples. Please analyse them before attempting your search. Example 1. ---------- Assume that you want to find files which mention AMBER. You could do it by saying: 1 /amber/ 1 or, since you have only one regular expression, you can say /amber/ The lettercase does not matter, so you can say also: /AMBER/, /AmbeR/, /Amber/. However, note that in principle your query could find words: camber, chamber, chamberlain, clamber or lambert, and this is not what you want. To be more specific, you need to specifically request that you want a word "amber" not any amber-containing word. You could do it as: / amber / i.e., putting spaces around it. However, what if somebody said: For this calculations I used AMBER. You would not find it, since "." is not a space. It is therefore best to use \b, i.e., the "word" boundary character: /\bamber\b/ This is not without problems, however. What if somebody said: Amber3.0 is slower than Amber3.1. You would not find it. Again, digits and underscores are considered parts of the word (by programmers, at least). I think that in this case, the best compromise is to look for /\bamber[^a-z]/ i.e., for word "amber" which is not followed by letter. I hope that you now realize, how important it is to analyze all possible combinations which will match your regular expression. You do not want to get too many unrelated files, but you want to be sure that you included all the files which relate to the topic of your search. Example 2. ---------- Assume that you want to search for the information on MM2, MM3 or MM2P or MMP2. You can search the archives by giving the following query: 4 /MM2/ /MM3/ /MM2P/ /MMP2/ 1 | 2 | 3 | 4 You can also write it differently: 1 /MM2|MM3|MM2P|MMP2/ 1 or even as: /MM2|MM3|MM2P|MMP2/ However, if you analyzed example 1, you might want to change your query to: 1 1B: /\bMM[23]\b|MMP2\b|\bMM2P\b/ 1 which is equivalent to saying: B: /\bMM[23]\b|MMP2\b|\bMM2P\b/ or /\bMM[23]\b|MMP2\b|\bMM2P\b/ since the default is both text and file names. You search for all text files which refer to MM2, MM3 or MM2P, MMP2 or file names which contain MM2, MM3, MMP2, MM2P. Example 3. ---------- Now the following query: /basis\sset/ The \s stand for a white space character. Actually, since all tabs and new lines are converted to single spaces, and multiple spaces are contracted to single ones, the query above is equivalent to: /basis set/ Beside the term "basis set", it will also find "basis sets" and "basis set." or "basis set,", etc. Note that this example is not equivalent to the query: 2 /basis/ /set/ 1 & 2 since in the first case, the words "basis" and "set" must be side by side, while in the latter case they may be separated by many words and in fact the "set" may be found before the "basis" is. Also, the latter case will find all the file names having "basis" or "set" in them, while there is no file names in the archive which have a space embedded in them. Actually, none of the above queries is good if you really want to find all the references about basis sets. People frequently say "basis" or "set", sometimes they say "basis functions" or "contracted gaussians", or whatever. You would need more elaborate expression to be confident that you found most of the references to this topic. It could look like: /basis|set|gaussians|contracted|6-31G\*|631G\*|gaussian exponent|/ and could be much larger, but you would run out of line length (there is no continuation lines). But you can make your query look like: 2 /basis|set|gaussians| contracted |6-31G\*|631G\*|gaussian exponent|/ /\bDZP\b|\bTZP\b|\bDZ\b|\bTZ\b|gaussian function|STO-?\dG|\+G\*/ 1 | 2 Note a few points. I did not use "gaussian" but "gaussians" as a separate term (by term, I mean the text between "|" signs) in the regular expression since otherwise, I would get all the files refering to GAUSSIAN program, and there are plenty of them, not necessarily about basis sets. Please note, that the star was quoted with a backspace character, or otherwise it would much anything (you might want this side effect by the way, if you wanted sets like 6-31G(3d,2f) or 6-31G without polarization). At the end of the second regular expression, you have some strange characters. "-?" means 0 or 1 minus signs (some people say STO-3G, and some incorrectly say STO3G). The \d means any digit (e.g., STO-3G or STO-4G, etc.). The term "\+G\*" means G preceded with a plus sign and followed with a star. The backslashes are necessary, since otherwise, the + sign would be interpreted as a "1 or more occurences" and the star as "0 or more occurences" of G, and this is not what we want. Example 4 --------- Somebody wants to search for files which talk about MNDO and d-orbitals. Here is an example of the query which could use for this purpose: 2 /\bMNDO[^a-z]/ /\bd[^a-z]|\bd[_\s-]orbital|\bd[_\s-]function/ 1 & 2 It will look for MNDO and strings words: "d", "d-orbital", "d orbital", d_orbital, "d-function", etc. Note the use of [] brackets. In this case you say: _, space or - can go there. According to rules, the "-" can only be used as last item within brackets to be itself. It would be treated as a range declaration if put in the middle of the [] contents. The logical expression requires that only files where MNDO and "d" was simultaneously mentioned will be collected. Of course, I could also write it as: 3 /\bMNDO\b/ /\bd[_\s-]/ /\borbital|\bfunction/ 1 & 2 & 3 It is not equivalent to the previous one, but is close. In both examples, I did not put \b at the end of "orbital" or "function", so "orbitals", "orbital,", "orbital.", etc. can also be included as matches. Example 5. ---------- Now a real one... It should be now elementary, my dear Watson. 3 1T: /\bMOPAC[^a-z]|/\bAMPAC[^a-z]|\bAM1\b|\bPM3\b|\bMNDO\b|\bMINDO/ 2T: /\bCHARGE/ 3T: /\bHYDROGEN[\s_-]?BOND[si\s.,;]|\bH[\s_-]?BOND[si\s.,;]/ 1 AND (2 OR 3) 3 regular expressions were specified. Expression 1 looks for a word MOPAC, AMPAC, AM1, PM3, MNDO, MINDO. Note that it may be either MOPAC or MOPAC6 so it is safer to require a non-letter after MOPAC rather than a space or word boundary. The 2nd expression can find "CHARGE ", "CHARGES", "CHARGE,", "CHARGE.", "CHARGE=", "CHARGE-", etc. The 3rd one is a challenge for you. Note that people may say: HYDROGEN BOND, HYDROGEN-BOND, HYDROGEN_BOND, HYDROGENBOND, H-BOND, H BOND, H_BOND, HBOND, and may say BONDS, BONDING, and may put .,; after BOND. Note that all the regular expressions given above request searching for the text of the file only, not its name. The logical relation simply says: "find me the files which mention MOPAC or AMPAC or AM1 or PM3 or MNDO or MINDO and say also something about CHARGes or HYDROGEN BONDs". In short: 1. prepare your search query as described above and look at it for a few minutes. It is not easy to write a good query ! 2. send it to chemistry-search@ccl.net 3. wait (at least 2 hours for an answer (the wait time depends how many searches are pending). 4. USING SEARCH FROM WITHIN MAILSERV ==================================== When using the MAILSERV program, you can perform searches of the current directory and all subdirectories by first CD-ing to the appropriate directory and then issuing the following command: SEARCH number of regular expressions /regular expression 1/ /regular expression 2/ . . . /regular expression n/ logical expression or SEARCH /regular expression/ Thus, the format of the MAILSERV search query is the same as before, but with the word SEARCH preceding it. Note that the search only takes place in the current directory and its subdirectories. This can be used to reduce search time if you have a reasonably good idea of where your target file or files will be, and if they aren't spread out all over the place. Unlike the search program above, this can be used to search not only them Computational Chemistry archives but the Russian archives as well. ----------------------------- I will welcome suggestions for improving this description and corrections to my spelling and grammar (as you know, I am not a native English speaker). Jan Labanowski jkl@ccl.net