CCL Home Page
Up Directory CCL regexp.html
Searching The CCL Archives

Searching The CCL Archives


Contents

  1. A Brief Overview
  2. Regular Expressions
  3. Logical Relations
  4. Searching via E-mail

1. A Brief Overview

Sometimes searching for a given word in a text is not enough. You may want to find different forms of the same word. Or sometimes the same word can be spelled in a number of ways. Sometimes people will join two words with a hyphen, or may write them together, yet some will write them separately. Hence, flexible searches are needed. Regular expressions will do that.

Also, you may want to look for a file which contains several pieces of information, e.g., you may want to find files which contain information about MOON and JUPITER. In other words, you want to introduce a logical relation between regular expressions.

Searching a collection of files saves time, since it helps you decide which files may be of interest. In this particular case, the archives can be searched via Web-Form, or through an e-mail interface that is also available, for those of you who cannot access the World-Wide-Web for some reason. The functions of the e-mail searcher are identical to the Web searcher, but obviously, e-mail is more awkward to use. Also, the Web searcher has the advantage of being able to retrieve or view interesting files with the click of the button. On the other hand, if a search is time-consuming (e.g., if you search through the entire large archive), you will appreciate performing the search off line, and receiving the results via e-mail, rather than waiting at your terminal for half an hour or more. In fact, the Web-Form software will refuse to run the search interactively if the estimate of time needed is greater than 10 minutes. For longer searches you will need to enter your valid e-mail address and select E-mail mode on the entry form.

You are also presented with 2 choices of output format (both for the E-mail and the Interactive mode), namely: Plain Text, and HTML (HyperText Markup Language). If you do not have Web access, choose Plain Text. However, if you are connected to the Web, use HTML by all means. Even if you are getting the search results via e-mail, you will be able to display your results as a Web page and view selected files with a click of a mouse button.

This description is written specifically for searches of the archives available at OSC. Searching scripts are written in the Perl programming language and therefore Perl syntax for regular expressions is used. It is essentially identical to the UNIX egrep syntax. Also, only a subset of the full syntax of regular expressions is described here, because, in our case, the text being searched is initially preprocessed (only for the purpose of the search) in the following ways:

  1. Most punctuation marks (namely: ( ) [ ] { } , ; . : ? ! = < > ' ` " & ^ @ | \ / ~ ) are converted to spaces. Do not search for these characters since you will not find them (though they are most likely present in the original text). Note that: + # - _ * are not converted to spaces. All white space characters (i.e, TABs, NEW-LINEs, FORM-FEEDs, and CARRET_RETURNs) are converted to a space.

  2. Multiple spaces are converted to a single space character.

  3. Hyphenated words which are split between two lines are joined.

For the search software the text looks like a single looooong line of text, without punctuation marks, and only single spaces between words. Moreover, this long line always starts with a space and ends with a space.

To help you decide if the file selected by the search is of interest, a portion of the text (context) surrounding the match is also displayed. You can make this up to 300 characters long. Again, what is displayed is not the actual text, but the one deprived of punctuation and formatting.

<Return to top of page>


2. Regular Expressions

A regular expression is a way to specify flexible matches, as opposed to rigid keywords. While letters and digits have their verbatim meaning in regular expressions, most punctuation marks have a special meaning and are therefore called metacharacters. Since in our case the search is not sensitive to letter case, it does not matter if you use capital or lower case letters within the regular expression. Regular expressions are enclosed within a pair of identical characters (delimiters) which must be different from the ones used inside the regular expression itself. This is done to see spaces that may be if a part of the regular expression. For example: ? Mozart ? and ! MoZarT ! will find the same text. Some possible constructs within regular expressions are:
  • Alternative -- a | character separates alternative pieces of text, e.g. /Chopin|moZart|Kuhlau/ will match Chopin, or Mozart, or Kuhlau.
  • Atoms, i.e., the basic elements of a regular expression. They may represent a single character or a group of characters, or even a complete regular expression enclosed within parentheses:
    1. A letter, digit, - , or # will match itself (except that letters may be written as capital or small, i.e., K is the same as k ). To match metacharacters you need to precede them with a backslash, (e.g.: \+, \*, \$, etc.). When it doubt, use the \ to ensure that the original meaning is preserved. Remember, however, that most of these characters were temporarily changed to spaces for searching and you will not find them. Do not use a backslash before ordinary letters, since some sequences have a special meaning:
      1. \d matches any digit,
      2. \D matches a non-digit,
      3. \s matches a space,
      4. \S matches a non-space character (i.e., letter, digit, punctuation, etc.),
      5. \w corresponds to a word character, i.e., any letter, digit, or _ (underscore).
      6. \W matches characters which are not matched by \w .
      The above sequences are useful. There are also other sequences of this type but they would not be useful for searching here as they deal with characters which are explicitly removed from the text before searching.

    2. A . (period) matches any single character.

    3. A [list] , i.e., a list or a range of characters surrounded by square brackets, matches any character on the list. E.g., [abc012] will match any of the first three letters or digits. The list may include ranges (e.g., [a-z] represents any letter). The list may also be negated with a ^ (carat) character, e.g., [^0-9+-] signifies: all characters but digits, plus or minus signs. Note that the minus sign specifies a range only when surrounded by two ordinary characters. Some ranges do not make sense, e.g., if the first character is later in the table of character codes than the second one, e.g., the range: [f-a] is a nonsense. More than one list may be enclosed in backets, e.g., [a-z0-9] will match any alphanumeric character.

    4. A \ backslash followed by the octal code, will match the ASCII (or extended ASCII) code of the character. Only eagles should dare. For example, /G\351za/ , and /G\363recki/ will match Géza, and Górecki, respectively, if the ISO Latin-1 character set is used in the file.

    5. Any regular expressions enclosed within parentheses is an atom, e.g., (\d\d\d) is an atom which matches 3 consecutive digits.

  • A Quantifier follows an atom and denotes how many times the atom needs to occur:
    1. ? -- 0 or 1 time,
    2. * -- 0 or more times,
    3. + -- 1 or more times,
    4. {2}, {3}, ... -- 2, 3, ... times. In general: {n} means n times,
    5. {2,}, {3,}, ... -- at least 2 times, at least 3 times, ... In general: {n,} means at least n times,
    6. {n,m} -- at least n times but no more than m times.

You can build very powerful searches using these elements. For example, if you are looking for dates from 1820 until 1899, / 18[2-9][0-9] / will do (note the spaces around it). If you want Sonaten, Sonatas, Sonate, Sonata, Sonatine, Sonatinen, Sonatensatz, etc., / SONAT[AEI]/ may help (note, no space at the end). Different spellings of the same word can also be searched for, e.g., Schroedinger or Schrodinger or Schrödinger? To catch all variants use: /Schr(o|oe|\366)dinger/ . Looking for a million dollars or more? This one is easy: /\d( ?\d{3}){2,}/ i.e, find at least 2 consecutive groups of 3 digits (the group may, but does not have to, start from a space) following a digit. Be aware that the numbers could be written as: 53,123,456 , 53123456 ,or 53 123 456 and the commas, if present, were changed to spaces before searching.

Here are some examples of valid regular expressions:

/m?ethane/
would match either ethane or methane.
/ab*c/
would match ac, abc, abbc, abbbc, etc., that is any string that starts with an a, is followed by 0 or more b's, and ends with a c.
/ab+c/
would not match ac, but it would match abc, abbc, abbbc, etc.
/cyclo.*ane/
would match cyclodecane, cyclohexane and even cyclones drive me insane. Any string that starts with cyclo, is followed by an arbitrary string, and ends with ane will be matched. Note that the null string will be matched by the period-star pair; thus, cycloane would be matched by the above expression. If you wanted to search for articles on cyclodecane and cyclohexane, but didn't want to match articles about how cyclones drive one insane, you could string together three periods, as follows: /cyclo...ane/
/ c\++ /
would match c+ , c++ , c+++ , etc., while / c\+\+ / would only match c++ .
/\W[^f-h]ood\W/
matches any four letter wording ending in ood except for food, good or hood. (Thus mood and wood would both be matched.)

While the regular expressions seem somewhat cryptic and complicated, they are compact. After only a little practice you can become a guru. One important bit of advice for beginners though is to train on some small directory which does not have large files before you submit your final exhaustive search.

<Return to top of page>


3. Logical Relations

Sometimes you may want to find out if the text contains a number of specific words, not only a single keyword. For example you may be interested in a text file which specifically mentions:
1:/Beethoven/ and 2:/Sonatas/.
In short, you want: 1 AND 2.
Or, if you want to learn who wrote:
1:/Sonatas/ beside 2:/Beethoven/ and 3:/Mozart/
you may want to have a logical relation like:
1 AND NOT (2 OR 3)
which is equivalent to:
1 AND (NOT 2 AND NOT 3).
The numbers refer to regular expressions which are to be matched with the proprocessed text of the file. Of course, you may want to create more complicated relations between matched pieces of text. This is needed when you look for a specific topic. But beware... You can miss useful information, if you search mechanically. In the above example, the most valuable file may have started with the phrase: Below is a list of all Sonatas, except those composed by Mozart and Beethoven and the search would have missed it. On the other hand, if you are looking specifically for
1:/Sonatas/ of 2:/Beethoven/ or 3:/Mozart/, that is:
1 AND (2 OR 3)
you would find this file, even if you do not need it. You have to be aware that even the most specific searches will turn out some useless material. For this reason, when a match to the regular expressions is found, a fragment of surrounding text is displayed to help you decide if the information is useful. But, only the first match is shown, and you never know what is written later in the file.

<Return to top of page>


4. Searching Archives via E-mail

The archives can be also searched via e-mail by sending a message to MAILSERV@server.ccl.net. To get an overview of all commands send a single word:
  help
in the body of your message to MAILSERV@server.ccl.net. More information is available on each command, and the details of e-mail archive searching can be obtained by sending the line:
  help search
to MAILSERV@server.ccl.net. There is also a more detailed help file available regarding email searching. An example of a typical search query follows:
  select chemistry
  cd archived-messages/95
  search HTML 250
  7
  1T:  /charges/
  2T:  / MP[2-4] /
  3T:  / MBPT /
  4T:  / DFT /
  5T:  / LSD /
  6T:  / LDF /
  7T:  / Density[ -]Functional /
  1 AND (2 or 3 or 4 or 5 or 6 or 7)
  dir 
  quit
In this example:
  • Line 1: selects the particular archives called "chemistry".
  • Line 2: the top directory to be searched is specified. In this case it coresponds to archived messages which appeared in 1995 on the computational chemistry list.
  • Line 3: the search command with parameters: HTML (i.e., requesting the results to be formatted as a Web page. You could also use "Plain" to receive the plan ASCII text.), and 250 character long context to be shown.
  • Line 4: the number of regular expressions (7) to be matched with files.
  • Lines 5-11: represent regular expressions preceded by the optional running number and the scope of matching. A T specifies that only the text of the file should be searched for a match (other searching scopes are: N for matching file names only, and B for matching regular expressions against both the text and the name of the file -- this is the default).
  • Line 12: the logical relation to be satisfied by the result of regular expression matching. In this case, somebody is interested in finding information about atomic charges calculations by different quantum chemical calculations.
  • Line 13: a request for a directory listing of the current directory.
  • Line 14: the command which marks the end of the commands.
You may send multiple search queries and other commands within a single request to MAILSERV@server.ccl.net. The search will be executed in the order received. one at a time.

<Return to top of page>


Return to the CCL homepage WWW Search Engine Information on this page
Modified: Thu Sep 13 22:03:04 2001 GMT
Page accessed 8391 times since Sat Apr 17 21:20:47 1999 GMT