CAR Execution Modification

The researcher can modify CAR's execution of processes in five ways:

  1. parameterisation of hard-code
  2. invocation of system library
  3. invocation of research library
  4. dynamic interface
  5. version control.

Parameterisation of hard-code

By hard-code we mean java class code completely compiled and integrated in CAR. Specific methods have been provided as hard-code that pre-empt the long-term needs of the researcher. For example, the method setField is a parameterisable feature of documentUpdate which allows the researcher to specify a field name and value in the session file. All hard-coded features are defined in CAR User Manual.

Invocation of system library

Each of the four processes documentUpdate, collectionUpdate, generalUpdate, and match has a CAR system library associated with it, namely CarLibraryA, CarLibraryB, CarLibraryC, and CarLibraryD, respectively.

Each library is implemented as a Java class and provides one or more methods that can be invoked by the researcher through specifying class (library) and method name in a library parameter in the session file. Such methods are invoked 'by name' and their initial invocation is slower than if they had been hard-coded. This is hardly significant for the collectionUpdate and generalUpdate processes which require just one invocation (since the library method itself processes all of the documents and is executed just once) but for documentUpdate and match processes invocations oocur for each document. This is especially significant for large document collections.

So why have a library? Implementation of library methods is relatively simple, involving programming the method in the particular class and informing researchers of its availability (the mechanism for invoking such methods is general and requires no change). Hard-coding involves considerably greater implementation effort. During the research-intensive phase of CAR it's therefore convenient to have libraries.

All system-library features are defined in CAR User Manual.

Invocation of research library

The concept of user libraries is similar to system libraries. The difference is that the researcher provides the library. Invocation of a research library is through the same library parameter that is used to invoke system libraries.

Imagine then that a separate library folder named ResearchLibraries has been set up and a class DocumentUpdateResearchLibrary has been developed by the researcher. In order to access the compiled class the Java executor requires its path (the folder containing it) to be included in the classpath. Assuming we are invoking CAR using the make utility we might have a Makefile as follows:


######################## CAR Makefile1 ####################
#                                                         #
#  this Makefile:                                         #
#        - reads and runs a session file from stdIn       #
#        - it uses a JAR file of all CAR classes          #
#        - and optionally research libraries              #
#        - and a JAR file of Bali (for dynamic runs)      #
#        - the classpath is set at run time               #
#                                                         #
#  author: Lindsey Ford                                   #
#                                                         #
###########################################################

PROJDIR       =   c:\Car

JAR           =   car.jar

JDKDIR        =   c:\jdk1.3
JAVA          =   $(JDKDIR)\bin\java
RUNCLASSPATH=$(PROJDIR)\System\$(JAR);$(PROJDIR)\System\bali.jar;$(JDKDIR)\jre\lib\rt.jar

go: 
    cd $(PROJDIR)
    $(JAVA) -classpath $(PROJDIR)\ResearchLibraries;$(RUNCLASSPATH) MainCar < Xamples\sessionTest8a.txt
        

Note that in the last line, the one that invokes Java, the classpath causes the Java executor to first look in the ResearchLibraries folder in its search for class files.

For demonstration purposes, for each of the four processes documentUpdate, collectionUpdate, generalUpdate, and match a research library has been set up in the folder ResearchLibraries, namely DocumentUpdateResearchLibrary, CollectionUpdateResearchLibrary, GeneralUpdateResearchLibrary, and MatchResearchLibrary, respectively. These correspond directly to CarLibraryA, CarLibraryB, CarLibraryC, and CarLibraryD, respectively.

Furthermore we have chosen to exemplify these libraries using functionality already to be found in hard-code and/or exemplified in system libraries and/or dynamic interfaces. This serves to show that each of these methods of execution modification are functionally equivalent, only differing in implementation detail.

Example 1 (documentUpdate: Research Library)

<!-- session Test 8a -->

<process id=p1 type=documentUpdate>
    input=          external;
    inputName=      "c:\Car\Xamples\eNotes3.txt";
    output=         external;
    outputName=     "c:\Car\Xamples\docs9Updated.txt"
</process>

<parameter id=param1 type=library>
    class=          DocumentUpdateResearchLibrary;
    method=         attributeChange,location,weight,1.2;
</parameter>

<action id=a1>
    processId=      p1;
    parameterId=    param1;
</action>

Note that the class name is DocumentUpdateResearchLibrary (as opposed to CarLibraryA if we had been doing a document update using a system library). That is the only difference as far as the session file is concerned.

As shown in Example 6 (documentUpdate: library) of the CAR User Manual, several methods for the same class can be invoked from one parameter, and indeed several classes can be used in one parameter. The following arrangement, for example, is possible:

<parameter id=param1 type=library>
    class=          C1;
    method=         m1;
    method=         m2;
    ...
    method=         mn;

    class=          C2;
    method=         m1;
    method=         m2;
    ...
    method=         mn;

...

    class=          Cn;
    method=         m1;
    method=         m2;
    ...
    method=         mn;
</parameter>

Apart from having a working knowledge of Java and CAR classes such as ValueTuple the researcher needs to be familiar with how CAR invokes his library methods. Below is an extract from DocumentUpdateResearchLibrary showing the class constructor and the method used in example 1 above. (Researchers can implement their libraries however they wish, using whatever fields and methods they desire, provided there is a public class constructor and public methods conforming to the ones specified in the session file.)

  //----------------------------
  public DocumentUpdateResearchLibrary(ValueTuple valueTuple) { // constructor
  }

  //----------------------------
  public void attributeChange(Document doc, ValueTuple vt) {
      Field field = doc.getField(vt.getStringValue(1));
      // above searches for field with given name (1)
      if(field==null) return;    // document doesn't contain the field
      field.setAttribute(vt.getStringValue(2),vt.getStringValue(3));
      // above sets the attribute (2) with the value (3)
  }

CAR passes a single argument, the ValueTuple for the labelled value 'class=' from the session file, to the constructor. The first (and perhaps only) value of the tuple is the name of the class, but there is no reason why other comma-separated values shouldn't be included in the labelled value and passed to the constructor, and then to be used by the researcher as desired.

The constructor is invoked during session file validation, to ensure the session file's validity and to provide an instance of the class that will later be used to validate the named methods for it.

CAR passes two arguments to each method of the class specified in the session file. The first is a reference to the document to be processed, the second a ValueTuple for the labelled value 'method='. During session file validation, instances of such methods are generated to check the validity of their specification in the session file and to facilitate their later execution.

When a document update process is actioned with a library parameter the instance of the attributeChange method is invoked for each incoming document and thereby allowing the researcher's method to update the document as desired. Parameters containing the field name, name of the attribute, and its value, are contained in the ValueTuple (the second argument of the method, the first value of which is the method name).

Example 2 (collectionUpdate: Research Library)

<!-- session Test 8b -->

<process id=p1 type=collectionUpdate>
  input=          external;
  inputName=      "c:\Car\Xamples\eNotes1c.txt"; 
  output=         external;
  outputName=     "c:\Car\Xamples\docs8updated.txt"
</process>

<parameter id=param1 type=library>
  class=          CollectionUpdateResearchLibrary;
  method=         identify,LF;
  method=         delete,LF1
</parameter>

<action id=a1> processId=p1;
  parameterId=param1
</action>

The example shows two methods of CollectionUpdateResearchLibrary being applied. Because this is a collectionUpdate each method is applied in turn to a collection. The first method creates an anonymous internal collection which is input to the second method ... which outputs the collection to the external file.

Below is an extract from CollectionUpdateResearchLibrary showing the class constructor and the methods used in example 2 above.

  //----------------------------
  public CollectionUpdateResearchLibrary(ValueTuple valueTuple) {    // argument not used at present
  }

  //----------------------------
  public Collection identify(Collection docs, ValueTuple vt) {
    String prefix = vt.getStringValue(1);
    int suffix=0;
    Collection outDocs = new Collection();
    for(Enumeration eDocs = docs.elements();eDocs.hasMoreElements();) { // for each document ...
      Document doc = (Document)((Document)eDocs.nextElement()).clone(); // need to take a copy of the inDoc
      Field field = doc.getField("note");
      if(field!=null) field.setAttribute("id",prefix+(suffix++));
      outDocs.addElement(doc);
    }
    return outDocs;
  }

  //----------------------------
  public Collection delete(Collection docs, ValueTuple vt) {
    String deleteId = vt.getStringValue(1);
    Collection outDocs = new Collection();
    for(Enumeration eDocs = docs.elements();eDocs.hasMoreElements();) { // for each document ...
      Document doc = (Document)((Document)eDocs.nextElement()).clone(); // need to take a copy of the inDoc
      Field field = doc.getField("note");
      String value=null;
      if(field!=null) value = field.getAttributeValue("id");
      if(value!=null && value.equals(deleteId)) continue; // found a match - don't include this document
      outDocs.addElement(doc);
    }
    return outDocs;
  }

CAR passes a single argument, the ValueTuple for the labelled value 'class=' from the session file, to the constructor. The first (and perhaps only) value of the tuple is the name of the class.

The constructor is invoked during session file validation, to ensure the session file's validity and to provide an instance of the class that will later be used to validate the named methods for it.

CAR passes two arguments to each method of the class specified in the session file. The first is a reference to the collection of documents to be processed, the second a ValueTuple for the labelled value 'method='. During session file validation, instances of such methods are generated to check the validity of their specification in the session file and to facilitate their later execution.

Note that (a) each method returns a collection (the output), and (b) when processing the input collection a clone of each input document is taken before being fiddled with and passed to the output collection. The reason for the latter is that if a clone was not used the internal representation of the original input collection would not be preserved in its original form (and any subsequent actions using it in its internal form would receive the updated version).

Unlike a document update, collection update methods are invoked just once (in this case one invocation for each of the two methods). Thus once a method is invoked it runs at the same speed as if it had been a hard-coded one.

Example 3 (generalUpdate: Research Library)

<!-- session Test Dynamic 8c -->

<process id=p1 type=generalUpdate>
  input=            external,external,external; 
  inputName=        "Xamples\diary.txt","Xamples\context.txt","Xamples\eNotes1c.txt";
  output=           external;
  outputName=       "Xamples\concat.txt";
</process>

<parameter id=param1 type=library>
  class=          GeneralUpdateResearchLibrary;
  method=         concatenate;
</parameter>

<action id=a1>      
  processId=        p1;
  parameterId=      param1;
</action>
    

A general update has an arbitrary number of input and output collections. Because of this generality it is usual for there to be only one parameter and indeed only one method for a process. This example shows the concatenate method taking 3 input collections and concatenating them into one output colleection.

Below is an extract from GeneralUpdateResearchLibrary showing the class constructor and the method used in example 3 above.

  public GeneralUpdateResearchLibrary(ValueTuple valueTuple) {
    t("cstr starting, vt="+valueTuple);
  }

  //----------------------------
  public Collection[] concatenate(Collection[] inCollections, ValueTuple vt) {
    Collection outDocs = new Collection();
    for(int i=0;i<inCollections.length;i++) {
      Collection collection = inCollections[i];
      for(int j=0;j<collection.size();j++) {
        Document doc = (Document)((Document)collection.elementAt(j)).clone();
        if(!doc.hasField("note")) doc.setField("note");
        outDocs.addElement(doc);
      }
    }
    return new Collection[]{outDocs};
  }

CAR passes a single argument, the ValueTuple for the labelled value 'class=' from the session file, to the constructor. The first (and perhaps only) value of the tuple is the name of the class.

The constructor is invoked during session file validation, to ensure the session file's validity and to provide an instance of the class that will later be used to validate the named methods for it.

CAR passes two arguments to each method of the class specified in the session file. The first is a reference to the array of document collections to be processed, the second a ValueTuple for the labelled value 'method='. During session file validation, instances of such methods are generated to check the validity of their specification in the session file and to facilitate their later execution.

Note that the method returns an array of collections (the output). Also note that when processing documents from any of the input collections a clone of each input document is taken and passed to the output collection.

Unlike a document update, general update methods are invoked just once. Thus once a method is invoked it runs at the same speed as if it had been a hard-coded one.

Example 4 (match: Research Library)

<!-- session Test Dynamic 8d -->

<process id=p1 type=match>
  input=            external,external; 
  inputName=        "Xamples\context4.txt","Xamples\eNotes4.txt";
  output=           external;
  outputName=       "Xamples\match8.txt";
</process>

<parameter id=param1 type=matchSpecA>
  activeTags=       text,location;
  threshold=        1.0;          
  scores=           true;
  currentContext=   ACTIVE;
  document=         ANY;
  direction=        interactive;
</parameter>

<parameter id=param2 type=library>
  class=            MatchResearchLibrary;
  method=           match;
</parameter>

<action id=a1>      
  processId=        p1;
  parameterId=      param1;
  parameterId=      param2;
</action>

A match process for the MatchResearchLibrary requires two parameters: a matchSpecA and a library. It is identical to how match is implemented in CarLibraryD.

A match parameter has two inputs (a context and the collection to be matched against) and a single output collection.

Below is an extract from MatchResearchLibrary showing the class constructor and the method used in example 4 above.

//----------------------------
public MatchResearchLibrary(ValueTuple valueTuple) {
}

//----------------------------
public Document match(Document target, Document query, Document doc,
      Double dthreshold, Boolean bscores, Object aFields, ValueTuple vt) {

    // get args into more convenient type...
    double threshold = dthreshold.doubleValue();
    boolean scores = bscores.booleanValue();
    String[] activeFields = (String[])aFields;

    Field tField,qField;                                      // target and query fields
    double fScore,noteScore=1.0;                              // field and note score

    // remove all score attributes in output doc ... (the ones for activeFields are put in later)
    doc.removeAttribute("score");

    int numScores=0;
    // match the two Notes
    for(int j=0;j<target.size();j++) {                     // for each target field
        tField = target.getField(j);
        String tName=tField.getFieldName().toLowerCase();       // name of target field
        t("tName="+tName);
        if(tName.equals("note")) continue;                      // ignore note field for scoring purposes
        if(!Utils.member(tName,activeFields)) continue;         // ignore inactive fields in target

        // got an active target field...
        // process each matching queryField 
        // remembering there can be 'duplicates' i.e. same name different attributes
        for (int qX=0, numQFields = query.numberOfFields(); qX < numQFields; qX++) {
            qField=query.getField(qX);
            if(!qField.getFieldName().equals(tName))
              continue;    // not the right query field
            fScore = tField.score(qField);       // score the field
            fScore = (double)Math.round(fScore * 100.0) / 100.0;  // round to nearest 2nd dec place
            noteScore *= fScore;       // accumulate the note score
            if(scores)                  // scores are required in output
                                        // put score in output doc...
              doc.getNamedFieldWithoutAttribute(tName,"score").setAttribute("score",stringScore(fScore));
            numScores++;
            t("numScores="+numScores);
            t("noteScore="+noteScore);
        }    // end for each query field
    } // end for each target field

    // all target fields have been processed, calculate note score, check against threshold...
    // noteScore is the geometric mean of scores
    noteScore = Math.pow(noteScore,1.0/(double)numScores);
    noteScore = (double)Math.round(noteScore * 100.0)/100.0;  // round to nearest 2nd dec place
    if(noteScore < threshold)
      doc=null;                       // not going to output the doc
    else {                      // set the <note> score attribute
        Field field = doc.getField("note");
        if(scores) field.setAttribute("score",stringScore(noteScore));
    }
    return doc; // return doc (which may be null if score less than threshold)
}

CAR passes a single argument, the ValueTuple for the labelled value 'class=' from the session file, to the constructor. The first (and perhaps only) value of the tuple is the name of the class.

The constructor is invoked during session file validation, to ensure the session file's validity and to provide an instance of the class that will later be used to validate the named methods for it.

CAR passes seven arguments to the method of the class specified in the session file. The first is a reference to the target document, then a reference to the query document, then a template of the output document, followed by the threshold, whether to output scores, and the active fields. The final argumnet is a ValueTuple for the labelled value 'method='. During session file validation, instances of such methods are generated to check the validity of their specification in the session file and to facilitate their later execution.

Note that the method returns a document (the output) ... which is set to null if not to be output.

Like a document update, match methods are invoked once for each relevant document. Although they will be faster than dynamic methods they will be significantly slower than hard-coded ones.

Dynamic interface

Dynamic interfaces allow the researcher to provide code in the session file for each of the four processes documentUpdate, collectionUpdate, generalUpdate, and match. The procedure is described in the CAR Dynamic Interface document.

Version control

Parameters, libraries, and dynamic methods provide the researcher with the ability to influence processing at predefined points. Sometimes, however, the researcher may wish to influence matters more directly.

This can be achieved by, in effect, by superceding an integrated class by one of the researcher's own. Any class that makes up the Car set of software can be so superceded. This is done by suitably amending the classpath of the run command line. It must be borne in mind that it will influece all processes using the class (parameterised hard-code, system and research libraries, dynamic programs) in the session with the amended classpath.

Suppose, for example , the researcher wishes to insert their own scoring mechanism that forms part of the matching process. Now the detailed part of scoring is undertaken in the class Value, so the researcher: (1) takes a copy of the Value java class; (2) suitably amends the score() method, compiles the source; (3) places the path of the folder containing the resultant class file as the first priority in the classpath of the session run; (4) runs the session as normal. The Java executor will use the version of a class that first appears in the classpath (later duplicates are ignored).

It is important that the signatures of all methods in the superceded class are retained ... otherwise exceptions may arise in processing.

Below is a Makefile used for initiating a CAR session.

######################## CAR Makefile1 ####################
#                                                         #
#  this Makefile:                                         #
#        - reads and runs a session file from stdIn       #
#        - it uses a JAR file of all CAR classes          #
#        - and optionally research libraries              #
#        - and a JAR file of Bali (for dynamic runs)      #
#        - the classpath is set at run time               #
#                                                         #
#  author: Lindsey Ford                                   #
#                                                         #
###########################################################

PROJDIR       =   c:\Car

JAR           =   car.jar

JDKDIR        =   c:\jdk1.3
JAVA          =   $(JDKDIR)\bin\java
RUNCLASSPATH=$(PROJDIR)\System\$(JAR);$(PROJDIR)\System\bali.jar;$(JDKDIR)\jre\lib\rt.jar

go: 
  cd $(PROJDIR)
  $(JAVA) -classpath $(PROJDIR)\ResearchLibraries;$(RUNCLASSPATH) MainCar < Xamples\sessionTest4.txt

In it we have placed the ResearchLibraries folder as the first priority of the classpath. Earlier we have taken a copy of the Value.java file, amended the way string fields are scored (instead of scoring 1.0 for a partially matched string value we have expressed the score as a fraction of the query value found in the target value). We then compiled the new Value class into the ResearchLibraries folder.

When we run the session (which contains a match process) the new version of the score() method is executed instead of the original.

Sunday 25 November 2001