CAR Dynamic Interface

Introduction

DocumentUpdate

Example 1 (set an attribute of a particular field)

Example 2 (apply a formula to a value)

collectionUpdate

Example 3 (puts an attribute in the note field of each document)

Example 4 (applies a document filter)

Example 5 (produces a summary document, and displays it)

Example 6 (extracts particular documents and particular fields)

generalUpdate

Example 7 (a pre-processor to establish field weights)

match

Example 8 (default matcher)

Introduction

The session file parameter 'dynamic' allows a researcher to provide at run time a small program to be applied to a documentUpdate, collectionUpdate, or generalUpdate process. The program is written in Bali and is the string value of the label 'program' in the parameter field. Note that the program text is placed between <markup> and </markup> tags to enable the program to contain such characters as '"' and '<' without them interfering with the syntax of the session file.

During the parsing of the session file each Bali program (there may be an arbitrary number) is compiled and held in internal form. When CAR actions a process the associated Bali program is executed accordingly. It is able to access a processes' data through an interface (which varies according to the type of process).

The interface is defined in the class Dynamic. It is defined below and exemplified for the 4 relevant process types:

1. documentUpdate

2. collectionUpdate

3. generalUpdate

4. match.

A researcher wishing to use the Dynamic Interface will need to have (a) an understanding of Java, (b) familiarity with CAR classes that relate to the structure of CAR data (Collection, Document, Field, and so on). Details of these latter classes can be found in CAR class documentation.

DocumentUpdate

The class Dynamic contains a static method Dynamic.getDocument() which delivers to the program a clone (copy) of the current document of the input collection. The program updates the document as necessary.

Example 1 (set an attribute of a particular field)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9Updated.txt"

</process>

program=

// this program, because it is a parameter to a documentUpdate process,

// is executed for each document;

// it gets a document, updates it as necessary, and terminates;

// CAR handles everything else

class SetWeights {

static void main(String[] args) {

Document doc = Dynamic.getDocument(); // get the current document to be processed

// loop that processes each field ...

// sets a weight attribute value, overwriting or inserting the attribute...

for(int numFields=doc.numberOfFields(), i=0; i<numFields; i++) {

Field field = doc.getField(i);

if(field.getFieldName().equals(“location")) field.setAttribute("weight","1.5");

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

The program is a complete Bali/Java program in the sense that it defines a class (called 'SetWeights' although the name can be any valid identifier). Crucially the program contains a static main() method. It could contain other methods, and indeed a program could contain other classes, but they are not necessary in this simple example.

CAR causes the main() method to be executed for each document.

Note how the first line of the body of main() gets the current document using the Dynamic interface. The variable 'doc' is, in Java terms, a reference to the object provided by Dynamic.getDocument(). When the main() method completes its execution the Dynamic class has a reference to the object that has been processed, and CAR causes the object to be placed in the output collection. You must therefore not cause the reference to have a value null. Document deletions can be accomplished in collectionUpdate or generalUpdate processes.

After the program gets a document it processes each field and sets an attribute. The logic is straightforward but the code serves to show that a knowledge of CAR's classes is essential i.e. the researcher needs to know the data structures (as explained in the Glossary of the User Manual) and the classes that relate to them.

Example 2 (apply a formula to a value)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9formula.txt"

</process>

program=

// this program, for a documentUpdate, is executed for each document;

// it converts any fahrenheit fields to centigrade as necessary

// CAR handles everything else

class Formula {

static void main(String[] args) {

Document doc = Dynamic.getDocument(); // get the current document to be processed

// loop that processes each field ...

for(int numFields=doc.numberOfFields(), i=0; i<numFields; i++) {

Field field = doc.getField(i);

if(!field.getFieldName().toLowerCase().equals("temperature")) continue; // ignore

if(!field.hasAttribute("scale")) continue; // ignore

if(!field.getAttributeValue("scale").toLowerCase().equals("fahrenheit")) continue; // ignore

if(!field.hasValue()) continue; // ignore

// we have an appropriate field...

double value = field.getNumericValue(); // get the fahrenheit value

value = (value - 32) * 5 / 9; // convert it to centigrade

field.setNumericValue(value);

field.setAttribute("scale","centigrade");

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

This program has the same basic structure as the previous one, but exemplifies different features of CAR classes. It looks for a temperature field that has the attribute scale="Fahrenheit", converts the value to centigrade and changes the attribute to scale="centigrade".

Note, however, that it uses the String method toLowerCase() in order to make string comparisons (example 1 omitted this piece of logic for 'location' and would not have set the weight for a location field name containing any capital letters). This is necessary because field names (tags) are not case sensitive but Java is.

The line containing if(!field.hasAttribute("scale")) continue; is important since if it were not present the next line would produce an execution error if the field did not have a 'scale' attribute. Why is this? The method field.getAttributeValue("scale") would result in the value null (that's how the class Field defines its return value for an attribute that doesn't exist in the field) and because other parts of the conditional expression assume a String reference has been returned a run-time exception is thrown (nothing new here, it would happen in any Java program with similar logic).

The line containing field.getNumericValue() is acceptable provided we are sure we are dealing with a non-range, 1-tuple value. The logic is a little more complex if we have to deal with theses cases, involving getting the ValueTuple and processing each element appropriately (reference to the documentation of the ValueTuple and Value classes would be necessary).

collectionUpdate

In a collectionUpdate process there is one input and one output collection.

The class Dynamic contains:

· a static method Dynamic.getCollection() which delivers to the program a reference to the input collection

· a static method Dynamic.putCollection(collection) which passes a reference to the output collection to Dynamic.

CAR looks after the output collection after Dynamic has received it.

Example 3 (puts an attribute in the note field of each document)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9Updated.txt"

</process>

program=

// this program puts an attribute "id=PJBn" into each note field,

// incrementing 'n' by 1 in the value "PJBn" and n starting from a given suffix

class Identify {

static void main(String[] args) {

String prefix = "PJB"; // prefix

int suffix=7; // initial suffix

Collection inDocs = Dynamic.getCollection(); // get the current collection to be processed

Collection outDocs = new Collection();

int numDocs = inDocs.numberOfDocuments();

Document doc;

for (int i=0; i<numDocs; i++) {

doc = inDocs.getDocumentCopy(i); // get the next input document

Field field = doc.getField("note");

if(field!=null) field.setAttribute("id",prefix+(suffix++));

outDocs.addDocument(doc); // add the document to the output collection

}

Dynamic.putCollection(outDocs); // pass the new collection to Dynamic

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

This example program 'identifies' each note in a collection by creating a uniquely valued 'id' attribute for each <note> field. In the main() method a reference to the input document collection is created (inDocs). And a new, empty, output document collection is created (outDocs). The idea in this program is to process each inDoc and create an outDoc for each. And then to pass the outDocs to the Dynamic interface at the end of all processing.

The simple loop ensures that each document of the input collection is processed. The first line in the loop's body containing doc = inDocs.getDocumentCopy(i); is important. The method getDocumentCopy(i) takes a clone (copy) of the ith input document. If a clone was not taken, and instead a reference was taken with the method getDocument(i), then the original document (as held in memory) would be updated. This will cause problems in situations where the input is designated 'internal' and is used by other processes (since the internal collection will have changed).

In the next line of the loop's body notice that the method doc.getField("note") can return a null value if the document does not contain a <note> field (it's always wise to check the class documentation to see what range of values can be returned, and then to handle all eventualities in your program).

Two lines later the document is added to the outDocs collection. When the loop terminates outDocs is passed to the Dynamic interface.

Example 4 (applies a document filter)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9good.txt"

</process>

program=

// this program deletes documents that do not have <temperature> AND <location> fields.

class Filter {

static void main(String[] args) {

Collection inDocs = Dynamic.getCollection(); // get the current collection to be processed

Collection outDocs = new Collection();

int numDocs = inDocs.numberOfDocuments();

Document doc;

for (int i=0; i<numDocs; i++) {

doc = inDocs.getDocumentCopy(i); // get the next input document

if(doc.getField("temperature")==null || doc.getField("location")==null) continue; // ignore (delete)

outDocs.addDocument(doc); // add the document to the output collection

}

Dynamic.putCollection(outDocs); // pass the new collection to Dynamic

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

This example shows how to delete documents from a collection. Within the loop is a conditional which if true loops again without adding the document to the output collection. In this program documents that do not have the fields <temperature> and <location> are deleted i.e. are not passed to the output collection.

Example 5 (produces a summary document, and displays it)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9summary.txt"

</process>

program=

// this program creates a summary of the input documents in the output collection

// the output collection will thus end up as a single-document collection

// finally the output collection is displayed on standard output

class Summary {

static void main(String[] args) {

Collection inDocs = Dynamic.getCollection(); // get the current collection to be processed

int numDocs = inDocs.numberOfDocuments();

int numTemperatureFields=0,numLocationFields=0;

Document doc;

for (int i=0; i<numDocs; i++) {

doc = inDocs.getDocumentCopy(i); // get the next input document

if(doc.getField("temperature")!=null) numTemperatureFields++;

if(doc.getField("location")!=null) numLocationFields++;

}

Document summaryDoc=new Document();

summaryDoc.addComment("summary of docs9.txt");

summaryDoc.addField(new Field("number of documents",numDocs));

summaryDoc.addField(new Field("number of temperature fields",numTemperatureFields));

summaryDoc.addField(new Field("number of location fields",numLocationFields));

Collection outDocs = new Collection();

outDocs.addDocument(summaryDoc); // add the summary document to the output collection

Dynamic.putCollection(outDocs); // pass the new collection to Dynamic

System.out.println("Summary:\n" + outDocs);

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

This program shows how to create a new document. It creates just one document for the output collection, a summary of the input collection. At the end of processing it displays the summary document. The logic is straightforward. Note that in the method call System.out.println("Summary:\n" + outDocs); the argument is a String and that outDocs, the collection, is automatically converted to an appropriate String. To do this Java looks for a method toString() in Collection, and if it exists invokes it for outDocs. You will find that the toString() method has been provided in all CAR classes that form part of the structure of a Collection e.g. Document, Field.

This is the first example that shows a Bali program using part of Java's API, namely System.out.println(). In principle any classes of Java's API can be used.

Example 6 (extracts particular documents and particular fields)

input= external;

inputName= "c:\Car\Xamples\docs9.txt";

output= external;

outputName= "c:\Car\Xamples\docs9extraction.txt"

</process>

program=

// this program extracts from the input documents only those documents

// that have a <temperature> and <location> field ...

// and removes other fields from them ...

// finally the output collection is displayed on standard output

class Extraction {

static void main(String[] args) {

Collection inDocs = Dynamic.getCollection(); // get the current collection to be processed

Collection outDocs = new Collection();

int numDocs = inDocs.numberOfDocuments();

Document doc;

for (int i=0; i<numDocs; i++) {

doc = inDocs.getDocumentCopy(i); // get the next input document

if(doc.getField("temperature")==null || doc.getField("location")==null) continue; // ignore

// need to create a new document and copy across required fields for it

Document newDoc = new Document();

// loop that processes each input field...

for(int numFields=doc.numberOfFields(), j=0; j<numFields; j++) {

Field field = doc.getField(j);

if(field.getFieldName().toLowerCase().equals("note") ||

field.getFieldName().toLowerCase().equals("temperature") ||

field.getFieldName().toLowerCase().equals("location")

)

newDoc.addField(field);

}

outDocs.addDocument(newDoc); // add the new document to the output collection

}

Dynamic.putCollection(outDocs); // pass the new collection to Dynamic

System.out.println("New collection:\n" + outDocs);

}

</markup>

</parameter>

processId= p1;

parameterId= param1;

</action>

This program passes to the output collection only those documents that have both temperature and location fields, and within those documents only note, temperature, and location fields are passed. The condition doc.getField("temperature")==null || doc.getField("location")==null is used to detect relevant documents by testing whether the getField() method returns null (i..e. not present).

Once a relevant document is found a new output document is created, which initially has no fields. There then follows an inner loop to process each field of the input document. Relevant fields of it are added to the new output document which is in turn added to the output collection.

generalUpdate

In a generalUpdate process there is an arbitrary number of inputs and an arbitrary number of outputs (and not necessarily the same number of inputs as outputs). Each input is a collection and each output is a collection.

The class Dynamic contains:

· a static method Dynamic.getCollections() which delivers to the program a reference to an array of input collections

· a static method Dynamic.putCollections(collections) which passes a reference to an array of output collections to Dynamic.

CAR looks after the array of output collections after Dynamic has received it.

The paper Needs of the Matcher-library (Brown) describes various scenarios in which several inputs might be needed for an update. For example, updating a context diary would require a diary and a current context as inputs, with a new diary being the single output. A pre-processor to massage the current context on the basis of history would require a diary and context as inputs, with a new context being the single output. A pre-processor to set field weights would require similar inputs and would create a single document of weights (which could then be used in a dynamic matching process, an example that we show below). And there are numerous others that fall within the generalUpdate category.

Example 7 (a pre-processor to establish field weights)

input= external,external;

inputName= "Xamples\diary.txt","Xamples\context.txt";

output= external;

outputName= "Xamples\weights.txt";

</process>

program=

// this program, because it is a parameter to a generalUpdate process, is executed just once;

// it gets an array of input collections, processes them, outputs collections as necessary, and terminates;

// CAR handles everything else.

// Specifically it works through a diary determining which fields in the current context

// have changed value most often, and uses this to create a weights document.

class CreateFieldWeights {

static void main(String[] args) {

Collection[] inCollections = Dynamic.getCollections(); // get the collections to be processed

Collection[] outCollections = new Collection[1];

// check that we have just two inputs...

if(inCollections.length!=2) Error.error("In the program 'CreateFieldWeights' there should be 2 inputs");

Collection diary = inCollections[0];

if(inCollections[1].numberOfDocuments()<1) Error.error("In the program 'CreateFieldWeights' the second input

should contain a context (but it contains no documents)");

if(inCollections[1].numberOfDocuments()>1) Error.warn("In the program 'CreateFieldWeights' the second input

should contain a single context (but it contains more than one) ... only the first will be used");

Document context = inCollections[1].getDocument(0); // get the context document

/* stage 1 - create a weights document based on the current context

- set each value (weighting factor) of each field to zero...

the factor will be incremented for each occurrence of the field in the diary...

provided its value has changed from the last occurrence

stage 2 - create a lastValues document based on the current context...

that will be used to record a field's last value

stage 3 - for each document in the diary...

for each field in the document...

if its name is in the weights document and...

if its value is different to the equivalent value in the lastValues document...

increment the value in the appropriate field of the weights document

stage 4 - using the weights document ...

sum all the weights and for each field in the weights document...

express its value as a fraction of the sum

stage 5 - output the weights document.

////// stage 1 //////

Document weights = context.copy(); // make a copy of the context document

for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {

Field field = weights.getField(i);

field.setNumericValue(0.0);

}

////// stage 2 //////

Document lastValues = context.copy(); // make a copy of the context document

for(int numFields=lastValues.numberOfFields(), i=0; i<numFields; i++) {

Field field = lastValues.getField(i);

field.removeValue();

}

////// stage 3 //////

int numDiaryDocs = diary.numberOfDocuments();

Document diaryDoc;

for (int i=0; i<numDiaryDocs; i++) {

diaryDoc = diary.getDocumentCopy(i); // get the next diary document

for (int j=0, numFields=diaryDoc.numberOfFields(); j < numFields; j++) {

Field dField = diaryDoc.getField(j);

String fName = dField.getFieldName().toLowerCase();

if(fName.equals("note")) continue; // ignore

Field wField;

if((wField=weights.getField(fName)) == null) continue; // ignore

Field lvField=lastValues.getField(fName);

if(!dField.sameValue(lvField)) { // change of value

double val = wField.getNumericValue();

wField.setNumericValue(++val);

lvField.setValueTuple(dField.getValueTuple());

}

////// stage 4 //////

double sumWeights=0;

for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {

Field field = weights.getField(i);

sumWeights+=field.getNumericValue();

}

for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {

Field field = weights.getField(i);

field.setNumericValue(field.getNumericValue() / sumWeights);

}

Collection outDocs = new Collection();

outDocs.addDocument(weights);

outCollections[0]=outDocs;

Dynamic.putCollections(outCollections);

}

</markup>

</parameter>

processId= p1;

parameterId= prog1;

</action>

This example is largely self-explanatory. Note how the CAR class Error is used to report any incidents that occur during execution of the program. The method Error.error() causes an error message to be output on standard output and the session is halted. The method Error.warn(), on the other hand, issues a warning and continues processing.

match

In a match process there are two input collections and a single output collection. Matching is the process whereby a comparison is made between one or more contexts and a document collection, and those comparisons that are deemed important form the basis of output.

For a discussion of matching see STICK-E NOTES: the Context Matcher User Manual (Brown), also see Active fields and the rules for document matching which provides the rules for matching. A variety of other papers discuss particular sub-topics of matching in greater detail, these are contained in Peter Brown's collection of discussion and specification papers. Implementation details of matching can be found in the javadoc for Matcher.

There are 2 parameter types for match:

matchSpecA
dynamic

matchSpecA is compulsory – it specifies the tags to be matched, the type of match (interactive or proactive), a scoring threshold, and what is to be output.

The class Dynamic contains for match:

· a static method Dynamic.getTargetDocument() which gets a reference to the current target document

· a static method Dynamic.getQueryDocument() which gets a reference to the current query document

· a static method Dynamic.getOutputDocument() which gets a reference to the current document from the collection

· a static method Dynamic.getThresholds() which gives a double threshold score

· a static method Dynamic.getScores() which gives a boolean indicating whether score attributes are to be output

· a static method Dynamic.getActiveFields() which gives a reference to a String[] which contains lower case names of active fields

· a static method Dynamic.deleteOutputDocument() which allows the program to indicate that the current output document is not to form part of the output (if it scores below the threshold for example).

CAR passes control to main() for each matching document from the collection, the one provided by Dynamic.getOutputDocument(). Other information is passed as indicated above. The main() method updates this document, using the query and target documents, and when it terminates CAR adds the document to the output collection (CAR also deals with the output of the context at the head of the output collection).

Example 8 (default matcher)

<!-- this session reads an external context and an external eNote document collection,

does a proactive match on them and puts the output in an external doc collection

-->

input= external,external;

inputName= "Xamples\context4.txt","Xamples\eNotes4.txt";

output= external;

outputName= "Xamples\match4.txt";

</process>

activeTags= text,location;

threshold= 1.0;

scores= true;

currentContext= ACTIVE;

document= ANY;

direction= proactive;

</parameter>

program=

// this program, because it is a parameter to a match process, is executed for each match;

// it gets a target document, a query document, an output document, updates it as necessary, and terminates;

// CAR handles everything else.

class MatchScorerA {

static boolean TRACE=false; // set to false if no trace messages required

//---------------------------------

static void t(String s) {if(TRACE) System.out.println("MatchScorerA- "+s);} // simple debugger

//-------------------------------

private static String stringScore(double d) {

// convert score to string, truncating scores ending .0

if(d==0.0) return "0";

String s = ""+d;

int i=s.indexOf(".");

if( (i + 3) < s.length() ) s = s.substring(0,i+3);

while(s.endsWith("0")) s = s.substring(0,s.length()-1);

if(s.endsWith(".")) s = s.substring(0,s.length()-1);

return s;

}

//---------------------------------

static void main(String[] args) {

Document target = Dynamic.getTargetDocument(); // get the current target document

Document query = Dynamic.getQueryDocument(); // get the current query document

Document doc = Dynamic.getOutputDocument(); // get the current document from the collection

double threshold = Dynamic.getThreshold(); // get input 'threshold' (default 0.0)

boolean scores = Dynamic.getScores(); // get input 'scores' (default true)

String[] activeFields = Dynamic.getActiveFields(); // get activeField names

t("\n\n\n\n\n================ new document ====================================");

t("target="+target);

t("query="+query);

t("outDoc="+doc);

t("scores="+scores);

Field tField,qField; // target and query fields

double fScore,noteScore=1.0; // field and note score

// remove all score attributes in output doc ... (the ones for activeFields are put in later)

doc.removeAttribute("score");

int numScores=0;

// match the two Notes

for(int j=0;j<target.numberOfFields();j++) { // for each target field

tField = target.getField(j);

String tName=tField.getFieldName().toLowerCase(); // name of target field

t("tName="+tName);

if(tName.equals("note")) continue; // ignore note field for scoring purposes

if(!Utils.member(tName,activeFields)) continue; // ignore inactive fields in target

// got an active target field...

// process each matching queryField

// remembering there can be 'duplicates' i.e. same name different attributes

for (int qX=0, numQFields = query.numberOfFields(); qX < numQFields; qX++) {

qField=query.getField(qX); // get the matching query field

if(!qField.getFieldName().equals(tName)) continue; // not the right query field

fScore = tField.score(qField); // score the field

fScore = (double)Math.round(fScore * 100.0) / 100.0; // round to nearest 2nd dec place

noteScore *= fScore; // accumulate the note score

if(scores) // scores are required in output

// put score in output doc...

doc.getNamedFieldWithoutAttribute(tName,"score").setAttribute("score",stringScore(fScore));

numScores++;

t("numScores="+numScores);

t("noteScore="+noteScore);

} // end for each query field

} // end for each target field

// all target fields have been processed, calculate note score, check against threshold...

// noteScore is the geometric mean of scores

noteScore = Math.pow(noteScore,1.0/(double)numScores);

noteScore = (double)Math.round(noteScore * 100.0)/100.0; // round to nearest 2nd dec place

if(noteScore < threshold) Dynamic.deleteOutputDocument(); // not going to output the doc

else { // set the <note> score attribute

Field field = doc.getField("note");

if(scores) field.setAttribute("score",stringScore(noteScore));

}

} // end main()

} // end class definition

</markup>

</parameter>

processId= p1;

parameterId= m1;

parameterId= prog1;

</action>

This example shows two parameters, one for matchSpecA, the other for dynamic.

The dynamic parameter contains a program that, at the time of writing, is an implementation of the current matcher (the one that is invoked when there is no dynamic parameter). It thus serves to provide a basis for your own dynamic matching program.

In the example we see that there are two methods other than main(). The first t() is to be found in many CAR classes. It simply outputs strings to standard output if the boolean TRACE is set. The second method stringScore() is used to truncate doubles appropriately for output.

The program is largely self-explanatory. There are two main loops. The first processes each target document field which, if it is an active field, is matched against an appropriate query document field (CAR's Matcher has ensured there must be a matching one) – this is accomplished by the inner loop which is mainly concerned with scoring. You may wish to replace the default scoring mechanism in ValueTuple and Value with your own – this can be done in the inner loop at the point where matching target and query fields are established. You'll need to use methods in Field, ValueTuple, and Value to get at the values.

Sunday, November 18, 2001

Contents

Example 1 (set an attribute of a particular field)