Example 1 (set an attribute of a particular field)
Example 2 (apply a formula to a value)
Example 3 (puts an attribute in the note field of each document)
Example 4 (applies a document filter)
Example 5 (produces a summary document, and displays it)
Example 6 (extracts particular documents and particular fields)
Example 7 (a pre-processor to establish field weights)
The session file parameter 'dynamic' allows a researcher to provide at run time a small program to be applied to a documentUpdate, collectionUpdate, or generalUpdate process. The program is written in Bali and is the string value of the label 'program' in the parameter field. Note that the program text is placed between <markup> and </markup> tags to enable the program to contain such characters as '"' and '<' without them interfering with the syntax of the session file.
During the parsing of the session file each Bali program (there may be an arbitrary number) is compiled and held in internal form. When CAR actions a process the associated Bali program is executed accordingly. It is able to access a processes' data through an interface (which varies according to the type of process).
The interface is defined in the class Dynamic. It is defined below and exemplified for the 4 relevant process types:
1. documentUpdate
2. collectionUpdate
3. generalUpdate
4. match.
A researcher wishing to use the Dynamic Interface will need to have (a) an understanding of Java, (b) familiarity with CAR classes that relate to the structure of CAR data (Collection, Document, Field, and so on). Details of these latter classes can be found in CAR class documentation.
The class Dynamic
contains a static method Dynamic.getDocument()
which delivers to the program a clone (copy) of the current document of the
input collection. The program updates the document as necessary.
<process id=p1 type=documentUpdate> input= external; inputName= "c:\Car\Xamples\docs9.txt"; output= external; outputName=
"c:\Car\Xamples\docs9Updated.txt" </process> <parameter id=param1 type=dynamic> program= <markup> //
this program, because it is a parameter to a documentUpdate process, // is executed for each
document; //
it gets a document, updates it as necessary, and terminates; //
CAR handles everything else class
SetWeights { static
void main(String[] args) { Document doc =
Dynamic.getDocument(); // get the current document to be processed //
loop that processes each field ... //
sets a weight attribute value, overwriting or inserting the attribute... for(int
numFields=doc.numberOfFields(), i=0; i<numFields; i++) { Field field = doc.getField(i);
if(field.getFieldName().equals(“location"))
field.setAttribute("weight","1.5"); } } } </markup> </parameter> <action id=a1> processId= p1; parameterId= param1; </action> |
The program is a complete Bali/Java program in the sense that it defines a class (called 'SetWeights' although the name can be any valid identifier). Crucially the program contains a static main() method. It could contain other methods, and indeed a program could contain other classes, but they are not necessary in this simple example.
CAR causes the main() method to be executed for each document.
Note how the first line of the body of main() gets the current document using the Dynamic interface. The variable 'doc' is, in Java terms, a reference to the object provided by Dynamic.getDocument(). When the main() method completes its execution the Dynamic class has a reference to the object that has been processed, and CAR causes the object to be placed in the output collection. You must therefore not cause the reference to have a value null. Document deletions can be accomplished in collectionUpdate or generalUpdate processes.
After the program gets a document it processes each field and sets an attribute. The logic is straightforward but the code serves to show that a knowledge of CAR's classes is essential i.e. the researcher needs to know the data structures (as explained in the Glossary of the User Manual) and the classes that relate to them.
<!-- session Test Dynamic1b --> <process id=p1 type=documentUpdate> input= external; inputName= "c:\Car\Xamples\docs9.txt"; output= external; outputName=
"c:\Car\Xamples\docs9formula.txt" </process> <parameter id=param1 type=dynamic> program= <markup> // this program, for a
documentUpdate, is executed for each document; // it converts any
fahrenheit fields to centigrade as necessary // CAR handles everything
else class Formula { static void main(String[] args) { Document doc =
Dynamic.getDocument(); // get the
current document to be processed // loop that processes each field ... for(int numFields=doc.numberOfFields(), i=0; i<numFields;
i++) { Field field =
doc.getField(i); if(!field.getFieldName().toLowerCase().equals("temperature"))
continue; // ignore
if(!field.hasAttribute("scale"))
continue; // ignore
if(!field.getAttributeValue("scale").toLowerCase().equals("fahrenheit"))
continue; // ignore if(!field.hasValue()) continue; // ignore
// we have an
appropriate field... double value =
field.getNumericValue(); // get
the fahrenheit value value = (value - 32)
* 5 / 9; // convert it
to centigrade field.setNumericValue(value); field.setAttribute("scale","centigrade"); } } } </markup> </parameter> <action id=a1> processId= p1; parameterId= param1; </action> |
This program has the same basic structure as the previous one, but exemplifies different features of CAR classes. It looks for a temperature field that has the attribute scale="Fahrenheit", converts the value to centigrade and changes the attribute to scale="centigrade".
Note, however, that it uses the String method toLowerCase() in order to make string comparisons (example 1 omitted this piece of logic for 'location' and would not have set the weight for a location field name containing any capital letters). This is necessary because field names (tags) are not case sensitive but Java is.
The line containing
if(!field.hasAttribute("scale")) continue;
is important since if it were not present the next line would produce an
execution error if the field did not have a 'scale' attribute. Why is this? The
method field.getAttributeValue("scale") would result in the
value null (that's how the class Field defines its return value for an
attribute that doesn't exist in the field) and because other parts of the
conditional expression assume a String reference has been returned a run-time
exception is thrown (nothing new here, it would happen in any Java program with
similar logic).
The line containing field.getNumericValue()
is acceptable provided we are sure we are dealing with a non-range, 1-tuple
value. The logic is a little more complex if we have to deal with theses cases,
involving getting the ValueTuple and processing each element appropriately
(reference to the documentation of the ValueTuple and Value classes would be
necessary).
In a collectionUpdate process there is one input and one output collection.
The class Dynamic contains:
·
a static method Dynamic.getCollection() which delivers to the program a reference to the input collection
·
a
static method Dynamic.putCollection(collection) which passes a reference
to the output collection to Dynamic.
CAR looks after
the output collection after Dynamic has received it.
<!--
session Test Dynamic2a --> <process id=p1
type=collectionUpdate> input= external; inputName= "c:\Car\Xamples\docs9.txt"; output= external; outputName=
"c:\Car\Xamples\docs9Updated.txt" </process> <parameter id=param1 type=dynamic> program= <markup> //
this program puts an attribute "id=PJBn" into each note field, //
incrementing 'n' by 1 in the value "PJBn" and n starting from a
given suffix class
Identify { static
void main(String[] args) { String
prefix = "PJB"; //
prefix int
suffix=7; //
initial suffix Collection
inDocs = Dynamic.getCollection(); // get the current collection to be
processed Collection
outDocs = new Collection(); int
numDocs = inDocs.numberOfDocuments(); Document
doc; for
(int i=0; i<numDocs; i++) { doc
= inDocs.getDocumentCopy(i); //
get the next input document Field
field = doc.getField("note"); if(field!=null) field.setAttribute("id",prefix+(suffix++)); outDocs.addDocument(doc); // add the document to the
output collection } Dynamic.putCollection(outDocs); // pass the new collection to
Dynamic } } </markup> </parameter> <action id=a1> processId= p1; parameterId= param1; </action> |
This example program 'identifies' each note in a collection by creating a uniquely valued 'id' attribute for each <note> field. In the main() method a reference to the input document collection is created (inDocs). And a new, empty, output document collection is created (outDocs). The idea in this program is to process each inDoc and create an outDoc for each. And then to pass the outDocs to the Dynamic interface at the end of all processing.
The simple loop ensures that each document of the input collection is processed. The first line in the loop's body containing doc = inDocs.getDocumentCopy(i); is important. The method getDocumentCopy(i) takes a clone (copy) of the ith input document. If a clone was not taken, and instead a reference was taken with the method getDocument(i), then the original document (as held in memory) would be updated. This will cause problems in situations where the input is designated 'internal' and is used by other processes (since the internal collection will have changed).
In the next line of the loop's body notice that the method doc.getField("note") can return a null value if
the document does not contain a <note> field (it's always wise to check
the class documentation to see what range of values can be returned, and then
to handle all eventualities in your program).
Two lines later the document is added to the outDocs collection. When the loop terminates outDocs is passed to the Dynamic interface.
<!-- session Test Dynamic2b --> <process id=p1 type=collectionUpdate> input= external; inputName= "c:\Car\Xamples\docs9.txt"; output= external; outputName=
"c:\Car\Xamples\docs9good.txt" </process> <parameter id=param1 type=dynamic> program= <markup> // this program
deletes documents that do not have <temperature> AND <location>
fields. class Filter { static void main(String[] args) { Collection inDocs = Dynamic.getCollection(); // get the
current collection to be processed Collection outDocs = new Collection(); int numDocs =
inDocs.numberOfDocuments(); Document doc; for (int i=0; i<numDocs; i++) { doc =
inDocs.getDocumentCopy(i); //
get the next input document if(doc.getField("temperature")==null
|| doc.getField("location")==null) continue; // ignore (delete) outDocs.addDocument(doc); // add the document to the output collection }
Dynamic.putCollection(outDocs); // pass the new collection to
Dynamic } } </markup> </parameter> <action id=a1> processId= p1; parameterId= param1; </action> |
This example shows how
to delete documents from a collection. Within the loop is a conditional which
if true loops again without adding the document to the output collection. In
this program documents that do not have the fields <temperature>
and <location> are deleted i.e. are not passed to the output collection.
<!--
session Test Dynamic2c --> <process
id=p1 type=collectionUpdate> input= external; inputName= "c:\Car\Xamples\docs9.txt"; output= external; outputName= "c:\Car\Xamples\docs9summary.txt" </process> <parameter
id=param1 type=dynamic> program= <markup> // this program creates a summary of the
input documents in the output collection // the output collection will thus end up
as a single-document collection // finally the output collection is
displayed on standard output class Summary {
static void main(String[] args) {
Collection inDocs =
Dynamic.getCollection(); // get the current
collection to be processed
int numDocs = inDocs.numberOfDocuments();
int numTemperatureFields=0,numLocationFields=0;
Document doc;
for (int i=0; i<numDocs; i++) { doc = inDocs.getDocumentCopy(i);
// get the next input document if(doc.getField("temperature")!=null)
numTemperatureFields++; if(doc.getField("location")!=null)
numLocationFields++;
}
Document summaryDoc=new Document();
summaryDoc.addComment("summary of docs9.txt");
summaryDoc.addField(new Field("number of
documents",numDocs));
summaryDoc.addField(new Field("number of temperature
fields",numTemperatureFields));
summaryDoc.addField(new Field("number of location
fields",numLocationFields));
Collection outDocs = new Collection();
outDocs.addDocument(summaryDoc); // add the summary document to the output collection
Dynamic.putCollection(outDocs); // pass the new collection to Dynamic
System.out.println("Summary:\n" + outDocs); } } </markup> </parameter> <action
id=a1> processId= p1; parameterId= param1; </action> |
This program shows
how to create a new document. It creates just one document for the output
collection, a summary of the input collection. At the end of processing it displays
the summary document. The logic is straightforward. Note that in the method
call System.out.println("Summary:\n" + outDocs); the
argument is a String and that outDocs, the collection, is automatically
converted to an appropriate String. To do this Java looks for a method toString()
in Collection, and if it exists invokes it for outDocs. You will find that the toString()
method has been provided in all CAR classes that form part of the structure of
a Collection e.g. Document, Field.
This is the first example that
shows a Bali program using part of Java's API, namely System.out.println().
In principle any classes of Java's API can be used.
<!-- session Test Dynamic2d --> <process id=p1 type=collectionUpdate> input=
external; inputName=
"c:\Car\Xamples\docs9.txt"; output=
external; outputName=
"c:\Car\Xamples\docs9extraction.txt"
</process> <parameter
id=param1 type=dynamic> program= <markup> // this program extracts from the input
documents only those documents // that have a <temperature> and
<location> field ... // and removes other fields from them ... // finally the output collection is
displayed on standard output class Extraction {
static void main(String[] args) {
Collection inDocs =
Dynamic.getCollection(); // get the current collection to be processed
Collection outDocs = new Collection();
int numDocs = inDocs.numberOfDocuments();
Document doc;
for (int i=0; i<numDocs; i++) { doc = inDocs.getDocumentCopy(i); // get the next input document if(doc.getField("temperature")==null
|| doc.getField("location")==null) continue; // ignore // need to create a new document and
copy across required fields for it Document newDoc = new Document(); // loop that processes each input
field... for(int
numFields=doc.numberOfFields(), j=0; j<numFields; j++) {
Field field = doc.getField(j);
if(field.getFieldName().toLowerCase().equals("note") ||
field.getFieldName().toLowerCase().equals("temperature") ||
field.getFieldName().toLowerCase().equals("location")
) newDoc.addField(field); } outDocs.addDocument(newDoc); // add the new document to the output
collection
}
Dynamic.putCollection(outDocs); //
pass the new collection to Dynamic
System.out.println("New collection:\n" + outDocs); } } </markup> </parameter> <action
id=a1> processId= p1; parameterId= param1; </action> |
This program passes
to the output collection only those documents that have both temperature and
location fields, and within those documents only note, temperature, and
location fields are passed. The condition doc.getField("temperature")==null
|| doc.getField("location")==null
is used to detect relevant documents by testing whether the getField()
method returns null (i..e. not present).
Once a relevant document is found a new output document is created, which initially has no fields. There then follows an inner loop to process each field of the input document. Relevant fields of it are added to the new output document which is in turn added to the output collection.
In a generalUpdate process there is an arbitrary number of inputs and an arbitrary number of outputs (and not necessarily the same number of inputs as outputs). Each input is a collection and each output is a collection.
The class Dynamic contains:
·
a static method Dynamic.getCollections() which delivers to the program a reference to an array of input
collections
·
a
static method Dynamic.putCollections(collections) which passes a
reference to an array of output collections to Dynamic.
CAR looks
after the array of output collections after Dynamic has received it.
The paper Needs of the
Matcher-library (Brown) describes various scenarios in which several inputs
might be needed for an update. For example, updating a context diary would
require a diary and a current context as inputs, with a new diary being the
single output. A pre-processor to massage the current context on the basis of
history would require a diary and context as inputs, with a new context being
the single output. A pre-processor to set field weights would require similar
inputs and would create a single document of weights (which could then be used
in a dynamic matching process, an example that we show below). And there are
numerous others that fall within the generalUpdate category.
<!--
session Test Dynamic 4b -->
<process
id=p1 type=generalUpdate>
input= external,external;
inputName= "Xamples\diary.txt","Xamples\context.txt";
output= external;
outputName= "Xamples\weights.txt";
</process>
<parameter
id=prog1 type=dynamic>
program=
<markup>
// this program, because it is a parameter
to a generalUpdate process, is executed just once;
// it gets an array of input collections,
processes them, outputs collections as necessary, and terminates;
// CAR handles everything else.
// Specifically it works through a diary
determining which fields in the current context
// have changed value most often, and uses
this to create a weights document.
class CreateFieldWeights {
static void main(String[] args) {
Collection[] inCollections = Dynamic.getCollections(); // get the collections to be processed
Collection[] outCollections = new Collection[1];
// check that we have just two inputs...
if(inCollections.length!=2) Error.error("In the program
'CreateFieldWeights' there should be 2 inputs");
Collection diary = inCollections[0];
if(inCollections[1].numberOfDocuments()<1)
Error.error("In the program 'CreateFieldWeights' the second input
should contain a context (but it contains no documents)");
if(inCollections[1].numberOfDocuments()>1) Error.warn("In the
program 'CreateFieldWeights' the second input
should contain a single context
(but it contains more than one) ... only the first will be used");
Document context = inCollections[1].getDocument(0); // get the context document
/* stage 1 - create a weights document based on the current context
- set each value (weighting factor) of each field to
zero...
the factor will be incremented for each occurrence of the field
in the diary...
provided its value has changed from the last occurrence
stage 2 - create a lastValues document based on the current context...
that will be used to record a field's last value
stage 3 - for each document in the diary...
for each field in the document...
if its name is in the weights document and...
if its value is different to the equivalent
value in the lastValues document...
increment the value in the appropriate
field of the weights document
stage 4 - using the weights document ...
sum all the weights and for each field in the weights
document...
express its value as a fraction of the sum
stage 5 - output the weights document.
*/
////// stage 1 //////
Document weights = context.copy(); //
make a copy of the context document
for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {
Field field = weights.getField(i);
field.setNumericValue(0.0);
}
////// stage 2 //////
Document lastValues = context.copy(); //
make a copy of the context document
for(int numFields=lastValues.numberOfFields(), i=0; i<numFields; i++)
{
Field field =
lastValues.getField(i);
field.removeValue();
}
////// stage 3 //////
int
numDiaryDocs = diary.numberOfDocuments();
Document diaryDoc;
for (int i=0; i<numDiaryDocs; i++) {
diaryDoc = diary.getDocumentCopy(i); // get the next diary document
for (int j=0,
numFields=diaryDoc.numberOfFields(); j < numFields; j++) {
Field dField = diaryDoc.getField(j);
String fName = dField.getFieldName().toLowerCase();
if(fName.equals("note")) continue; // ignore
Field wField;
if((wField=weights.getField(fName)) == null) continue; // ignore
Field lvField=lastValues.getField(fName);
if(!dField.sameValue(lvField)) { // change of value
double val = wField.getNumericValue();
wField.setNumericValue(++val);
lvField.setValueTuple(dField.getValueTuple());
}
}
}
////// stage 4 //////
double sumWeights=0;
for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {
Field field = weights.getField(i);
sumWeights+=field.getNumericValue();
}
for(int numFields=weights.numberOfFields(), i=0; i<numFields; i++) {
Field field = weights.getField(i);
field.setNumericValue(field.getNumericValue()
/ sumWeights);
}
Collection outDocs = new Collection();
outDocs.addDocument(weights);
outCollections[0]=outDocs;
Dynamic.putCollections(outCollections);
}
}
</markup>
</parameter>
<action
id=a1>
processId= p1;
parameterId= prog1;
</action>
This example is largely self-explanatory. Note how the CAR class Error is used to report any incidents that occur during execution of the program. The method Error.error() causes an error message to be output on standard output and the session is halted. The method Error.warn(), on the other hand, issues a warning and continues processing.
In a match process there are two input collections and a single output collection. Matching is the process whereby a comparison is made between one or more contexts and a document collection, and those comparisons that are deemed important form the basis of output.
For a discussion of matching see STICK-E NOTES: the Context Matcher User Manual (Brown), also see Active fields and the rules for document matching which provides the rules for matching. A variety of other papers discuss particular sub-topics of matching in greater detail, these are contained in Peter Brown's collection of discussion and specification papers. Implementation details of matching can be found in the javadoc for Matcher.
There are 2 parameter types for match:
matchSpecA is compulsory – it specifies the tags to be matched, the type of match (interactive or proactive), a scoring threshold, and what is to be output.
The class Dynamic contains for match:
·
a static method Dynamic.getTargetDocument() which gets a reference to the current target document
·
a
static method Dynamic.getQueryDocument() which gets a reference to the current query document
·
a
static method Dynamic.getOutputDocument() which gets a reference to the current document from the
collection
·
a
static method Dynamic.getThresholds()
which gives a double threshold score
·
a
static method Dynamic.getScores()
which gives a boolean indicating whether score attributes are to
be output
·
a
static method Dynamic.getActiveFields()
which gives a reference to a String[] which contains lower case names of active fields
·
a
static method Dynamic.deleteOutputDocument()
which allows the program to indicate that the current output document is not to form part of the output (if it scores below the threshold for example).
CAR passes
control to main() for each matching document from the collection, the
one provided by Dynamic.getOutputDocument(). Other information is passed
as indicated above. The main() method updates this document, using the
query and target documents, and when it terminates CAR adds the document to the
output collection (CAR also deals with the output of the context at the head of
the output collection).
<!-- session Test Dynamic 3a -->
<!-- this session reads an external
context and an external eNote document collection,
does a proactive match on
them and puts the output in an external doc collection
-->
<process id=p1 type=match>
input= external,external; <!--
context always before collection -->
inputName=
"Xamples\context4.txt","Xamples\eNotes4.txt";
output= external;
outputName= "Xamples\match4.txt";
</process>
<parameter id=m1 type=matchSpecA>
activeTags= text,location;
threshold= 1.0;
scores= true;
currentContext= ACTIVE;
document= ANY;
direction= proactive;
</parameter>
<parameter id=prog1 type=dynamic>
program=
<markup>
// this program, because it is a parameter to a match process, is
executed for each match;
// it gets a target document, a query document, an output document,
updates it as necessary, and terminates;
// CAR handles everything else.
class MatchScorerA {
static boolean TRACE=false; // set to
false if no trace messages required
//---------------------------------
static void t(String s) {if(TRACE)
System.out.println("MatchScorerA- "+s);} // simple debugger
//-------------------------------
private static String stringScore(double d) {
// convert score to string,
truncating scores ending .0
if(d==0.0) return "0";
String s = ""+d;
int i=s.indexOf(".");
if( (i + 3) < s.length() ) s =
s.substring(0,i+3);
while(s.endsWith("0")) s =
s.substring(0,s.length()-1);
if(s.endsWith(".")) s =
s.substring(0,s.length()-1);
return s;
}
//---------------------------------
static void main(String[] args) {
Document target =
Dynamic.getTargetDocument(); //
get the current target document
Document query = Dynamic.getQueryDocument(); // get the current query document
Document doc = Dynamic.getOutputDocument(); // get the current document from
the collection
double threshold = Dynamic.getThreshold(); // get input 'threshold'
(default 0.0)
boolean scores = Dynamic.getScores(); // get input 'scores'
(default true)
String[] activeFields = Dynamic.getActiveFields();
// get activeField names
t("\n\n\n\n\n================
new document ====================================");
t("target="+target);
t("query="+query);
t("outDoc="+doc);
t("scores="+scores);
Field tField,qField; //
target and query fields
double fScore,noteScore=1.0; // field and
note score
// remove all score attributes in
output doc ... (the ones for activeFields are put in later)
doc.removeAttribute("score");
int numScores=0;
// match the two Notes
for(int
j=0;j<target.numberOfFields();j++) { //
for each target field
tField = target.getField(j);
String tName=tField.getFieldName().toLowerCase(); // name of target field
t("tName="+tName);
if(tName.equals("note")) continue; // ignore note field for scoring purposes
if(!Utils.member(tName,activeFields)) continue; // ignore inactive fields in target
// got an active target field...
// process each matching queryField
// remembering there can be 'duplicates' i.e. same name different
attributes
for (int qX=0, numQFields = query.numberOfFields(); qX < numQFields;
qX++) {
qField=query.getField(qX);
// get the matching query field
if(!qField.getFieldName().equals(tName)) continue; // not the right query field
fScore = tField.score(qField); //
score the field
fScore = (double)Math.round(fScore * 100.0) / 100.0; // round to nearest 2nd dec place
noteScore *= fScore; //
accumulate the note score
if(scores) //
scores are required in output
//
put score in output doc...
doc.getNamedFieldWithoutAttribute(tName,"score").setAttribute("score",stringScore(fScore));
numScores++;
t("numScores="+numScores);
t("noteScore="+noteScore);
} // end for each query field
} // end for each target field
// all target fields have been
processed, calculate note score, check against threshold...
// noteScore is the geometric mean of
scores
noteScore =
Math.pow(noteScore,1.0/(double)numScores);
noteScore =
(double)Math.round(noteScore * 100.0)/100.0; //
round to nearest 2nd dec place
if(noteScore < threshold)
Dynamic.deleteOutputDocument(); // not
going to output the doc
else { // set the
<note> score attribute
Field field = doc.getField("note");
if(scores) field.setAttribute("score",stringScore(noteScore));
}
} // end main()
} // end class definition
</markup>
</parameter>
<action id=a1>
processId= p1;
parameterId= m1;
parameterId= prog1;
</action>
This example shows two parameters, one for matchSpecA, the other for dynamic.
The dynamic parameter contains a program that, at the time of writing, is an implementation of the current matcher (the one that is invoked when there is no dynamic parameter). It thus serves to provide a basis for your own dynamic matching program.
In the example we see that there are two methods other than main().
The first t() is to be found in many CAR classes. It simply outputs
strings to standard output if the boolean TRACE is set. The second
method stringScore() is used to truncate doubles
appropriately for output.
The program
is largely self-explanatory. There are two main loops. The first processes each
target document field which, if it is an active field, is matched against an
appropriate query document field (CAR's Matcher has ensured there must be a
matching one) – this is accomplished by the inner loop which is mainly
concerned with scoring. You may wish to replace the default scoring mechanism
in ValueTuple and Value with your own – this can be done in the
inner loop at the point where matching target and query fields are established.
You'll need to use methods in Field, ValueTuple, and Value to get
at the values.
Sunday, November 18, 2001