Table of contentsThis document describes the processing of libxslt, the XSLTC library developed for the Gnomeproject. Note: this documentation is by definition incomplete and I am not good
atspelling, grammar, so patches and suggestions are really welcome. XSLT is a transformation language. It takes an input document and
astylesheet document and generates an output document: Libxslt is written in C. It relies on libxml, the XML C library for Gnome,
forthe following operations: - parsing files
- building the in-memory DOM structure associated with the
documentshandled
- the XPath implementation
- serializing back the result document to XML and HTML. (Text is
handleddirectly.)
Libxslt is not very specialized. It is built under the assumption that
allnodes from the source and output document can fit in the virtual memory
ofthe system. There is a big trade-off there. It is fine for reasonably
sizeddocuments but may not be suitable for large sets of data. The gain is
that itcan be used in a relatively versatile way. The input or output may
never beserialized, but the size of documents it can handle are limited by
the sizeof the memory available. More specialized memory handling approaches are possible, like buildingthe
input tree from a serialization progressively as it is consumed,factoring
repetitive patterns, or even on-the-fly generation of the output asthe input
is parsed but it is possible only for a limited subset of thestylesheets. In
general the implementation of libxslt follows the followingpattern: - KISS (keep it simple stupid)
- when there is a clear bottleneck optimize on top of this
simpleframework and refine only as much as is needed to reach the
expectedresult
The result is not that bad, clearly one can do a better job but
morespecialized too. Most optimization like building the tree on-demand
wouldneed serious changes to the libxml XPath framework. An easy step would
be toserialize the output directly (or call a set of SAX-like output handler
tokeep this a flexible interface) and hence avoid the memory consumption of
theresult. DOM-like trees, as used and generated by libxml and libxslt, arerelatively
complex. Most node types follow the given structure except a fewvariations
depending on the node type: Nodes carry a nameand the node
typeindicates the kind of node it represents, the most
common ones are: - document nodes
- element nodes
- text nodes
For the XSLT processing, entity nodes should not be generated (i.e.
theyshould be replaced by their content). Most nodes also contains the
following"navigation" informations: - the containing document
- the parentnode
- the first childrennode
- the lastchildren node
- the previous sibling
- the following sibling (next)
Elements nodes carries the list of attributes in the properties,
anattribute itself holds the navigation pointers and the children list
(theattribute value is not represented as a simple string to allow usage
ofentities references). The nspoints to the namespace declaration for
thenamespace associated to the node, nsDefis the linked
listof namespace declaration present on element nodes. Most nodes also carry an _privatepointer which can beused
by the application to hold specific data on this node. There are a few steps which are clearly decoupled at the
interfacelevel: - parse the stylesheet and generate a DOM tree
- take the stylesheet tree and build a compiled version of it
(thecompilation phase)
- take the input and generate a DOM tree
- process the stylesheet against the input tree and generate an
outputtree
- serialize the output tree
A few things should be noted here: - the steps 1/ 3/ and 5/ are optional
- the stylesheet obtained at 2/ can be reused by multiple processing
4/(and this should also work in threaded programs)
- the tree provided in 2/ should never be freed using xmlFreeDoc, but
byfreeing the stylesheet.
- the input tree 4/ is not modified except the _private field which maybe
used for labelling keys if used by the stylesheet
This is the second step described. It takes a stylesheet tree,
and"compiles" it. This associates to each node a structure stored in
the_private field and containing information computed in the stylesheet: One xsltStylesheet structure is generated per document parsed for
thestylesheet. XSLT documents allow includes and imports of other
documents,imports are stored in the importslist (hence
keeping thetree hierarchy of includes which is very important for a proper
XSLTprocessing model) and includes are stored in the
doclistlist. An imported stylesheet has a parent link to
allow browsing of thetree. The DOM tree associated to the document is stored in
doc.It is preprocessed to remove ignorable empty nodes and
all the nodes in theXSLT namespace are subject to precomputing. This usually
consist ofextracting all the context information from the context tree
(attributes,namespaces, XPath expressions), and storing them in an
xsltStylePreCompstructure associated to the _privatefield of
the node. A couple of notable exceptions to this are XSLT template nodes (more
onthis later) and attribute value templates. If they are actually
templates,the value cannot be computed at compilation time. (Some
preprocessing couldbe done like isolation and preparsing of the XPath
subexpressions but it'snot done, yet.) The xsltStylePreComp structure also allows storing of the precompiled
formof an XPath expression that can be associated to an XSLT element (more
onthis later). A proper handling of templates lookup is one of the keys of fast
XSLTprocessing. (Given a node in the source document this is the process
offinding which templates should be applied to this node.) Libxslt follows
thehint suggested in the 5.2Patternssection of the XSLT
Recommendation, i.e. it doesn't evaluate itas an XPath expression but
tokenizes it and compiles it as a set of rules tobe evaluated on a candidate
node. There usually is an indication of the nodename in the last step of this
evaluation and this is used as a key check forthe match. As a result libxslt
builds a relatively more complex set ofstructures for the templates: Let's describe a bit more closely what is built. First the
xsltStylesheetstructure holds a pointer to the template hash table. All the
XSLT patternscompiled in this stylesheet are indexed by the value of the the
targetelement (or attribute, pi ...) name, so when a element or an attribute
"foo"needs to be processed the lookup is done using the name as a key. Each of the patterns is compiled into an xsltCompMatch structure. It
holdsthe set of rules based on the tokenization of the pattern stored in
reverseorder (matching is easier this way). It also holds some information
about theprevious matches used to speed up the process when one iterates over
a set ofsiblings. (This optimization may be defeated by trashing when
runningthreaded computation, it's unclear that this is a big deal in
practice.)Predicate expressions are not compiled at this stage, they may be
at run-timeif needed, but in this case they are compiled as full XPath
expressions (theuse of some fixed predicate can probably be optimized, they
are not yet). The xsltCompMatch are then stored in the hash table, the clash list
isitself sorted by priority of the template to implement "naturally" the
XSLTpriority rules. Associated to the compiled pattern is the xsltTemplate itself
containingthe information required for the processing of the pattern
including, ofcourse, a pointer to the list of elements used for building the
patternresult. Last but not least a number of patterns do not fit in the hash
tablebecause they are not associated to a name, this is the case for
patternsapplying to the root, any element, any attributes, text nodes, pi
nodes, keysetc. Those are stored independently in the stylesheet structure as
separatelinked lists of xsltCompMatch. The processing is defined by the XSLT specification (the basis of
thealgorithm is explained in the
Introductionsection). Basically it works by taking the root of the input
document andapplying the following algorithm: - Finding the template applying to it. This is a lookup in the
templatehash table, walking the hash list until the node satisfies all
the stepsof the pattern, then checking the appropriate(s) global
templates to seeif there isn't a higher priority rule to apply
- If there is no template, apply the default rule (recurse on
thechildren)
- else walk the content list of the selected templates, for each of them:
- if the node is in the XSLT namespace then the node has a
_privatefield pointing to the preprocessed values, jump to the
specificcode
- if the node is in an extension namespace, look up the
associatedbehavior
- otherwise copy the node.
The closure is usually done through the
XSLTapply-templatesconstruct recursing by applying
theadequate template on the input node children or on the result of
anassociated XPath selection lookup.
Note that large parts of the input tree may not be processed by a
givenstylesheet and that on the opposite some may be processed multiple
times.(This often is the case when a Table of Contents is built). The module transform.c is the one implementing most of
thislogic. xsltApplyStylesheet()is the entry point,
itallocates an xsltTransformContext containing the following: - a pointer to the stylesheet being processed
- a stack of templates
- a stack of variables and parameters
- an XPath context
- the template mode
- current document
- current input node
- current selected node list
- the current insertion points in the output document
- a couple of hash tables for extension elements and functions
Then a new document gets allocated (HTML or XML depending on the type
ofoutput), the user parameters and global variables and parameters
areevaluated. Then xsltProcessOneNode()which implements
the1-2-3 algorithm is called on the root element of the input. Step 1/
isimplemented by calling xsltGetTemplate(), step 2/
isimplemented by xsltDefaultProcessOneNode()and step 3/
isimplemented by xsltApplyOneTemplate(). The XPath support is actually implemented in the libxml module (where itis
reused by the XPointer implementation). XPath is a relatively
classicexpression language. The only uncommon feature is that it is working
on XMLtrees and hence has specific syntax and types to handle them. XPath expressions are compiled using xmlXPathCompile().It
will take an expression string in input and generate a structurecontaining
the parsed expression tree, for example the expression: /doc/chapter[title='Introduction'] will be compiled as Compiled Expression : 10 elements
SORT
COLLECT 'child' 'name' 'node' chapter
COLLECT 'child' 'name' 'node' doc
ROOT
PREDICATE
SORT
EQUAL =
COLLECT 'child' 'name' 'node' title
NODE
ELEM Object is a string : Introduction
COLLECT 'child' 'name' 'node' title
NODE This can be tested using the testXPath command (in thelibxml
codebase) using the --tree option. Again, the KISS approach is used. No optimization is done. This could bean
interesting thing to add. MichaelKay
describesa lot of possible and interesting optimizations done inSaxon
which would be possible at this level. I'm unsure they would providemuch gain
since the expressions tends to be relatively simple in general andstylesheets
are still hand generated. Optimizations at the interpretationsounds likely to
be more efficient. The interpreter is implemented by
xmlXPathCompiledEval()which is the front-end to
xmlXPathCompOpEval()the functionimplementing the evaluation
of the expression tree. This evaluation followsthe KISS approach again. It's
recursive and callsxmlXPathNodeCollectAndTest()to collect
nodes set whenevaluating a COLLECT node. An evaluation is done within the framework of an XPath context stored inan
xmlXPathContextstructure, in the framework of
atransformation the context is maintained within the XSLT context. Its
contentfollows the requirements from the XPath specification: - the current document
- the current node
- a hash table of defined variables (but not used by XSLT)
- a hash table of defined functions
- the proximity position (the place of the node in the current
nodelist)
- the context size (the size of the current node list)
- the array of namespace declarations in scope (there also is a
namespacehash table but it is not used in the XSLT transformation).
For the purpose of XSLT an extrapointer has been
addedallowing to retrieve the XSLT transformation context. When an
XPathevaluation is about to be performed, an XPath parser context is
allocatedcontaining and XPath object stack (this is actually an XPath
evaluationcontext, this is a remain of the time where there was no separate
parsing andevaluation phase in the XPath implementation). Here is an overview
of the setof contexts associated to an XPath evaluation within an
XSLTtransformation: Clearly this is a bit too complex and confusing and should be refactoredat
the next set of binary incompatible releases of libxml. For example
thexmlXPathCtxt has a lot of unused parts and should probably be merged
withxmlXPathParserCtxt. An XPath expression manipulates XPath objects. XPath defines the
defaulttypes boolean, numbers, strings and node sets. XSLT adds the result
treefragment type which is basically an unmodifiable node set. Implementation-wise, libxml follows again a KISS approach,
thexmlXPathObject is a structure containing a type description and the
variouspossibilities. (Using an enum could have gained some bytes.) In the
case ofnode sets (or result tree fragments), it points to a separate
xmlNodeSetobject which contains the list of pointers to the document
nodes: The XPath
API(andits 'internal'part)
includes a number of functions to create, copy, compare, convert orfree XPath
objects. All the XPath functions available to the interpreter are registered in
thefunction hash table linked from the XPath context. They all share the
samesignature: void xmlXPathFunc (xmlXPathParserContextPtr ctxt, int nargs); The first argument is the XPath interpretation context, holding
theinterpretation stack. The second argument defines the number of
objectspassed on the stack for the function to consume (last argument is on
top ofthe stack). Basically an XPath function does the following: - check
nargs for proper handling of errors or functionswith
variable numbers of parameters
- pop the parameters from the stack using
obj
=valuePop(ctxt);
- do the function specific computation
- push the result parameter on the stack using
valuePush(ctxt,res);
- free up the input parameters
with
xmlXPathFreeObject(obj);
- return
Sometime the work can be done directly by modifying in-situ the top
objecton the stack ctxt->value . Not to be confused with XPath object stack, this stack holds the
XSLTvariables and parameters as they are defined through the recursive calls
ofcall-template, apply-templates and default templates. This is used to
definethe scope of variables being called. This part seems to be the most urgent attention right now, first it isdone
in a very inefficient way since the location of the variables andparameters
within the stylesheet tree is still done at run time (it reallyshould be done
statically at compile time), and I am still unsure that myunderstanding of
the template variables and parameter scope is actuallyright. This part of the documentation is still to be written once this part ofthe
code will be stable. TODO There is a separate document explaining how
theextension support works. Michael Kay wrote areally
interesting article on Saxon internalsand the work he did onperformance
issues. I wishes I had read it before starting libxslt design (Iwould
probably have avoided a few mistakes and progressed faster). A lot ofthe
ideas in his papers should be implemented or at least tried inlibxslt. The libxml documentation, especially the I/O interfacesand the memory management. redesign the XSLT stack frame handling. Far too much work is done
atexecution time. Similarly for the attribute value templates handling,
atleast the embedded subexpressions ought to be precompiled. Allow output to be saved to a SAX like output (this notion of SAX like
APIfor output should be added directly to libxml). Implement and test some of the optimization explained by Michael
Kayespecially: - static slot allocation on the stack frame
- specific boolean interpretation of an XPath expression
- some of the sorting optimization
- Lazy evaluation of location path. (this may require more changes
butsounds really interesting. XT does this too.)
- Optimization of an expression tree (This could be done as a
completelyindependent module.)
Error reporting, there is a lot of case where the XSLT
specificationspecify that a given construct is an error are not checked
adequately bylibxslt. Basically one should do a complete pass on the XSLT
spec again andadd all tests to the stylesheet compilation. Using the DTD
provided in theappendix and making direct checks using the libxml validation
API sounds agood idea too (though one should take care of not raising errors
forelements/attributes in different namespaces). Double check all the places where the stylesheet compiled form might
bemodified at run time (extra removal of blanks nodes, hint on
thexsltCompMatch). Daniel Veillard |