KAR Encapsulation Specification
STATUS:
The content on this page is outdated. The page is archived for reference only. For more information about current work, please contact the Framework team.
Overview
The purpose of this file format is to encapsulate all of the data and services (actors) in a workflow, as well as the workflow itself and any nested workflows into a single archive file to be used for ease of transport between kepler clients. The vision is for a file, similar to a JAR file that contains a manifest of what entities are contained within the archive. The difference between a JAR file and a KSW file is that the contents would include annotation information for use with the kepler ontology system. A user would be able to "drop" a KSW file into Kepler and Kepler would recognize the file and automatically load the workflow and all of its pieces which are contained in the ksw. Multiple KSW files could also be dropped into Kepler to replace the basicKeplerActorLibrary.xml file so that the actor library can be generated directly by the KSW subsystem (thus eliminating the need for the basicKeplerActorLibrary.xml file altogether).
Requirements
- must contain java class files and other binary executable files
- must contain native libraries
- must contain MoML and other XML based text
- must contain an OWL document with semantic ordering for the contained objects
- must contain data in binary or ascii formats. The data itself may be a zip type file.
- optimally, the archive would be compressed for optimized network transfers
- each object in the ksw must have an id (LSID) and a type. These will be stored in the manifest.
- each item in the ksw should also have a dependency tree associated with it.
File format
We have talked about using existing formats, such as tar, gzip, jar, etc. for the KSW file. Since Kepler is a Java application, it seems logical to use a jar file. Dan has pointed out an interresting article from IBM on jar files with internal jar files. This article touches on some class loading and organizational issues that will probably come up with the KSW.
Classloaders
The object manager will need a custom classloader (or loaders if we go with the method described in the IBM article) to handle the special loading and unloading of class files associated with the behavior of ksw files. When a ksw file is loaded at run-time, the classloader will need to check to see whether the classes in the ksw have already been loaded or not.
User interface for creating KSW files
From within Kepler, a user must have the ability to encapsulate their workflow into a KSW file. This could be done in several different ways.- Menu item accessible from the right click menu on the canvas
- Menu item accessible from the menu bar
- Toolbar button
If each actor in the actor library is, itself, a ksw file, then a ksw file packaged by a user for an entire workflow) would just contain nested ksw files. When an actor or composite is referenced inside the main workflow, it's lsid would be found in the manifest and it's ksw would be parsed and loaded recursively.
Loading KSW files
Kepler must have the ability to accept KSW files from a user and to dynamically load the executable code and workflow(s) into the system at runtime. Much of the functionality needed to load the KSW will probably be provided by the Kepler Object Manager (KOM). The rough algorithm for loading the KSW files will probably look something like this:- unpack the KSW file
- locate the manifest and/or ontology
- analyze the ontology to find the workflow(s)
- find all java classes needed by the workflow
- load any classes that weren't already loaded by the classloader (note that the more common actors may already be loaded and thus can't be loaded again with the default classloader).
- move data into a usable area in the KOM (i.e. cache it)
- load the workflow in Kepler
The case arrises here, where a user might drop in a ksw with an actor that has a different annotation from one already loaded into kepler. If this happens, the actor should be loaded into the actor library in both places (i.e. the original location, and the new, user defined location.
Actors will also need to be versioned through their id's. If a workflow is designed to work with version 1 of an actor and it is packaged into a ksw file and the system into which it is being dropped has version 2 already loaded, the system must realize this and load the older actor as well. It should probably be pointed out to the user that the actor they are using is old or deprecated.
Security
We will need to sign ksw files and have a security manager in place, probably based on current java sandboxing techniques.
Defining Actors, Workflows, Archiving and Publishing
After much discussion, it was decided that the definition of actor within this system was somewhat ambiguous. Here, I try to clarify the definitions of several terms and try to describe how these interrelated systems will work.Atomic Actors
An actor as the user sees it, is the small icon that is dragged onto the canvas. To the user, each box on the canvas is a distinct actor even though several of these boxes might be based upon the same java class and simply be a soft-coded paramaterized modification of the base class. Because of this, the KSW and object manager (OM) should model actors in a different way than what has previously been discussed.The java class, which could be the base for many actors, should be a unique object contained by the OM and have a unique id. This class would have one or more metadata files associated with it. Each metadata file would correspond to what a user sees as an actor. The metadata would describe the port setup, dependency information, ontological annotations and human readable documentation. Each of these metadata files would be unique within the OM and have it's own id. If the base java class were to change (it gets updated and recompiled for some reason), the object manager would consider this updated class a new object with a different id thereby conserving the referential integrity of the actors. In order to use this new class, the actor metadata would have to be updated and, like the class, be assigned a new id. This ensures that if an actor says it uses a specific class, that one class will always be used, no matter what the updates. This is necessary for data provenance and other lineage tracking.
Workflows
Given that actors are now just objects with ids, the job of the workflow document itself is simplified. It basically serves as a mechanism to link actors together and to (at least in the meantime) specify the visual layout for those actors. A workflow is just another metadata document so it will be assigned an id and the OM will provide version control for it as well.A workflow can also be a composite actor within the system. In this case, there are two different ways we could use them in the OM.
- We could treat the composite as a blackbox and have a metadata description for it (just like for class based atomic actors)
- We could add additional metadata into the <group> element of the existing moml and parse it from within the file.
I think #2 is problematic because I would like to think of all actors (composite or atomic) as being the same metadata document that describes a base class (i.e. generalize the concept of an actor). If we use a standard metadata format for both atomics and composites, it makes things simpler.
In any case, the moml parser will need to be modified to handle id references instead of hard coded class names. The parser will need to identify that a reference to an lsid has been made, then call the OM to locate the object with that id and instantiate it (this may be a java class or it might just be another metadata file).
Publishing
If a user has a workflow or actor that s/he wants to share, it should be able to be published to the grid. The OM should also be able to check for updated components on the grid. The OM should keep track of what objects come from which server(s) and should know that if a workflow is introduced with an id to an object on a different server that it can go get the object.The OM needs to be smart enough to realize that when a ksw is downloaded from the grid, there are probably going to be some collisions in the annotations, file names, ids, etc. It has to manage all of these collisions in a user friendly way. Some items might already be loaded and can be discarded.
Grid server technology
With a model such as this, where we have a metadata document describing some other object (maybe a binary one), we will need a server that is capable of capturing, serving and searching metadata while linking multiple metadata files to one non-metadata object. It occurs to me that Metacat would work perfectly for this purpose. The actors and workflows are now just xml metadata docs that need to link to a java class. Metacat can store the actors and even provide the link to the binary class file. It could also store binary shared object (dll) libraries which will be linked to through the actor's dependency information.With metacat's web interface, it would make it easy for users to see what components are stored on the server. The replication features in metacat would make it easy to keep multiple copies of components on different servers.
Archive vs. Transport KSWs
The KSW actually encompasses two different ideas. One is that the user should be able to completely archive a workflow with all of its dependent files/libraries. This would allow the ksw to be shipped to a user with a completely different version of kepler and still have the WF inside the ksw run. This would also allow the WF to run in the future after version changes in much of the underlying software. This file would be loaded with all libraries, metadata and data needed to run the WF thusly might be very large. This mode could also be thought of as a "snapshot" of the WF.The second idea behind the ksw is to use it as a transport mechanism. In this mode, the ksw would probably not preserve the state of the WF for posterity. It would be stripped down to the bare necessities. It would only contain those objects that could not be found on a server (the grid) or with the basic kepler installation. This would allow a user to email (or ftp or whatever) the ksw to a colleague and allow that colleague to run the WF inside the ksw with little trouble. This would not be an archive, though, because it would lack the actual executable objects needed to run the WF. Instead it would be a light-weight version suitable for sending around to other users.
Draft API
public class KSWFile extends java.util.jar.JarFile { /** * adds an attribute to the manifest for the ksw object with the specified lsid. * @param lsid the lsid of the ksw object that you are setting the attribute for. * if this lsid has not been added to the ksw, an lsidNotFoundException * will be thrown. * @param attributeName the name of the attribute that you want to add. * @param attributeValue the value of the attribute that you want to add. * @throws lsidNotFoundException if the object with lsid has not been added to the ksw yet. */ public void addManifestAttribute(String lsid, String attributeName, Object attributeValue) throws lsidNotFoundException; } public class KSWEntry extends java.util.jar.JarEntry { /** * The constructor. The entry is referenced by lsid. * @param lsid the lsid of the KSWEntry */ public KSWEntry(String lsid); }
Draft Schema for KSW metadata file
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:sms="http://seek.ecoinformatics.org/sms" xmlns="http://kepler-project.org/wfmetadata" targetNamespace="http://kepler-project.org/actorMetadata"> <xs:element name="metadata" type="KeplerMetadata" minOccurs="1" maxOccurs="unbounded"> <xs:attribute name="id" type="xs:string" use="required"/> </xs:element> <xs:complexType name="KeplerMetadata"> <xs:sequence> <xs:group ref="DescriptionGroup"/> <xs:element name="ports"> <xs:sequence> <xs:element name="port" type="xs:string" minOccurs="1" maxOccurs="unbounded"> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="type" type="xs:string" use="required"/> <xs:attribute name="flow" type="PortType" use="required"/> <xs:attribute name="multi" type="MultiType" use="required"/> <xs:sequence> <xs:group ref="DescriptionGroup"/> </xs:sequence> </xs:element> </xs:sequence> </xs:element> <xs:element name="dependencies"> <xs:sequence> <xs:element name="depend"> <xs:attribute name="lsid" type="xs:string" use="required"/> <xs:element name="os" type="OSType" use="required"/> <xs:sequence> <xs:group ref="DescriptionGroup"/> <xs:element name="os" type="OSType" use="required" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:element> </xs:sequence> </xs:element> <xs:element name="semtypes"> <xs:sequence> <xs:element name="semtype" type="sms:SemanticType" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:element> </xs:sequence> </xs:complexType> <xs:group name="DescriptionGroup"> <xs:sequence> <xs:element name="description"> <xs:complexType mixed="true"> <xs:sequence> <xs:element name="short" type="xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:group> <xs:simpleType name="MultiType"> <xs:restriction base="xs:string"> <xs:enumeration value="true"/> <xs:enumeration value="false"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="PortType"> <xs:restriction base="xs:string"> <xs:enumeration value="input"/> <xs:enumeration value="output"/> <xs:enumeration value="both"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="OSType"> <xs:restriction base="xs:string"> <xs:enumeration value="windows"/> <xs:enumeration value="linux"/> <xs:enumeration value="osx"/> <xs:enumeration value="solaris"/> <xs:enumeration value="os/2"/> <xs:enumeration value="bsd"/> <xs:enumeration value="unix"/> </xs:restriction> </xs:simpleType> </xs:schema>
Example metadata doc
<?xml version="1.0" encoding="UTF-8"?> <kepler:metadata xmlns:kepler="http://kepler-project.org/actorMetadata" xmlns:sms="http://seek.ecoinformatics.org/sms"> <description><short>This actor is a string constant.</short> It can be used to supply basic static input to other actors and workflows. </description> <ports> <port name="input0" type="string" flow="input" multi="false"> <description><short>This is the first input port.</short> It takes data in. </description> </port> <port name="input1" type="float" flow="input" multi="false"/> <port name="output0" type="string" flow="output" multi="false"/> </ports> <dependencies> <depend lsid="urn:lsid:kepler-project.org:jar:1:1"> <description><short>This is a dependency for a jar file.</short> </description> <os>windows</os> <os>linux</os> <os>osx</os> </depend> <depend lsid="urn:lsid:kepler-project.org:shared_object:1:1"> <description><short>This is an SO file for linux.</short></description> <os>linux</os> </depend> <depend lsid="urn:lsid:kepler-project.org:shared_object:1:1"> <description><short>This is an SO file for OSX.</short></description> <os>osx</os> </depend> <depend lsid="urn:lsid:kepler-project.org:dll:1:1"> <description><short>This is a dll file for windows.</short></description> <os>windows</os> </depend> </dependencies> <semtypes> <semtype id="urn:lsid:kepler-project.org:semtype:1:1"> <sms:Label name="Constant" resource="urn:lsid:kepler-project.org:actor:1:1"/> <sms:Label name="ConstantActor" resource="http://seek.ecoinformatics.org/ontology#ConstantActor"/> <sms:Annotation object="Constant" meaning="ConstantActor"/> </semtype> </semtypes> </kepler:metadata>
KSW metadata file use cases
- An atomic actor needs structured metadata to be automatically classified in the actor library.
- A user archives a workflow into a ksw file and each actor in that workflow needs structured metadata associated with it. The metadata and objects are linked in the manifest of the ksw
- A composite actor is archived. Like a workflow, each actor needs documentation for automated re-ingestion.
- Each actor needs basic, human readable documentation for the automated documentation system (javadoc) and for tooltips and user docs within the GUI.
- This could be used to preprocess actors/workflows to pre-load them before invoking the moml processor with kepler.
Discussion of options
There is some confusion about implementing this because the ksw file is trying to fill at least 3 different goals. 1) an archive, 2) an exchange format and 3) internal actor/WF classification system (aka the actor library). In addition to these 3 things, we are also talking about doing this both in an autonomous local system and in a distributed/client-server architecture. In 1, the ksw needs to contain the contents of an entire workflow. this includes all dependent jars, ontologies, data, etc. In case 2, the ksw could contain references to already registered components that can be found on a server or local to all kepler installations (thus making the ksw file much smaller and easier to handle).
If the manifest/metadata for KSW files is going to be an xml document, this would allow us to use metacat servers to store ksws in a distributed way. We might also want p2p for computational distribution, but as far as registering and storing actors and onotologies, metacat would probably fit the bill well.
Developer Comments
---Dan Higgins---
Comments about the classloader needing to determine if a class is already loaded appear several places. This function will be automatically handled by new classloaders. The system looks at all classloaders above a new one in the hierarchy and sees if the class is already loaded, so simply extending existing classloaders should handle this behavior automatically. A good reference is the book 'Component Development for the Java Platform' by Stuart Dabbs Halloway. See (http://www.develop.com/us/technology/developmentorseries.aspx) for a downloadable copy of the book.
---Shawn Bowers---
Here are some high-level comments and observations:
- The term "actor" is ambiguous in Ptolemy/Kepler. The term "actor" is used to denote: (1) the java class that implements the actor; (2) a parameterized version of an actor, e.g., a particular use of the expression actor parameterized with a formula or a configured web service; and (3) a particular "copy" of an actor within a workflow (i.e., actor A may occur n times within a workflow, where each occurrence is a "copy"). The first two actor denotations are important to support for archival and interchange. The third can be viewed as an "instance" of either type (1) or (2). Of course, if the the "copy" is further configured, it may become a new "actor" of type (2).
- It is important to clarify the use of identifiers and the impact that adding identifiers may have on Kepler. Currently, identifiers are not explicitly used in Ptolemy/Kepler (besides the actor lib. extension prototype). If one wishes to save a workflow (composite actor), each actor description is "copied" into the resulting MoML file (whose name serves as a local identifier). Implicitly, the identity of an actor is based on the java class that implements the actor (thus, one can distinguish type (1) actors based on the class). Ptolemy/Kepler currently do not have a convenient mechanism to explicitly check equality among actors, e.g., to determine if two type (3) actors are a copy of the same type (1) or type (2) actor. One can check if two actors are of the same java class.
- Identifiers are useful for a number of reasons. First, identifiers reduce the need to store redundant information. For example, one can store all the information concerning an item in one place. And, references to the item can be made via the identifier, instead of copying the item. Identifiers also allow information concerning an item to be "broken into pieces" and stored in multiple places (for example, performing "vertical" partitioning). Identifiers can also make checking whether two items are equivalent trivial: two items are equivalent if they have the same identifier. Separately, one can determine whether two items are value equivalent.
- For the local case (i.e., kepler "out of the box"), identifiers could be used as follows. First, we distinguish between actor definitions and actor references. An actor definition is where the actor is "tied" to an identifier. For example, we might use the tag <entity id="urn:lsid:kepler-project.org:1:1"> to start an actor definition, and sub-elements would further define the actor, e.g., we could give it a standard name <property name="name" value="Expression"/>, define it's class <property name="class" value="ptolemy.actor.lib.Expression"/>, ports, comments, and so on. For type (2) actors, we can do a similar thing: define the entity with an id, perhaps state that it "specializes" a particular type (1) actor (and thus, not requiring a class property) and define its specific configuration. Alternatively, for cases when the actor is being used in a workflow (composite actor), we would reference the actor (instead of "copy" it). For example, we might say <entity name="Expression2" ref="urn:lsid:kepler-project.org:1:1"> and possibly include any additional information about the copy such as it's location on the canvas. Here we use the name "Expression2" to identify a particular copy within the workflow. Note that the workflow itself (composite actor) is a definition, and can also be referenced, e.g., as a sub-workflow.
- For archival and distribution, actor definitions (but not the particular actor references) would require additional information such as library dependencies and so on.
- Also, with this approach, the question arrises as to whether semantic types should be included (and even intertwined) in actor definitions. By intertwined, we mean, e.g., embedded in port definitions. In general, this decision is simply an implementation detail. Here are some arguments for separating semantic types from actor definitions. However, these are primarily based on implementation convenience. The current semantic type language will be designed to support (in a limited way) either approach (but not intertwined metadata definition). A counter-argument to those given below is that there will be a single place in which the semantic type(s) of an actor can be found (i.e., in the actor definition itself).
- Semantic types are used for both data sets and actors, making it convenient to have a single representation for both, that does not depend on the particular metadata language used for either.
- There may be many semantic types for an actor (e.g., made by different users or generated by an actors usage within a workflow).
- The semantic type(s) of an actor may not be know a priori, and instead may evolve over time. This situation is problematic because in a distributed environment an actor definition may be at various sites (each requiring updating, instead of adding a single semantic type).
- Because semantic types will be used for search, they should be stored in a single location. If semantic types are embedded within actors (and/or datasets), they would require "harvesting".
- Finally, given this approach, the KSW archive would need to include an additional "index" file listing the actor definitions (i.e., where in the archive the actor definitions can be found), the semantic annotations (if not included in the actor definitions themselves, then where in the archive the annotations can be found), ontology definitions for semantic types (i.e., where the ontology definitions are located in the archive), where the relevant libraries are located in the archive, and so on. Note the index would only include the items that have been explicitly archived.
Implementation road map for the KSW functionality
I would like to start implementing the KSW and associated framework ASAP. I would like to start by manually creating a KSW file with all the needed metadata and then building a parser to access it programatically.The next step would be to automate the creation of the KSW from within kepler. At first this would be a one way process (i.e. you could build a ksw but would not yet be able to read one back into kepler).
Once kepler can create a KSW, work will begin on building the basic sub-systems of the object manager so that KSW files can be read back into the system and used to build the actor library.