Download the code for this article: XMLFiles0405.exe (143KB)
XML's ubiquity and continually improving tool support has created a magnetism that attracts organizations everywhere. As organizations move to XML, they must also provide a coherent data migration strategy that allows their users to bring old files forward. This is a nontrivial problem that typically requires tedious code to implement the transformation process. The System.Xml namespace in the Microsoft® .NET Framework, however, can greatly simplify these data migration challenges through its extensible APIs and support for XSLT.
An example of data migration is what's currently happening around genealogy data formats. Genealogists have long relied on the Genealogical Data Communications (GEDCOM) 5.5 format for sharing genealogical information. GEDCOM 5.5 is text based but not XML based. A beta version of GEDCOM 6.0 is available and is completely based on XML. But what about the gigabytes of genealogical information that can still be found in GEDCOM 5.5 format? This presents an interesting data migration challenge that really should not be ignored.
In this column I'll walk you through solving this data migration problem using System.Xml. This process can serve as a blueprint for other data migration problems you may face.
GEDCOM was developed to facilitate exchanging genealogical data across different genealogy programs and systems. A common format like GEDCOM allows users to share their work with others regardless of the program they're using.
Version 5.5 is the most widely used version of the various GEDCOM specifications (see The GEDCOM Standard Release 5.5). GEDCOM 5.5 relies on a simple text-based grammar that leverages line delimiters and level numbers to structure family tree information. Figure 1 provides a sample GEDCOM 5.5 file.
Each GEDCOM line contains a level number, a tag, and an optional value. Multiple lines constitute a GEDCOM record. A level of 0 marks the beginning of a new GEDCOM record. Every line that follows is part of the record until you reach another line with a level of 0. The tag name conveys meaning about the information on the line as defined in the specification.
In the example shown in Figure 1, HEAD is the first record and it contains six children (SOUR, DATE, GEDC, CHAR, SUBM, SUBN). SOUR contains two children (VERS and NAME) while DATE contains one child (TIME). The increasing level numbers indicate parent-to-child relationships. A line may also contain a unique identifier, as shown in the following snippet:
The latest GEDCOM specification, version 6.0, defines a full-fledged XML format for GEDCOM information. The specification even provides a Document Type Definition (DTD) that defines the elements and attributes that make up the complete GEDCOM 6.0 vocabulary.
The GEDCOM 6.0 format is much different from the one my GedcomReader simulates. In order to migrate to GEDCOM 6.0, you have to either modify the GedcomReader implementation in order to simulate the new format or write an XSLT that performs the transformation in a subsequent step.
The GEDCOM 6.0 format is much more complex than the simple mapping I simulated in GedcomReader. Consequently, trying to simulate GEDCOM 6.0 in the GedcomReader code would be extremely difficult and error prone. Using an XSLT transformation to accomplish this step is a more tractable problem.
In Figure 7 I've provided an XSLT in the sample project that illustrates how to generate a GEDCOM 6.0 file from the intermediate XML format shown in Figure 3. The XSLT covers the most common GEDCOM 6.0 use cases, and it produces files that pass validation against the GEDCOM 6.0 DTD.
You can use this XSLT by taking advantage of the System.Xml.Xsl.XslTransform class, as shown here:GedcomReader gr = new GedcomReader(gedcomFileName); XmlDocument doc = new XmlDocument(); doc.Load(gr); gr.Close(); // done using GedcomReader XslTransform tx = new XslTransform(); tx.Load("gedcom6.xsl"); FileStream fs = new FileStream("skonnord6.xml", FileMode.Create); tx.Transform(doc, null, fs, null);
The output file, skonnord6.xml, will now be GEDCOM 6.0-compliant. At this point it's also possible to write other XSLT transformations that change the intermediate XML format into another format of your choosing. For example, you could write an XSLT transformation that produces human-readable HTML pages for viewing and navigating the GEDCOM family tree information.
As you can see, it didn't take much code to implement a complete GEDCOM migration path. The ability to programmatically move between GEDCOM 5.5 and GEDCOM 6.0 greatly simplifies the migration scenarios that are involved in building a complete genealogy system around the new GEDCOM 6.0 data model.
Where Are We?
I've walked through a real-world data migration scenario using the System.Xml classes in .NET. I started with the GEDCOM 5.5 format and moved to an intermediate XML format that can be processed in a variety of ways. This is made possible by a custom XmlReader implementation called GedcomReader.
The intermediate XML format can also be transformed into a variety of other formats. I was able to migrate to GEDCOM 6.0 by using an XSLT transformation (see Figure 8).
Figure 8 Migrating to GEDCOM 6.0
The extensibility points provided by System.Xml facilitate dealing with a variety of data migration scenarios like this one. If you have old data formats lying around that would benefit from migrating to XML, follow the approach discussed in this column to implement your own migration path.
Send your questions and comments for Aaron to email@example.com.
Aaron Skonnard teaches at Northface University in Salt Lake City. Aaron coauthored Essential XML Quick Reference (Addison-Wesley, 2001) and Essential XML (Addison-Wesley, 2000), and frequently speaks at conferences. Reach him at http://www.skonnard.com.
From the May 2004 issue of MSDN Magazine.
In this example, the SUBM record has a unique identifier of SUB01. This ID must be unique within the scope of the document. Identifiers make it possible to establish links between records. For example, the HEAD record referenced this SUBM record on the following line of code:
2 CONT Salt Lake City, Utah 84150
1 SUBM @SUB01@
These are the basics of the GEDCOM 5.5 grammar. GEDCOM 5.5 is capable of representing sophisticated tree structures through its use of level numbers, identifiers, and cross-referencing capabilities. I don't have enough space in this column to cover GEDCOM semantics in more detail. Suffice it to say that GEDCOM 5.5 makes it possible to express a wide range of genealogy-related information.
Since GEDCOM was primarily designed to represent tree structures (family trees) in an interoperable manner, moving GEDCOM to an XML format seems like a perfect fit.
Mapping GEDCOM to XML
The GEDCOM 5.5 grammar maps nicely to XML. One way to define a simple XML mapping is to convert each GEDCOM line into an XML element with the same name. The line's optional data, if any, can be placed in an attribute named "value". Identifiers can be placed in an attribute named "id" and references to other records can be placed in an attribute named "idref". Figure 2 may help you visualize the mapping. Applying this simple mapping to the GEDCOM 5.5 file shown in Figure 1 produces the XML document shown in Figure 3.
Figure 2 Mapping GEDCOM 5.5 to GEDCOM XML
The XML version conveys the same information, but now it can be processed by a much wider range of tools and technologies. For example, you could process the XML file with your favorite XML API (such as DOM, SAX, XmlTextReader, or XPathNavigator), a query language like XPath or XQuery, or a transformation language like XSLT. Ultimately, once you have GEDCOM data in XML format, you can do just about anything with it.