The Problem
I have written several pages of content for my personal web site in XML and I wanted to know which tags I used in all of my XML source files. The first thing I did was create a subclass of WCMSBaseParser :
class TagLister(WCMSBaseParser): def __init__(self,source,cms,menu): WCMSBaseParser.__init__(self,source,cms,menu) self.taglist = [] def do_unknowntag(self,node): if node.tagName not in self.taglist: self.taglist.append(node.tagName) for child in node.childNodes: self.parse(child)
All this parser will do is create a list of tags in one XML file. Now I needed to create a content manager to process all of my XML files.
class SiteTagLister(WebContentManager):
def __init__(self,pth):
WebContentManager.__init__(self,pth)
self.path = pth
self.tagdict = {}
def FindTags(self,report=1):
filestoupdate = self.FilterInputFiles()
for file in filestoupdate:
tp = TagLister(file,None,None)
tp.Convert() # Process everthing
# Grab the type of the file
thistype = tp.source.tagName
for tag in tp.taglist:
if tag in self.tagdict.keys():
self.tagdict[tag].append((thistype,file))
else:
self.tagdict[tag]=[(thistype,file)]
keys = self.tagdict.keys()
keys.sort()
for key in keys:
print key
for file in self.tagdict[key]:
print '\t%s->%s' % file)
This provided me with a list of each tag and a list of which files used that tag. I added the ability to report what is essentially the DOCTYPE for each file as well. To get this information all I needed was two lines of code to start the whole thing:
myLister = SiteTagLister("Zaphod:Web")
myLister.FindTags(1)
So, I asked a question about my data, and spent less than 10 minutes getting the answer. If that isn’t efficiency, I don’t know what is.
Why do this?
Each XML source file gets processed by a specific parser, and those parsers, subclasses of WCMSBaseParser need to handle each tag, so I can double check my parsers with this list to see if I’ve missed anything. It is also easier, I suppose, to have the do_unknowntag method raise an error or add a line in a report. I found this to be easier.