The Incomplete Works of Josh English

Writings and Ramblings about almost anything...

On this page: The Problem | Why do this?

The Problem

I have written several pages of content for my personal web site in XML and I wanted to know which tags I used in all of my XML source files. The first thing I did was create a subclass of WCMSBaseParser :


class TagLister(WCMSBaseParser):
	def __init__(self,source,cms,menu):
		WCMSBaseParser.__init__(self,source,cms,menu)
		self.taglist = []
		
	def do_unknowntag(self,node):
		if node.tagName not in self.taglist:
			self.taglist.append(node.tagName)
		for child in node.childNodes: self.parse(child)

All this parser will do is create a list of tags in one XML file. Now I needed to create a content manager to process all of my XML files.


class SiteTagLister(WebContentManager):
	def __init__(self,pth):
		WebContentManager.__init__(self,pth)
		self.path = pth
		self.tagdict = {}
	
	def FindTags(self,report=1):
		filestoupdate = self.FilterInputFiles()
		for file in filestoupdate:
			tp = TagLister(file,None,None)
			tp.Convert()	# Process everthing
			# Grab the type of the file
			thistype = tp.source.tagName
			
			for tag in tp.taglist:
				if tag in self.tagdict.keys():
					self.tagdict[tag].append((thistype,file))
				else:
					self.tagdict[tag]=[(thistype,file)]
		keys = self.tagdict.keys()
		keys.sort()
		for key in keys:
			print key
			for file in self.tagdict[key]:
				print '\t%s->%s' % file)

This provided me with a list of each tag and a list of which files used that tag. I added the ability to report what is essentially the DOCTYPE for each file as well. To get this information all I needed was two lines of code to start the whole thing:


	myLister = SiteTagLister("Zaphod:Web")
	myLister.FindTags(1)

So, I asked a question about my data, and spent less than 10 minutes getting the answer. If that isn’t efficiency, I don’t know what is.

Why do this?

Each XML source file gets processed by a specific parser, and those parsers, subclasses of WCMSBaseParser need to handle each tag, so I can double check my parsers with this list to see if I’ve missed anything. It is also easier, I suppose, to have the do_unknowntag method raise an error or add a line in a report. I found this to be easier.