@Before之前可能会有一些重复的问题建议,我不认为这是第一次,请尽可能简短.标题给出基本思想.
以下是一个XML示例(案例1):
<root> <Item> <ItemID>4504216603</ItemID> <ListingDetails> <StartTime>10:00:10.000Z</StartTime> <EndTime>10:00:30.000Z</EndTime> <ViewItemURL>http://url</ViewItemURL> .... </item>
以下是一个示例XML(情况2):
<Item> <ItemID>4504216604</ItemID> <ListingDetails> <StartTime>10:30:10.000Z</StartTime> <!-- Start difference from case 1 --> <averages> <AverageTime>value1</AverageTime> <category type="TX">9823</category> <category type="TY">9112</category> <AveragePrice>value2</AveragePrice> </averages> <!-- End difference from case 1 --> <EndTime>11:00:10.000Z</EndTime> <ViewItemURL>http://url</ViewItemURL> .... </item> </root>
我从谷歌借了这个XML,无论如何,我的对象并不总是一样的,有时候还有额外的元素,如case2.现在我想从这两种情况生成这样的CSV:
ItemID,StartTime,EndTime,ViewItemURL,AverageTime,AveragePrice 4504216603,10:00:10.000Z,10:00:30.000Z,http://url 4504216604,10:30:10.000Z,11:00:10.000Z,http://url,value1,value2
这第一行是标题,它也应该包含在csv中.我今天得到了一些有用的链接,我真的不知道什么是正确的/最佳的方法,我现在在努力3天,不是真的愿意放弃.
告诉我你觉得你会如何解决这个问题
我忘了提到这是非常巨大的xml文件到1GB
BOUNTY更新:
我正在寻找更多的通用方法,这意味着这应该适用于任何数量的具有任何深度的节点,有时在示例xml中,可能会发生一个项目对象具有比下一个/上一个节点更多的节点数量还应该有这种情况(所有列和值都以CSV格式匹配).
同样可能发生的是,节点具有相同的名称/ localName但不同的值和属性,如果是这样,则新列应以适当的值显示在CSV中. (我在< average>标签中的类别中添加了这个例子)
解决方法
提供的代码应该被认为是草图而不是最后的文章.我不是SAX的专家,并且可以改进实现以获得更好的性能,更简单的代码等.这表示SAX应该能够处理流大型XML文件.
我将使用SAX解析器接收2次通过的问题. (顺便说一句,我也将使用CSV生成库来创建输出,因为这将处理所有涉及到的CSV文件,但是我没有在草图中实现).
第二通:
输出CSV
我认为XML文件格式良好.我假设我们没有一个具有预定义顺序的方案/ DTD.
在第一遍,我假设将为包含文本内容或任何属性的每个XML元素添加一个CSV列(我假定属性将包含某些东西!).
建立目标列数后,第二遍将执行实际的CSV输出.
根据您的示例XML,我的代码草图将产生:
ItemID,category,type,4504216604,9823,9112,TX,TY,value2
请注意,我已经使用了Google集合LinkedHashMultimap,因为在将多个值与单个键相关联时,这很有用.我希望你觉得这很有用!
import com.google.common.collect.LinkedHashMultimap; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.util.LinkedHashMap; import java.util.Map.Entry; import org.xml.sax.Attributes; import org.xml.sax.InputSource; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; public class App { public static void main(String[] args) throws SAXException,FileNotFoundException,IOException { // First pass - to determine headers XMLReader xr = XMLReaderFactory.createXMLReader(); HeaderHandler handler = new HeaderHandler(); xr.setContentHandler(handler); xr.setErrorHandler(handler); FileReader r = new FileReader("test1.xml"); xr.parse(new InputSource(r)); LinkedHashMap<String,Integer> headers = handler.getHeaders(); int totalnumberofcolumns = 0; for (int headercount : headers.values()) { totalnumberofcolumns += headercount; } String[] columnheaders = new String[totalnumberofcolumns]; int i = 0; for (Entry<String,Integer> entry : headers.entrySet()) { for (int j = 0; j < entry.getValue(); j++) { columnheaders[i] = entry.getKey(); i++; } } StringBuilder sb = new StringBuilder(); for (String h : columnheaders) { sb.append(h); sb.append(','); } System.out.println(sb.substring(0,sb.length() - 1)); // Second pass - collect and output data xr = XMLReaderFactory.createXMLReader(); DataHandler datahandler = new DataHandler(); datahandler.setHeaderArray(columnheaders); xr.setContentHandler(datahandler); xr.setErrorHandler(datahandler); r = new FileReader("test1.xml"); xr.parse(new InputSource(r)); } public static class HeaderHandler extends DefaultHandler { private String content; private String currentElement; private boolean insideElement = false; private Attributes attribs; private LinkedHashMap<String,Integer> itemHeader; private LinkedHashMap<String,Integer> accumulativeHeader = new LinkedHashMap<String,Integer>(); public HeaderHandler() { super(); } private LinkedHashMap<String,Integer> getHeaders() { return accumulativeHeader; } private void addItemHeader(String headerName) { if (itemHeader.containsKey(headerName)) { itemHeader.put(headerName,itemHeader.get(headerName) + 1); } else { itemHeader.put(headerName,1); } } @Override public void startElement(String uri,String name,String qName,Attributes atts) { if ("item".equalsIgnoreCase(qName)) { itemHeader = new LinkedHashMap<String,Integer>(); } currentElement = qName; content = null; insideElement = true; attribs = atts; } @Override public void endElement(String uri,String qName) { if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) { if (content != null && qName.equals(currentElement) && content.trim().length() > 0) { addItemHeader(qName); } if (attribs != null) { int attsLength = attribs.getLength(); if (attsLength > 0) { for (int i = 0; i < attsLength; i++) { String attName = attribs.getLocalName(i); addItemHeader(attName); } } } } if ("item".equalsIgnoreCase(qName)) { for (Entry<String,Integer> entry : itemHeader.entrySet()) { String headerName = entry.getKey(); Integer count = entry.getValue(); //System.out.println(entry.getKey() + ":" + entry.getValue()); if (accumulativeHeader.containsKey(headerName)) { if (count > accumulativeHeader.get(headerName)) { accumulativeHeader.put(headerName,count); } } else { accumulativeHeader.put(headerName,count); } } } insideElement = false; currentElement = null; attribs = null; } @Override public void characters(char ch[],int start,int length) { if (insideElement) { content = new String(ch,start,length); } } } public static class DataHandler extends DefaultHandler { private String content; private String currentElement; private boolean insideElement = false; private Attributes attribs; private LinkedHashMultimap dataMap; private String[] headerArray; public DataHandler() { super(); } @Override public void startElement(String uri,Attributes atts) { if ("item".equalsIgnoreCase(qName)) { dataMap = LinkedHashMultimap.create(); } currentElement = qName; content = null; insideElement = true; attribs = atts; } @Override public void endElement(String uri,String qName) { if (!"item".equalsIgnoreCase(qName) && !"root".equalsIgnoreCase(qName)) { if (content != null && qName.equals(currentElement) && content.trim().length() > 0) { dataMap.put(qName,content); } if (attribs != null) { int attsLength = attribs.getLength(); if (attsLength > 0) { for (int i = 0; i < attsLength; i++) { String attName = attribs.getLocalName(i); dataMap.put(attName,attribs.getValue(i)); } } } } if ("item".equalsIgnoreCase(qName)) { String data[] = new String[headerArray.length]; int i = 0; for (String h : headerArray) { if (dataMap.containsKey(h)) { Object[] values = dataMap.get(h).toArray(); data[i] = (String) values[0]; if (values.length > 1) { dataMap.removeAll(h); for (int j = 1; j < values.length; j++) { dataMap.put(h,values[j]); } } else { dataMap.removeAll(h); } } else { data[i] = ""; } i++; } StringBuilder sb = new StringBuilder(); for (String d : data) { sb.append(d); sb.append(','); } System.out.println(sb.substring(0,sb.length() - 1)); } insideElement = false; currentElement = null; attribs = null; } @Override public void characters(char ch[],length); } } public void setHeaderArray(String[] headerArray) { this.headerArray = headerArray; } } }