翻譯|使用教程|編輯:胡濤|2022-08-29 10:51:06.847|閱讀 358 次
概述:本文介紹了如何用Java從Word文檔中提取文本,歡迎查閱!
# 界面/圖表報表/文檔/IDE等千款熱門軟控件火熱銷售中 >>
相關鏈接:
Aspose.Words For .NET是一種高級Word文檔處理API,用于執行各種文檔管理和操作任務。API支持生成,修改,轉換,呈現和打印文檔,而無需在跨平臺應用程序中直接使用Microsoft Word。此外,API支持所有流行的Word處理文件格式,并允許將Word文檔導出或轉換為固定布局文件格式和最常用的圖像/多媒體格式。本文介紹了如何用Java從Word文檔中提取文本
從 Word 文檔中提取文本通常在不同的場景中執行。例如,分析文本,提取文檔的特定部分并將它們組合成單個文檔,等等。在本文中,您將學習如何在 Java 中以編程方式從 Word 文檔中提取文本。此外,我們將介紹如何動態提取段落、表格等特定元素之間的內容。
Aspose.Words for Java 是一個功能強大的庫,可讓您從頭開始創建 MS Word 文檔。此外,它可以讓您操作現有的 Word 文檔進行加密、轉換、文本提取等。我們將使用這個庫從 Word DOCX 或 DOC 文檔中提取文本。您可以下載API 的 JAR 或使用以下 Maven 配置安裝它。
<repository> <id>AsposeJavaAPI</id> <name>Aspose Java API</name> <url>//repository.aspose.com/repo/</url> </repository> <dependency> <groupId>com.aspose</groupId> <artifactId>aspose-words</artifactId> <version>22.6</version> <type>pom</type> </dependency>
MS Word 文檔由各種元素組成,包括段落、表格、圖像等。因此,文本提取的要求可能因場景而異。例如,您可能需要在段落、書簽、評論等之間提取文本。
Word DOC/DOCX 中的每種元素都表示為一個節點。因此,要處理文檔,您將不得不使用節點。那么讓我們開始看看如何在不同的場景下從 Word 文檔中提取文本。
在本節中,我們將為 Word 文檔實現一個 Java 文本提取器,文本提取的工作流程如下:
現在讓我們編寫一個名為extractContent的方法,我們將向該方法傳遞節點和一些其他參數來執行文本提取。此方法將解析文檔并克隆節點。以下是我們將傳遞給此方法的參數。
以下是提取傳遞的節點之間的內容的extractContent方法的完整實現。
// For complete examples and data files, please go to //github.com/aspose-words/Aspose.Words-for-Java
public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) throws Exception {
// First check that the nodes passed to this method are valid for use.
verifyParameterNodes(startNode, endNode);
// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();
// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;
// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.getParentNode().getNodeType() != NodeType.BODY)
startNode = startNode.getParentNode();
while (endNode.getParentNode().getNodeType() != NodeType.BODY)
endNode = endNode.getParentNode();
boolean isExtracting = true;
boolean isStartingNode = true;
boolean isEndingNode;
// The current node we are extracting from the document.
Node currNode = startNode;
// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting) {
// Clone the current node and its children to obtain a copy.
/*System.out.println(currNode.getNodeType());
if(currNode.getNodeType() == NodeType.EDITABLE_RANGE_START
|| currNode.getNodeType() == NodeType.EDITABLE_RANGE_END)
{
currNode = currNode.nextPreOrder(currNode.getDocument());
}*/
System.out.println(currNode);
System.out.println(endNode);
CompositeNode cloneNode = null;
///cloneNode = (CompositeNode) currNode.deepClone(true);
Node inlineNode = null;
if(currNode.isComposite())
{
cloneNode = (CompositeNode) currNode.deepClone(true);
}
else
{
if(currNode.getNodeType() == NodeType.BOOKMARK_END)
{
Paragraph paragraph = new Paragraph(currNode.getDocument());
paragraph.getChildNodes().add(currNode.deepClone(true));
cloneNode = (CompositeNode)paragraph.deepClone(true);
}
}
isEndingNode = currNode.equals(endNode);
if (isStartingNode || isEndingNode) {
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode) {
processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}
// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode) {
processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
} else
// Node is not a start or end marker, simply add the copy to the list.
nodes.add(cloneNode);
// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting) {
// Move to the next section.
Section nextSection = (Section) currNode.getAncestor(NodeType.SECTION).getNextSibling();
currNode = nextSection.getBody().getFirstChild();
} else {
// Move to the next node in the body.
currNode = currNode.getNextSibling();
}
}
// Return the nodes between the node markers.
return nodes;
}
extractContent方法還需要一些輔助方法來完成文本提取操作,如下所示。
/**
* Checks the input parameters are correct and can be used. Throws an exception
* if there is any problem.
*/
private static void verifyParameterNodes(Node startNode, Node endNode) throws Exception {
// The order in which these checks are done is important.
if (startNode == null)
throw new IllegalArgumentException("Start node cannot be null");
if (endNode == null)
throw new IllegalArgumentException("End node cannot be null");
if (!startNode.getDocument().equals(endNode.getDocument()))
throw new IllegalArgumentException("Start node and end node must belong to the same document");
if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null)
throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body");
// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check
// their position in the body of the same section they are in.
Section startSection = (Section) startNode.getAncestor(NodeType.SECTION);
Section endSection = (Section) endNode.getAncestor(NodeType.SECTION);
int startIndex = startSection.getParentNode().indexOf(startSection);
int endIndex = endSection.getParentNode().indexOf(endSection);
if (startIndex == endIndex) {
if (startSection.getBody().indexOf(startNode) > endSection.getBody().indexOf(endNode))
throw new IllegalArgumentException("The end node must be after the start node in the body");
} else if (startIndex > endIndex)
throw new IllegalArgumentException("The section of end node must be after the section start node");
}
/**
* Checks if a node passed is an inline node.
*/
private static boolean isInline(Node node) throws Exception {
// Test if the node is desendant of a Paragraph or Table node and also is not a
// paragraph or a table a paragraph inside a comment class which is decesant of
// a pararaph is possible.
return ((node.getAncestor(NodeType.PARAGRAPH) != null || node.getAncestor(NodeType.TABLE) != null)
&& !(node.getNodeType() == NodeType.PARAGRAPH || node.getNodeType() == NodeType.TABLE));
}
/**
* Removes the content before or after the marker in the cloned node depending
* on the type of marker.
*/
private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive,
boolean isStartMarker, boolean isEndMarker) throws Exception {
// If we are dealing with a block level node just see if it should be included
// and add it to the list.
if (!isInline(node)) {
// Don't add the node twice if the markers are the same node
if (!(isStartMarker && isEndMarker)) {
if (isInclusive)
nodes.add(cloneNode);
}
return;
}
// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same
// paragraph.
if (node.getNodeType() == NodeType.FIELD_START) {
// If the marker is a start node and is not be included then skip to the end of
// the field.
// If the marker is an end node and it is to be included then move to the end
// field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive)) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)
node = node.getNextSibling();
}
}
// If either marker is part of a comment then to include the comment itself we
// need to move the pointer forward to the Comment
// node found after the CommentRangeEnd node.
if (node.getNodeType() == NodeType.COMMENT_RANGE_END) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT)
node = node.getNextSibling();
}
// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have
// been removed. Subtract the
// difference to get the right index.
int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();
// Child node count identical.
if (indexDiff == 0)
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));
else
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);
// Remove the nodes up to/from the marker.
boolean isSkip;
boolean isProcessing = true;
boolean isRemoving = isStartMarker;
Node nextNode = cloneNode.getFirstChild();
while (isProcessing && nextNode != null) {
Node currentNode = nextNode;
isSkip = false;
if (currentNode.equals(node)) {
if (isStartMarker) {
isProcessing = false;
if (isInclusive)
isRemoving = false;
} else {
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}
nextNode = nextNode.getNextSibling();
if (isRemoving && !isSkip)
currentNode.remove();
}
// After processing the composite node may become empty. If it has don't include
// it.
if (!(isStartMarker && isEndMarker)) {
if (cloneNode.hasChildNodes())
nodes.add(cloneNode);
}
}
public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception {
// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.getFirstSection().getBody().removeAllChildren();
// Import each node from the list into the new document. Keep the original
// formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);
for (Node node : (Iterable<Node>) nodes) {
Node importNode = importer.importNode(node, true);
dstDoc.getFirstSection().getBody().appendChild(importNode);
}
// Return the generated document.
return dstDoc;
}
現在我們準備好使用這些方法并從 Word 文檔中提取文本。
讓我們看看如何在 Word DOCX 文檔的兩個段落之間提取內容。以下是在 Java 中執行此操作的步驟。
以下代碼示例展示了如何在 Java 的 Word DOCX 中提取第 7 段和第 11 段之間的文本。
// Load document
Document doc = new Document("TestFile.doc");
// Gather the nodes. The GetChild method uses 0-based index
Paragraph startPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, true);
Paragraph endPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, true);
// Extract the content between these nodes in the document. Include these
// markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endPara, true);
// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");
您還可以在不同類型的節點之間提取內容。為了演示,讓我們提取段落和表格之間的內容并將其保存到新的 Word 文檔中。以下是在 Java 中提取 Word 文檔中不同節點之間的文本的步驟。
以下代碼示例展示了如何使用 Java 在 DOCX 中提取段落和表格之間的文本。
// Load documents
Document doc = new Document("TestFile.doc");
// Get reference of starting paragraph
Paragraph startPara = (Paragraph) doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true);
Table endTable = (Table) doc.getLastSection().getChild(NodeType.TABLE, 0, true);
// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endTable, true);
// Lets reverse the array to make inserting the content back into the document easier.
Collections.reverse(extractedNodes);
while (extractedNodes.size() > 0) {
// Insert the last node from the reversed list
endTable.getParentNode().insertAfter((Node) extractedNodes.get(0), endTable);
// Remove this node from the list after insertion.
extractedNodes.remove(0);
}
// Save the generated document to disk.
doc.save("output.doc");
現在讓我們看看如何根據樣式提取段落之間的內容。為了演示,我們將提取 Word 文檔中第一個“標題 1”和第一個“標題 3”之間的內容。以下步驟演示了如何在 Java 中實現此目的。
以下代碼示例展示了如何根據樣式提取段落之間的內容。
// Load document
Document doc = new Document(dataDir + "TestFile.doc");
// Gather a list of the paragraphs using the respective heading styles.
ArrayList parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1");
ArrayList parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3");
// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node) parasStyleHeading1.get(0);
Node endPara1 = (Node) parasStyleHeading3.get(0);
// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara1, endPara1, false);
// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");
以上便是如何用Java從Word文檔中提取文本 ,要是您還有其他關于產品方面的問題,歡迎咨詢我們,或者加入我們官方技術交流群。
歡迎下載|體驗更多Aspose產品
本站文章除注明轉載外,均為本站原創或翻譯。歡迎任何形式的轉載,但請務必注明出處、不得修改原文相關鏈接,如果存在內容上的異議請郵件反饋至chenjj@ke049m.cn