午夜亚洲aⅴ无码高潮片,内射国产内射夫妻免费频道,91午夜免费福利视频

日韩福利首页在线观看网站-日韩福利免费网站视频在线-日韩福利局二区视频-日韩福利-日韩二区在线-日韩二区三区四区-日韩二区三-日韩电影中文字幕

Word處理控件Aspose.Words功能演示：用Java從Word文檔中提取文本

翻譯|使用教程|編輯：胡濤|2022-08-29 10:51:06.847|閱讀 358 次

概述：本文介紹了如何用Java從Word文檔中提取文本,歡迎查閱！

相關鏈接：

Aspose.Words For .NET是一種高級Word文檔處理API，用于執行各種文檔管理和操作任務。API支持生成，修改，轉換，呈現和打印文檔，而無需在跨平臺應用程序中直接使用Microsoft Word。此外，API支持所有流行的Word處理文件格式，并允許將Word文檔導出或轉換為固定布局文件格式和最常用的圖像/多媒體格式。本文介紹了如何用Java從Word文檔中提取文本

從 Word 文檔中提取文本通常在不同的場景中執行。例如，分析文本，提取文檔的特定部分并將它們組合成單個文檔，等等。在本文中，您將學習如何在 Java 中以編程方式從 Word 文檔中提取文本。此外，我們將介紹如何動態提取段落、表格等特定元素之間的內容。

Aspose.Words 最新下載

獲取從 Word 文檔中提取文本的 Java 庫

Aspose.Words for Java 是一個功能強大的庫，可讓您從頭開始創建 MS Word 文檔。此外，它可以讓您操作現有的 Word 文檔進行加密、轉換、文本提取等。我們將使用這個庫從 Word DOCX 或 DOC 文檔中提取文本。您可以下載API 的 JAR 或使用以下 Maven 配置安裝它。

<repository>
<id>AsposeJavaAPI</id>
<name>Aspose Java API</name>
<url>//repository.aspose.com/repo/</url>
</repository>
<dependency>
<groupId>com.aspose</groupId>
<artifactId>aspose-words</artifactId>
<version>22.6</version>
<type>pom</type>
</dependency>

在Java 中提取 Word DOC/DOCX 中的文本

MS Word 文檔由各種元素組成，包括段落、表格、圖像等。因此，文本提取的要求可能因場景而異。例如，您可能需要在段落、書簽、評論等之間提取文本。

Word DOC/DOCX 中的每種元素都表示為一個節點。因此，要處理文檔，您將不得不使用節點。那么讓我們開始看看如何在不同的場景下從 Word 文檔中提取文本。

在 Java 中提取 Word DOC 中的文本

在本節中，我們將為 Word 文檔實現一個 Java 文本提取器，文本提取的工作流程如下：

首先，我們將定義要包含在文本提取過程中的節點。
然后，我們將提取指定節點之間的內容（包括或不包括開始和結束節點）。
最后，我們將使用提取節點的克隆，例如創建一個包含提取內容的新 Word 文檔。

現在讓我們編寫一個名為extractContent的方法，我們將向該方法傳遞節點和一些其他參數來執行文本提取。此方法將解析文檔并克隆節點。以下是我們將傳遞給此方法的參數。

startNode和endNode 分別作為內容提取的起點和終點。這些可以是塊級（Paragraph、Table）或內聯級（例如Run、FieldStart、BookmarkStart等）節點。
1. 要傳遞一個字段，您應該傳遞相應的FieldStart對象。
2. 要傳遞書簽，應傳遞BookmarkStart和BookmarkEnd節點。
3. 對于評論，應使用CommentRangeStart和CommentRangeEnd節點。
isInclusive定義標記是否包含在提取中。如果此選項設置為 false 并且傳遞相同的節點或連續節點，則將返回一個空列表。

以下是提取傳遞的節點之間的內容的extractContent方法的完整實現。

// For complete examples and data files, please go to //github.com/aspose-words/Aspose.Words-for-Java
public static ArrayList extractContent(Node startNode, Node endNode, boolean isInclusive) throws Exception {
// First check that the nodes passed to this method are valid for use.
verifyParameterNodes(startNode, endNode);

// Create a list to store the extracted nodes.
ArrayList nodes = new ArrayList();

// Keep a record of the original nodes passed to this method so we can split marker nodes if needed.
Node originalStartNode = startNode;
Node originalEndNode = endNode;

// Extract content based on block level nodes (paragraphs and tables). Traverse through parent nodes to find them.
// We will split the content of first and last nodes depending if the marker nodes are inline
while (startNode.getParentNode().getNodeType() != NodeType.BODY)
startNode = startNode.getParentNode();

while (endNode.getParentNode().getNodeType() != NodeType.BODY)
endNode = endNode.getParentNode();

boolean isExtracting = true;
boolean isStartingNode = true;
boolean isEndingNode;
// The current node we are extracting from the document.
Node currNode = startNode;

// Begin extracting content. Process all block level nodes and specifically split the first and last nodes when needed so paragraph formatting is retained.
// Method is little more complex than a regular extractor as we need to factor in extracting using inline nodes, fields, bookmarks etc as to make it really useful.
while (isExtracting) {
// Clone the current node and its children to obtain a copy.
/*System.out.println(currNode.getNodeType());
if(currNode.getNodeType() == NodeType.EDITABLE_RANGE_START
|| currNode.getNodeType() == NodeType.EDITABLE_RANGE_END)
{
currNode = currNode.nextPreOrder(currNode.getDocument());
}*/
System.out.println(currNode);
System.out.println(endNode);

CompositeNode cloneNode = null;
///cloneNode = (CompositeNode) currNode.deepClone(true);

Node inlineNode = null;
if(currNode.isComposite())
{
cloneNode = (CompositeNode) currNode.deepClone(true);
}
else
{
if(currNode.getNodeType() == NodeType.BOOKMARK_END)
{
Paragraph paragraph = new Paragraph(currNode.getDocument());
paragraph.getChildNodes().add(currNode.deepClone(true));
cloneNode = (CompositeNode)paragraph.deepClone(true);
}
}

isEndingNode = currNode.equals(endNode);

if (isStartingNode || isEndingNode) {
// We need to process each marker separately so pass it off to a separate method instead.
if (isStartingNode) {
processMarker(cloneNode, nodes, originalStartNode, isInclusive, isStartingNode, isEndingNode);
isStartingNode = false;
}

// Conditional needs to be separate as the block level start and end markers maybe the same node.
if (isEndingNode) {
processMarker(cloneNode, nodes, originalEndNode, isInclusive, isStartingNode, isEndingNode);
isExtracting = false;
}
} else
// Node is not a start or end marker, simply add the copy to the list.
nodes.add(cloneNode);

// Move to the next node and extract it. If next node is null that means the rest of the content is found in a different section.
if (currNode.getNextSibling() == null && isExtracting) {
// Move to the next section.
Section nextSection = (Section) currNode.getAncestor(NodeType.SECTION).getNextSibling();
currNode = nextSection.getBody().getFirstChild();
} else {
// Move to the next node in the body.
currNode = currNode.getNextSibling();
}
}

// Return the nodes between the node markers.
return nodes;
}

extractContent方法還需要一些輔助方法來完成文本提取操作，如下所示。

/**
* Checks the input parameters are correct and can be used. Throws an exception
* if there is any problem.
*/
private static void verifyParameterNodes(Node startNode, Node endNode) throws Exception {
// The order in which these checks are done is important.
if (startNode == null)
throw new IllegalArgumentException("Start node cannot be null");
if (endNode == null)
throw new IllegalArgumentException("End node cannot be null");

if (!startNode.getDocument().equals(endNode.getDocument()))
throw new IllegalArgumentException("Start node and end node must belong to the same document");

if (startNode.getAncestor(NodeType.BODY) == null || endNode.getAncestor(NodeType.BODY) == null)
throw new IllegalArgumentException("Start node and end node must be a child or descendant of a body");

// Check the end node is after the start node in the DOM tree
// First check if they are in different sections, then if they're not check
// their position in the body of the same section they are in.
Section startSection = (Section) startNode.getAncestor(NodeType.SECTION);
Section endSection = (Section) endNode.getAncestor(NodeType.SECTION);

int startIndex = startSection.getParentNode().indexOf(startSection);
int endIndex = endSection.getParentNode().indexOf(endSection);

if (startIndex == endIndex) {
if (startSection.getBody().indexOf(startNode) > endSection.getBody().indexOf(endNode))
throw new IllegalArgumentException("The end node must be after the start node in the body");
} else if (startIndex > endIndex)
throw new IllegalArgumentException("The section of end node must be after the section start node");
}

/**
* Checks if a node passed is an inline node.
*/
private static boolean isInline(Node node) throws Exception {
// Test if the node is desendant of a Paragraph or Table node and also is not a
// paragraph or a table a paragraph inside a comment class which is decesant of
// a pararaph is possible.
return ((node.getAncestor(NodeType.PARAGRAPH) != null || node.getAncestor(NodeType.TABLE) != null)
&& !(node.getNodeType() == NodeType.PARAGRAPH || node.getNodeType() == NodeType.TABLE));
}

/**
* Removes the content before or after the marker in the cloned node depending
* on the type of marker.
*/
private static void processMarker(CompositeNode cloneNode, ArrayList nodes, Node node, boolean isInclusive,
boolean isStartMarker, boolean isEndMarker) throws Exception {
// If we are dealing with a block level node just see if it should be included
// and add it to the list.
if (!isInline(node)) {
// Don't add the node twice if the markers are the same node
if (!(isStartMarker && isEndMarker)) {
if (isInclusive)
nodes.add(cloneNode);
}
return;
}

// If a marker is a FieldStart node check if it's to be included or not.
// We assume for simplicity that the FieldStart and FieldEnd appear in the same
// paragraph.
if (node.getNodeType() == NodeType.FIELD_START) {
// If the marker is a start node and is not be included then skip to the end of
// the field.
// If the marker is an end node and it is to be included then move to the end
// field so the field will not be removed.
if ((isStartMarker && !isInclusive) || (!isStartMarker && isInclusive)) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.FIELD_END)
node = node.getNextSibling();

}
}

// If either marker is part of a comment then to include the comment itself we
// need to move the pointer forward to the Comment
// node found after the CommentRangeEnd node.
if (node.getNodeType() == NodeType.COMMENT_RANGE_END) {
while (node.getNextSibling() != null && node.getNodeType() != NodeType.COMMENT)
node = node.getNextSibling();

}

// Find the corresponding node in our cloned node by index and return it.
// If the start and end node are the same some child nodes might already have
// been removed. Subtract the
// difference to get the right index.
int indexDiff = node.getParentNode().getChildNodes().getCount() - cloneNode.getChildNodes().getCount();

// Child node count identical.
if (indexDiff == 0)
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node));
else
node = cloneNode.getChildNodes().get(node.getParentNode().indexOf(node) - indexDiff);

// Remove the nodes up to/from the marker.
boolean isSkip;
boolean isProcessing = true;
boolean isRemoving = isStartMarker;
Node nextNode = cloneNode.getFirstChild();

while (isProcessing && nextNode != null) {
Node currentNode = nextNode;
isSkip = false;

if (currentNode.equals(node)) {
if (isStartMarker) {
isProcessing = false;
if (isInclusive)
isRemoving = false;
} else {
isRemoving = true;
if (isInclusive)
isSkip = true;
}
}

nextNode = nextNode.getNextSibling();
if (isRemoving && !isSkip)
currentNode.remove();
}

// After processing the composite node may become empty. If it has don't include
// it.
if (!(isStartMarker && isEndMarker)) {
if (cloneNode.hasChildNodes())
nodes.add(cloneNode);
}
}

public static Document generateDocument(Document srcDoc, ArrayList nodes) throws Exception {

// Create a blank document.
Document dstDoc = new Document();
// Remove the first paragraph from the empty document.
dstDoc.getFirstSection().getBody().removeAllChildren();

// Import each node from the list into the new document. Keep the original
// formatting of the node.
NodeImporter importer = new NodeImporter(srcDoc, dstDoc, ImportFormatMode.KEEP_SOURCE_FORMATTING);

for (Node node : (Iterable<Node>) nodes) {
Node importNode = importer.importNode(node, true);
dstDoc.getFirstSection().getBody().appendChild(importNode);
}

// Return the generated document.
return dstDoc;
}

現在我們準備好使用這些方法并從 Word 文檔中提取文本。

在Java 提取 Word DOC 中段落之間的文本

讓我們看看如何在 Word DOCX 文檔的兩個段落之間提取內容。以下是在 Java 中執行此操作的步驟。

首先，使用Document類加載 Word 文檔。
使用Document.getFirstSection().getChild(NodeType.PARAGRAPH, int, bool)方法將開始和結束段落的引用獲取到兩個對象中。
調用extractContent(startPara, endPara, true)方法將節點提取到對象中。
調用generateDocument(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
最后，使用Document.save(String)方法保存返回的文檔。

以下代碼示例展示了如何在 Java 的 Word DOCX 中提取第 7 段和第 11 段之間的文本。

// Load document
Document doc = new Document("TestFile.doc");

// Gather the nodes. The GetChild method uses 0-based index
Paragraph startPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 6, true);
Paragraph endPara = (Paragraph) doc.getFirstSection().getChild(NodeType.PARAGRAPH, 10, true);
// Extract the content between these nodes in the document. Include these
// markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endPara, true);

// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");

在Java 中提取 DOC 中文本 - 在不同類型的節點之間

您還可以在不同類型的節點之間提取內容。為了演示，讓我們提取段落和表格之間的內容并將其保存到新的 Word 文檔中。以下是在 Java 中提取 Word 文檔中不同節點之間的文本的步驟。

使用Document類加載 Word 文檔。
使用Document.getFirstSection().getChild(NodeType, int, bool)方法將起始節點和結束節點引用到兩個對象中。
調用extractContent(startPara, endPara, true)方法將節點提取到對象中。
調用generateDocument(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
使用Document.save(String)方法保存返回的文檔。

以下代碼示例展示了如何使用 Java 在 DOCX 中提取段落和表格之間的文本。

// Load documents
Document doc = new Document("TestFile.doc");

// Get reference of starting paragraph
Paragraph startPara = (Paragraph) doc.getLastSection().getChild(NodeType.PARAGRAPH, 2, true);
Table endTable = (Table) doc.getLastSection().getChild(NodeType.TABLE, 0, true);

// Extract the content between these nodes in the document. Include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara, endTable, true);

// Lets reverse the array to make inserting the content back into the document easier.
Collections.reverse(extractedNodes);

while (extractedNodes.size() > 0) {
// Insert the last node from the reversed list
endTable.getParentNode().insertAfter((Node) extractedNodes.get(0), endTable);
// Remove this node from the list after insertion.
extractedNodes.remove(0);
}

// Save the generated document to disk.
doc.save("output.doc");

在Java 中提取 DOCX 中文本 - 基于樣式的段落之間

現在讓我們看看如何根據樣式提取段落之間的內容。為了演示，我們將提取 Word 文檔中第一個“標題 1”和第一個“標題 3”之間的內容。以下步驟演示了如何在 Java 中實現此目的。

首先，使用Document類加載 Word 文檔。
然后，使用paragraphsByStyleName(Document, “Heading 1”)輔助方法將段落提取到一個對象中。
使用paragraphsByStyleName(Document, “Heading 3”)輔助方法將段落提取到另一個對象中。
調用extractContent(startPara, endPara, true)方法并將兩個段落數組中的第一個元素作為第一個和第二個參數傳遞。
調用generateDocument(Document, extractNodes)輔助方法來創建包含提取內容的文檔。
最后，使用Document.save(String)方法保存返回的文檔。

以下代碼示例展示了如何根據樣式提取段落之間的內容。

// Load document
Document doc = new Document(dataDir + "TestFile.doc");

// Gather a list of the paragraphs using the respective heading styles.
ArrayList parasStyleHeading1 = paragraphsByStyleName(doc, "Heading 1");
ArrayList parasStyleHeading3 = paragraphsByStyleName(doc, "Heading 3");

// Use the first instance of the paragraphs with those styles.
Node startPara1 = (Node) parasStyleHeading1.get(0);
Node endPara1 = (Node) parasStyleHeading3.get(0);

// Extract the content between these nodes in the document. Don't include these markers in the extraction.
ArrayList extractedNodes = extractContent(startPara1, endPara1, false);

// Insert the content into a new separate document and save it to disk.
Document dstDoc = generateDocument(doc, extractedNodes);
dstDoc.save("output.doc");

以上便是如何用Java從Word文檔中提取文本，要是您還有其他關于產品方面的問題，歡迎咨詢我們，或者加入我們官方技術交流群。

歡迎下載|體驗更多Aspose產品

獲取更多信息請咨詢 或加入Aspose技術交流群（761297826）

標簽：

本站文章除注明轉載外，均為本站原創或翻譯。歡迎任何形式的轉載，但請務必注明出處、不得修改原文相關鏈接，如果存在內容上的異議請郵件反饋至chenjj@ke049m.cn

上一篇：java開發工具IntelliJ IDEA中使用 Git Blame 進行注釋教程下一篇：界面控件Telerik UI for WPF - 如何為WPF應用制作一個虛擬鍵盤？(1/2)