2013年9月17日星期二

Using java to achieve the html text saved as txt , how to remove body {font-family: SimSun; font-size: 22px; .....}

Wrote a java class, a web page saved as txt, html , txt text content after storage are correct, but always with a

body {
font-family: SimSun;
font-size: 22px;
font-style: italic;
font-weight: bold;
color: # 00F;
}

I do not know how to get rid , seeking help heroes

java part of the code :
package format.conversion;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;

import javax.servlet.jsp.tagext.BodyTag;
import javax.swing.JFileChooser;
import javax.swing.filechooser.FileNameExtensionFilter;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.filters.OrFilter;
import org.htmlparser.nodes.RemarkNode;
import org.htmlparser.nodes.TextNode;
import org.htmlparser.tags.LinkTag;
import org.htmlparser.tags.MetaTag;
import org.htmlparser.tags.StyleTag;
import org.htmlparser.tags.TitleTag;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.HtmlPage;

public class HtmlToTxt {

public static void main (String [] args) throws Exception {
HtmlToTxt test = new HtmlToTxt ();
test.go ();
}

public void go () {
try {

JFileChooser fileSave = new JFileChooser (".");

FileNameExtensionFilter extension = new FileNameExtensionFilter ("txt Files (. txt)", "txt");
fileSave.setFileFilter (extension);

fileSave.showSaveDialog (null);
File file = fileSave.getSelectedFile ();
if (! file.getPath (). endsWith (". txt")) {
file = new File (file.getPath () + ". txt");
}
String outputFile = file.toString ();

FileWriter writer = new FileWriter (outputFile);
String content = readTextFile ("WebRoot / Report.html", "UTF-8");
String txtcontent = getText (content);
writer.write (txtcontent);
writer.close ();
System.out.println ("txt file saved successfully ! " ) ;
System.out.println (" file path is :" + new File (outputFile). toURI (). toURL ());
} catch (IOException ex) {
System.out.println ("txt file failed ! " ) ;
} catch (ParserException ex) {
System.out.println (" Character conversion failed " ) ;
}
}
/ * ---------------- access to text content and title ----------------- ----- * /
public static String getText (String content) throws ParserException {
Parser myParser; / / htmlParser right parse html page
NodeList nodeList = null;
StringBuilder result = new StringBuilder ();
myParser = Parser.createParser (content, "UTF-8");
NodeFilter textFilter = new NodeClassFilter (TextNode.class NodeFilter linkFilter = new NodeClassFilter (LinkTag . class

OrFilter lastFilter = new OrFilter ();
lastFilter.setPredicates (new NodeFilter [] {textFilter, linkFilter});
nodeList = myParser.parse (lastFilter) ;/ / Get the list of nodes
Node [] nodes = nodeList.toNodeArray (); / / get array of nodes
String line = "";

for (int i = 0; i Node anode = (Node) nodes [i];
if (anode instanceof TextNode) { TextNode textnode = (TextNode) anode;
line = textnode.getText ();
} else if (anode instanceof LinkTag) {
LinkTag linknode = (LinkTag) anode;

line = linknode.getLink ();
}

if (isTrimEmpty (line))
continue;

result.append (line);
}

return result.toString ();
}

/ * ------------------- read html file --------------- ---- * /
public static String readTextFile (String sFileName, String sEncode) {
StringBuffer sbStr = new StringBuffer (); / / string variable
try {
File ff = new File (sFileName);
InputStreamReader read = new InputStreamReader (new FileInputStream (
ff), sEncode); BufferedReader ins = new BufferedReader (read);
String dataLine = "";
while (null! = (dataLine = ins.readLine ())) {
sbStr.append (dataLine);
sbStr.append ("\ r \ n");
}
ins.close ();
} catch (Exception e) {
e.printStackTrace ();
}
return sbStr.toString ();
}
public static boolean isTrimEmpty (String astr) {
if ((null == astr) | | (astr.length () == 0)) {
return true;
}
if (isBlank (astr.trim ())) {
return true;
}
return false;
}
public static boolean isBlank (String astr) {
if ((null == astr) | | (astr.length () == 0)) {
return true;
} else {
return false;
}
}

}





------ Solution ------------------------------------ --------

error       
Exception in thread "main" java.lang.Error: Unresolved compilation problems:       
The method replace (char, char) in the type String is not applicable for the arguments (s)       
Syntax error on token "?", delete this token       
s cannot be resolved to a type       
Syntax error, insert ")" to complete MethodInvocation       
Syntax error, insert ";" to complete Statement       
body cannot be resolved to a variable       
Syntax error on tokens, delete these tokens       
Syntax error, insert "}" to complete Block       
      
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)       
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)                
compiler tool you use is what ? Eclipse it is not     
html.replace ("(? s) body {(. *?)}", "");     
less added a two "sign          
  
I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}  
Are you sure about okay
you is body {}
or body {}
middle of no spaces.
Or String str = html.replaceAll ("(? s) body. \ \ {. *? \ \}", ""); so as not to worry there are no spaces of . {
In addition to special characters , the above I forgot to escape the
------ For reference only --------------------- ------------------
html.replace ((? s) body {(. *?)}, "");

------ For reference only ---------------------------------- -----

error
Exception in thread "main" java.lang.Error: Unresolved compilation problems:
The method replace (char, char) in the type String is not applicable for the arguments (s)
Syntax error on token "?", delete this token
s cannot be resolved to a type
Syntax error, insert ")" to complete MethodInvocation
Syntax error, insert ";" to complete Statement
body cannot be resolved to a variable
Syntax error on tokens, delete these tokens
Syntax error, insert "}" to complete Block

at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)
------ For reference only ------------------ ---------------------

error   
Exception in thread "main" java.lang.Error: Unresolved compilation problems:   
The method replace (char, char) in the type String is not applicable for the arguments (s)   
Syntax error on token "?", delete this token   
s cannot be resolved to a type   
Syntax error, insert ")" to complete MethodInvocation   
Syntax error, insert ";" to complete Statement   
body cannot be resolved to a variable   
Syntax error on tokens, delete these tokens   
Syntax error, insert "}" to complete Block   
  
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)   
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)  
compiler tool you use is what ? Eclipse it is not
html.replace ("(? s) body {(. *?)}", "");
less added a two "sign
------ For reference only ------------------------- --------------

error     
Exception in thread "main" java.lang.Error: Unresolved compilation problems:     
The method replace (char, char) in the type String is not applicable for the arguments (s)     
Syntax error on token "?", delete this token     
s cannot be resolved to a type     
Syntax error, insert ")" to complete MethodInvocation     
Syntax error, insert ";" to complete Statement     
body cannot be resolved to a variable     
Syntax error on tokens, delete these tokens     
Syntax error, insert "}" to complete Block     
    
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)     
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)          
compiler tool you use is what ? Eclipse it is not   
html.replace ("(? s) body {(. *?)}", "");   
less added a two "sign  

I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}
------ For reference only ------------------------------- --------

error         
Exception in thread "main" java.lang.Error: Unresolved compilation problems:         
The method replace (char, char) in the type String is not applicable for the arguments (s)         
Syntax error on token "?", delete this token         
s cannot be resolved to a type         
Syntax error, insert ")" to complete MethodInvocation         
Syntax error, insert ";" to complete Statement         
body cannot be resolved to a variable         
Syntax error on tokens, delete these tokens         
Syntax error, insert "}" to complete Block         
        
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)         
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)                      
compiler tool you use is what ? Eclipse it is not       
html.replace ("(? s) body {(. *?)}", "");       
less added a two "sign                
    
I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}          
Are you sure about okay   
you is body {}   
or body {}   
middle of no spaces.   
Or String str = html.replaceAll ("(? s) body. \ \ {. *? \ \}", ""); so as not to worry there are no spaces of .   {
In addition to special characters , I forgot to escape the above  
yes
body {
font-family: SimSun;
font-size: 22px;
font-style: italic;
font-weight: bold;
color: # 00F;
}

html page that is used to set the font format, color code
------ For reference only ---------------------- -----------------

error         
Exception in thread "main" java.lang.Error: Unresolved compilation problems:         
The method replace (char, char) in the type String is not applicable for the arguments (s)         
Syntax error on token "?", delete this token         
s cannot be resolved to a type         
Syntax error, insert ")" to complete MethodInvocation         
Syntax error, insert ";" to complete Statement         
body cannot be resolved to a variable         
Syntax error on tokens, delete these tokens         
Syntax error, insert "}" to complete Block         
        
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)         
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)                      
compiler tool you use is what ? Eclipse it is not       
html.replace ("(? s) body {(. *?)}", "");       
less added a two "sign                
    
I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}          
Are you sure about okay   
you is body {}   
or body {}   
middle of no spaces.   
Or String str = html.replaceAll ("(? s) body. \ \ {. *? \ \}", ""); so as not to worry there are no spaces of .   {
In addition to special characters , I forgot to escape the above  

Thank you very much , with this can
------ For reference only ---------------------------------------

error           
Exception in thread "main" java.lang.Error: Unresolved compilation problems:           
The method replace (char, char) in the type String is not applicable for the arguments (s)           
Syntax error on token "?", delete this token           
s cannot be resolved to a type           
Syntax error, insert ")" to complete MethodInvocation           
Syntax error, insert ";" to complete Statement           
body cannot be resolved to a variable           
Syntax error on tokens, delete these tokens           
Syntax error, insert "}" to complete Block           
          
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)           
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)                            
compiler tool you use is what ? Eclipse it is not         
html.replace ("(? s) body {(. *?)}", "");         
less added a two "sign                      
      
I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}                
Are you sure about okay     
you is body {}     
or body {}     
middle of no spaces.     
Or String str = html.replaceAll ("(? s) body. \ \ {. *? \ \}", ""); so as not to worry there are no spaces of .     {
In addition to special characters , I forgot to escape the above          
yes   
body {   
font-family: SimSun;   
font-size: 22px;   
font-style: italic;   
font-weight: bold;   
color: # 00F;   
}   
  
html page that is used to set the font format, color code  
rid of this paragraph is to chant
String str = html.replaceAll ("(? s) body \ \ s {0,1} \ \ {. *? \ \}", "");
This should be no problem, I tested .
was last written str. Not html
------ For reference only ------------------------------------ ---

error             
Exception in thread "main" java.lang.Error: Unresolved compilation problems:             
The method replace (char, char) in the type String is not applicable for the arguments (s)             
Syntax error on token "?", delete this token             
s cannot be resolved to a type             
Syntax error, insert ")" to complete MethodInvocation             
Syntax error, insert ";" to complete Statement             
body cannot be resolved to a variable             
Syntax error on tokens, delete these tokens             
Syntax error, insert "}" to complete Block             
            
at format.conversion.HtmlToTxt.go (HtmlToTxt.java: 60)             
at format.conversion.HtmlToTxt.main (HtmlToTxt.java: 39)                                  
compiler tool you use is what ? Eclipse it is not           
html.replace ("(? s) body {(. *?)}", "");           
less added a two "sign                            
        
I use myEclipse, I tried html.replace ("(? s) body {(. *?)}", ""); although no error, but the text inside or txt unable to remove the body {}                      
Are you sure about okay       
you is body {}       
or body {}       
middle of no spaces.       
Or String str = html.replaceAll ("(? s) body. \ \ {. *? \ \}", ""); so as not to worry there are no spaces of .       {
In addition to special characters , I forgot to escape the above                
yes     
body {     
font-family: SimSun;     
font-size: 22px;     
font-style: italic;     
font-weight: bold;     
color: # 00F;     
}     
    
html page that is used to set the font format, color code          
rid of this paragraph is to chant   
String str = html.replaceAll ("(? s) body \ \ s {0,1} \ \ {. *? \ \}", "");   
This should be no problem, I tested .   
was last written str. Not html  
ah , ah , thank you ~

没有评论:

发表评论