У меня есть эта функция, чтобы каждый тег img имел абсолютный URL:
function absoluteSrc($html, $encoding = 'utf-8') { $dom = new DOMDocument(); // Workaround to use proper encoding $prehtml = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>"; $posthtml = "</body></html>"; if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){ foreach($dom->getElementsByTagName('img') as $img){ if($img instanceof DOMElement){ $src = $img->getAttribute('src'); if( strpos($src, 'http://') !== 0 ){ $img->setAttribute('src', 'http://my.server/' . $src); } } } $html = $dom->saveHTML(); // Remove remains of workaround / DomDocument additions $cut_start = strpos($html, '<body>') + 6; $cut_length = -1 * (1+strlen($posthtml)); $html = substr($html, $cut_start, $cut_length); } return $html; }
Он отлично работает, но возвращает декодированные объекты как символы Unicode
$html = <<< EOHTML <p><img src="images/lorem.jpg" alt="lorem" align="left"> Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet Cum magna. Suscipit sed vel tincidunt urna.<br> Vel consequat pretium Curabitur faucibus justo adipiscing elit. <img src="others/ipsum.png" alt="ipsum" align="right"></p> <center>© Dr Jekyll & Mr Hyde</center> EOHTML; echo absoluteSrc($html);
-$html = <<< EOHTML <p><img src="images/lorem.jpg" alt="lorem" align="left"> Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet Cum magna. Suscipit sed vel tincidunt urna.<br> Vel consequat pretium Curabitur faucibus justo adipiscing elit. <img src="others/ipsum.png" alt="ipsum" align="right"></p> <center>© Dr Jekyll & Mr Hyde</center> EOHTML; echo absoluteSrc($html);
Выходы:
<p><img src="http://img.ruphp.com/domdocument/lorem.jpg" alt="lorem" align="left"> Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet Cum magna. Suscipit sed vel tincidunt urna.<br> Vel consequat pretium Curabitur faucibus justo adipiscing elit. <img src="http://img.ruphp.com/domdocument/ipsum.png" alt="ipsum" align="right"></p> <center>© Dr Jekyll & Mr Hyde</center>
Как вы можете видеть в последней строке
Я бы хотел, чтобы они оставались такими же, как и в строке ввода.
Я тоже хотел бы знать ответ на этот вопрос.
Я закончил конвертировать & …; сущности до **ENTITY-...-ENTITY**
перед разбором и преобразованием назад после его завершения.
Следующий код, похоже, работает
$dom= new DOMDocument('1.0', 'UTF-8'); $dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) ); $dom->preserveWhiteSpace = true; $innerHTML = str_replace("<html></html><html><body>", "", str_replace("</body></html>", "", str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom ))))); return $this->stringcode2htmlentities($innerHTML)); } // ---------------------------------------------------------- function htmlentities2stringcode($string) { // This method will convert htmlentities such as © into the pseudo string version ^copy^; etc $from = array_keys($this->getHTMLEntityStringCodeArray()); $to = array_values($this->getHTMLEntityStringCodeArray()); return str_replace($from, $to, $string); } // ---------------------------------------------------------- function stringcode2htmlentities ($string) { // This method will convert pseudo string such as ^copy^ to the original html entity © etc $from = array_values($this->getHTMLEntityStringCodeArray()); $to = array_keys($this->getHTMLEntityStringCodeArray()); return str_replace($from, $to, $string); } // ------------------------------------------------------------- function getHTMLEntityStringCodeArray() { return array('Α'=>'^Alpha^', 'Β'=>'^Beta^', 'Χ'=>'^Chi^', '‡'=>'^Dagger^', 'Δ'=>'^Delta^', 'Ε'=>'^Epsilon^', 'Η'=>'^Eta^', 'Γ'=>'^Gamma^', 'Ι'=>'^lota^', 'Κ'=>'^Kappa^', 'Λ'=>'^Lambda^', 'Μ'=>'^Mu^', 'Ν'=>'^Nu^', 'Œ'=>'^OElig^', 'Ω'=>'^Omega^', 'Ο'=>'^Omicron^', 'Φ'=>'^Phi^', 'Π'=>'^Pi^', '″'=>'^Prime^', 'Ψ'=>'^Psi^', 'Ρ'=>'^Rho^', 'Š'=>'^Scaron^', 'Š'=>'^Scaron^', 'Σ'=>'^Sigma^', 'Τ'=>'^Tau^', 'Θ'=>'^Theta^', 'Υ'=>'^Upsilon^', 'Ξ'=>'^Xi^', 'Ÿ'=>'^Yuml^', 'Ζ'=>'^Zeta^', 'ℵ'=>'^alefsym^', 'α'=>'^alpha^', '∧'=>'^and^', '∠'=>'^ang^', '≈'=>'^asymp^', '„'=>'^bdquo^', 'β'=>'^beta^', '•'=>'^bull^', '∩'=>'^cap^', 'χ'=>'^chi^', 'ˆ'=>'^circ^', '♣'=>'^clubs^', '≅'=>'^cong^', '↵'=>'^crarr^', '∪'=>'^cup^', '⇓'=>'^dArr^', '†'=>'^dagger^', '↓'=>'^darr^', 'δ'=>'^delta^', '♦'=>'^diams^', '∅'=>'^empty^', ' '=>'^emsp^', ' '=>'^ensp^', 'ε'=>'^epsilon^', '≡'=>'^equiv^', 'η'=>'^eta^', '€'=>'^euro^', '∃'=>'^exist^', 'ƒ'=>'^fnof^', '∀'=>'^forall^', '⁄'=>'^frasl^', 'γ'=>'^gamma^', '≥'=>'^ge^', '⇔'=>'^hArr^', '↔'=>'^harr^', '♥'=>'^hearts^', '…'=>'^hellip^', 'ℑ'=>'^image^', '∞'=>'^infin^', '∫'=>'^int^', 'ι'=>'^iota^', '∈'=>'^isin^', 'κ'=>'^kappa^', '⇐'=>'^lArr^', 'λ'=>'^lambda^', '⟨'=>'^lang^', '←'=>'^larr^', '⌈'=>'^lceil^', '“'=>'^ldquo^', '≤'=>'^le^', '⌊'=>'^lfloor^', '∗'=>'^lowast^', '◊'=>'^loz^', '‎'=>'^lrm^', '‹'=>'^lsaquo^', '‘'=>'^lsquo^', '—'=>'^mdash^', '−'=>'^minus^', 'μ'=>'^mu^', '∇'=>'^nabla^', '–'=>'^ndash^', '≠'=>'^ne^', '∋'=>'^ni^', '∉'=>'^notin^', '⊄'=>'^nsub^', 'ν'=>'^nu^', 'œ'=>'^oelig^', '‾'=>'^oline^', 'ω'=>'^omega^', 'ο'=>'^omicron^', '⊕'=>'^oplus^', '∨'=>'^or^', '⊗'=>'^otimes^', '∂'=>'^part^', '‰'=>'^permil^', '⊥'=>'^perp^', 'φ'=>'^phi^', 'π'=>'^pi^', 'ϖ'=>'^piv^', '′'=>'^prime^', '∏'=>'^prod^', '∝'=>'^prop^', 'ψ'=>'^psi^', '⇒'=>'^rArr^', '√'=>'^radic^', '⟩'=>'^rang^', '→'=>'^rarr^', '⌉'=>'^rceil^', '”'=>'^rdquo^', 'ℜ'=>'^real^', '⌋'=>'^rfloor^', 'ρ'=>'^rho^', '‏'=>'^rlm^', '›'=>'^rsaquo^', '’'=>'^rsquo^', '‚'=>'^sbquo^', 'š'=>'^scaron^', '⋅'=>'^sdot^', 'σ'=>'^sigma^', 'ς'=>'^sigmaf^', '∼'=>'^sim^', '♠'=>'^spades^', '⊂'=>'^sub^', '⊆'=>'^sube^', '∑'=>'^sum^', '⊃'=>'^sup^', '⊇'=>'^supe^', 'τ'=>'^tau^', '∴'=>'^there4^', 'θ'=>'^thetasym^', 'ϑ'=>'^thetasym^', ' '=>'^thinsp^', '˜'=>'^tilde^', '™'=>'^trade^', '⇑'=>'^uArr^', '↑'=>'^uarr^', 'ϒ'=>'^upsih^', 'υ'=>'^upsilon^', '℘'=>'^weierp^', 'ξ'=>'^xi^', 'ÿ'=>'^yuml^', 'ζ'=>'^zeta^', '‍'=>'^zwj^', '‌'=>'^zwnj^'); }
к$dom= new DOMDocument('1.0', 'UTF-8'); $dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) ); $dom->preserveWhiteSpace = true; $innerHTML = str_replace("<html></html><html><body>", "", str_replace("</body></html>", "", str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom ))))); return $this->stringcode2htmlentities($innerHTML)); } // ---------------------------------------------------------- function htmlentities2stringcode($string) { // This method will convert htmlentities such as © into the pseudo string version ^copy^; etc $from = array_keys($this->getHTMLEntityStringCodeArray()); $to = array_values($this->getHTMLEntityStringCodeArray()); return str_replace($from, $to, $string); } // ---------------------------------------------------------- function stringcode2htmlentities ($string) { // This method will convert pseudo string such as ^copy^ to the original html entity © etc $from = array_values($this->getHTMLEntityStringCodeArray()); $to = array_keys($this->getHTMLEntityStringCodeArray()); return str_replace($from, $to, $string); } // ------------------------------------------------------------- function getHTMLEntityStringCodeArray() { return array('Α'=>'^Alpha^', 'Β'=>'^Beta^', 'Χ'=>'^Chi^', '‡'=>'^Dagger^', 'Δ'=>'^Delta^', 'Ε'=>'^Epsilon^', 'Η'=>'^Eta^', 'Γ'=>'^Gamma^', 'Ι'=>'^lota^', 'Κ'=>'^Kappa^', 'Λ'=>'^Lambda^', 'Μ'=>'^Mu^', 'Ν'=>'^Nu^', 'Œ'=>'^OElig^', 'Ω'=>'^Omega^', 'Ο'=>'^Omicron^', 'Φ'=>'^Phi^', 'Π'=>'^Pi^', '″'=>'^Prime^', 'Ψ'=>'^Psi^', 'Ρ'=>'^Rho^', 'Š'=>'^Scaron^', 'Š'=>'^Scaron^', 'Σ'=>'^Sigma^', 'Τ'=>'^Tau^', 'Θ'=>'^Theta^', 'Υ'=>'^Upsilon^', 'Ξ'=>'^Xi^', 'Ÿ'=>'^Yuml^', 'Ζ'=>'^Zeta^', 'ℵ'=>'^alefsym^', 'α'=>'^alpha^', '∧'=>'^and^', '∠'=>'^ang^', '≈'=>'^asymp^', '„'=>'^bdquo^', 'β'=>'^beta^', '•'=>'^bull^', '∩'=>'^cap^', 'χ'=>'^chi^', 'ˆ'=>'^circ^', '♣'=>'^clubs^', '≅'=>'^cong^', '↵'=>'^crarr^', '∪'=>'^cup^', '⇓'=>'^dArr^', '†'=>'^dagger^', '↓'=>'^darr^', 'δ'=>'^delta^', '♦'=>'^diams^', '∅'=>'^empty^', ' '=>'^emsp^', ' '=>'^ensp^', 'ε'=>'^epsilon^', '≡'=>'^equiv^', 'η'=>'^eta^', '€'=>'^euro^', '∃'=>'^exist^', 'ƒ'=>'^fnof^', '∀'=>'^forall^', '⁄'=>'^frasl^', 'γ'=>'^gamma^', '≥'=>'^ge^', '⇔'=>'^hArr^', '↔'=>'^harr^', '♥'=>'^hearts^', '…'=>'^hellip^', 'ℑ'=>'^image^', '∞'=>'^infin^', '∫'=>'^int^', 'ι'=>'^iota^', '∈'=>'^isin^', 'κ'=>'^kappa^', '⇐'=>'^lArr^', 'λ'=>'^lambda^', '⟨'=>'^lang^', '←'=>'^larr^', '⌈'=>'^lceil^', '“'=>'^ldquo^', '≤'=>'^le^', '⌊'=>'^lfloor^', '∗'=>'^lowast^', '◊'=>'^loz^', '‎'=>'^lrm^', '‹'=>'^lsaquo^', '‘'=>'^lsquo^', '—'=>'^mdash^', '−'=>'^minus^', 'μ'=>'^mu^', '∇'=>'^nabla^', '–'=>'^ndash^', '≠'=>'^ne^', '∋'=>'^ni^', '∉'=>'^notin^', '⊄'=>'^nsub^', 'ν'=>'^nu^', 'œ'=>'^oelig^', '‾'=>'^oline^', 'ω'=>'^omega^', 'ο'=>'^omicron^', '⊕'=>'^oplus^', '∨'=>'^or^', '⊗'=>'^otimes^', '∂'=>'^part^', '‰'=>'^permil^', '⊥'=>'^perp^', 'φ'=>'^phi^', 'π'=>'^pi^', 'ϖ'=>'^piv^', '′'=>'^prime^', '∏'=>'^prod^', '∝'=>'^prop^', 'ψ'=>'^psi^', '⇒'=>'^rArr^', '√'=>'^radic^', '⟩'=>'^rang^', '→'=>'^rarr^', '⌉'=>'^rceil^', '”'=>'^rdquo^', 'ℜ'=>'^real^', '⌋'=>'^rfloor^', 'ρ'=>'^rho^', '‏'=>'^rlm^', '›'=>'^rsaquo^', '’'=>'^rsquo^', '‚'=>'^sbquo^', 'š'=>'^scaron^', '⋅'=>'^sdot^', 'σ'=>'^sigma^', 'ς'=>'^sigmaf^', '∼'=>'^sim^', '♠'=>'^spades^', '⊂'=>'^sub^', '⊆'=>'^sube^', '∑'=>'^sum^', '⊃'=>'^sup^', '⊇'=>'^supe^', 'τ'=>'^tau^', '∴'=>'^there4^', 'θ'=>'^thetasym^', 'ϑ'=>'^thetasym^', ' '=>'^thinsp^', '˜'=>'^tilde^', '™'=>'^trade^', '⇑'=>'^uArr^', '↑'=>'^uarr^', 'ϒ'=>'^upsih^', 'υ'=>'^upsilon^', '℘'=>'^weierp^', 'ξ'=>'^xi^', 'ÿ'=>'^yuml^', 'ζ'=>'^zeta^', '‍'=>'^zwj^', '‌'=>'^zwnj^'); }
Альтернативным решением является использование DOMDocument-> saveHTMLFile () (который не конвертирует объекты HTML) и считывает содержимое сохраненного файла обратно в строку.
Это не супер красиво, но имеет то преимущество, что вам не нужно вручную находить и заменять коды объектов самостоятельно (дважды) в соответствии с некоторыми другими предлагаемыми здесь решениями.