Как извлечь html-комментарии и весь html, содержащийся в узле?

Я создаю небольшое веб-приложение, чтобы помочь мне управлять и анализировать содержимое моих веб-сайтов, а cURL – моя любимая новая игрушка. Я выяснил, как извлекать информацию обо всех видах элементов, как найти все элементы с определенным классом и т. Д., Но я застрял в двух проблемах (см. Ниже). Я надеюсь, что есть какой-то отличный xpath ответ, но если мне придется прибегать к регулярным выражениям, я думаю, что все в порядке. Хотя я не так хорош в регулярном выражении, поэтому, если вы думаете, что это путь, я бы оценил примеры …

Довольно стандартная отправная точка:

$ch = curl_init(); curl_setopt($ch, CURLOPT_USERAGENT, $userAgent); curl_setopt($ch, CURLOPT_URL,$target_url); curl_setopt($ch, CURLOPT_FAILONERROR, true); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_AUTOREFERER, true); curl_setopt($ch, CURLOPT_RETURNTRANSFER,true); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html = curl_exec($ch); if (!$html) { $info .= "<br />cURL error number:" .curl_errno($ch); $info .= "<br />cURL error:" . curl_error($ch); return $info; } $dom = new DOMDocument(); @$dom->loadHTML($html); $xpath = new DOMXPath($dom);

и извлечение информации, например:

 // iframes $iframes = $xpath->evaluate("/html/body//iframe"); $info .= '<h3>iframes ('.$iframes->length.'):</h3>'; for ($i = 0; $i < $iframes->length; $i++) { // get iframe attributes $iframe = $iframes->item($i); $framesrc = $iframe->getAttribute("src"); $framewidth = $iframe->getAttribute("width"); $frameheight = $iframe->getAttribute("height"); $framealt = $iframe->getAttribute("alt"); $frameclass = $iframe->getAttribute("class"); $info .= $framesrc.'&nbsp;('.$framewidth.'x'.$frameheight.'; class="'.$frameclass.'")'.'<br />'; }

Вопросы / проблемы:

Как извлечь HTML-комментарии?

Я не могу понять, как идентифицировать комментарии – считаются ли они узлами или что-то еще?
Как получить весь контент div, включая дочерние узлы? Поэтому, если div содержит изображение и пару hrefs, он найдет их и передаст все это мне как блок HTML.

Узлы комментариев должны быть легко найдены в XPath с тестом comment() , аналогичным тесту text() :

 $comments = $xpath->query('//comment()'); // or another path, as you prefer

Это стандартные узлы: вот ручная запись для класса DOMComment .

К вашему другому вопросу это немного сложнее. Самый простой способ – использовать saveXML() с дополнительным аргументом $node :

 $html = $dom->saveXML($el); // $el should be the element you want to get // the HTML for

Для комментариев HTML быстрый метод:

  function getComments ($html) { $rcomments = array(); $comments = array(); if (preg_match_all('#<\!--(.*?)-->#is', $html, $rcomments)) { foreach ($rcomments as $c) { $comments[] = $c[1]; } return $comments; } else { // No comments matchs return null; } }

 public function parse($source) { $comments = array(); // multiline comment /* */ $tmp = explode("/*", $source); foreach ($tmp as $t) { if (strpos($t, "*/") !== false) { $comment = explode("*/", $t)[0]; $comment = trim($comment); if (!empty($comment)) $comments[] = "/* " . $comment . " */"; } } // multiline comment <!-- --> $tmp = explode("<!--", $source); foreach ($tmp as $t) { if (strpos($t, "-->") !== false) { $comment = explode("-->", $t)[0]; $comment = trim($comment); if (!empty($comment)) $comments[] = "<!-- " . $comment . " -->"; } } $tmp = explode("//", $source); foreach ($tmp as $t) { if (empty($t)) continue; $pos = strpos($source, $t); if ($pos > 1) { if ($source[$pos-2] == "/" && $source[$pos-1] == "/") { $comment = trim(explode("\n", $t)[0]); if (!empty($comment)) $comments[] = "// " . $comment; } } }

для комментариев, которые вы ищете для рекурсивного регулярного выражения. Например, чтобы избавиться от комментариев html:

 preg_replace('/<!--(?(?=<!--)(?R)|.)*?-->/s',$yourHTML);

найти их:

 preg_match_all('/(<!--(?(?=<!--)(?R)|.)*?-->)/s',$yourHTML,$comments);