Удалите все теги HTML + содержимое из текста

Хорошо, так, как это может показаться, я все еще не в состоянии сделать правильно. Я попытался с RegEx, я даже попытался разбор DOM, но все еще не смог понять это правильно.

Основываясь на ответе в моем предыдущем вопросе ( Попытка удалить HTML-теги (+ контент) из String ), вот что я получил:

public static function removeHtmlTags($str) { $dom = new DOMDOcument(); $errorState = libxml_use_internal_errors(true); $dom->loadHTML($str); $xpath = new DOMXPath($dom); $node = $xpath->query('//body/p/text()')->item(0); if (isset($node->textContent)) $ret = $node->textContent; else $ret=""; libxml_use_internal_errors($errorState); return $ret; } 

Это, похоже, делает трюк большую часть времени, однако вот уловка …

Это (ну, если вы не можете понять, что это такое, это Википедия Infobox ):

 |conventional_long_name = Italian Republic |native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}} |common_name = Italy |nickname(s) = Il Belpaese |image_flag = Flag of Italy.svg |image_coat = Italy-Emblem.svg |symbol_type = Emblem |image_map = EU-Italy.svg |map_caption = {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}} |national_anthem = {{native name|it|[[Il Canto degli Italiani]]}}<br/>{{small|''The Song of the Italians''}} [[File:Inno di Mameli instrumental.ogg|center]] |official_languages = [[Italian language|Italian]]<sup>a</sup> |Religion= [[Roman Catholic]] |capital = {{Coat of arms|Rome}} |latd=41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E |largest_city = capital |largest_metropolitan area = {{hlist |[[Milan]] |[[Naples]]}} |demonym = [[Italians|Italian]] |government_type = [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]] |leader_title1 = [[President of Italy|President]] |leader_name1 = [[Giorgio Napolitano]] |leader_title2 = [[Prime Minister of Italy|Prime Minister]] |leader_name2 = [[Enrico Letta]] |leader_title3 = [[List of Presidents of the Senate of Italy|President of the Senate]] |leader_name3 = [[Pietro Grasso]] |leader_title4 = [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]] |leader_name4 = [[Laura Boldrini]] |legislature = [[Parliament of Italy|Parliament]] |upper_house = [[Italian Senate|Senate of the Republic]] |lower_house = [[Italian Chamber of Deputies|Chamber of Deputies]] |accessionEUdate = 25 March 1957 (founding member) |EUseats = 78 |area_rank = 72nd |area_magnitude = 1 E11 |area_km2 = 301,338 |area_sq_mi = 116,347 <!--Do not remove per [[WP:MOSNUM]]--> |percent_water = 2.4 |population_census = 59,433,744<ref name="Istat">{{cite web |url=http://www.istat.it/it/files/2012/12/volume_popolazione-legale_XV_censimento_popolazione.pdf|title=Census 2011 - final results |publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=19 December 2012}}</ref> |population_census_year = 2011 |population_census_rank = 23rd |population_estimate = 59,685,227<ref>{{cite web |url=http://www.istat.it/en/archive/94537|title=Resident population and population change|publisher=[[National Institute of Statistics (Italy)|ISTAT]] |accessdate=25 June 2013}}</ref> |population_estimate_year = 2012 |population_estimate_rank = 23rd |population_density_rank = 63rd |population_density_km2 = 197.7 |population_density_sq_mi = 511.6 <!--Do not remove per [[WP:MOSNUM]]--> |GDP_PPP = $1.848 trillion<ref name=autogenerated1 >{{cite web |url=http://www.imf.org/external/pubs/ft/weo/2013/02/weodata/weorept.aspx?pr.x=25&pr.y=1&sy=2013&ey=2013&scsm=1&ssd=1&sort=country&ds=.&br=1&c=136&s=NGDPD%2CNGDPDPC%2CPPPGDP%2CPPPPC&grp=0&a= |title=Italy |publisher=International Monetary Fund |accessdate=17 October 2013}}</ref> |GDP_PPP_rank = 11th |GDP_PPP_year = 2014 |GDP_PPP_per_capita = $30,218<ref name=autogenerated1/> |GDP_PPP_per_capita_rank = 34th |GDP_nominal = $2.148 trillion<ref name=autogenerated1/> |GDP_nominal_rank = 9th |GDP_nominal_year = 2014 |GDP_nominal_per_capita = $35,123<ref name=autogenerated1/> |GDP_nominal_per_capita_rank = 27th |sovereignty_type = [[History of Italy|Formation]] |established_event1 = [[Italian unification|Unification]] |established_date1 = 17 March 1861 |established_event2 = [[Italian constitutional referendum, 1946|Republic]] |established_date2 = 2 June 1946 |Gini_year = 2011 |Gini_change = <!--increase/decrease/steady--> |Gini = 31.9 <!--number only--> |Gini_ref = <ref name=eurogini>{{cite web|title=Gini coefficient of equivalised disposable income (source: SILC)|url=http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=ilc_di12|publisher=Eurostat Data Explorer|accessdate=13 August 2013}}</ref> |Gini_rank = |HDI_year = 2013 |HDI_change = increase <!--increase/decrease/steady--> |HDI = 0.881 <!--number only--> |HDI_ref = <ref name="HDI">{{cite web |url=http://hdr.undp.org/en/media/HDR_2011_EN_Table1.pdf |title=Human Development Report 2011 |year=2011 |publisher=United Nations |accessdate=5 November 2011}}</ref> |HDI_rank = 25th |currency = Euro ([[Euro sign|€]])<sup>b</sup> |currency_code = EUR |country_code = |time_zone = [[Central European Time|CET]] |utc_offset = +1 |time_zone_DST = [[Central European Summer Time|CEST]] |utc_offset_DST = +2 |drives_on = right |calling_code = [[Telephone numbers in Italy|39]]<sup>c</sup> |cctld = [[.it]]<sup>d</sup> |footnote_a = <span style="font-size:100%;">French is co-official in the [[Aosta Valley]]; [[Slovene language|Slovene]] is co-official in the [[province of Trieste]] and the [[province of Gorizia]]; German and [[Ladin language|Ladin]] are co-official in [[South Tyrol]].</span> |footnote_b = <span style="font-size:100%;">Before 2002, the [[Italian lira|Italian Lira]]. The euro is accepted in [[Campione d'Italia]], but the official currency there is the [[Swiss Franc]].<ref>{{cite web |url=http://www.comune.campione-d-italia.co.it/ |title=Comune di Campione d'Italia |publisher=Comune.campione-d-italia.co.it |date=14 July 2010 |accessdate=30 October 2010}}</ref></span> |footnote_c = <span style="font-size:100%;">To call [[Campione d'Italia]], it is necessary to use the Swiss code [[+41]].</span> |footnote_d = <span style="font-size:100%;">The [[.eu]] domain is also used, as it is shared with other [[European Union]] member states.</span> 

становится (после также explode новые строки):

 Array ( [conventional_long_name] => Italian Republic [native_name] => {{lang|it|''Repubblica italiana [common_name] => Italy [nickname(s)] => Il Belpaese [image_flag] => Flag of Italy.svg [image_coat] => Italy-Emblem.svg [symbol_type] => Emblem [image_map] => EU-Italy.svg [map_caption] => {{map caption |location_color=dark green |region=Europe |region_color=dark grey |subregion=the [[European Union]] |subregion_color=green |legend=EU-Italy.svg}} [national_anthem] => {{native name|it|[[Il Canto degli Italiani]]}} [official_languages] => [[Italian language|Italian]] [Religion] => [[Roman Catholic]] [capital] => {{Coat of arms|Rome}} [latd] => 41 |latm=54 |latNS=N |longd=12 |longm=29 |longEW=E [largest_city] => capital [largest_metropolitan area] => {{hlist |[[Milan]] |[[Naples]]}} [demonym] => [[Italians|Italian]] [government_type] => [[Unitary state|Unitary]] [[parliamentary system|parliamentary]] [[constitutional republic]] [leader_title1] => [[President of Italy|President]] [leader_name1] => [[Giorgio Napolitano]] [leader_title2] => [[Prime Minister of Italy|Prime Minister]] [leader_name2] => [[Enrico Letta]] [leader_title3] => [[List of Presidents of the Senate of Italy|President of the Senate]] [leader_name3] => [[Pietro Grasso]] [leader_title4] => [[List of Presidents of the Italian Chamber of Deputies|President of the Chamber of Deputies]] [leader_name4] => [[Laura Boldrini]] [legislature] => [[Parliament of Italy|Parliament]] [upper_house] => [[Italian Senate|Senate of the Republic]] [lower_house] => [[Italian Chamber of Deputies|Chamber of Deputies]] [accessionEUdate] => 25 March 1957 (founding member) [EUseats] => 78 [area_rank] => 72nd [area_magnitude] => 1 E11 [area_km2] => 301,338 [area_sq_mi] => 116,347 [percent_water] => 2.4 [population_census] => 59,433,744 [population_census_year] => 2011 [population_census_rank] => 23rd [population_estimate] => 59,685,227 [population_estimate_year] => 2012 [population_estimate_rank] => 23rd [population_density_rank] => 63rd [population_density_km2] => 197.7 [population_density_sq_mi] => 511.6 [GDP_PPP] => $1.848 trillion [GDP_PPP_rank] => 11th [GDP_PPP_year] => 2014 [GDP_PPP_per_capita] => $30,218 [GDP_PPP_per_capita_rank] => 34th [GDP_nominal] => $2.148 trillion [GDP_nominal_rank] => 9th [GDP_nominal_year] => 2014 [GDP_nominal_per_capita] => $35,123 [GDP_nominal_per_capita_rank] => 27th [sovereignty_type] => [[History of Italy|Formation]] [established_event1] => [[Italian unification|Unification]] [established_date1] => 17 March 1861 [established_event2] => [[Italian constitutional referendum, 1946|Republic]] [established_date2] => 2 June 1946 [Gini_year] => 2011 [Gini_change] => [Gini] => 31.9 [Gini_ref] => [HDI_year] => 2013 [HDI_change] => increase [HDI] => 0.881 [HDI_ref] => [HDI_rank] => 25th [currency] => Euro ([[Euro sign|â¬]]) [currency_code] => EUR [time_zone] => [[Central European Time|CET]] [utc_offset] => +1 [time_zone_DST] => [[Central European Summer Time|CEST]] [utc_offset_DST] => +2 [drives_on] => right [calling_code] => [[Telephone numbers in Italy|39]] [cctld] => [[.it]] [footnote_a] => [footnote_b] => [footnote_c] => [footnote_d] => ) 

И мне интересно:

Что случилось с |native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}} |native_name = {{lang|it|''Repubblica italiana<!--italiana is without uppercase; see Italian wiki-->''}}

Не может быть:

|native_name = {{lang|it|''Repubblica italiana''}}

Вместо этого, похоже, он избавляется от комментария HTML и последующего текста.

Есть идеи?

Путь из ада:

 $str = substr($str, 1); $lines = explode("\n|", $str); $result = array(); $pattern = '~ # subpattern definitions (?(DEFINE) (?<c> <!--.*?--> ) # html comment (?<tag> # tag (possible nested tags with the same name) ( <(\w++) (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) )* </\g{-1}> ) ) (?<sctag> </w++[^>]*> ) # self closing tag ) # main pattern \g<c> | \g<tag> | \g<sctag> | \s+$ ~x'; foreach($lines as $line) { $kv = explode(' = ', $line, 2); $kv[1] = (isset($kv[1])) ? preg_replace($pattern, '', $kv[1]) : null; $result[$kv[0]] = $kv[1]; } unset($kv, $pattern, $lines, $str); echo '<pre>' . htmlspecialchars(print_r($result, true)) . '</pre>'; 

примечание 1: поскольку строка содержит необычные теги (т. е. теги, которые не являются тегами html), возможно, что эти теги могут быть закрывающими тегами или не совпадать в одно и то же время. Другими словами, вы можете найти <ref>....</ref> и <ref/> (или <ref> как сам закрывающий тег) в том же документе. Чтобы справиться с этим конкретным случаем, вы можете изменить среднюю строку определения подшаблона тега на: (?>[^<]++ | \g<c> | < (?!/?\g{-1}) | (?-2) | <\g{-1}\b[^>]*?/?> )*

примечание 2: Если вы не хотите использовать регулярное выражение, способ использовать DOM, но поскольку тег <ref> не существует в html, вы должны написать свой собственный DTD, который описывает этот тег (и все остальные html теги), добавьте его в свою строку и используйте метод DOMDocument класса DOMDocument .