Удаление символов не-ascii из строки

При получении данных с веб-сайта я получаю странные символы:

Â

Как удалить все, что не является расширенным символом ASCII?

Related of "Удаление символов не-ascii из строки"

Лучшим вариантом будет замена регулярного выражения. Используя $str качестве примерной строки и сопоставив ее с помощью :print: который представляет собой класс символов POSIX :

 $str = 'aAÂ'; $str = preg_replace('/[[:^print:]]/', '', $str); // should be aA

Что :print: выполняет поиск всех непечатаемых символов. Любые символы, которые не являются частью текущего набора символов, будут удалены.

Примечание. Перед использованием этого метода вы должны убедиться, что ваш текущий набор символов – ASCII. Классы символов POSIX поддерживают как ASCII, так и Unicode и будут соответствовать только в соответствии с текущим набором символов. Начиная с PHP 5.6, кодировка по умолчанию – UTF-8.

Вам нужны только символы ASCII для печати ?

использовать это:

 <?php header('Content-Type: text/html; charset=UTF-8'); $str = "abqwrešđčžsff"; $res = preg_replace('/[^\x20-\x7E]/','', $str); echo "($str)($res)";

Или даже лучше, конвертируйте свой вход в utf8 и используйте phputf8 lib для перевода «ненормальных» символов в их ascii-представление:

 require_once('libs/utf8/utf8.php'); require_once('libs/utf8/utils/bad.php'); require_once('libs/utf8/utils/validation.php'); require_once('libs/utf8_to_ascii/utf8_to_ascii.php'); if(!utf8_is_valid($str)) { $str=utf8_bad_strip($str); } $str = utf8_to_ascii($str, '' );

$clearstring=filter_var($rawstring, FILTER_SANITIZE_STRING, FILTER_FLAG_STRIP_HIGH);

Вроде связанного, у нас было веб-приложение, которому приходилось отправлять данные в унаследованную систему, которая могла иметь дело только с первыми 128 символами набора символов ASCII.

Решение, которое мы должны использовать, было чем-то, что бы «перевести» как можно больше символов в эквивалентные ASCII эквиваленты, но оставить все, что невозможно перевести в одиночку.

Обычно я бы сделал что-то вроде этого:

 <?php // transliterate if (function_exists('iconv')) { $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); } ?>

… но это заменяет все, что невозможно перевести в знак вопроса (?).

Таким образом, мы закончили делать следующее. Проверьте в конце этой функции выражение (закомментированное) php regex, которое просто вычеркивает символы, отличные от ASCII.

 <?php public function cleanNonAsciiCharactersInString($orig_text) { $text = $orig_text; // Single letters $text = preg_replace("/[∂άαáàâãªä]/u", "a", $text); $text = preg_replace("/[∆лДΛдАÁÀÂÃÄ]/u", "A", $text); $text = preg_replace("/[ЂЪЬБъь]/u", "b", $text); $text = preg_replace("/[βвВ]/u", "B", $text); $text = preg_replace("/[çς©с]/u", "c", $text); $text = preg_replace("/[ÇС]/u", "C", $text); $text = preg_replace("/[δ]/u", "d", $text); $text = preg_replace("/[éèêëέëèεе℮ёєэЭ]/u", "e", $text); $text = preg_replace("/[ÉÈÊË€ξЄ€Е∑]/u", "E", $text); $text = preg_replace("/[₣]/u", "F", $text); $text = preg_replace("/[НнЊњ]/u", "H", $text); $text = preg_replace("/[ђћЋ]/u", "h", $text); $text = preg_replace("/[ÍÌÎÏ]/u", "I", $text); $text = preg_replace("/[íìîïιίϊі]/u", "i", $text); $text = preg_replace("/[Јј]/u", "j", $text); $text = preg_replace("/[ΚЌК]/u", 'K', $text); $text = preg_replace("/[ќк]/u", 'k', $text); $text = preg_replace("/[ℓ∟]/u", 'l', $text); $text = preg_replace("/[Мм]/u", "M", $text); $text = preg_replace("/[ñηήηπⁿ]/u", "n", $text); $text = preg_replace("/[Ñ∏пПИЙийΝЛ]/u", "N", $text); $text = preg_replace("/[óòôõºöοФσόо]/u", "o", $text); $text = preg_replace("/[ÓÒÔÕÖθΩθОΩ]/u", "O", $text); $text = preg_replace("/[ρφрРф]/u", "p", $text); $text = preg_replace("/[®яЯ]/u", "R", $text); $text = preg_replace("/[ГЃгѓ]/u", "r", $text); $text = preg_replace("/[Ѕ]/u", "S", $text); $text = preg_replace("/[ѕ]/u", "s", $text); $text = preg_replace("/[Тт]/u", "T", $text); $text = preg_replace("/[τ†‡]/u", "t", $text); $text = preg_replace("/[úùûüџμΰµυϋύ]/u", "u", $text); $text = preg_replace("/[√]/u", "v", $text); $text = preg_replace("/[ÚÙÛÜЏЦц]/u", "U", $text); $text = preg_replace("/[Ψψωώẅẃẁщш]/u", "w", $text); $text = preg_replace("/[ẀẄẂШЩ]/u", "W", $text); $text = preg_replace("/[ΧχЖХж]/u", "x", $text); $text = preg_replace("/[ỲΫ¥]/u", "Y", $text); $text = preg_replace("/[ỳγўЎУуч]/u", "y", $text); $text = preg_replace("/[ζ]/u", "Z", $text); // Punctuation $text = preg_replace("/[‚‚]/u", ",", $text); $text = preg_replace("/[`‛′'']/u", "'", $text); $text = preg_replace("/[″“”«»„]/u", '"', $text); $text = preg_replace("/[—–―−–‾⌐─↔→←]/u", '-', $text); $text = preg_replace("/[ ]/u", ' ', $text); $text = str_replace("…", "...", $text); $text = str_replace("≠", "!=", $text); $text = str_replace("≤", "<=", $text); $text = str_replace("≥", ">=", $text); $text = preg_replace("/[‗≈≡]/u", "=", $text); // Exciting combinations $text = str_replace("ыЫ", "bl", $text); $text = str_replace("℅", "c/o", $text); $text = str_replace("₧", "Pts", $text); $text = str_replace("™", "tm", $text); $text = str_replace("№", "No", $text); $text = str_replace("Ч", "4", $text); $text = str_replace("‰", "%", $text); $text = preg_replace("/[∙•]/u", "*", $text); $text = str_replace("‹", "<", $text); $text = str_replace("›", ">", $text); $text = str_replace("‼", "!!", $text); $text = str_replace("⁄", "/", $text); $text = str_replace("∕", "/", $text); $text = str_replace("⅞", "7/8", $text); $text = str_replace("⅝", "5/8", $text); $text = str_replace("⅜", "3/8", $text); $text = str_replace("⅛", "1/8", $text); $text = preg_replace("/[‰]/u", "%", $text); $text = preg_replace("/[Љљ]/u", "Ab", $text); $text = preg_replace("/[Юю]/u", "IO", $text); $text = preg_replace("/[ﬁﬂ]/u", "fi", $text); $text = preg_replace("/[зЗ]/u", "3", $text); $text = str_replace("£", "(pounds)", $text); $text = str_replace("₤", "(lira)", $text); $text = preg_replace("/[‰]/u", "%", $text); $text = preg_replace("/[↨↕↓↑│]/u", "|", $text); $text = preg_replace("/[∞∩∫⌂⌠⌡]/u", "", $text); //2) Translation CP1252. $trans = get_html_translation_table(HTML_ENTITIES); $trans['f'] = '&fnof;'; // Latin Small Letter F With Hook $trans['-'] = array( '&hellip;', // Horizontal Ellipsis '&tilde;', // Small Tilde '&ndash;' // Dash ); $trans["+"] = '&dagger;'; // Dagger $trans['#'] = '&Dagger;'; // Double Dagger $trans['M'] = '&permil;'; // Per Mille Sign $trans['S'] = '&Scaron;'; // Latin Capital Letter S With Caron $trans['OE'] = '&OElig;'; // Latin Capital Ligature OE $trans["'"] = array( '&lsquo;', // Left Single Quotation Mark '&rsquo;', // Right Single Quotation Mark '&rsaquo;', // Single Right-Pointing Angle Quotation Mark '&sbquo;', // Single Low-9 Quotation Mark '&circ;', // Modifier Letter Circumflex Accent '&lsaquo;' // Single Left-Pointing Angle Quotation Mark ); $trans['"'] = array( '&ldquo;', // Left Double Quotation Mark '&rdquo;', // Right Double Quotation Mark '&bdquo;', // Double Low-9 Quotation Mark ); $trans['*'] = '&bull;'; // Bullet $trans['n'] = '&ndash;'; // En Dash $trans['m'] = '&mdash;'; // Em Dash $trans['tm'] = '&trade;'; // Trade Mark Sign $trans['s'] = '&scaron;'; // Latin Small Letter S With Caron $trans['oe'] = '&oelig;'; // Latin Small Ligature OE $trans['Y'] = '&Yuml;'; // Latin Capital Letter Y With Diaeresis $trans['euro'] = '&euro;'; // euro currency symbol ksort($trans); foreach ($trans as $k => $v) { $text = str_replace($v, $k, $text); } // 3) remove <p>, <br/> ... $text = strip_tags($text); // 4) &amp; => & &quot; => ' $text = html_entity_decode($text); // transliterate // if (function_exists('iconv')) { // $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text); // } // remove non ascii characters // $text = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $text); return $text; } ?>

Я также считаю, что лучшим решением может быть использование регулярного выражения.

Вот мое предложение:

 function convert_to_normal_text($text) { $normal_characters = "a-zA-Z0-9\s`~!@#$%^&*()_+-={}|:;<>?,.\/\"\'\\\[\]"; $normal_text = preg_replace("/[^$normal_characters]/", '', $text); return $normal_text; }

Затем вы можете использовать его следующим образом:

 $before = 'Some "normal characters": Abc123!+, some ASCII characters: ABC+ŤĎ and some non-ASCII characters: Ąąśćł.'; $after = convert_to_simple_text($before); echo $after;

Вывод:

 Some "normal characters": Abc123!+, some ASCII characters: ABC+ and some non-ASCII characters: .

Мне просто нужно было добавить заголовок

 header('Content-Type: text/html; charset=UTF-8');

Это должно быть довольно прямо вперед и не нужно для функции iconv:

 // Remove all characters that are not the separator, az, 0-9, or whitespace $string = preg_replace('![^'.preg_quote('-').'a-z0-_9\s]+!', '', strtolower($string)); // Replace all separator characters and whitespace by a single separator $string = preg_replace('!['.preg_quote('-').'\s]+!u', '-', $string);

Я думаю, что лучший способ сделать что-то подобное – использовать команду ord (). Таким образом, вы сможете хранить символы, написанные на любом языке. Просто не забудьте сначала проверить результаты вашего текста. Это не будет работать в Юникоде.

 $name="βγδεζηΘKgfgebhjrf!@#$%^&"; //this function will clear all non greek and english characters on greek-iso charset function replace_characters($string) { $str_length=strlen($string); for ($x=0;$x<$str_length;$x++) { $character=$string[$x]; if ((ord($character)>64 && ord($character)<91) || (ord($character)>96 && ord($character)<123) || (ord($character)>192 && ord($character)<210) || (ord($character)>210 && ord($character)<218) || (ord($character)>219 && ord($character)<250) || ord($character)==252 || ord($character)==254) { $new_string=$new_string.$character; } } return $new_string; } //end function $name=replace_characters($name); echo $name;