Нечеткий поиск слов в HTML-файле, возвращающий разные результаты

Нечеткий поиск слов в HTML-файле, возвращающий разные результаты ⇐ Python

1 сообщение • Страница 1 из 1

Anonymous

Нечеткий поиск слов в HTML-файле, возвращающий разные результаты

Цитата

Сообщение Anonymous » 15 дек 2024, 22:24

Мой скрипт читает файлы различных типов (TXT, CSV, JSON, DOCX, PDF, XLSX, HTML) и ищет определенные слова. Он отлично работает для большинства типов файлов, но для файлов HTML возвращает меньше результатов, чем ожидалось. Я использую BeautifulSoup для извлечения текста, но поведение поиска для HTML-файлов кажется непоследовательным.
Скрипт использует нечеткий поиск, чтобы найти такие слова, как vipus, и сопоставить их с вирусом. code> и, возможно, именно это не работает для HTML.
Содержимое каждого файла представляет собой список слов следующего вида:

Код: Выделить всё

hello
example
virus
help
try
change
world
fullvirustry
ViRus
sp AcE
hy-ph-en
vipus

Код, который считывает файл и ищет строку, следующий:

Код: Выделить всё

import os  # To access the directories and files on the PC individually
import re  # To search words easier
from collections import defaultdict  # Won't raise an error if the key doesn't exist

from bs4 import BeautifulSoup  # To work with Html files
from fuzzywuzzy import fuzz  # To filter and match words with functions

# remove extra spaces,conv to lower.
def normalize_content(content):
content = re.sub(r"[^\w\s@]", "", content)
return re.sub(r"\s+", " ", content).strip().lower()

# handlers for different file types
def handle_txt(file_path):
try:
with open(file_path, "r", encoding="utf-8") as file:
return normalize_content(file.read())
except (UnicodeDecodeError, FileNotFoundError) as error:
print(f"Error reading TXT file {file_path}: {str(error)}")
return None

def handle_html(file_path):
try:
with open(file_path, "r", encoding="utf-8") as file:
all_words = BeautifulSoup(file, "html.parser")
return normalize_content(all_words.get_text())
except Exception as error:
print(f"Error reading HTML file {file_path}: {str(error)}")
return None

# File handlers connected to the file extensions
File_Handlers = {
".txt": handle_txt,
".html": handle_html,
}

# read a file using the correct handler and say if the ext is not supported.
def read_file(file_path):
_, ext = os.path.splitext(file_path)
handler = File_Handlers.get(ext)
if handler:
return handler(file_path)
else:
print(f"Unsupported file type: {ext}.  Cannot read file: {file_path}.")
return None

# Counts search string occurrences and more...
def search_file(file_content, search_string, case_insensitive):
if not search_string.strip():
print("Error: The search string cannot be empty.")
return 0
if case_insensitive:
file_content = normalize_content(file_content)
search_string = search_string.lower()
return file_content.count(search_string)

# Perform a fuzzy search and tells how much is the match % is matching with the string
def fuzzy_search(file_content, search_string, matching_score):
matches = 0
# Finds all words in the text
words = re.findall(r"\b\w+\b", file_content)
for word in words:
score = fuzz.ratio(search_string.lower(), word.lower())
if score >= matching_score:
matches += 1
print(f"Match: '{word}' with score {score}%")
return matches

def search_in_directory(
directory_path,
search_string,
case_insensitive,
use_fuzzy=False,
fuzzy_matching_score=60,
):
found_files = defaultdict(int)
fuzzy_found_files = defaultdict(int)
unsupported_files = []
supported_file_count = 0

ignored_files = lambda file_name: file_name.startswith(".")

for root, dirs, files in os.walk(directory_path):
for file_name in files:
if ignored_files(file_name):
continue

file_path = os.path.join(root, file_name)
_, ext = os.path.splitext(file_path)

# Check if file is supported
if ext not in File_Handlers:
unsupported_files.append(file_path)
continue

# Process supported files
file_content = read_file(file_path)
if not file_content:
continue

supported_file_count += 1

# Normal search
normal_matches = search_file(file_content, search_string, case_insensitive)
if normal_matches > 0:
found_files[file_path] += normal_matches

# Fuzzy search (if enabled)
if use_fuzzy:
fuzzy_matches = fuzzy_search(
file_content, search_string, fuzzy_matching_score
)
if fuzzy_matches >  0:
fuzzy_found_files[file_path] += fuzzy_matches

# Combine results
all_matches = {**found_files, **fuzzy_found_files}
if all_matches:
print(f"\nFound the string '{search_string}' in the following files:")
for file, count in all_matches.items():
print(f"{file}: {count} occurrence(s)")
else:
print(f"\nThe string '{search_string}' was not found in any file.")

# Display the summary

print(f"\nTotal supported files processed: {supported_file_count}")
print(f"Unsupported files encountered: {len(unsupported_files)}")
if unsupported_files:
print("Unsupported files:")
for file in unsupported_files:
print(file)
print(f"Total matches found: {sum(all_matches.values())}")

# Main function to get the user inputs and more...
def main():
directory_path = input("Enter the directory path: ").strip()
if not os.path.isdir(directory_path):
print("Error: Directory does not exist.")
return

search_string = input("Enter the string to search for: ").strip()
if not search_string:
print("Error: The search string cannot be empty.")
return

case_insensitive = (
input("Case-insensitive search? (yes/no): ").lower().strip() == "yes"
)
fuzzy_search = (
input("Would you like to use fuzzy search? (yes/no): ").lower().strip() == "yes"
)

fuzzy_matching_score = 60
if fuzzy_search:
while True:
try:
fuzzy_matching_score = int(
input(
"Enter the fuzzy match minimum % (0-100) (default-60%): "
).strip()
)
if 0 

Подробнее здесь: [url]https://stackoverflow.com/questions/79282944/fuzzy-searching-of-words-in-html-file-returning-different-results[/url]

1734290681

Anonymous

Мой скрипт читает файлы различных типов (TXT, CSV, JSON, DOCX, PDF, XLSX, HTML) и ищет определенные слова. Он отлично работает для большинства типов файлов, но для файлов HTML возвращает меньше результатов, чем ожидалось. Я использую BeautifulSoup для извлечения текста, но поведение поиска для HTML-файлов кажется непоследовательным.
Скрипт использует нечеткий поиск, чтобы найти такие слова, как vipus, и сопоставить их с вирусом. code> и, возможно, именно это не работает для HTML.
Содержимое каждого файла представляет собой список слов следующего вида:
[code]hello
example
virus
help
try
change
world
fullvirustry
ViRus
sp AcE
hy-ph-en
vipus
[/code]
Код, который считывает файл и ищет строку, следующий:
[code]import os  # To access the directories and files on the PC individually
import re  # To search words easier
from collections import defaultdict  # Won't raise an error if the key doesn't exist

from bs4 import BeautifulSoup  # To work with Html files
from fuzzywuzzy import fuzz  # To filter and match words with functions

# remove extra spaces,conv to lower.
def normalize_content(content):
content = re.sub(r"[^\w\s@]", "", content)
return re.sub(r"\s+", " ", content).strip().lower()

# handlers for different file types
def handle_txt(file_path):
try:
with open(file_path, "r", encoding="utf-8") as file:
return normalize_content(file.read())
except (UnicodeDecodeError, FileNotFoundError) as error:
print(f"Error reading TXT file {file_path}: {str(error)}")
return None

def handle_html(file_path):
try:
with open(file_path, "r", encoding="utf-8") as file:
all_words = BeautifulSoup(file, "html.parser")
return normalize_content(all_words.get_text())
except Exception as error:
print(f"Error reading HTML file {file_path}: {str(error)}")
return None

# File handlers connected to the file extensions
File_Handlers = {
".txt": handle_txt,
".html": handle_html,
}

# read a file using the correct handler and say if the ext is not supported.
def read_file(file_path):
_, ext = os.path.splitext(file_path)
handler = File_Handlers.get(ext)
if handler:
return handler(file_path)
else:
print(f"Unsupported file type: {ext}.  Cannot read file: {file_path}.")
return None

# Counts search string occurrences and more...
def search_file(file_content, search_string, case_insensitive):
if not search_string.strip():
print("Error: The search string cannot be empty.")
return 0
if case_insensitive:
file_content = normalize_content(file_content)
search_string = search_string.lower()
return file_content.count(search_string)

# Perform a fuzzy search and tells how much is the match % is matching with the string
def fuzzy_search(file_content, search_string, matching_score):
matches = 0
# Finds all words in the text
words = re.findall(r"\b\w+\b", file_content)
for word in words:
score = fuzz.ratio(search_string.lower(), word.lower())
if score >= matching_score:
matches += 1
print(f"Match: '{word}' with score {score}%")
return matches

def search_in_directory(
directory_path,
search_string,
case_insensitive,
use_fuzzy=False,
fuzzy_matching_score=60,
):
found_files = defaultdict(int)
fuzzy_found_files = defaultdict(int)
unsupported_files = []
supported_file_count = 0

ignored_files = lambda file_name: file_name.startswith(".")

for root, dirs, files in os.walk(directory_path):
for file_name in files:
if ignored_files(file_name):
continue

file_path = os.path.join(root, file_name)
_, ext = os.path.splitext(file_path)

# Check if file is supported
if ext not in File_Handlers:
unsupported_files.append(file_path)
continue

# Process supported files
file_content = read_file(file_path)
if not file_content:
continue

supported_file_count += 1

# Normal search
normal_matches = search_file(file_content, search_string, case_insensitive)
if normal_matches > 0:
found_files[file_path] += normal_matches

# Fuzzy search (if enabled)
if use_fuzzy:
fuzzy_matches = fuzzy_search(
file_content, search_string, fuzzy_matching_score
)
if fuzzy_matches >  0:
fuzzy_found_files[file_path] += fuzzy_matches

# Combine results
all_matches = {**found_files, **fuzzy_found_files}
if all_matches:
print(f"\nFound the string '{search_string}' in the following files:")
for file, count in all_matches.items():
print(f"{file}: {count} occurrence(s)")
else:
print(f"\nThe string '{search_string}' was not found in any file.")

# Display the summary

print(f"\nTotal supported files processed: {supported_file_count}")
print(f"Unsupported files encountered: {len(unsupported_files)}")
if unsupported_files:
print("Unsupported files:")
for file in unsupported_files:
print(file)
print(f"Total matches found: {sum(all_matches.values())}")

# Main function to get the user inputs and more...
def main():
directory_path = input("Enter the directory path: ").strip()
if not os.path.isdir(directory_path):
print("Error: Directory does not exist.")
return

search_string = input("Enter the string to search for: ").strip()
if not search_string:
print("Error: The search string cannot be empty.")
return

case_insensitive = (
input("Case-insensitive search? (yes/no): ").lower().strip() == "yes"
)
fuzzy_search = (
input("Would you like to use fuzzy search? (yes/no): ").lower().strip() == "yes"
)

fuzzy_matching_score = 60
if fuzzy_search:
while True:
try:
fuzzy_matching_score = int(
input(
"Enter the fuzzy match minimum % (0-100) (default-60%): "
).strip()
)
if 0 

Подробнее здесь: [url]https://stackoverflow.com/questions/79282944/fuzzy-searching-of-words-in-html-file-returning-different-results[/url]

Ответить Пред. тема След. тема

1 сообщение • Страница 1 из 1

Быстрый ответ

Заголовок:

Имя пользователя:

Изменение регистра текста:

Смайлики

Ещё смайлики…

К этому ответу прикреплено по крайней мере одно вложение.

Если вы не хотите добавлять вложения, оставьте поля пустыми. Можно прикреплять файлы, перетаскивая их в окно сообщения.

Максимально разрешённый размер вложения: 15 МБ.

Имя файла:

Комментарий к файлу:

Имя файла	Комментарий к файлу	Размер	Статус

Похожие темы

Ответы

Просмотры

Последнее сообщение

Как выполнить программно нечеткий поиск в pubchem, используя составные имена

Последнее сообщение Anonymous « 21 май 2024, 01:37
Добавлено в форуме Python

Anonymous » 21 май 2024, 01:37 » в форуме Python

Когда я вручную выполнил поиск на веб-странице pubchem по ключевому слову 1-(2-гидроксифенил)-2-фенил этанон , я получил следующие результаты.

Хотя ни одно соединение точно не соответствовало приведенному выше ключевых слов, было обнаружено...

0 Ответы

11 Просмотры

Последнее сообщение Anonymous
21 май 2024, 01:37
Mysql regexp несколько слов поиск путем включения конкретных слов

Последнее сообщение Anonymous « 24 май 2025, 23:45
Добавлено в форуме Php

Anonymous » 24 май 2025, 23:45 » в форуме Php

У меня есть таблица базы данных с ключевыми словами столбцом. Для этого я использовал после запроса, и все работало нормально, чтобы получить результаты, и нашел полный результат «Gul Ahmad Outlet»
$sql = SELECT keywords FROM table WHERE keywords...

0 Ответы

13 Просмотры

Последнее сообщение Anonymous
24 май 2025, 23:45
C ++ Поиск строки из файла для определенных слов, а затем вставьте слово после этих слов

Последнее сообщение Anonymous « 29 май 2025, 13:16
Добавлено в форуме C++

Anonymous » 29 май 2025, 13:16 » в форуме C++

Я очень новичок в C ++, и я долго боролся, пытаясь выяснить, как решить эту проблему. По сути, мне нужно прочитать из файла и найти все экземпляры статьи («A», «A», «A», «an», «an», «an», «», «», «», «», «», «», «», «», «», «»), а затем вставьте...

0 Ответы

7 Просмотры

Последнее сообщение Anonymous
29 май 2025, 13:16
Как набрать намек на метод Python Factory, возвращающий разные типы?

Последнее сообщение Anonymous « 07 июл 2025, 17:47
Добавлено в форуме Python

Anonymous » 07 июл 2025, 17:47 » в форуме Python

Я работаю над общей структурой с целью решения разных, но связанных проблем. Проблема состоит из данных и множества алгоритмов, работающих на этих данных. Данные и алгоритмы могут варьироваться от проблемы к проблеме, поэтому мне нужны разные...

0 Ответы

2 Просмотры

Последнее сообщение Anonymous
07 июл 2025, 17:47
Как исправить каждый запрос, возвращающий только результаты об очистке кеша браузера

Последнее сообщение Anonymous « 28 июн 2024, 22:01
Добавлено в форуме Python

Anonymous » 28 июн 2024, 22:01 » в форуме Python

Я работаю над программным обеспечением, написанным на PYTHON, которое внезапно начало возвращать только результаты поиска, связанные с очисткой кеша браузера и не имеющие ничего общего с запросом. Я не знаю, является ли это кодом, хотя он не должен...

0 Ответы

17 Просмотры

Последнее сообщение Anonymous
28 июн 2024, 22:01

Вернуться в «Python»