MinHash - Search News

Big Data Analysis for Healthcare Application using Minhash and Machine Learning in Apache Spark Framework

Abstract: Analysing data on a large scale is becoming important and engages in convincing many researchers to use new platforms and tools that can handle large amounts of data. In this article, we ...

IEEE

Improved Consistent Sampling, Weighted Minhash and L1 Sketching

Abstract: We propose a new Consistent Weighted Sampling method, where the probability of drawing identical samples for a pair of inputs is equal to their Jaccard similarity. Our method takes ...

GitHub

vishalsingha/replay-dataset-pipeline

Replay (rehearsal) dataset generation for mitigating catastrophic forgetting during SFT. Instead of mixing public SFT datasets (which are distributionally mismatched), this pipeline reconstructs the ...

MinHash Near-duplicate detection at scale — without comparing every pair

Given a billion web pages, which ones are near-copies of each other? Comparing every pair is O(n²) — at a billion documents that's 5×10¹⁷ comparisons. MinHash collapses this to something tractable.

GitHub

4-content-minhash.py

movies = pd.read_csv(os.path.join(input_path, 'movies.csv')) train = pd.read_csv(os.path.join(input_path, 'train_set.csv')) test = pd.read_csv(os.path.join(input_path ...

Nature

Role of mobile genetic elements in the global dissemination of the carbapenem resistance gene

Mash, a MinHash based genome distance estimator 73, was applied with default settings to evaluate pairwise genetic distances between contig sequences and plasmid references. Contig-reference hits with ...

Nature

A genomic surveillance framework and genotyping tool for Klebsiella pneumoniae and its related species complex

Klebsiella pneumoniae is a pathogen of increasing public health concern and antimicrobial resistance is becoming more prevalent. Here, the authors describe a K. pneumoniae genotyping tool, Kleborate, ...

The Lancet

Pan-pathogen deep sequencing of nosocomial bacterial pathogens in Italy in spring 2020: a prospective cohort study

Our study shows that a culture-based deep-sequencing approach is a possible route towards improving future pathogen surveillance and infection control at hospitals. Future studies should be designed ...

️ Vector Databases: Choosing the Right Index Type

When building RAG pipelines or semantic search systems, the index you pick can make or break your performance. Here's what I mapped out: ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results