ICASSP-2023-Papers

ICASSP 2023 Papers: A complete collection of influential and exciting research papers from the ICASSP 2023 conference. Explore the latest advancements in acoustics, speech and signal processing. Code included. ⭐ the repository to support the advancement of audio and signal processing!

PDF version of the ICASSP 2023 Conference Programme, which lists all accepted full papers along with their presentation mode and time.

Other collections of the best AI conferences

Conference	Year
Computer Vision (CV)
CVPR	2023
ICCV	2023
Speech (SP)
INTERSPEECH	2023

Contributors

Contributions to improve the completeness of this list are greatly appreciated. If you come across any overlooked papers, please feel free to create pull requests, open issues or contact me via email. Your participation is crucial to making this repository even better.

Papers

List of sections

Audio for Multimedia and Multimodal Processing
Drone-vs-Bird Detection Grand Challenge at ICASSP23
Human Identification and Face Recognition
Self-Supervised Learning Methods
ASR with Constrained Resource
ASR: Multilingual Speech Recognition
Adaptive Signal Processing
6G Integrated Sensing and Communication (ISAC) from Theory to Practice - A Signal Processing Perspective
Applications to Physiological Signals, Audio, and Speech
Super Resolution
Denoising
Semantic Segmentation
Object Segmentation
Deep Learning for Image and Video Processing
Graph based Learning
Learning from Multimodal Data
Matrix/Tensor Factorization and Completion
ASR - Improve Latency, Efficiency, and Accuracy
ASR: Domain Adaptation and Robust Training
ASR: New Models
ASR: Noise Robustness
Audio Signal Restoration and Editing
Epilepsy Detection Grand Challenge
Deep Learning Theory
Neural Architecture Search
Expressive and Controllable TTS
Keyword Spotting
Detection and Classification
Advances in Signal Processing and Machine Learning for Non-Intrusive Load Monitoring
Machine Learning Applications
Classification
Human Posture Estimation
Human Reconstruction
Face Recognition
Source Separation, ICA, and Sparsity
Neural Sound Synthesis and Representation
Deep Learning for Audio and Music Applications
Machine Learning for Image and Video Processing
ASR: Text Adaptation
ASR: Training Methods
ASR: VAD and Other Topics
Automatic Audio Captioning and Retrieval
Auditory EEG Decoding Challenge
Image Restoration
Interpretable and Explainable Machine Learning
Language Modeling
Language Modeling and Spoken Language Understanding
Estimation Theory and Methods
AI Security and Privacy in Speech and Audio Processing
Binaural Audio; Multichannel Source Separation
Image/Video Caption Generation
Flow Estimation
Image/Video Retrieval
Transfer Learning
Learning Theory and Algorithms
Distributed and Federated Learning
Machine Learning for Telecommunications
Dialog and Multimodal Processing of Language
Discourse and Dialog
Emerging Topics in Speech Synthesis
Audio and Text Segmentation, Tagging and Parsing
Diffusion-based Generative Models for Audio and Speech
Multilingual Alzheimer's Dementia Recognition through Spontaneous Speech: a Signal Processing Grand Challenge
Model Pruning and Compression
Image Recognition and Detection
Machine Learning Methods for Language
Machine Translation and Dialog System
Radar Waveform Design: Recent Advances and New Emerging Applications
Conversational Healthcare Interfaces
Computer Vision Applications
Domain-Specific Detection
Temporal Video Analysis and Detection
Object Detection
Deep Learning for Speech and Audio Processing
Deep Learning for Speech and Language Processing
Language Modeling and Representation Learning
Lightweight TTS and TTS Analysis
Machine Translation for Spoken and Written Language
Music Audio Synthesis and Modeling
Spoken Language Understanding Grand Challenge
Image Segmentation
Multi-Speaker ASR
Multimodal Processing of Language and Language Systems
Tracking
Radar-Assisted Perception (RAP)
Data Driven and Machine Learning based Room Acoustic Modeling
Sensing Applications
Computational Imaging
Anomaly Detection
Deep Neural Network
Deep Learning
Deep and Sequential Learning
Machine Learning for Time Series Analysis
Multilingual Speech Recognition and Identification
Quantum Computing for Machine Learning and Signal Processing
Sound Event Detection
Brain Connectivity
Speech Signal Improvement Signal Processing Grand Challenge 2023
Anonymization and Data Privacy
Natural Language Processing
Pronunciation and Fluency Assessment
Edge Learning for Emerging Wireless Technologies
Acoustic Sensor Array Processing and Sound Source Localization
Representation Learning
Adversarial Machine Learning
Target Detection and Classification
Spatial Processing for Audio and Speech
Brain Computer Interfaces
Acoustic Echo Cancellation Signal Processing Grand Challenge 2023
DoA Estimation
Speaker Recognition: Scoring, Fairness, Privacy
Speaker Recognition: Verification, Diarization, Anti-Spoofing
Recent Advances in Robust Learning for Modern Computational Imaging
Signal Processing and Machine Learning for Networked Autonomous Agents
Active Noise Control, echo Reduction and Feedback Reduction
Anomaly Detection and Representation Learning for Audio Classification
Data Processing
Perceptual Assessment
Machine Learning for Recommendation, Search and other Applications
Reinforcement Learning
Pattern Recognition and Classification
Sparsity, Compressed Sensing, and Tensor Decomposition
Adversarial Machine Learning and Information Theoretic Security
Resource Constrained ASR
Singing Voice Synthesis/Conversion and Pretrained TTS
Medical Image Reconstruction
L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality
Multimedia Forensics
MIMO Radars and Waveform Design
Speech Dysarthria
Speech Emotion Recognition: General Topics
Intelligent and Semantic Communications for 5G Mobile Networks and Beyond
Audio and Speech Quality Measurements
Acoustic Modeling; Auditory Modeling for Hearing Instruments
Anonymization, Data Privacy, and Biometrics
Object Recognition
Identification Detection
Tracking, Data Fusion, and Sensor Networks
Speaker Recognition: Neural Network Architecture
Speech Analysis
Speaker Recognition: Anti-Spoofing and Verification
Bayesian Signal Processing
Speaker Recognition: Verification and Diarization
Learning on Graphs for Biology and Medicine
Learning from Neuroimaging Data
Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech
Quality Assessment and Anomaly Detection
Human-Centric Multimedia and Human-Machine Interaction
Speech Emotion Recognition: Transfer Learning
Multi-Antenna Communications and Sensing
Quantum Machine Learning Algorithms and Applications on NISQ Devices
Neural Speech and Audio Coding: Emerging Challenges and Opportunities
Medical and Environmental Acoustics; Audio Security
Classification of Acoustic Scenes and Events
Learning from EEG Data
Physiological Signal Processing
Speech Production, Perception,and Psychoacoustics
Watermarking, Data Hiding and Human Factors in Security
3D Point Cloud/Stereo Video
Face Processing
MIMO Radars and MIMO Communications
Speaker Recognition: Diarization
Estimation, Detection, and Classification
Model Lightweight and Video Compression
Subspace and Manifold Learning
Speech Enhancement - Diffusion and Other Generative Models
ICASSP2023 General Meeting Understanding and Generation (MUG) Challenge
Signal Processing for Smart City Applications and the Internet of Things
Symbol-Level Precoding: Recent Advance and New Applications in 6G and Beyond
Graphical Inference and Modeling in Dynamical Systems
Deep Learning-based Source Separation
Medical Image Segmentation
Bioinformatics
Cybersecurity, Hardware and Network Security
Multi-Antenna Communications and Intelligent Reflecting Surfaces
Multimedia Compression and Quality
Multimedia Analysis, Synthesis, and Learning
DoA Estimation and Beamforming
Speech Emotion Recognition: Multimodality
Speech Emotion Recognition: Neural Architectures
Optimization Methods for Signal Processing
5th DNS Challenge at IEEE ICASSP 2023
Signal Processing and Learning over Dynamic Graphs
Human Action Recognition
Deep Generative Model
Multimodal Signal Processing and Analysis
Speech Enhancement - Self-Supervised Learning
Distributed and Reliable Signal Processing and Communications
Resource-Efficient Real-time Neural Speech Separation
Multichannel Speech Enhancement, Dereverberation, and System Identification
Multilabel Acoustic Event Classification
Deep Learning for Medical Imaging
Machine/Deep Learning Methodologies for Multimedia
Human-Centric Multimedia
Source Localization and Separation
Speech Enhancement /Audio-Visual, Multi-Channel, and Other
Speech Enhancement - Separation and Target Speech Extraction
Speech Enhancement - Single Channel
Machine Learning Applications to Communications
Aspects in Image Generation/Analysis
Multi-Antenna and Multi-Carrier Communications
Signal Filtering, Restoration, Enhancement, and Reconstruction
ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids
Image and Video Enhancement
Speech Recognition-training/adaptation
Decentralized Wireless Systems and Energy Harvesting
Robust Learning and Inference
Music Classification and Transcription
Music Information Retrieval
Deep Learning for Medical Image Segmentation
Detection and Classification in Medical Imaging
Image Coding/Compression
Audio-Visual Signal Processing and Analysis
Various Aspects in Speech and Language Processing
Speech Recognition: Modeling and Context
Speech Recognition: Self-Supervised Models
Channel State Estimation
Signal Processing over Graphs and Networks
Signal Processing over Networks
Applications to Vision, Speech, and Robotics
Person Identification and Relapse Detection from Continuous Recordings of Biosignals
Vision and Language Model
TTS: AM and Vocoder
Signal Processing Education
Signal Processing and Systems for Remote Biometrics
Signal Processing for RIS-Enabled Smart Wireless Environments
Multimodal Learning
Video Coding/Compression
Object Tracking
Image Generation
Spoken Language Understanding
Optimization and Machine Learning for Communications
Sparse/Low-Dimensional Signal Processing
Signal Processing Theory and Methods
Radar/Array Signal Processing. Networks and Communications
Applications to Communications
The First Pathloss Radio Map Prediction Challenge
Human Video Generation and Editing
Point Cloud Processing
Multimedia Databases and Information Retrieval
Voice and Style Conversion
Synergy between Human and Machine Approaches to Sound/Scene Recognition and Processing
Topological and Simplicial Data Processing
Unsupervised Deep Learning of Image Priors for Inverse Problems
Self-Supervised Learning and Data-Efficiency for Speech and Audio
Sound Event Detection and Localization; Bioacoustic Event Detection
Aspects in Machine Learning
Aspects in Image/Video Processing and Analysis
Learning Algorithms and Applications
Optimization Methods in Machine Learning
Applications of Machine Learning
Sensing, Computing, and Semantic Communications
Sparsity and Low-Rank Models
Signal Processing over Graphs
Target Source Extraction
Music Generation and Arrangement
Multimodal Information based Speech Processing (MISP) 2022 Challenge
Image Retrieval and Classification
Variational Inference and Approximate Bayesian Techniques
Spatial Audio Recording and Reproduction
Speech Modeling and Audio Coding
Audio Processing and Analysis
Image/Video Enhancement
Zero or Few-Shot Learning
Acoustic and Microphone Array Processing
Speech and Language Disorders
Various Aspects in Speech and Speaker Recognition
Sampling Theory, Compressed and Non-uniform Sampling
Show and Tell Demos: Session
Rising Stars Workshop

Audio for Multimedia and Multimodal Processing

🆔	Title	Repo
647	Diverse and Vivid Sound Generation from Text Descriptions
2248	EPIC-SOUNDS: A Large-Scale Dataset of Actions that Sound
784	I See What You Hear: A Vision-inspired Method to Localize Words	➖
6119	Incorporating Lip Features Into Audio-Visual Multi-Speaker DOA Estimation by Gated Fusion	➖
6787	UAVM: Towards Unifying Audio and Visual Models (SPS Journal Paper)

Drone-vs-Bird Detection Grand Challenge at ICASSP23

🆔	Title	Repo
6834	High-Speed Drone Detection based on Yolo-v8	➖
6863	S-Feature Pyramid Network and Attention Model for Drone Detection	➖
6881	Drone-vs-Bird: Drone Detection using Yolov7 with CSRT Tracker	➖

Human Identification and Face Recognition

🆔	Title	Repo
530	EMCLR: Expectation Maximization Contrastive Learning Representations	➖
711	Boosting Person Re-Identification with Viewpoint Contrastive Learning and Adversarial Training	➖
812	Top-K Visual Tokens Transformer: Selecting Tokens for Visible-infrared Person Re-Identification	➖
2531	Frequency-aware Attentional Feature Fusion for Deepfake Detection	➖
5309	Recursive Joint Attention for Audio-Visual Fusion in Regression based Emotion Recognition
3475	Multi-Stream Facial Adaptive Network for Expression Recognition from a Single Image

Self-Supervised Learning Methods

🆔	Title	Repo
429	PointACL: Adversarial Contrastive Learning for Robust Point Clouds Representation under Adversarial Attack
2579	Enhancing Representation Learning with Deep Classifiers in Presence of Shortcut
730	K²NN: Self-Supervised Learning with Hierarchical Nearest Neighbors for Remote Sensing	➖
4453	TriNet: Stabilizing Self-Supervised Learning from Complete or Slow Collapse
1629	On Minimal Variations for Unsupervised Representation Learning	➖
740	Adaptive Data Augmentation for Contrastive Learning	➖

ASR with Constrained Resource

🆔	Title	Repo
690	De'HuBERT: Disentangling Noise in a Self-Supervised Model for Robust Speech Recognition	➖
1948	Masked Token Similarity Transfer for Compressing Transformer-based ASR Models	➖
2888	Unsupervised Fine-Tuning Data Selection for ASR using Self-Supervised Speech Models	➖
3250	CB-Conformer: Contextual Biasing Conformer for Biased Word Recognition	➖
3712	Context-aware Fine-Tuning of Self-Supervised Speech Models	➖
6449	Data2vec-Aqc: Search for the Right Teaching Assistant in the Teacher-Student Training Setup

ASR: Multilingual Speech Recognition

🆔	Title	Repo
2417	Hierarchical Softmax for End-to-End Low-Resource Multilingual Speech Recognition
4510	Improving Massively Multilingual ASR With Auxiliary CTC Objectives
4777	Massively Multilingual Shallow Fusion with Large Language Models	➖
5465	UML: A Universal Monolingual Output Layer for Multilingual ASR	➖
5744	Investigation Into Phone-based Subword Units for Multilingual End-to-End Speech Recognition	➖
6221	Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities	➖

Adaptive Signal Processing

🆔	Title	Repo
1224	A Compensated Shrinkage Affine Projection Algorithm for Debiased Sparse Adaptive Filtering	➖
1761	Dynamic Selection of p-Norm in Linear Adaptive Filtering via Online Kernel-based Reinforcement Learning	➖
2511	Neural Network Models with Integrated Training and Adaptation for Nonlinear Acoustic System Identification	➖
3895	Neural Mode Estimation	➖
5352	Adaptive ECCM for Mitigating Smart Jammers	➖
6529	Differentiable Adaptive Short-Time Fourier Transform with Respect to the Window Length	➖

6G Integrated Sensing and Communication (ISAC) from Theory to Practice - A Signal Processing Perspective

🆔	Title	Repo
3049	6G Integrated Sensing and Communication - Sensing Assisted Environmental Reconstruction and Communication	➖
3325	Neurally Augmented State Space Model for Simultaneous Communication and Tracking with Low Complexity Receivers	➖
3456	Multi-View Millimeter-Wave Imaging Over Wireless Cellular Network	➖
3803	Joint Data Association, NLOS Mitigation, and Clutter Suppression for Networked Device-Free Sensing in 6G Cellular Network	➖
4255	Integrating the Sensing and Radio Communications Channel Modelling from Radar Mutual Interference	➖
5326	Active Beam Tracking with Reconfigurable Intelligent Surface	➖

Applications to Physiological Signals, Audio, and Speech

🆔	Title	Repo
5872	ClassA Entropy for the Analysis of Structural Complexity of Physiological Signals	➖
1034	Unobtrusive Respiratory Monitoring System for Intensive Care	➖
4381	Improved WiFi-based Respiration Tracking via Contrast Enhancement	➖
4851	Joint Angle and Respiration Estimation for Passive and Device-Free Respiration Monitoring	➖
3418	Implementing Continuous HRTF Measurement in Near-Field	➖
5094	SeliNet: A Lightweight Model for Single Channel Speech Separation	➖
5196	Adaptive Time-Scale Modification for Improving Speech Intelligibility based on Phoneme Clustering for Streaming Services	➖
3109	Cutting through the Noise: An Empirical Comparison of Psychoacoustic and Envelope-based Features for Machinery Fault Detection	➖
4835	Cochlear Decomposition: A Novel Bio-Inspired Multiscale Analysis Framework	➖
2458	Design and Performance of the Low-Power Noise Reduction Algorithm of the Med-EL Sonnet 2^TM Cochlear Implant Audio Processor	➖
6491	Modulo EEG Signal Recovery using Transformers	➖
454	Knowledge-Graph Augmented Music Representation for Genre Classification	➖

Super Resolution

🆔	Title	Repo
275	PFT-SSR: Parallax Fusion Transformer for Stereo Image Super-Resolution
326	Raising the Limit of Image Rescaling using Auxiliary Encoding	➖
1431	Kernel Estimation and Deconvolution for Blind Image Super-Resolution	➖
1555	A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution	➖
1900	Long-Short Attention Network for the Spectral Super-Resolution of Multispectral Images
2363	Multi-Level Fusion for Burst Super-Resolution with Deep Permutation-Invariant Conditioning	➖
2684	Frequency Reciprocal Action and Fusion for Single Image Super-Resolution	➖
2777	FCIR: Rethink Aerial Image Super Resolution with Fourier Analysis
2962	A Content-based Multi-Scale Network for Single Image Super-Resolution	➖
3053	Learning to Explain: A Gradient-based Attribution Method for Interpreting Super-Resolution Networks	➖
3140	CNN Filter for RPR-based SR in VVC with Wavelet Decomposition	➖
3555	Local to Global Prior Learning for Blind Unsupervised Image Super-Resolution	➖

Denoising

🆔	Title	Repo
5974	Rain2Avoid: Self-Supervised Single Image Deraining	➖
5479	Aprogressive Image Dehazing Framework with Inter and Intra Contrastive Learning	➖
5267	Graph-based Point Cloud Color Denoising with 3-Dimensional Patch-based Similarity	➖
2310	CAENet: using Collaborative Attention Transformer and Add-Boost Strategy for Single Image Deraining	➖
1791	SFEMGN: Image Denoising with Shallow Feature Enhancement Network and Multi-Scale ConvGRU	➖
1554	Affinity Learning with Blind-Spot Self-Supervision for Image Denoising	➖
1473	SAR Image Despeckling with Residual-in-Residual Dense Generative Adversarial Network	➖
1211	Uncer2Natural: Uncertainty-aware Unsupervised Image Denoising	➖
553	HPFTN: Hierarchical Progressive Fusion Transformer Network for Video Denoising	➖
398	Subspace Modeling enabled High-Sensitivity X-Ray Chemical Imaging	➖
274	MSP-Former: Multi-Scale Projection Transformer for Single Image Desnowing	➖
117	Hyperspectral Image Denoising via Nonlocal Rank Residual Modeling

Semantic Segmentation

🆔	Title	Repo
190	LoG-CAN: Local-Global Class-aware Network for Semantic Segmentation of Remote Sensing Images
406	WUDA: Unsupervised Domain Adaptation based on Weak Source Domain Labels
555	Class-aware Contextual Information for Semantic Segmentation	➖
1132	Semi-Supervised Semantic Segmentation with Structured Output Space Adaption	➖
1170	PRRD: Pixel-Region Relation Distillation for Efficient Semantic Segmentation	➖
2521	Spatial Correlation Fusion Network for Few-Shot Segmentation	➖
3306	Exploring Vision Transformer Layer Choosing for Semantic Segmentation	➖
3941	Joint Training of Hierarchical GANs and Semantic Segmentation for Expression Translation	➖
6357	Progressive Refinement Learning based on Feature Cross Perception for Residential Areas Semantic Segmentation	➖
1599	Lightweight Portrait Segmentation via Edge-optimized Attention
3857	A Dynamic Cross-Scale Transformer with Dual-Compound Representation for 3D Medical Image Segmentation	➖
3793	LABANet: Lead-Assisting Backbone Attention Network for Oral Multi-Pathology Segmentation	➖

Object Segmentation

🆔	Title	Repo
3473	Robust Video Object Segmentation with Restricted Attention	➖
3501	Stacking-based Attention Temporal Convolutional Network for Action Segmentation	➖
2436	VLKP: Video Instance Segmentation with Visual-Linguistic Knowledge Prompts	➖
4867	Automatic Error Detection in Integrated Circuits Image Segmentation: A Data-Driven Approach	➖
3745	TransWnet: Integrating Transformers Into CNNs via Row and Column Attention for Abdominal Multi-Organ Segmentation	➖
5844	Active Perception System for Enhanced Visual Signal Recovery using Deep Reinforcement Learning	➖
302	OAFormer: Learning Occlusion Distinguishable Feature for Amodal Instance Segmentation	➖
698	Encoder-Decoder Graph Convolutional Network for Automatic Timed-Up-and-Go and Sit-to-Stand Segmentation	➖
758	Meta++ Network for Few-Shot Aerospace Crack Segmentation	➖
1764	IAST: Instance Association Relying on Spatio-Temporal Features for Video Instance Segmentation
2469	Continual Cell Instance Segmentation of Microscopy Images	➖

Deep Learning for Image and Video Processing

🆔	Title	Repo
5397	Spammer Detection on Short Video Applications: A New Challenge and Baselines	➖
814	Weakly- and Semi-Supervised Object Localization	➖
2503	Balanced Mixup Loss for Long-Tailed Visual Recognition	➖
4130	On Cross-Layer Alignment for Model Fusion of Heterogeneous Neural Networks	➖
2813	Invariant Adversarial Imitation Learning from Visual Inputs	➖
6423	SPECTRANET-SO(3): Learning Satellite Orientation from Optical Spectra by Implicitly Modeling Mutually Exclusive Probability Distributions on the Rotation Manifold	➖
3097	Structured-Anchor Projected Clustering for Hyperspectral Images	➖
140	Learning Sparse Auto-Encoders for Green AI Image Coding	➖
643	Learning to Generate 3D Representations of Building Roofs using Single-View Aerial Imagery	➖
4843	Robust Monocular Localization of Drones by Adapting Domain Maps to Depth Prediction Inaccuracies	➖
5940	Large Dimensional Analysis of LS-SVM Transfer Learning: Application to PolSAR Classification	➖
5062	SMUG: Towards Robust MRI Reconstruction by Smoothed Unrolling

Graph based Learning

🆔	Title	Repo
715	Graph-Graph Context Dependency Attention for Graph Edit Distance	➖
3882	Topology Uncertainty Modeling for Imbalanced Node Classification on Graphs	➖
589	CPD-GAN: Cascaded Pyramid Deformation GAN for Pose Transfer	➖
5321	Space-Time Graph Neural Networks with Stochastic Graph Perturbations	➖
6793	Untrained Graph Neural Networks for Denoising	➖
5846	Learning on Graphs under Label Noise	➖
2906	Select the Best: Enhancing Graph Representation with Adaptive Negative Sample Selection	➖
2586	Learning with Multigraph Convolutional Filters	➖
2164	Self-Supervised Guided Hypergraph Feature Propagation for Semi-Supervised Classification with Missing Node Features	➖
3752	Incorporating Reliability in Graph Information Propagation by Fluid Dynamics Diffusion: a Case of Multimodal Semi-Supervised Deep Learning	➖
5159	GraphMAD: Graph Mixup for Data Augmentation using Data-Driven Convex Clustering
3724	Time-Varying Signals Recovery via Graph Neural Networks	➖

Learning from Multimodal Data

🆔	Title	Repo
3546	Multimodal Knowledge Distillation for Arbitrary-Oriented Object Detection in Aerial Images	➖
1234	Hierarchical Spatial-Temporal Transformer with Motion Trajectory for Individual Action and Group Activity Recognition	➖
693	Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-Linked Inputs	➖
1571	Towards Robust Audio-based Vehicle Detection via Importance-Aware Audio-Visual Learning	➖
841	Hierarchical Multi-Task Learning for Fabric Component Analysis Based on NIR Spectral Signals	➖
1706	Cross Modality Knowledge Distillation for Robust Pedestrian Detection in Low Light and Adverse Weather Conditions	➖
6375	Data Leakage in Cross-Modal Retrieval Training: A Case Study	➖
5825	Difficulty-Aware Data Augmentor for Scene Text Recognition	➖
461	TinyOOD: Effective Out-of-Distribution Detection for TinyML	➖
4211	A Principled Approach to Model Validation in Domain Generalization
4220	Scale-Adaptive Tiny Object Detection Enhanced by Across-Scale and Shape-Preserved Semantic Location	➖
3735	Audio-Visual Inpainting: Reconstructing Missing Visual Information with Sound	➖

Matrix/Tensor Factorization and Completion

🆔	Title	Repo
507	Learn Topological Representation with Flexible Manifold Layer
1438	Tensorized LSSVMs for Multitask Regression	➖
3571	A Bayesian Perspective for Determinant Minimization based Robust Structured Matrix Factorization	➖
5045	Volume-Regularized Nonnegative Tucker Decomposition with Identifiability Guarantees	➖
687	Transductive Matrix Completion with Calibration for Multi-Task Learning	➖
1668	Projected Hierarchical ALS for Generalized Boolean Matrix Factorization	➖
2934	Robust Binary Component Decompositions	➖
3897	Multi-Resolution Convolutional Dictionary Learning for Riverbed Dynamics Modeling	➖
2388	PARAFAC2-based Coupled Matrix and Tensor Factorizations
6088	Deep Plug-and-Play for Tensor Robust Principal Component Analysis	➖
6125	Geometric Matrix Completion with Collaborative Routing between Capsules	➖
3256	Enrollment Rate Prediction in Clinical Trials based on CDF Sketching and Tensor Factorization Tools	➖

ASR - Improve Latency, Efficiency, and Accuracy

🆔	Title	Repo
900	Multi-blank Transducers for Speech Recognition
1642	Diagonal State Space Augmented Transformers for Speech Recognition	➖
1661	TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty	➖
3385	Towards Accurate and Real-Time End-of-Speech Estimation	➖
3999	Peak-First CTC: Reducing the Peak Latency of CTC Models by Applying Peak-First Regularization	➖
4330	Evaluating Parameter-Efficient Transfer Learning Approaches on SURE Benchmark for Speech Understanding
5058	Powerful and Extensible WFST Framework for RNN-Transducer Losses	➖
5337	Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation	➖
5434	Improving Non-Autoregressive Speech Recognition with Autoregressive Pretraining	➖
5558	Conversation-Oriented ASR with Multi-Look-Ahead CBS Architecture	➖
5607	Using Adapters to Overcome Catastrophic Forgetting in End-to-End Automatic Speech Recognition	➖
5824	Fast and Parallel Decoding for Transducer

ASR: Domain Adaptation and Robust Training

🆔	Title	Repo
505	SAN: A Robust End-to-End ASR Model Architecture	➖
1604	Explanations for Automatic Speech Recognition	➖
1674	On-the-Fly Text Retrieval for End-to-End ASR Adaptation	➖
2397	Unsupervised Model-based Speaker Adaptation of End-To-End Lattice-Free MMI Model for Speech Recognition	➖
3258	Domain Adaptation with External Off-Policy Acoustic Catalogs for Scalable Contextual End-To-End Automated Speech Recognition	➖
3600	Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR	➖
3973	WeavSpeech: Data Augmentation Strategy for Automatic Speech Recognition via Semantic-aware Weaving	➖
4139	Joint Discriminator and Transfer based Fast Domain Adaptation for End-to-End Speech Recognition	➖
5424	Improving Fairness and Robustness in End-to-End Speech Recognition Through Unsupervised Clustering	➖
5491	Improving Fast-Slow Encoder based Transducer with Streaming Deliberation	➖
5496	Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy	➖
5902	Improving Accented Speech Recognition with Multi-Domain Training	➖

ASR: New Models

🆔	Title	Repo
179	UCONV-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition
876	A Comparison of Semi-Supervised Learning Techniques for Streaming ASR at Scale	➖
1356	Improving Contextual Biasing with Text Injection	➖
1655	Structured State Space Decoder for Speech Recognition and Synthesis	➖
3365	JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition	➖
3368	Variable Attention Masking for Configurable Transformer Transducer Speech Recognition	➖
3499	Factorized Blank Thresholding for Improved Runtime Efficiency of Neural Transducers	➖
3926	Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames	➖
4365	Understanding Shared Speech-Text Representations	➖
4534	Front-End Adapter: Adapting Front-End Input of Speech based Self-Supervised Learning for Speech Recognition	➖
2237	Lego-Features: Exporting Modular Encoder Features for Streaming and Deliberation ASR	➖
5384	Modular Conformer Training for Flexible End-to-End ASR	➖

ASR: Noise Robustness

🆔	Title	Repo
1897	On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems
1919	Gradient Remedy for Multi-Task Learning in End-to-End Noise-Robust Speech Recognition
1929	MADI: Inter-Domain Matching and Intra-Domain Discrimination for Cross-Domain Speech Recognition	➖
1971	Robust Data2vec: Noise-Robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning	➖
2040	Robust Audio-Visual ASR with Unified Cross-Modal Attention	➖
3292	HuBERT-AGG: Aggregated Representation Distillation of Hidden-Unit BERT for Robust Speech Recognition	➖
4124	Speech and Noise Dual-Stream Spectrogram Refine Network with Speech Distortion Loss for Robust Speech Recognition
4680	RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness	➖
5455	Cleanformer: A Multichannel Array Configuration-Invariant Neural Enhancement Frontend for ASR in Smart Speakers	➖
5504	On the Effectiveness of Monoaural Target Source Extraction for Distant End-to-End Automatic Speech Recognition	➖
6389	Noise-aware Target Extension with Self-Distillation for Robust Speech Recognition	➖

Audio Signal Restoration and Editing

🆔	Title	Repo
5003	AERO: Audio Super Resolution in the Spectral Domain
1768	UPGLADE: Unplugged Plug-and-Play Audio Declipper based on Consensus Equilibrium of DNN and Sparse Optimization	➖
2121	Improving Performance of Real-Time Full-Band Blind Packet-Loss Concealment with Predictive Network
4388	Faster than Fast: Accelerating the Griffin-Lim Algorithm	➖
3726	Improving Phase-Vocoder-based Time Stretching by Time-Directional Spectrogram Squeezing
6288	Extreme Audio Time Stretching using Neural Synthesis	➖

Epilepsy Detection Grand Challenge

🆔	Title	Repo
7015	Lightweight Machine Learning for Seizure Detection on Wearable Devices	➖
7021	Pretrained Transformers for Seizure Detection	➖
7022	Towards Interpretable Seizure Detection using Wearables	➖
7033	Optimization of the Deep Neural Networks for Seizure Detection	➖

Deep Learning Theory

🆔	Title	Repo
2465	MSFormer: Multi-Scale Transformer with Neighborhood Consensus for Feature Matching	➖
3498	Decoupled Visual Causality for Robust Detection	➖
2500	Semantics-Disentangled Contrastive Embedding for Generalized Zero-Shot Learning	➖
4730	Dynamic Scalable Self-Attention Ensemble for Task-Free Continual Learning	➖
2125	Ultimate Negative Sampling for Contrastive Learning	➖
3936	An Application of Quantum Mechanics to Attention Methods in Computer Vision	➖

Neural Architecture Search

🆔	Title	Repo
3492	Search for Efficient Deep Visual-Inertial Odometry Through Neural Architecture Search
4072	Receptive Field Reliant Zero-Cost Proxies for Neural Architecture Search	➖
4346	ZO-DARTS: Differentiable Architecture Search with Zeroth-Order Approximation	➖
2675	Performing Neural Architecture Search without Gradients
796	Neural Architecture of Speech	➖
1461	BHE-DARTS: Bilevel Optimization based on Hypergradient Estimation for Differentiable Architecture Search	➖

Expressive and Controllable TTS

🆔	Title	Repo
2625	Improving Speech Prosody of Audiobook Text-to-Speech Synthesis with Acoustic and Textual Contexts
4768	Context-Aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis
4776	Ensemble Prosody Prediction for Expressive Speech Synthesis
5782	Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
5970	High-Acoustic Fidelity Text to Speech Synthesis with Fine-Grained Control of Speech Attributes	➖
6203	Embedding a Differentiable Mel-Cepstral Synthesis Filter to a Neural Speech Synthesis System

Keyword Spotting

🆔	Title	Repo
1848	Disentangled Training with Adversarial Examples for Robust Small-Footprint Keyword Spotting	➖
3578	Dual-Attention Neural Transducers for Efficient Wake Word Spotting in Speech Recognition	➖
5025	Fixed-Point Quantization Aware Training for On-Device Keyword-Spotting	➖
5106	To Wake-Up or Not to Wake-Up: Reducing Keyword False Alarm by Successive Refinement	➖
5584	Transcription Free Filler Word Detection with Neural Semi-CRFs
6078	The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis

Detection and Classification

🆔	Title	Repo
657	Passive Detection of Rank-One Gaussian Signals for Known Channel Subspaces and Arbitrary Noise	➖
2389	False Alarm Regulation for Off-Grid Target Detection with the Matched Filter	➖
2536	Data-Driven Quickest Change Detection in Markov Models	➖
3510	Quickest Change Detection with Leave-one-Out Density Estimation	➖
4778	Identifying Coordination in a Cognitive Radar Network - A Multi-Objective Inverse Reinforcement Learning Approach
4815	Improved Small Sample Hypothesis Testing using the Uncertain Likelihood Ratio	➖

Advances in Signal Processing and Machine Learning for Non-Intrusive Load Monitoring

🆔	Title	Repo
2170	A Wavelet Scattering Approach for Load Identification with Limited Amount of Training Data	➖
2653	Applying Symmetrical Component Transform for Industrial Appliance Classification in Non-Intrusive Load Monitoring	➖
3326	ContiNILM: A Continual Learning Scheme for Non-Intrusive Load Monitoring	➖
5853	Improving Knowledge Distillation for Non-Intrusive Load Monitoring through Explainability Guided Learning	➖
6414	Improved Appliance Transient Feature Extraction via Template Matching	➖

Machine Learning Applications

🆔	Title	Repo
6355	Causal Discovery and Causal Inference based Counterfactual Fairness in Machine Learning	➖
4965	Benchmarking Convolutional Neural Network Inference on Low-Power Edge Devices	➖
1115	Code-Enhanced Fine-Grained Semantic Matching for Tag Recommendation in Software Information Sites	➖
394	Robust Dominant Periodicity Detection for Time Series with Missing Data	➖
3994	Dynamic Split Computing for Efficient Deep Edge Intelligence
5723	Dense Adversarial Transfer Learning based on Class-Invariance	➖
4620	VAN-ICP: GPU-Accelerated Approximate Nearest Neighbor Search for ICP Registration via Voxel Dilation
5776	Clustering-based Supervised Contrastive Learning for Identifying Risk Items on Heterogeneous Graph	➖
4052	Multiresolution Signal Processing of Financial Market Objects	➖
1752	Hierarchical Multi-Agent Reinforcement Learning with Intrinsic Reward Rectification	➖
3493	An Antispoofing Approach in Biometric Authentication System for a Smartcard	➖
3576	Unsupervised Domain Adaptation via Subspace Interpolating Deep Dictionary Learning: A Case Study in Machine Inspection	➖

Classification

🆔	Title	Repo
283	Multi-Modal Domain Generalization for Cross-Scene Hyperspectral Image Classification	➖
1056	Hierarchical Transformer for Multi-Label Trailer Genre Classification	➖
1236	S3I-PointHop: SO(3)-Invariant PointHop for 3D Point Cloud Classification	➖
1302	Sample-Aware Knowledge Distillation for Long-Tailed Learning	➖
1562	Laryngeal Leukoplakia Classification via Dense Multiscale Feature Extraction in White Light Endoscopy Images	➖
1904	Long-Tailed Recognition with Causal Invariant Transformation	➖
2199	STACKMAPS: A Visualization Technique for Diabetic Retinopathy Grading	➖
2904	Gender-Cartoon: Image Cartoonization Method based on Gender Classification	➖
3167	Extracting the Brain-Like Representation by an Improved Self-Organizing Map for Image Classification
3888	DDN: Dynamic Aggregation Enhanced Dual-Stream Network for Medical Image Classification	➖
4696	LGViT: Local-Global Vision Transformer for Breast Cancer Histopathological Image Classification	➖
5583	Learning a Weight Map for Weakly-Supervised Localization	➖

Human Posture Estimation

🆔	Title	Repo
301	Interweaved Graph and Attention Network for 3D Human Pose Estimation
3696	Learning 3D Human Pose and Shape Estimation using Uncertainty-Aware Body Part Segmentation	➖
3841	Monocular 3D Human Pose Estimation based on Global Temporal-Attentive and Joints-Attention in Video
4380	EVOPOSE: A Recursive Transformer for 3D Human Pose Estimation with Kinematic Structure Priors	➖
142	HTNet: Human Topology Aware Network for 3D Human Pose Estimation
1107	Improving Occluded Human Pose Estimation via Linked Joints	➖
5121	Efficient and Effective Multi-Camera Pose Estimation with Weighted M-Estimate Sample Consensus	➖
5668	AMPose: Alternately Mixed Global-Local Attention Model for 3D Human Pose Estimation
5750	FlowPose: Conditional Normalizing Flows for 3D Human Pose and Shape Estimation from Monocular Videos	➖
6050	Animal Re-Identification Algorithm for Posture Diversity
6322	Retrieval-based Natural 3D Human Motion Generation	➖
2453	Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-Temporal Masked Transformers	➖

Human Reconstruction

🆔	Title	Repo
4237	Time-Frequency Awareness Network for Human Mesh Recovery from Videos
2028	Diffusion Motion: Generate Text-Guided 3D Human Motion by Diffusion Model	➖
4667	GATOR: Graph-Aware Transformer with Motion-Disentangled Regression for Human Mesh Recovery from a 2D Pose
5538	Real-Time Human Reconstruction based on Human Pose Prior and Epipolar Refinement	➖
642	Efficient Feature Fusion for Learning-based Photometric Stereo	➖
2442	Volumetric 3D Reconstruction with Window-Wise Global Feature Aggregation	➖
4008	Stereoscopic Video Retargeting based on Camera Motion Classification	➖
4893	Detail-Aware Uncalibrated Photometric Stereo	➖
5712	SDRNet: Shape Decoupled Regression Network for 3D Face Reconstruction	➖
1119	Binary Image Fast Perfect Recovery from Sparse 2D-DFT Coefficients	➖
1175	HQP-MVS: High-Quality Plane Priors Assisted Multi-View Stereo for Low-Textured Areas	➖
3183	Dynamic Multi-View Scene Reconstruction using Neural Implicit Surface	➖

Face Recognition

🆔	Title	Repo
3959	LOGO-Former: Local-Global Spatio-Temporal Transformer for Dynamic Facial Expression Recognition	➖
4254	Quaternion Orthogonal Transformer for Facial Expression Recognition in the Wild
3490	Privacy Preserving Face Recognition with Lensless Camera	➖
3649	MaskDUL: Data Uncertainty Learning in Masked Face Recognition
4814	Cov Loss: Covariance-based Loss for Deep Face Recognition	➖
5674	Boosting Face Recognition Performance with Synthetic Data and Limited Real Data	➖
2762	A Dual-Branch Adaptive Distribution Fusion Framework for Real-World Facial Expression Recognition
4199	Efficient Practices for Profile-to-Frontal Face Synthesis and Recognition	➖
4208	Learning Causal Representations for Generalizable Face Anti-Spoofing	➖
2767	Self-Paced Partial Domain-aware Learning for Face Anti-Spoofing	➖
746	Context-aware Face Clustering with Graph Convolutional Networks	➖

Source Separation, ICA, and Sparsity

🆔	Title	Repo
193	A Multi-Stage Triple-Path Method for Speech Separation in Noisy and Reverberant Environments	➖
524	On the Minimum Perimeter Criterion for Bounded Component Analysis	➖
4129	Joint Unmixing and Demosaicing Methods for Snapshot Spectral Images	➖
5036	Identifiable Bounded Component Analysis via Minimum Volume Enclosing Parallelotope	➖
5587	Balanced Deep CCA for Bird Vocalization Detection
1692	Independent Vector Analysis with Multivariate Gaussian Model: A Scalable Method by Multilinear Regression	➖
3184	Activity-Informed Industrial Audio Anomaly Detection via Source Separation	➖
6717	Double Nonstationarity: Blind Extraction of Independent Nonstationary Vector/Component from Nonstationary Mixtures - Algorithms	➖
6798	Towards Flexible Sparsity-Aware Modeling: Automatic Tensor Rank Learning using the Generalized Hyperbolic Prior	➖
5426	MedleyVox: An Evaluation Dataset for Multiple Singing Voices Separation
674	Hybrid Transformers for Music Source Separation
5141	Dictionary Learning on Graph Data with Weisfieler-Lehman Sub-Tree Kernel and KSVD	➖

Neural Sound Synthesis and Representation

🆔	Title	Repo
2678	GANStrument: Adversarial Instrument Sound Synthesis with Pitch-Invariant Instance Conditioning
2555	I Hear Your True Colors: Image Guided Audio Generation
1261	Grad-StyleSpeech: Any-Speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
3085	Voice Conversion using Feature Specific Loss Function based Self-Attentive Generative Adversarial Network
1268	TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion
6748	Decorrelating Feature Spaces for Learning General-Purpose Audio Representations
4904	Continuous Descriptor-based Control for Deep Audio Synthesis
5786	Rigid-Body Sound Synthesis with Differentiable Modal Resonators
5349	Exploring Approaches to Multi-Task Automatic Synthesizer Programming	➖
6710	Speech Time-Scale Modification with GANs	➖
4339	Full-Band General Audio Synthesis with Score-based Diffusion
4443	Is Quality Enoughƒ Integrating Energy Consumption in a Large-Scale Evaluation of Neural Audio Synthesis Models	➖

Deep Learning for Audio and Music Applications

🆔	Title	Repo
896	Controllable Music Inpainting with Mixed-Level and Disentangled Representation
1991	HIPI: A Hierarchical Performer Identification Model based on Symbolic Representation of Music	➖
207	Chord-Conditioned Melody Harmonization with Controllable Harmonicity
1878	Jazznet: A Dataset of Fundamental Piano Patterns for Music Audio Machine Learning Research
5273	Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects
1442	An Improved Optimal Transport Kernel Embedding Method with Gating Mechanism for Singing Voice Separation and Speaker Identification	➖
3448	Tempo vs. Pitch: Understanding Self-Supervised Tempo Estimation
1995	Adversarial Permutation Invariant Training for Universal Sound Separation
1379	Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining	➖
4727	Low-Resource Music Genre Classification with Cross-Modal Neural Model Reprogramming
1375	SPADE: Self-Supervised Pretraining for Acoustic Disentanglement	➖
1615	On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors

Machine Learning for Image and Video Processing

🆔	Title	Repo
1011	IoU-Aware Multi-Expert Cascade Network via Dynamic Ensemble for Long-Tailed Object Detection	➖
1622	Efficient Compressed Video Action Recognition via Late Fusion with a Single Network	➖
1649	Amicable Aid: Perturbing Images to Improve Classification Performance	➖
3861	Spatial Cross-Attention for Transformer-based Image Captioning	➖
3879	Towards Hyperbolic Regularizers for Point Cloud Part Segmentation	➖
5265	Clip4VideoCap: Rethinking CLIP for Video Captioning with Multiscale Temporal Fusion and Commonsense Knowledge	➖
6356	Learning Silhouettes with Group Sparse Autoencoders
5042	Deep Learning for Lagrangian Drift Simulation at The Sea Surface
2382	Difference Guided VHR Remote Sensing Image Change Detection	➖
2696	Adaptive Submanifold-Preserving Sparse Regression for Feature Selection and Multiclass Classification	➖
6814	Learning Multiscale Convolutional Dictionaries for Image Reconstruction
7162	Impact of PolSAR Pre-Processing and Balancing Methods on Complex-Valued Neural Networks Segmentation Tasks	➖

ASR: Text Adaptation

🆔	Title	Repo
209	Improving Contextual Spelling Correction by External Acoustics Attention and Semantic Aware Data Augmentation	➖
1007	AdapITN: A Fast, Reliable, and Dynamic Adaptive Inverse Text Normalization
1373	Fast and Accurate Factorized Neural Transducer for Text Adaption of End-to-End Speech Recognition Models	➖
1628	Effective Training of RNN Transducer Models on Diverse Sources of Speech and Text Data	➖
1672	Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis	➖
2409	Slot-triggered Contextual Biasing for Personalized Speech Recognition using Neural Transducers	➖
3355	Fine-grained Textual Knowledge Transfer to Improve RNN Transducers for Speech Recognition and Understanding	➖
4612	Gated Contextual Adapters for Selective Contextual Biasing in Neural Transducers	➖
4830	Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation	➖
4970	Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax	➖
5596	Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR with Internal Language Model Estimation	➖
6116	Factorized AED: Factorized Attention-based Encoder-Decoder for Text-Only Domain Adaptive ASR	➖

ASR: Training Methods

🆔	Title	Repo
3731	Weight Averaging: A Simple Yet Effective Method to Overcome Catastrophic Forgetting in Automatic Speech Recognition	➖
112	Reducing the GAP Between Streaming and Non-Streaming Transducer-based ASR by Adaptive Two-Stage Knowledge Distillation	➖
164	Alignment Entropy Regularization	➖
392	From English to more Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition	➖
1499	Neural Transducer Training: Reduced Memory Consumption with Sample-Wise Computation	➖
2433	Towards Domain Generalisation in ASR with Elitist Sampling and Ensemble Knowledge Distillation	➖
2677	Accelerating RNN-T Training and Inference using CTC Guidance	➖
3382	Resource-Efficient Transfer Learning from Speech Foundation Model using Hierarchical Feature Fusion	➖
3917	Robust Knowledge Distillation from RNN-T Models with Noisy Training Labels using Full-Sum Loss	➖
5520	More Speaking or more Speakers?	➖
5845	Federated Learning for ASR based on Wav2Vec 2.0	➖
6343	Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models	➖

ASR: VAD and Other Topics

🆔	Title	Repo
691	Real-Time Speech Interruption Analysis: from Cloud to Client Deployment	➖
2005	Audio-to-Intent using Acoustic-Textual Subword Representations from End-to-End ASR	➖
2615	Adaptive Endpointing with Deep Contextual Multi-Armed Bandits	➖
2616	Dynamic Speech Endpoint Detection with Regression Targets	➖
2665	Speaker Change Detection for Transformer Transducer ASR	➖
4769	Less is more: A Unified Architecture for Device-Directed Speech Detection with Multiple Invocation Types	➖
4865	SG-VAD: Stochastic Gates based Speech Activity Detection
5523	Filler Word Detection with Hard Category Mining and Inter-Category Focal Loss	➖
5787	Unsupervised Voice Type Discrimination Score Adaptation using X-Vector Clusters	➖
6269	Multilingual Word Error Rate Estimation: E-Wer3	➖
5792	Multilingual Query-by-Example Keyword Spotting with Metric Learning and Phoneme-to-Embedding Mapping	➖
7177	Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise	➖
836	Keyword-Specific Acoustic Model Pruning for Open Vocabulary Keyword Spotting	➖
5030	Self-Supervised Speech Representation Learning for Keyword-Spotting with Light-Weight Transformers	➖
5579	Lightweight Feature Encoder for Wake-Up Word Detection based on Self-Supervised Speech Representation	➖
5649	VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting	➖
1378	Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers
1518	Continual Learning for On-Device Speech Recognition using Disentangled Conformers	➖
1986	Filterbank Learning for Noise-Robust Small-Footprint Keyword Spotting	➖
3390	Locale Encoding for Scalable Multilingual Keyword Spotting Models	➖
3531	Small-Footprint Slimmable Networks for Keyword Spotting	➖
3615	Metric Learning for User-Defined Keyword Spotting
3928	WeKws: A Production First Small-Footprint End-to-End Keyword Spotting Toolkit
4822	Exploring Sequence-to-Sequence Transformer-Transducer Models for Keyword Spotting	➖

Automatic Audio Captioning and Retrieval

Auditory EEG Decoding Challenge

Image Restoration

Interpretable and Explainable Machine Learning

Language Modeling

Language Modeling and Spoken Language Understanding

Estimation Theory and Methods

AI Security and Privacy in Speech and Audio Processing

🆔	Title	Repo
673	Privacy-Enhanced Federated Learning Against Attribute Inference Attack for Speech Emotion Recognition	➖
2009	Privacy-Preserving Occupancy Estimation	➖
3761	Federated Intelligent Terminals Facilitate Stuttering Monitoring
4942	Beyond Neural-on-Neural Approaches to Speaker Gender Protection
6129	Distinguishable Speaker Anonymization Based on Formant and Fundamental Frequency Scaling	➖

Binaural Audio; Multichannel Source Separation

🆔	Title	Repo
1755	Spatially Informed Independent Vector Analysis for Source Extraction based on the Convolutive Transfer Function Model	➖
2514	Fast Online Source Steering Algorithm for Tracking Single Moving Source using Online Independent Vector Analysis	➖
4589	Online Binaural Speech Separation of Moving Speakers with a Wavesplit Network	➖
5759	Convolutive NTF for Ambisonic Source Separation under Reverberant Conditions	➖
4677	On the Relevance of the Differences between HRTF Measurement Setups for Machine Learning	➖
6362	Neural Fourier Shift for Binaural Speech Rendering
1620	Global HRTF Interpolation via Learned Affine Transformation of Hyper-Conditioned Features
4790	HRTF Field: Unifying Measured HRTF Magnitude Representation with Neural Fields
5041	Learning to Personalize Equalization for High-Fidelity Spatial Audio Reproduction	➖
6719	A Data-Driven Approach to Audio Decorrelation	➖
6777	Switching Independent Vector Analysis and Its Extension to Blind and Spatially Guided Convolutional Beamforming Algorithms	➖

Image/Video Caption Generation

🆔	Title	Repo
6029	End-to-End Non-Autoregressive Image Captioning
337	Enhancing Multimodal Alignment with Momentum Augmentation for Dense Video Captioning	➖
450	I-Tuning: Tuning Frozen Language Models with Image for Lightweight Image Captioning	➖
972	Video Captioning via Relation-Aware Graph Learning
1192	Improving Image Captioning with Control Signal of Sentence Quality	➖
5827	Background Disturbance Mitigation for Video Captioning via Entity-Action Relocation	➖
5304	Motion-Aware Video Paragraph Captioning via Exploring Object-Centered Internal Knowledge	➖
2203	Associative Learning Network for Coherent Visual Storytelling	➖
6772	Shot Noise Analysis for Differential Sampling in Indirect Time of Flight Cameras	➖

Flow Estimation

Image/Video Retrieval

Transfer Learning

Learning Theory and Algorithms

Distributed and Federated Learning

Machine Learning for Telecommunications

Dialog and Multimodal Processing of Language

Discourse and Dialog

Emerging Topics in Speech Synthesis

Audio and Text Segmentation, Tagging and Parsing

Diffusion-based Generative Models for Audio and Speech

🆔	Title	Repo
5245	Cold Diffusion for Speech Enhancement	➖
5709	Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration
2264	Unsupervised Vocal Dereverberation with Diffusion-based Generative Models
5637	Solving Audio Inverse Problems with a Diffusion Model
5778	DiffPhase: Generative Diffusion-based STFT Phase Retrieval
3196	Optimal Transport in Diffusion Modeling for Conversion Tasks in Audio Domain	➖

Multilingual Alzheimer's Dementia Recognition through Spontaneous Speech: a Signal Processing Grand Challenge

Model Pruning and Compression

Image Recognition and Detection

🆔	Title	Repo
907	Data-Aware Zero-Shot Neural Architecture Search for Image Recognition	➖
3890	CFFMixer: Multi-Dimensional Feature Fusion for Object Detection	➖
1242	SANet: Spatial Attention Network with Global Average Contrast Learning for Infrared Small Target
736	Logovit: Local-Global Vision Transformer for Object Re-Identification
319	ProContEXT: Exploring Progressive Context Transformer for Tracking
3268	Pair DETR: Toward Faster Convergent DETR	➖

Machine Learning Methods for Language

Machine Translation and Dialog System

Radar Waveform Design: Recent Advances and New Emerging Applications

Conversational Healthcare Interfaces

Computer Vision Applications

Domain-Specific Detection

Temporal Video Analysis and Detection

Object Detection

Deep Learning for Speech and Audio Processing

Deep Learning for Speech and Language Processing

Language Modeling and Representation Learning

Lightweight TTS and TTS Analysis

Machine Translation for Spoken and Written Language

🆔	Title	Repo
683	Improving Speech-to-Speech Translation through Unlabeled Text	➖
1867	A Holistic Cascade System, Benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation	➖
3026	Decoupled Non-Parametric Knowledge Distillation for End-to-End Speech Translation	➖
3135	Joint Pre-training with Speech and Bilingual Text for Direct Speech-to-Speech Translation
3822	LEAPT: Learning Adaptive Prefix-to-Prefix Translation for Simultaneous Machine Translation	➖
3889	Enhancing Speech-To-Speech Translation with Multiple TTS Targets	➖
4196	Rethinking the Reasonability of the Test Set for Simultaneous Machine Translation
4387	Align, Write, Re-order: Explainable End-to-End Speech Translation via Operation Sequence Generation	➖
4983	Efficient Speech Translation with Dynamic Latent Perceivers
5169	Joint Training and Decoding for Multilingual End-to-End Simultaneous Speech Translation
5381	Enhancing Ontology Translation through Cross-Lingual Agreement	➖
6523	M³ST: Mix at Three Levels for Speech Translation	➖

Music Audio Synthesis and Modeling

Spoken Language Understanding Grand Challenge

Image Segmentation

Multi-Speaker ASR

Multimodal Processing of Language and Language Systems

🆔	Title	Repo
1158	Prefix Tuning for Automated Audio Captioning
1648	C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
2096	The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
2768	Adaptive Knowledge Distillation between Text and Speech Pre-trained Models	➖
6140	A Processing Framework to Access Large Quantities of Whispered Speech Found in ASMR
567	Cross-Modal Mutual Learning for Cued Speech Recognition	➖
1886	SLBERT: A Novel Pre-Training Framework for Joint Speech and Language Modeling	➖
2190	Cross-Modal Adversarial Contrastive Learning for Multi-Modal Rumor Detection	➖
2884	Multiple Contrastive Learning for Multimodal Sentiment Analysis	➖
3666	Token2vec: A Joint Self-Supervised Pre-Training Framework using Unpaired Speech and Text	➖
3714	DAIS: The Delft Database of EEG Recordings of Dutch Articulated and Imagined Speech	➖
4409	A Token-Level Contrastive Framework for Sign Language Translation
4801	Sign Language Recognition via Deformable 3D Convolutions and Modulated Graph Convolutional Networks	➖
4837	LAST: Scalable Lattice-based Speech Modelling in JAX
4989	M-SpeechCLIP: Leveraging Large-Scale, Pre-trained Models for Multilingual Speech to Image Retrieval	➖
5014	Using Emotion Embeddings to Transfer Knowledge between Emotions, Languages, and Annotation Formats
5146	Speech-Text based Multi-Modal Training with Bidirectional Attention for Improved Speech Recognition	➖

Tracking

Radar-Assisted Perception (RAP)

Data Driven and Machine Learning based Room Acoustic Modeling

Sensing Applications

Computational Imaging

Anomaly Detection

Deep Neural Network

Deep Learning

Deep and Sequential Learning

Machine Learning for Time Series Analysis

Multilingual Speech Recognition and Identification

Quantum Computing for Machine Learning and Signal Processing

Sound Event Detection

Brain Connectivity

Speech Signal Improvement Signal Processing Grand Challenge 2023

Anonymization and Data Privacy

Natural Language Processing

Pronunciation and Fluency Assessment

Edge Learning for Emerging Wireless Technologies

Acoustic Sensor Array Processing and Sound Source Localization

Representation Learning

Adversarial Machine Learning

🆔	Title	Repo	Paper
987	Backdoor Defense via Suppressing Model Shortcuts

Target Detection and Classification

Spatial Processing for Audio and Speech

Brain Computer Interfaces

Acoustic Echo Cancellation Signal Processing Grand Challenge 2023

DoA Estimation

Speaker Recognition: Scoring, Fairness, Privacy

Speaker Recognition: Verification, Diarization, Anti-Spoofing

🆔	Title	Repo	Paper
3059	Pushing the Limits of Self-Supervised Speaker Verification using Regularized Distillation Framework

Recent Advances in Robust Learning for Modern Computational Imaging

Signal Processing and Machine Learning for Networked Autonomous Agents

Active Noise Control, echo Reduction and Feedback Reduction

Anomaly Detection and Representation Learning for Audio Classification

Data Processing

Perceptual Assessment

Machine Learning for Recommendation, Search and other Applications

Reinforcement Learning

Pattern Recognition and Classification

Sparsity, Compressed Sensing, and Tensor Decomposition

Adversarial Machine Learning and Information Theoretic Security

Resource Constrained ASR

Singing Voice Synthesis/Conversion and Pretrained TTS

Medical Image Reconstruction

L3DAS23: Learning 3D Audio Sources for Audio-Visual Extended Reality

Multimedia Forensics

MIMO Radars and Waveform Design

Speech Dysarthria

Speech Emotion Recognition: General Topics

🆔	Title	Repo
2490	Multi-Scale Receptive Field Graph Model for Emotion Recognition in Conversations
3918	MGAT: Multi-Granularity Attention based Transformers for Multi-Modal Emotion Recognition	➖
4523	Achieving Fair Speech Emotion Recognition via Perceptual Fairness	➖
5023	Personalized Task Load Prediction in Speech Communication
5075	DWFormer: Dynamic Window Transformer for Speech Emotion Recognition
5730	Multi-View Learning for Speech Emotion Recognition with Categorical Emotion, Categorical Sentiment, and Dimensional Scores	➖
540	Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations
563	Emotion Recognition in Conversation from Variable-Length Context	➖
1423	Knowledge-Aware Graph Convolutional Network with Utterance-Specific Window Search for Emotion Recognition in Conversations	➖
1611	Masking Speech Contents by Random Splicing: is Emotional Expression Preserved?	➖
3129	Multi-Local Attention for Speech-based Depression Detection	➖
3130	Daily Mental Health Monitoring from Speech: A Real-World Japanese Dataset and Multitask Learning Analysis	➖
3830	SDTN: Speaker Dynamics Tracking Network for Emotion Recognition in Conversation	➖
4065	Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach for Speech Emotion Recognition
5683	Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP	➖
5711	EMix: A Data Augmentation Method for Speech Emotion Recognition	➖
6131	A Hierarchical Regression Chain Framework for Affective Vocal Burst Recognition	➖
6316	Automatic Classification of Vocal Intensity Category from Speech	➖

Intelligent and Semantic Communications for 5G Mobile Networks and Beyond

Audio and Speech Quality Measurements

Acoustic Modeling; Auditory Modeling for Hearing Instruments

Anonymization, Data Privacy, and Biometrics

Object Recognition

Identification Detection

Tracking, Data Fusion, and Sensor Networks

🆔	Title	Repo
268	Deep Fusion of Multi-Object Densities using Transformer
6240	Nonnegative Block-Term Decomposition with the β-Divergence: Joint Data Fusion and Blind Spectral Unmixing
2238	Robust Subspace Tracking with Contamination via α-Divergence
2321	Wireless Location Tracking via Complex-Domain Super MDS with Time Series Self-Localization Information	➖
2463	Angle-of-Arrival Target Tracking using a Mobile UAV in External Signal-Denied Environment	➖
2821	A Distributed Adaptive Algorithm for Non-Smooth Spatial Filtering Problems	➖
2937	A Computationally Efficient Algorithm for Distributed Adaptive Signal Fusion based on Fractional Programs	➖
3217	Data Driven Joint Sensor Fusion and Regression based on Geometric Mean Squared Error	➖
4043	Sensor Selection for Angle of Arrival Estimation based on the Two-Target Cramér-Rao Bound
4149	Clustered Greedy Algorithm for Large-Scale Sensor Selection	➖

Speaker Recognition: Neural Network Architecture

Speech Analysis

Speaker Recognition: Anti-Spoofing and Verification

🆔	Title	Repo	Paper
5447	SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

Bayesian Signal Processing

Speaker Recognition: Verification and Diarization

Learning on Graphs for Biology and Medicine

🆔	Title	Repo
2914	Deep Spatio-Temporal Multiplex Graph Learning for Cardiac Imaging Classification	➖
4165	Graph Signal Processing for Neurogimaging to Reveal Dynamics of Brain Structure-Function Coupling	➖
4375	Multiple Signed Graph Learning for Gene Regulatory Network Inference	➖
4599	Predicting Brain Age using Transferable Covariance Neural Networks	➖
6456	Spatial Graph Signal Interpolation with an Application for Merging BCI Datasets with Various Dimensionalities

Learning from Neuroimaging Data

Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech

Quality Assessment and Anomaly Detection

Human-Centric Multimedia and Human-Machine Interaction

Speech Emotion Recognition: Transfer Learning

🆔	Title	Repo
457	A Generalized Subspace Distribution Adaptation Framework for Cross-Corpus Speech Emotion Recognition	➖
3755	Fast Yet Effective Speech Emotion Recognition with Self-Distillation
3954	Domain Adaptation without Catastrophic Forgetting on a Small-Scale Partially-Labeled Corpus for Speech Emotion Recognition	➖
4547	Phonetic Anchor-based Transfer Learning to Facilitate Unsupervised Cross-Lingual Speech Emotion Recognition	➖
4559	Zero-Shot Speech Emotion Recognition using Generative Learning with Reconstructed Prototypes	➖
4858	Unsupervised Domain Adaptation for Preference Learning based Speech Emotion Recognition	➖

Multi-Antenna Communications and Sensing

Quantum Machine Learning Algorithms and Applications on NISQ Devices

Neural Speech and Audio Coding: Emerging Challenges and Opportunities

Medical and Environmental Acoustics; Audio Security

Classification of Acoustic Scenes and Events

Learning from EEG Data

Physiological Signal Processing

Speech Production, Perception,and Psychoacoustics

Watermarking, Data Hiding and Human Factors in Security

3D Point Cloud/Stereo Video

Face Processing

MIMO Radars and MIMO Communications

Speaker Recognition: Diarization

Estimation, Detection, and Classification

Model Lightweight and Video Compression

Subspace and Manifold Learning

#	Title	Repo
2651	Generative Modeling based Manifold Learning for Adaptive Filtering Guidance	➖
684	Tensor Completion for Efficient and Accurate Hyperparameter Optimisation in Large-Scale Statistical Learning	➖
903	CO-NET: Classification-Oriented Point Cloud Sampling via Informative Feature Learning and Non-Overlapped Local Adjustment	➖
2091	Deep Survival Analysis and Counterfactual Inference using Balanced Representations	➖
3045	Feature Space Recovery for Incomplete Multi-View Clustering	➖
4602	Study of Manifold Geometry using Multiscale Non-Negative Kernel Graphs	➖

Speech Enhancement - Diffusion and Other Generative Models

🆔	Title	Repo
2594	Cross-domain Diffusion based Speech Enhancement for Very Noisy Speech
3643	SRTNet: Time Domain Speech Enhancement via Stochastic Refinement
4671	Diffusion-based Generative Speech Source Separation
4716	SEPDIFF: Speech Separation based on Denoising Diffusion Model	➖
5798	Fast and Efficient Speech Enhancement with Variational Autoencoders	➖
6105	Metric-oriented Speech Enhancement using Diffusion Probabilistic Model	➖

ICASSP2023 General Meeting Understanding and Generation (MUG) Challenge

Signal Processing for Smart City Applications and the Internet of Things

Symbol-Level Precoding: Recent Advance and New Applications in 6G and Beyond

Graphical Inference and Modeling in Dynamical Systems

Deep Learning-based Source Separation

Medical Image Segmentation

Bioinformatics

Cybersecurity, Hardware and Network Security

Multi-Antenna Communications and Intelligent Reflecting Surfaces

Multimedia Compression and Quality

Multimedia Analysis, Synthesis, and Learning

DoA Estimation and Beamforming

Speech Emotion Recognition: Multimodality

Speech Emotion Recognition: Neural Architectures

Optimization Methods for Signal Processing

5th DNS Challenge at IEEE ICASSP 2023

Signal Processing and Learning over Dynamic Graphs

Human Action Recognition

Deep Generative Model

🆔	Title	Repo
1565	String-based Molecule Generation via Multi-Decoder VAE	➖
4161	Graph Contrastive Learning with Learnable Graph Augmentation	➖
3180	Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution
5068	Evaluation of Categorical Generative Models - Bridging the Gap Between Real and Synthetic Data	➖
6053	Diffusion Probabilistic Modeling for Fine-Grained Urban Traffic Flow Inference with Relaxed Structural Constraint	➖
4977	Single-Shot Domain Adaptation via Target-aware Generative Augmentations

Multimodal Signal Processing and Analysis

Speech Enhancement - Self-Supervised Learning

🆔	Title	Repo
915	Perceive and Predict: Self-Supervised Speech Representation based Loss Functions for Speech Enhancement	➖
2006	DATA2VEC-SG: Improving Self-Supervised Learning Representations for Speech Generation Tasks	➖
3343	Speech Separation with Large-Scale Self-Supervised Learning	➖
3511	Self-Supervised Learning-based Source Separation for Meeting Data
4456	An Adapter based Multi-Label Pre-training for Speech Separation and Enhancement	➖
5785	Self-Supervised Learning for Speech Enhancement Through Synthesis

Distributed and Reliable Signal Processing and Communications

Resource-Efficient Real-time Neural Speech Separation

Multichannel Speech Enhancement, Dereverberation, and System Identification

Multilabel Acoustic Event Classification

Deep Learning for Medical Imaging

🆔	Title	Repo	Paper
1384	Coarse-to-Fine Covid-19 Segmentation via Vision-Language Alignment

Machine/Deep Learning Methodologies for Multimedia

Human-Centric Multimedia

Source Localization and Separation

Speech Enhancement /Audio-Visual, Multi-Channel, and Other

Speech Enhancement - Separation and Target Speech Extraction

🆔	Title	Repo	Paper
3175	Unifying Speech Enhancement and Separation with Gradient Modulation for End-to-End Noise-Robust Speech Separation

Speech Enhancement - Single Channel

Machine Learning Applications to Communications

Aspects in Image Generation/Analysis

Multi-Antenna and Multi-Carrier Communications

Signal Filtering, Restoration, Enhancement, and Reconstruction

ICASSP SP Clarity Challenge: Speech Enhancement for Hearing Aids

Image and Video Enhancement

Speech Recognition-training/adaptation

Decentralized Wireless Systems and Energy Harvesting

Robust Learning and Inference

Music Classification and Transcription

Music Information Retrieval

Deep Learning for Medical Image Segmentation

Detection and Classification in Medical Imaging

Image Coding/Compression

Audio-Visual Signal Processing and Analysis

Various Aspects in Speech and Language Processing

Speech Recognition: Modeling and Context

Speech Recognition: Self-Supervised Models

Channel State Estimation

Signal Processing over Graphs and Networks

Signal Processing over Networks

Applications to Vision, Speech, and Robotics

🆔	Title	Repo
6443	LMBAO: A Landmark Map for Bundle Adjustment Odometry in Lidar Slam	➖
1069	Residual Squeeze-and-Excitation U-Shaped Network for Minutia Extraction in Contactless Fingerprint Images	➖
1603	TSPTQ-ViT: Two-Scaled Post-Training Quantization for Vision Transformer	➖
3925	Low-Complexity Low-Rank Approximation SVD for Massive Matrix in Tensor Train Format	➖
2043	DailyTalk: Spoken Dialogue Dataset for Conversational Text-to-Speech
3040	Cooperative Five Degrees of Freedom Motion Estimation for a Swarm of Autonomous Vehicles	➖

Person Identification and Relapse Detection from Continuous Recordings of Biosignals

Vision and Language Model

TTS: AM and Vocoder

Signal Processing Education

Signal Processing and Systems for Remote Biometrics

Signal Processing for RIS-Enabled Smart Wireless Environments

Multimodal Learning

Video Coding/Compression

Object Tracking

Image Generation

Spoken Language Understanding

Optimization and Machine Learning for Communications

Sparse/Low-Dimensional Signal Processing

Signal Processing Theory and Methods

Radar/Array Signal Processing. Networks and Communications

Applications to Communications

The First Pathloss Radio Map Prediction Challenge

Human Video Generation and Editing

Point Cloud Processing

Multimedia Databases and Information Retrieval

Voice and Style Conversion

Synergy between Human and Machine Approaches to Sound/Scene Recognition and Processing

Topological and Simplicial Data Processing

Unsupervised Deep Learning of Image Priors for Inverse Problems

Self-Supervised Learning and Data-Efficiency for Speech and Audio

🆔	Title	Repo	Paper
5842	Audio Signal Enhancement with Learning from Positive and Unlabelled Data

Sound Event Detection and Localization; Bioacoustic Event Detection

Aspects in Machine Learning

Aspects in Image/Video Processing and Analysis

🆔	Title	Repo	Paper
2133	ShaDocNet: Learning Spatial-Aware Tokens in Transformer for Document Shadow Removal

Learning Algorithms and Applications

Optimization Methods in Machine Learning

Applications of Machine Learning

Sensing, Computing, and Semantic Communications

Sparsity and Low-Rank Models

Signal Processing over Graphs

Target Source Extraction

Music Generation and Arrangement

Multimodal Information based Speech Processing (MISP) 2022 Challenge

Image Retrieval and Classification

Variational Inference and Approximate Bayesian Techniques

Spatial Audio Recording and Reproduction

Speech Modeling and Audio Coding

Audio Processing and Analysis

Image/Video Enhancement

Zero or Few-Shot Learning

Acoustic and Microphone Array Processing

Speech and Language Disorders

Various Aspects in Speech and Speaker Recognition

Sampling Theory, Compressed and Non-uniform Sampling

Show and Tell Demos: Session

Rising Stars Workshop

Name		Name	Last commit message	Last commit date
Latest commit History 228 Commits
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

anonlim/ICASSP-2023-Papers

Folders and files

Latest commit

History

Repository files navigation