NLP
For Prospective Students and Research Fellows
Data Mining and Information Systems (DMIS) lab is looking for MS/Ph.D. Students and Postdoctoral Fellows with a strong interest in the area of Natural Language Processing (NLP) and Biomedical NLP (BioNLP).
DMIS Lab aims to utilize AI and Machine Learning to solve a variety of research topics such as large-scale Language Modeling, Question Answering, and Named Entity Recognition. Our research achievements have been presented over the past years at top-tier conferences/journals such as ACL, NAACL, EMNLP, and Bioinformatics. Especially, our pre-trained language model BioBERT has been cited by over 2,400 and was selected as one of the Best Papers for the Natural Language Processing Section of the 2020 IMIA (International Medical Informatics Association) Yearbook. Biobert is by far the most cited pre-trained language model in the biomedical domain.
Aside from research publications, DMIS Lab has participated in many international BioNLP challenges and achieved top performance: BioASQ Phrase B (2019, 2020), BioCreative VII NER (NLM-Chem) (2021), etc. Also, we are currently collaborating with global big techs and pharma companies including Amazon, Adobe (USA), AstraZeneca (UK), and other valuable partners. Our alumni continue to produce outstanding achievements in both academia and industry: tenure-track assistant professors at Korea University and Sejong University (Korea), research scientists at Google, Amgen (USA), and NAVER (Korea), etc.
Recent News
2024
Sep. 1 paper accepted to EMNLP 2024.
July. 1 paper accepted to CIKM 2024.
Apr. 1 paper accepted to ISMB 2024.
Mar. Our new medical LM, Meerkat-7B, passed the United States Medical Licensing Examination (USMLE) for the first time among 7B-parameter models. (News)
Mar. BioBERT, our biomedical domain-specific transformer-based language model, has reached an remarkable milestone: surpassing 5,000 citations!
Feb. 1 paper accepted to LREC-COLING 2024.
Jan. 1 paper accepted to EACL 2024.
2023
Oct. 2 paper accepted to EMNLP 2023.
Aug. BioBERT has over 4,000 citations, which is both the first and most cited biomedical domain-specific transformer-based large language model.
Aug. Invited talk "Natural Language Processing and Social Media: Challenges, Applications and TweetNLP" by Jose Camacho Collados (Cardiff University).
Jun. Top performance at BioASQ 11b-phase B.
May. Top performance at BioNLP 2023 Shared Task on Radiology Report Summarization (RadSum@BioNLP23).
May. 2 papers accepted to ACL 2023.
Apr. Invited talk by Jinhyuk Lee (Google Research).
Mar. Invited talk by Fangyu Liu (University of Cambridge).
2022
Nov. 1 paper accepted to AAAI 2023.
Oct. Invited talk by Timothy Miller (Harvard Medical School).
Oct. 4 paper accepted to EMNLP 2022. (3 main conference papers and 1 industry track paper.)
Sep. WonJin Yoon received "Standigm Paper Award 2022" from the Korean Society for Bioinformatics.
Sep. 1 paper accepted to Bioinformatics 2022.
Jun. 1 paper accepted to Bioinformatics 2022.
Apr. Invited talk "Vision-and-Language Learning" by Jaemin Cho (UNC Chapel Hill).
Apr. 1 paper accepted to NACCL 2022.
Feb. 1 paper accepted to ACL 2022.
Feb. Joint seminar with Language & Knowledge Lab, KAIST (advised by Minjoon Seo).
Jan. Invited talk by Byeongchang Kim (SNU) and Hyunwoo Kim (SNU).
2021
Dec. Invited talk "Mutimodal Evaluation Metric and Image Captioning Model" by Seunghyun Yoon (Adobe Research).
Nov. Top performance at 2 challenge tracks held by the BioCreative VII workshop (3rd at Track 1 & 1st at Track 2-NER)
Aug. 3 papers accepted to EMNLP 2021.
May. 2 papers accepted to ACL 2021.
Jan. 1 paper accepted to EACL 2021.
~2020
Nov.2020 Invited talk by Gyuwan Kim (Naver) and Sundong Kim (Naver).
Oct.2020 Our team KU has won the eighth BioASQ challenge in 8b Phase B (Results - KU Team, News).
Sep.2020 BioBERT was included in the Best Papers for the NLP Section of the IMIA 2020 Yearbook (link).
Sep.2020 2 papers accepted to EMNLP 2020.
Apr.2020 2 papers accepted to ACL 2020.
Jan.2020 Wonjin Yoon received the NAVER Ph.D Fellowship Award. Congratulations!
Sep.2019 Our team KU has won the seventh BioASQ challenge in 7b Phase B (Results - KU Team).
Aug.2019 1 paper accepted to Bioinformatics.
Aug.2019 Invited talk by Dr. Tim Miller (Harvard University).
May.2019 1 paper accepted to ACL 2019.
Mar.2019 Invited talk by Minjoon Seo (University of Washington & NAVER).
Aug.2018 1 paper accepted to EMNLP 2018.
Publications
2024
CompAct: Compressing Retrieved Documents Actively for Question Answering
Chanwoong Yoon, Taewhoo Lee, Hyeon Hwang, Minbyul Jeong, Jaewoo Kang*
EMNLP 2024 (Long)
[Paper] [Code]
LAPIS: Language Model-Augmented Police Investigation System
Heedou Kim, Dain Kim, Jiwoo Lee, Chanwoong Yoon, Donghee Choi, Mogan Gim*, Jaewoo Kang*
CIKM 2024
[Paper] [Code]
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung*, Jaewoo Kang*
ISMB 2024
[Paper] [Code]
Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
Minbyul Jeong, Jiwoong Sohn, Mujeen Sung*, Jaewoo Kang*
ISMB 2024
[Paper] [Code]
CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions
Donghee Choi, Mogan Gim, Donghyeon Park, Mujeen Sung, Hyunjae Kim, Jaewoo Kang*, Jihun Choi*
LREC-COLING 2024
[TBA]
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim, Seunghyun Yoon, Trung Bui, Handong Zhao, Quan Tran, Franck Dernoncourt, Jaewoo Kang*
Finding of EACL 2024
[Paper]
2023
Tree of Clarifications: Answering Ambiguous Questions with Retrieval-Augmented Large Language Models
Gangwoo Kim, Sungdong Kim, Byeongguk Jeon, Joonsuk Park, Jaewoo Kang*
EMNLP 2023 (Short)
[Paper] [Code]
Pre-training Intent-Aware Encoders for Zero- and Few-Shot Intent Classification
Mujeen Sung, James Gung, Elman Mansimov, Nikolaos Pappas, Raphael Shu, Salvatore Romeo, Yi Zhang, Vittorio Castelli
EMNLP 2023 (Long)
[Paper] [Code]
KU-DMIS-MSRA at RadSum23: Pre-trained Vision-Language Model for Radiology Report Summarization
Gangwoo Kim, Hajung Kim, Lei Ji, Seongsu Bae, Chanhwi Kim, Mujeen Sung, Hyunjae Kim, Kun Yan, Eric Chang, Jaewoo Kang*
BioNLP Workshop @ ACL 2023
[Paper]
Automatic Creation of Named Entity Recognition Datasets by Querying Phrase Representations
Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jaewoo Kang*
ACL 2023 (Long)
[Paper]
Optimizing Test-Time Query Representations for Dense Retrieval
Mujeen Sung, Jungsoo Park, Jaewoo Kang, Danqi Chen, Jinhyuk Lee
Findings of ACL 2023 (Long)
[Paper] [Code]
LIQUID: A Framework for List Question Answering Dataset Generation
Seongyun Lee†, Hyunjae Kim†, Jaewoo Kang
AAAI 2023
[Paper] [Code]
2022
Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework
Wonjin Yoon†, Richard Jackson†, Elliot Ford, Vladimir Poroshin, Jaewoo Kang
EMNLP 2022 (Industry Track)
[Paper] [Code]
Saving Dense Retriever from Shortcut Dependency in Conversational Search
Sungdong Kim, Gangwoo Kim
EMNLP 2022 (Long)
[Paper] [Code]
Generating Information-Seeking Conversations from Unlabeled Documents
Gangwoo Kim†, Sungdong Kim†, Kang Min Yoo, Jaewoo Kang*
EMNLP 2022 (Long)
[Paper] [Code]
Simple Questions Generate Named Entity Recognition Datasets
Hyunjae Kim, Jaehyo Yoo, Seunghyun Yoon, Jinhyuk Lee, Jaewoo Kang*
EMNLP 2022 (Long)
[Paper] [Code]
Consistency Training with Virtual Adversarial Discrete Perturbation
Jungsoo Park†, Gyuwan Kim†, Jaewoo Kang
NAACL 2022 (Short)
[Paper] [Code]
FAVIQ: FAct Verification from Information-seeking Questions
Jungsoo Park†, Sewon Min†, Jaewoo Kang, Luke Zettlemoyer, Hannaneh Hajishirzi
ACL 2022 (Long)
[Paper] [Code][Web Service]
2021
Phrase Retrieval Learns Passage Retrieval, Too
Jinhyuk Lee, Alexander Wettig, Denqi Chen
EMNLP 2021 (Long)
[TBA]
Can Language Models be Biomedical Knowledge Bases?
Mujeen Sung, Jinhyuk Lee, Sean Yi, Minji Jeon, Sungdong Kim, Jaewoo Kang*
EMNLP 2021 (Short)
[TBA]
Simple Entity-Centric Questions Challenge Dense Retrievers
Christohper Sciavolino, Zexuan Zhong, Jinhyuk Lee, Danqi Chen
EMNLP 2021 (Short)
[TBA]
Learning Dense Representations of Phrases at Scale
Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, Danqi Chen
ACL-IJCNLP 2021 (Long)
[Paper] [Code] [Web Service]
Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering
Gangwoo Kim, Hyunjae Kim, Jungsoo Park, Jaewoo Kang*
ACL-IJCNLP 2021 (Long)
[Paper] [Code]
'Killing Me' Is Not a Spolier: Spoiler Detection Model using Graph Neural Networks with Dependency Relation-Aware Attention Mechanism
Buru Chang, Inggeol Lee, Hyunjae Kim, Jaewoo Kang*
EACL 2021 (Short)
[Paper]
2020
Answering Questions on COVID-19 in Real-Time
Jinhyuk Lee, Sean S. Yi, Minbyul Jeong, Mujeen Sung, Wonjin Yoon, Yonghwa Choi, Miyoung Ko, Jaewoo Kang*
NLP-COVID Workshop (EMNLP 2020) (Long)
[Paper] [Code] [Web Service]
Look at the First Sentence: Position Bias in Question Answering
Miyoung Ko, Jinhyuk Lee*, Hyunjae Kim, Gangwoo Kim, Jaewoo Kang*
Conference on Empirical Methods in Natural Language Processing (EMNLP 2020) (Long)
[Paper] [Code]
Adversarial Subword Regularization for Robust Neural Machine Translation
Jungsoo Park, Mujeen Sung, Jinhyuk Lee*, Jaewoo Kang*
Conference on Empirical Methods in Natural Language Processing (EMNLP 2020/Findings) (Short)
[Paper] [Code]
Transferability of Natural Language Inference to Biomedical Question Answering
Minbyul Jeong†, Mujeen Sung†, Gangwoo Kim, Donghyeon Kim, Wonjin Yoon, Jaehyo Yoo, Jaewoo Kang*
BioASQ Workshop (CLEF 2020)
[Paper] [Code]
Biomedical Entity Representations with Synonym Marginalization
Mujeen Sung, Hwisang Jeon, Jinhyuk Lee*, Jaewoo Kang*
Annual Conference of the Association for Computational Linguistics (ACL 2020) (Long)
[Paper] [Code]
Contextualized Sparse Representations for Real-Time Open-Domain Question Answering
Jinhyuk Lee, Minjoon Seo, Hannaneh Hajishirzi, Jaewoo Kang
Annual Conference of the Association for Computational Linguistics (ACL 2020) (Short)
[Paper] [Code]
Building a PubMed knowledge graph
Jian Xu, Sunkyu Kim, Min Song, Minbyul Jeong, Donghyeon Kim, Jaewoo Kang*, Justin F. Rousseau, Xin Li, Weijia Xu, Vetle I. Torvik, Yi Bu, Chongyan Chen, Islam Akef Ebeid, Daifeng Li & Ying Ding
Scientific Data 2020
[Paper]
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Jinhyuk Lee†, Wonjin Yoon†, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, Jaewoo Kang*
Bioinformatics 2020
[Paper] [Code]
2019
Pre-trained Language Models for Biomedical Question Answering
Wonjin Yoon, Jinhyuk Lee, Donghyeon Kim, Minbyul Jeong, Jaewoo Kang*
BioASQ Workshop (ECML PKDD 2019)
[Paper] [Code]
CollaboNet: collaboration of deep neural networks for biomedical named entity recognition
Wonjin Yoon†, Chan Ho So†, Jinhyuk Lee, and Jaewoo Kang*
Bioinformatics 2019
[Paper] [Code]
A Neural Named Entity Recognition and Multi-Type Normalization Tool for Biomedical Text Mining
Donghyeon Kim, Jinhyuk Lee, Chan Ho So, Hwisang Jeon, Minbyul Jeong, Yonghwa Choi, Wonjin Yoon, Mujeen Sung, Jaewoo Kang*
IEEE Access 2019
[Paper] [Code] [Demo]
Can Machines Learn to Comprehend Scientific Literature?
Donghyeon Park, Yonghwa Choi, Daehan Kim, Minhwan Yu, Seongsoon Kim and Jaewoo Kang*
IEEE Access 2019
[Paper]
Real-Time Open-Domain Question Answering on Wikipedia with Dense-Sparse Phrase Index
Minjoon Seo†, Jinhyuk Lee†, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, Hannaneh Hajishirzi
Annual Conference of the Association for Computational Linguistics (ACL 2019) (Long)
[Paper] [Code]
ChimerDB 4.0: an updated and expanded database of fusion genes
Ye Eun Jang†, Insu Jang†, Sunkyu Kim†, Subin Cho†, Daehan Kim, Keonwoo Kim, Jaewon Kim, Jimin Hwang, Sangok Kim, Jaesang Kim, Jaewoo Kang, Byungwook Lee*, Sanghyuk Lee*
Nucleic Acids Research 2019
[Paper] [Demo]
2018
Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering
Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, Miyoung Ko, and Jaewoo Kang*
Conference on Empirical Methods in Natural Language Processing (EMNLP 2018) (Short)
[Paper] [Code]
Learning User Preferences and Understanding Calendar Contexts for Event Scheduling
Donghyeon Kim†, Jinhyuk Lee†, Donghee Choi, Jaehoon Choi and Jaewoo Kang*
International Conference on Information and Knowledge Management (CIKM 2018)
[Paper] [Code]
A Deep Neural Spoiler Detection Model using a Genre-Aware Attention Mechanism
Buru Chang, Hyunjae Kim, Raehyun Kim, Daehan Kim, and Jaewoo Kang*
The 22nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2018)
[Paper] [Code]
A Pilot Study of Biomedical Text Comprehension using an Attention-Based Deep Neural Reader: Design and Experimental Analysis
Seongsoon Kim, Donghyeon Park, Yonghwa Choi, Kyubum Lee, Byounggun Kim, Minji Jeon, Jihye Kim, Aik Choon Tan, Jaewoo Kang*
JMIR Medical Informatics 2018
[Paper]
Deep learning of mutation-gene-drug relations from the literature
Kyubum Lee, Byounggun Kim, Yonghwa Choi, Sunkyu Kim, Wonho Shin, Sunwon Lee, Sungjoon Park, Seongsoon Kim, Aik Choon Tan, Jaewoo Kang*
Bioinformatics 2018
[Paper]
Drug drug interaction extraction from the literature using a recursive neural network
Sangrak Lim, Kyubum Lee, Jaewoo Kang*
PLOS ONE 2018
[Paper] [Code]
Chemical-gene relation extraction using recursive neural network
Sangrak Lim, Jaewoo Kang*
Database 2018
[Paper] [Code]
~ 2017
Name Nationality Classification with Recurrent Neural Networks
Jinhyuk Lee, Seongjun Yun, Hyunjae Kim, Miyoung Ko, and Jaewoo Kang*
International Joint Conference on Artificial Intelligence (IJCAI 2017)
[Paper] [Code]
Constructing and Evaluating a Novel Crowdsourcing-based Paraphrased Opinion Spam Dataset
Seongsoon Kim†, Seongwoon Lee†, Donghyeon Park, and Jaewoo Kang*
International World Wide Web Conference (WWW 2017)
[Paper] [Code]
ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining
Myunggyo Lee†, Kyubum Lee†, Namhee Yu†, Insu Jang†, Ikjung Choi, Pora Kim, Ye Eun Jang, Byounggun Kim, Sunkyu Kim, Byungwook Lee, Jaewoo Kang*, and Sanghyuk Lee*
Nucleic Acids Research 2017
[Paper] [Code]
SEMO: Searching Majority Opinions on Movies using SNS and QA Threads
Jukyong Lee, Yonghwa Choi, Suhkyung Kim, Seongsoon Kim, Jaewoo Kang*
The 25th International World Wide Web Conference (WWW 2016) (Demo)
[Paper]
BEST: Next-Generation Biomedical Entity Search Tool for Knowledge Discovery from Biomedical Literature
Sunwon Lee†, Donghyeon Kim†, Kyubum Lee, Jaehoon Choi, Seongsoon Kim, Minji Jeon, Sangrak Lim, Donghee Choi, Sunkyu Kim, Aik-Choon Tan, Jaewoo Kang*
PLOS ONE 2016
[Paper] [Code] [Demo]
HiPub: translating PubMed and PMC texts to networks for knowledge discovery
Kyubum Lee, Wonho Shin, Byounggun Kim, Sunwon Lee, Yonghwa Choi, Sunkyu Kim, Minji Jeon, Aik Choon Tan, Jaewoo Kang*
Bioinformatics 2016 (Applications Note)
[Paper]
BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations
Kyubum Lee, Sunwon Lee, Sungjoon Park, Sunkyu Kim, Suhkyung Kim, Kwanghun Choi, Aik Choon Tan*, Jaewoo Kang*
Database 2016
[Paper] [Demo]
Deep Semantic Frame-based Deceptive Opinion Spam Analysis
Seongsoon Kim, Hyeokyoon Chang, Seongwoon Lee, Minhwan Yu, Jaewoo Kang*
In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM 2015)
[Paper]
Smith Search: Opinion-Based Restaurant Search Engine
Jaehoon Choi, Donghyeon Kim, Donghee Choi, Seongsoon Kim, Sangrak Lim, Youngjae Choi and Jaewoo Kang*
Proceedings of the 24st International Conference on World Wide Web (WWW 2015) (Demo)
[Paper]
BOSS: context-enhanced search for biomedical objects
Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Sunwon Lee, Kyubum Lee and Jaewoo Kang*
BMC Medical Informatics and Decision Making 2012
[Paper]
Consento: A New Framework for Opinion Based Entity Search and Summarization
Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Junkyu Lee, Sunwon Lee and Jaewoo Kang*
21st ACM International Conference on Information and Knowledge Management (CIKM 2012) (Short)
[Paper]
Consento: A Consensus Search Engine for Answering Subjective Queries
Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Junkyu Lee, Sangrak Im, Sunwon Lee and Jaewoo Kang*
Proceedings of the 21st International Conference on World Wide Web (WWW 2012) (Poster)
[Paper]
BOSS: A Biomedical Object Search System
Jaehoon Choi, Donghyeon Kim, Seongsoon Kim, Sunwon Lee, Kyubum Lee and Jaewoo Kang*
ACM Fifth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 2011)
[Paper]
A Scalable Method for Detecting Multiple Loci Associated with Traits using TF-IDF Weighting and Association Rule Mining
Sunwon Lee, Jaewoo Kang and Junho Oh
IEEE International conference on Bioinformatics and Biomedicine Workshops (BIBMW 2010)
[Paper]