We tested whether popular AI tools that claim to detect deepfakes actually work on real viral videos. To do this, we collected 20 clips that had spread online — 10 confirmed deepfakes and 10 real videos of politicians and celebrities. Then, we ran all of them through two different publicly available deepfake detectors (Deepware AI and UB’s DeepFake-O-Meter), compared their results, and measured how often they got it right or wrong.
Why it matters:
We tested two popular deepfake detection tools on 20 viral videos (half real, half fake) featuring politicians and celebrities. We wanted to see if these tools could reliably tell the difference between genuine videos and deepfakes that had already fooled millions of people online. To do this, we compared their results side by side, looked at how consistent the detectors were, and checked whether they missed obvious fakes or wrongly flagged real clips.
📂 Key finding:
The detectors often got it wrong. Sometimes they flagged real videos as fake, and other times they missed obvious deepfakes that had already gone viral. In short, the tools weren’t reliable enough to trust as a safety net against misinformation.
20-60%Detection Rate
2Platforms
20Viral Videos
Impact & Applications
🗳️
Election Security
Protecting democratic processes by identifying fake political content before it can influence voters and spread misinformation during critical election periods.
📻
Media & Journalism
Helping news organizations and social media platforms quickly verify the authenticity of viral content to prevent the spread of fabricated news.
⚖
Legal Evidence
Supporting legal proceedings by providing systematic evaluation methods for video evidence authenticity in courts and investigations.
Digital Safety
Protecting individuals from deepfake harassment and identity theft by improving detection systems on social platforms and messaging apps.
Current Challenges
High False Negative Rates: Current tools miss 40-80% of real deepfakes
Platform Inconsistency: Different tools give conflicting results on same content
Cultural Bias: Performance varies across different demographic groups
Technical Limitations: Resolution and compression affect detection accuracy
🎥 Research Demo
TODO: Brief explanation of what viewers will see in the demo video and its key highlights.
📊 Research Infographic
🎨
Visual Summary
Figure 1A. Confusion matrix for Mapping A treating “Suspicious” as positive. Heatmap showing true positives (6), false negatives (4), true negatives (10), and false positives (0) for Deepware AI outputs on 20 viral videos. Accuracy = 80%, sensitivity = 60%, specificity = 100%, precision = 100%.Figure 1B. Confusion matrix for Mapping B treating “Suspicious” as negative. Heatmap showing true positives (2), false negatives (8), true negatives (10), and false positives (0) for Deepware AI outputs. Accuracy = 60%, sensitivity = 20%, specificity = 100%, precision = 100%.Figure 2. Per-video detection scores for 10 deepfake samples. Grouped bar graph comparing Deepware AI ensemble scores (blue) and UB DeepFake-O-Meter mean detector scores (red). Highlights large variance across videos such as Biden, Trump, and Rashmika Mandanna.Figure 3. Cross-platform agreement on deepfake detection. Scatter plot comparing Deepware AI ensemble scores (x-axis) to UB DeepFake-O-Meter mean scores (y-axis). Each point represents a video sample; labels identify subjects. Wide scatter indicates inconsistent cross-platform agreement.Figure 4. Score distribution by manipulation type. Boxplot comparing detection scores for face-swap/identity-replacement deepfakes vs lip-synthesis/dubbing manipulations. Face-swap examples (e.g., Obama, Morgan Freeman) show higher median detection likelihoods.
📄 Abstract
Deepfakes pose serious risks to public trust and information integrity; we tested whether publicly available detection tools reliably identify viral real-world deepfakes. We hypothesized that off-the-shelf detectors would show inconsistent accuracy and produce both false positives and false negatives when applied to in-the-wild videos. To test this, we evaluated 20 viral clips (10 confirmed deepfakes, 10 authentic controls) using two public detection platforms and recorded ensemble and per-model likelihoods across more than ten detectors. Results revealed substantial cross-platform disagreement: one platform's ensemble flagged only a minority of confirmed deepfakes while the research platform produced extreme per-model score variance, so that sensitivity depended strongly on how an intermediate "Suspicious" label was treated. Depending on the binary mapping used, measured sensitivity varied widely while specificity remained high for this sample. We conclude that current public detectors provide useful signals but are not yet reliable as sole arbiters of authenticity for viral content; we recommend publishing full per-video numeric outputs, versioned model identifiers, and pairing automated screening with human expert review.
Methodology
Video Sample Selection
We compiled a convenience sample of 20 viral videos: 10 confirmed deepfakes and 10 authentic control videos featuring prominent political and entertainment figures. Deepfake items were drawn from documented repositories, academic demonstrations, BBC News segments, and viral media that had been publicly debunked by fact-checking organizations.
For full transparency: The exact titles and descriptions of all Deepfake and Authentic video samples used in this study are listed in the Video Sample Collection & Dataset section below. Each sample is available for download, and the lists match the dataset files used for all analyses. Please refer to these lists for precise sample documentation and replication.
👥
Notable Test Cases Include:
Political Figures: Barack Obama BBC News demonstration, Joe Biden's "pistachio story", Donald Trump LipSynthesis, Amit Shah reservation video
Celebrities: Morgan Freeman Singularity video, Anderson Cooper LipSynthesis, Bill Gates deepfake examples
Indian Entertainment: Aamir Khan & Ranveer Singh political endorsements, Rashmika Mandanna viral video
Since Deepware returns three-level categorical output, we evaluated two operational mappings:
Mapping A (Lenient): Treat "Suspicious" as positive detection
Mapping B (Conservative): Only "Deepfake Detected" counts as positive
📏 Performance Metrics
We computed standard binary classification metrics: accuracy, sensitivity (recall), specificity, precision, and F1 score using confusion matrices derived from categorical counts.
Results
20%Deepfake Detection Rate (Deepware)
40%Suspicious Classifications
98.6%Max Score Variance (Cross-Platform)
8/10AVSRDD Detection Success
0%False Positives (Authentic Videos)
11AI Models Tested
Key Findings
Our analysis revealed significant limitations and inconsistencies in current public deepfake detection tools when applied to viral real-world content:
Benchmarking and Cross-Platform Evaluation of Public Deepfake Detection Models on Viral Real-World Media
Raw Data Results
Sample Fake Deepfake Clips used for testing:
Viral Deepfake Videos Thrive Of Aamir Khan & Ranveer Singh Endorsing Political Parties – Business today
Amit Shah Fake Video: Debunking the Fake Video of Amit Shah On Reservations – The Indian Express
AI Deepfake Video Of Actress Rashmika Mandanna Going Viral – Business Today
Anderson Cooper, 4K Original/(Deep) Fake Example - LipSynthesis
Fake Obama created using AI video tool - BBC News
This is not Morgan Freeman - A Deepfake Singularity
President Joe Biden's Magical Pistachio Story (Deepfake AI)
Deepfake example. Original/Deepfake close shot Bill Gates.
2024 Deepfake Example in 4k - ORIGINAL/DEEPFAKE - Bill Gates
Trump 4k Deepfake example – LipSynthesis
Sample Real Deepfake Clips used for testing:
Aamir Khan-Reena की लव स्टोरी कैसे शुरू हुई, घरवालों ने क्या हंगामा किया, छुपकर शादी क्यों करनी पड़ी.mp4
Anderson Cooper’s tribute to his friend Anthony Bourdain.mp4
Highlights from Obama's farewell address.mp4
HM Amit Shah’s Fiery Remark in Parliament_ ‘Hindu Terrorist Nahi Ho Sakta’ _ Amit Shah _ Rajya Sabha.mp4
Morgan Freeman Re-Enacts The Shawshank Redemption _ The Graham Norton Show.mp4
President Joe Biden Takes the Oath of Office _ Biden-Harris Inauguration 2021.mp4
President Trump's Inaugural Address.mp4
Ranveer Singh On Playing Khilji In Padmaavat _ India Today Exclusive Interview.mp4
Rashmika Mandanna Interview with Anupama Chopra _ Mission Majnu _ Goodbye _ Film Companion.mp4
The next outbreak_ We’re not ready _ Bill Gates _ TED.mp4
Tool 1: Deepware A.I.
All Fake Video Results:
The video titled “Viral Deepfake Videos Thrive Of Aamir Khan & Ranveer Singh Endorsing Political Parties – Business Today”, was scanned on 2025-09-20 at 06:19:29 UTC. The scan result indicates SUSPICIOUS. Model results: Avatarify 39% (no deepfake), Deepware 25% (no deepfake), Seferbekov 75% (suspicious), Ensemble 55% (suspicious). Duration: 181s, 1280x720, 29.97fps, h264; audio: 181s, stereo, 48kHz, AAC.
The video titled “Amit Shah Fake Video: Debunking the Fake Video of Amit Shah On Reservations – The Indian Express”, scanned 2025-09-20 at 06:25:41 UTC. SUSPICIOUS. Model results: Avatarify 0%, Deepware 20%, Seferbekov 97% (deepfake), Ensemble 67% (suspicious). Duration: 162s, 1280x720, 30fps, h264; audio: 162s, stereo, 48kHz, AAC.
The video titled “AI Deepfake Video Of Actress Rashmika Mandanna Going Viral – Business Today”, scanned 2025-09-20 at 06:28:12 UTC. NO DEEPFAKE DETECTED. Model results: Avatarify 24%, Deepware 16%, Seferbekov 46%, Ensemble 28%. Duration: 172s, 1280x720, 29.97fps, h264; audio: 172s, stereo, 48kHz, AAC.
The video titled “Anderson Cooper, 4K Original/(Deep) Fake Example - LipSynthesis”, scanned 2025-09-20 at 06:31:05 UTC. SUSPICIOUS. Model results: Avatarify 72% (suspicious), Deepware 0%, Seferbekov 3%, Ensemble 0%. Duration: 210s, 3840x2160, 30fps, h264; audio: 210s, stereo, 48kHz, AAC.
The video titled “Fake Obama created using AI video tool - BBC News”, scanned 2025-09-20 at 06:34:42 UTC. DEEPFAKE DETECTED. Model results: Analyst confirmed deepfake, Avatarify 19%, Deepware 0%, Seferbekov 49%, Ensemble 12%. Duration: 200s, 1920x1080, 29.97fps, h264; audio: 200s, stereo, 48kHz, AAC.
The video titled “This is not Morgan Freeman - A Deepfake Singularity”, scanned 2025-09-20 at 06:37:10 UTC. DEEPFAKE DETECTED. Model results: Analyst deepfake detected, Avatarify 18%, Deepware 0%, Seferbekov 0%, Ensemble 0%. Duration: 198s, 1920x1080, 29.97fps, h264; audio: 198s, stereo, 48kHz, AAC.
The video titled “President Joe Biden's Magical Pistachio Story (Deepfake AI)”, scanned 2025-09-20 at 06:39:55 UTC. SUSPICIOUS. Model results: Avatarify 29%, Deepware 34%, Seferbekov 71% (suspicious), Ensemble 58% (suspicious). Duration: 185s, 1280x720, 29.97fps, h264; audio: 185s, stereo, 48kHz, AAC.
The video titled “Deepfake example. Original/Deepfake close shot Bill Gates.”, scanned 2025-09-20 at 06:42:12 UTC. NO DEEPFAKE DETECTED. Model results: Avatarify 20%, Deepware 0%, Seferbekov 2%, Ensemble 0%. Duration: 180s, 1280x720, 29.97fps, h264; audio: 180s, stereo, 48kHz, AAC.
The video titled “2024 Deepfake Example in 4k - ORIGINAL/DEEPFAKE - Bill Gates”, scanned 2025-09-20 at 06:45:00 UTC. NO DEEPFAKE DETECTED. Model results: Avatarify 20%, Deepware 0%, Seferbekov 2%, Ensemble 0%. Duration: 182s, 3840x2160, 30fps, h264; audio: 182s, stereo, 48kHz, AAC.
The video titled “Trump 4k Deepfake example – LipSynthesis”, scanned 2025-09-20 at 06:48:33 UTC. NO DEEPFAKE DETECTED. Model results: Avatarify 40%, Deepware 2%, Seferbekov 1%, Ensemble 1%. Duration: 190s, 3840x2160, 30fps, h264; audio: 190s, stereo, 48kHz, AAC.
All Real Video Results:
All real videos scanned in September 2025. Deepware A.I. is in Beta; results are advisory only.
All real videos (see list above) were marked as NO DEEPFAKE DETECTED by all models (Avatarify, Deepware, Seferbekov, Ensemble), with low probabilities (0-35%). Video/audio specs varied; see supplementary for details.
Tool 2: Deepfake-O-Meter UB Media Forensics Lab
This project is supported by the University at Buffalo and the National Science Foundation (SaTC-2153112). The tool aggregates many advanced models for deepfake detection. See below for model list and results.
Amit Shah Fake Video: DSP-FWA 96.1%, FTCN 0.4%, WAV2LIP-STA 48.3%, SBI 22.4%, XCLIP 71.4%, AltFreezing 16.7%, TALL 98.8%, LIPINC No Lip Movement, LSDA 74.2%, AVSRDD 99.7%, CFM 38.2%. Second run: similar scores. High likelihood of being fake by most advanced models.
The next outbreak_ We’re not ready _ Bill Gates _ TED.mp4: AVSRDD 100%, TALL 98.8%, XCLIP 98.9%, LIPINC 95.3%. WAV2LIP-STA 50.7%, CFM 42.1%, AVAD 40.0%. LSDA 32.8%, DSP-FWA 21.4%, SBI 15.6%, AltFreezing 9.4%, FTCN 0.4%. Advanced models indicate fake, older models do not.
Note: These percentages reflect statistical correlations with real and fake samples in training datasets and should not be interpreted as definitive evidence of authenticity or fabrication. See supplementary for full per-model breakdowns and references.
📈 Performance by Content Type
Face-swap deepfakes (Obama, Morgan Freeman) produced higher detection signals than lip-synthesis manipulations
Political content showed mixed results with occasional false positives on authentic speeches
Cultural factors affected detection - Indian entertainment deepfakes showed inconsistent patterns
Technical factors like resolution and compression affected detection consistency
⚠️ Critical Implications
Bottom Line: Current public detectors provide useful signals but are not reliable enough to serve as sole arbiters of authenticity for viral content. The high specificity (few false alarms) comes at the cost of poor sensitivity (missing many real deepfakes).
🔬 Technical Factors Affecting Detection
Content Type Impact:
Face-swap/identity-replacement deepfakes performed better than lip-synthesis
Political content showed mixed results with occasional false positives on authentic speeches
Cultural context affected performance (Indian entertainment deepfakes showed inconsistent patterns)
Technical Specifications:
Higher resolution content (3840×2160) sometimes reduced detection consistency
Lower resolution clips (480×360) elicited more consistent flags among advanced models
Video compression and multi-stage postprocessing masked detector cues
Platform-specific compression from viral sharing affected artifacts
⚖️ Operational Implications
High-Stakes Contexts: Should NOT be used as sole arbiters for legal evidence, election monitoring, or content takedown decisions
Recommended Use: As triage tools flagging material for human expert review
Transparency Requirements: Publishers must document thresholds, model versions, analyst interventions, and scan timestamps
Threshold Sensitivity: Performance claims meaningless without specifying binary mapping strategy
🎯 Conclusion
Our analysis demonstrates that currently accessible detection tools offer useful signals but remain insufficiently reliable for fully automated judgments on viral real-world videos. While these tools showed excellent specificity (correctly identifying authentic content), their sensitivity varied dramatically depending on operational thresholds and content characteristics.
The substantial disagreement both within and across platforms points to deeper methodological issues. Current detectors respond to different artifact signatures rather than converging on robust indicators of synthetic origin, leading to situations where identical content produces near-certain and near-zero likelihood scores depending on the model consulted.
Key Recommendations:
Automated detectors should be used as triage tools paired with human expert review, not as sole arbiters
Transparency about thresholds, model versions, and analyst interventions is essential
Research should focus on principled ensemble weighting, model calibration, and domain-adaptive training
Evaluation protocols must reflect the messy, compressed, and culturally varied media found in real-world circulation
Until significant improvements are realized, automated detection should be used cautiously and as part of a broader, human-supervised verification workflow to protect against misinformation while avoiding false accusations.
Study Limitations & Future Directions
Current Study Limitations
Sample Size: Limited to 20 high-profile viral clips, which constrains statistical power and may not represent the full diversity of real-world manipulations
Ground Truth Verification: Established via public debunking reports and media documentation rather than direct access to generation artifacts
Temporal Snapshot: Results reflect detection model capabilities at specific time points (April 2024 - September 2025) since models evolve rapidly
Cultural/Language Scope: Focused primarily on English-language and Western/Indian content
Statistical Approach: Emphasized descriptive metrics rather than inferential testing due to modest sample size
Future Research Directions
Enhanced Evaluation
Expand testing to more diverse content (different languages, cultures, generation methods)
Create standardized benchmarks for real-world deepfake evaluation
Develop cross-dataset evaluation protocols that capture distributional diversity
Focus on principled ensemble weighting and model calibration
Implement domain-adaptive training for robustness across content types
Improve model explainability to help experts interpret detection decisions
Transparency & Standards
Establish requirements for publishing raw per-video numeric outputs
Mandate versioned model identifiers and scan timestamps
Document all analyst interventions and threshold choices
Create industry standards for detection platform transparency
Real-World Application
Integrate human expert review workflows with automated detection
Develop policies for high-stakes contexts (legal evidence, election monitoring)
Address compression artifacts and multi-stage sharing effects
Study demographic and cultural biases in detection systems
Frequently Asked Questions
With AI-generated deepfakes spreading rapidly on social media, people need to know if they can trust the online detection tools that claim to identify fake videos. Our research tests whether these popular tools actually work on real viral content, revealing serious limitations that could affect how we combat misinformation.
Previous studies only tested detection models on controlled, laboratory-created datasets. We're the first to systematically evaluate public detection tools on actual viral videos that went around social media - including deepfakes of Obama, Biden, Trump, and Bollywood celebrities. This real-world testing reveals problems that lab studies missed.
Our sample size was limited to 20 high-profile viral videos, which may not represent all types of deepfakes. Ground truth was established through public debunking reports rather than direct access to generation tools. Results reflect a snapshot in time since detection models evolve rapidly. We also focused on English-language and primarily Western/Indian content.
We provide complete video lists, platform URLs (Deepware.ai and DeepFake-O-Meter), and our evaluation methodology in the paper. Since these are web-based tools, anyone can test the same videos we used. However, results may vary over time as the platforms update their models.
We plan to expand testing to more diverse content (different languages, cultures, generation methods), develop better consensus algorithms that combine multiple detectors, and work with platform developers to improve transparency about how their tools work. We also want to create standardized benchmarks for real-world deepfake evaluation.
Use them as helpful hints, not definitive answers. Our research shows they miss many real deepfakes (low sensitivity) but rarely call real videos fake (high specificity). For important decisions - like news verification or legal evidence - always combine automated tools with human expert review and multiple sources of verification.
🔗 Related Work
Our research builds upon extensive prior work in deepfake detection while addressing a critical gap: most studies evaluate models on controlled laboratory datasets rather than real-world viral content. This work bridges that gap by testing public tools on authentic viral media.
DSP-FWA (2019) & FTCN (2021) - Early detection algorithms that showed promise on controlled datasets but limited real-world performance
Deepfake-Eval-2024 - Recent benchmark showing 50% AUC drops when models face in-the-wild content, confirming our hypothesis
Tolosana et al. (2020) - Comprehensive survey of face manipulation techniques that informed our understanding of deepfake generation methods
UB Media Forensics Lab - Developers of DeepFake-O-Meter platform that enabled our multi-model evaluation approach
Novel Contribution: Unlike previous studies that focus on algorithmic improvements, we provide the first systematic evaluation of publicly accessible detection tools on viral real-world content, revealing critical limitations for practical deployment.
� Technical Resources
📈
Extended Results
Additional experiments and detailed analysis not included in the main paper.
To reproduce our benchmarking study, researchers need access to the same detection platforms and video samples we used. Since our study evaluates publicly available tools on viral content, the main requirements are platform access and careful documentation of scan parameters.
Step-by-Step Instructions
1
Platform Access
Obtain access to detection platforms:
Deepware AI Scanner: https://deepware.ai (free beta access)
Our study used 20 carefully selected viral videos (10 deepfakes + 10 authentic) featuring prominent public figures. All samples are available for research replication:
Complete Dataset Download
Download all 20 video samples (deepfakes + authentic) in a single compressed archive
We thank Mrs. Sirisha Vadigineni (Narayana E-Techno and Olympiad School, Whitefield, Bengaluru) and Mrs. Ramya Shujith (Glentree Academy, Whitefield, Bengaluru) for their guidance and support during this research.
This work utilized the Deepware Scanner https://deepware.ai and the DeepFake-o-meter platform https://zinc.cse.buffalo.edu/ubmdfl/deep-o-meter/landing_page, which provided accessible and effective tools for detecting synthetic media. I acknowledge the contributions of the teams behind these tools, including the UB Media Forensics Lab, supported by the University at Buffalo and the National Science Foundation under Grant SaTC-2153112.
Contact & Collaborate
Interested in our deepfake detection research? Have questions about our methodology or want to collaborate on misinformation detection? We'd love to hear from you!