Épisodes

  • Deep Research by OpenAI - The Ups and Downs vs DeepSeek R1 Search + Gemini Deep Research
    Feb 3 2025

    12 hours ago Deep Research was unveiled, and I’ve tested it thoroughly, including vs Deepseek R1 with search, Gemini Deep Research and even R1 in Perplexity. It’s a notable step forward, with one big caveat. I’ll go through all the benchmark figures, my initial impression of the o3 model within, and much more.

    Deep Research:
    https://openai.com/index/introducing-deep-research/

    https://www.youtube.com/watch?v=YkCDVn3_wiw


    GAIA Bench: https://openreview.net/forum?id=fibxvahvs3

    https://openreview.net/pdf?id=fibxvahvs3

    CodeELO:https://arxiv.org/pdf/2501.01257

    CamelCamel:https://uk.camelcamelcamel.com/

    Deepseek R1 with search: https://chat.deepseek.com/

    https://arxiv.org/pdf/2501.12948

    HaluBench: https://arxiv.org/pdf/2407.08488


    Chapters:

    00:00 - Introduction

    01:06 - Powered by o3, Humanity’s Last Exam, GAIA

    03:55 - Simple Tests

    06:00 - Good News vs Deepseek R1 and Gemini Deep Research

    09:32 - Bad News on Hallucinations

    14:14 - What Can’t it Browse?

    14:42 - For Shopping?

    16:40 - Final thoughts



    Voir plus Voir moins
    19 min
  • o3-mini and the “AI War”
    Jan 31 2025

    o3-mini is here, and yes, I’ve read the paper in full - 2 hours after release, and even the post-launch Reddit AMA. Some epic details like a FrontierMath score that made me double-take, a likely new Cursor favorite, bio risk expertise and a cost-comparison with Deepseek R1., But does it perform on basic reasoning - let’s find out. Plus, arguably the bigger story - the increasingly frenetic rhetoric coming out of the West - and Dario Amodei and Alexandr Wang (CEOs of Anthropic and Scale AI respectively) in particular. The last thing we need is an “AI War”.


    https://wandb.me/simple-bench


    (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing


    Chapters:

    00:00 - Introduction

    00:45 - o3 mini

    05:11 - First impressions vs Deepseek R1

    07:21 - 10x Scale, o3-mini System Card, Amodei Essay, bitcoin wallets…

    12:40 - Simple Competition Finale

    13:03 - Clips and Final Thoughts on the “AI War”



    O3-mini: https://openai.com/index/openai-o3-mini/

    Paper: https://cdn.openai.com/o3-mini-system-card.pdf

    Amodei Essay: https://darioamodei.com/on-deepseek-and-export-controls?s=09

    FrontierMath wild stat:https://arxiv.org/pdf/2411.04872

    Sam Altman Channels Napoleon: https://x.com/sama/status/1883185690508488934

    Altman ‘pulls up releases’: https://x.com/sama/status/1884066337103962416

    “AI War” by Wang: https://scale.com/blog/win-the-ai-war

    Anthropic Original Views on Capabilities: https://www.anthropic.com/news/core-views-on-ai-safety

    AI Insider Cost Comparison:https://x.com/arankomatsuzaki/status/1884676245922934788

    Deepseek R1 Paper: https://arxiv.org/pdf/2501.12948

    R1, o3-mini Price Comparison: https://techcrunch.com/2025/01/31/openai-launches-o3-mini-its-latest-reasoning-model/

    Semianalysis on $1,3M deepseek salaries, and them falling behind as ‘the time gap to match US capabilities increases’: https://semianalysis.com/2025/01/31/deepseek-debates/

    OpenAI Valuation: https://www.bloomberg.com/news/articles/2025-01-30/openai-in-talks-to-raise-funding-at-340-billion-value-wsj-says?srnd=phx-ai

    Wang Clip: https://x.com/tsarnick/status/1867700453494206883

    Amodei Clip: https://x.com/ai_ctrl/status/1884951111771001188

    https://simple-bench.com/



    Voir plus Voir moins
    15 min
  • Nothing Much Happens in AI, Then Everything Does All At Once
    Jan 24 2025

    When it rains, it pours. OpenAI Operator tested and reviewed, with full paper analysis. Perplexity Assistant is useful. Then Stargate, is it all smoke and mirrors? Strong rumours of an o3+ model from Anthropic. Then a full breakdown of Deepseek R1, and what it’s training method says about the state of AI. It’s not open source BTW. Plus Humanity’s Last Exam, and Hassabis Accelerates his AGI timeline.

    00:00 - Introduction

    00:54 - OpenAI Operator

    04:53 - Perplexity Assistant

    05:15 - StarGate

    07:51 - Better than o3?

    08:25 - DeepSeek R1 Analysis

    12:12 - Training Secrets

    15:19 - No More Process Rewarding ?

    19:01 - Hassabis Timeline Accelerates

    21:22 - Humanity’s Last Exam


    https://app.grayswan.ai/arena/chat/harmful-ai-assistant

    https://app.grayswan.ai/arena

    https://openai.com/index/computer-using-agent/

    System Prompt: https://github.com/wunderwuzzi23/scratch/blob/master/system_prompts/operator_system_prompt-2025-01-23.txt


    OpenAI Operator: https://operator.chatgpt.com/

    System Card: https://cdn.openai.com/operator_system_card.pdf


    There is No Plan: https://x.com/jeffclune/status/1882120726339318007


    Perplexity Assistant: https://x.com/perplexity_ai/status/1882466239123255686


    Stargate: https://openai.com/index/announcing-the-stargate-project/

    Labour goes to 0: https://moores.samaltman.com/

    Larry Ellison AI Surveillance: https://x.com/TheChiefNerd/status/1882042989184430332

    Amodei 1984: https://www.bloomberg.com/news/articles/2025-01-22/anthropic-ceo-says-openai-s-stargate-venture-seems-chaotic

    Microsoft Hesitate: https://www.theinformation.com/articles/why-sam-altman-joined-forces-with-larry-ellison-and-took-a-step-back-from-microsoft?rc=sy0ihq


    Dylan Patel o3+ for Anthropic: https://www.youtube.com/watch?v=7EH0VjM3dTk


    Deepseek R1: https://arxiv.org/pdf/2501.12948

    https://arxiv.org/pdf/2412.19437

    Diagram: https://pbs.twimg.com/media/GhyQsM6WQAE7W52?format=jpg&name=large

    https://simple-bench.com/

    Process: https://x.com/sama/status/1664018190840614912

    https://x.com/karpathy/status/1835561952258723930

    https://openai.com/index/trading-inference-time-compute-for-adversarial-robustness/?s=09

    Demis Interview: https://www.youtube.com/watch?v=yr0GiSgUvPU

    Humanity’s Last Exam:

    https://agi.safe.ai/

    https://x.com/DanHendrycks/status/1882481730671857815

    https://www.nytimes.com/2025/01/23/technology/ai-test-humanitys-last-exam.html?s=09



    Voir plus Voir moins
    23 min
  • Altman Expects a ‘Fast Take-off’, ‘Super-Agent’ Debuting Soon and DeepSeek R1 Out
    Jan 20 2025

    OpenAI looks set to debut their Operator system, and some leaks are out. At the same time Deepseek R1 releases some numbers, and Sam Altman says he might have been wrong before, and now anticipates a 'fast take-off'. Plus two papers to give you an idea of what a super-agent might be decent at doing, some more exclusive article analysis and much more. Who said anything else is happening today...

    80,000 Hours Channel: https://www.youtube.com/channel/UCafjal1QYJ3rb0Y9xZk1Ezg
    Spotify: https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib

    AI Insiders ($9!): https://www.patreon.com/AIExplained

    Chapters:
    00:00 - Introduction
    01:13 - Pro Cost and OpenAI Operator
    04:00 - Agent Benchmarks Being Targeted
    07:48 - Fast Take-off, Altman
    08:48 - Altman flip-flops
    10:02 - Deepseek R1 First Reaction

    Altman ‘100x expectations out of control’: https://x.com/sama/status/1881258443669172470
    OpenAI Operator Table: https://x.com/btibor91/status/1881285255266750564
    WebVoyager: https://arxiv.org/pdf/2401.13919
    OSWorld: https://arxiv.org/pdf/2404.07972
    Axios Exclusive 1 (SuperAgent): https://www.axios.com/2025/01/19/ai-superagent-openai-meta?s=09
    Axios Exclusive 2: https://www.axios.com/2025/01/18/biden-sullivan-ai-race-trump-china
    Deepseek R1 Numbers: https://x.com/deepseek_ai/status/1881318130334814301
    Does 1.5B outperform 3.5 Sonnet on Math?: https://x.com/reach_vb/status/1881319500089634954
    Deepseek R1 (deepseek-reasoner) Pricing: https://api-docs.deepseek.com/quick_start/pricing/
    Altman Fast Takeoff: https://x.com/tsarnick/status/1879100390840697191
    OpenAI Economic Blueprint: https://cdn.openai.com/global-affairs/ai-in-america-oai-economic-blueprint-20250113.pdf
    Target is Long-horizon Tasks: https://x.com/karinanguyen_/status/1879576037249667520
    Support Regulations: https://www.techemails.com/p/elon-musk-and-openai
    https://www.nytimes.com/2023/05/16/technology/openai-altman-artificial-intelligence-regulation.html
    Donation: https://qz.com/sam-altman-donate-million-zuckerberg-bezos-donald-trump-1851721035
    Amodei on Regulations by 2025: https://www.youtube.com/watch?v=ugvHCXCOmm4
    ‘Feel the AGI’: https://x.com/polynoamial?lang=en
    GPT-5 and o-series merger: https://x.com/sama/status/1880358749187240274
    o1 Thinks in Chinese: https://techcrunch.com/2025/01/14/openais-ai-reasoning-model-thinks-in-chinese-sometimes-and-no-one-really-knows-why/



    Non-hype Newsletter: https://signaltonoise.beehiiv.com/

    Voir plus Voir moins
    13 min
  • OpenAI Backtracks on Superintelligence + Altman Brings His Timeline Forward
    Jan 8 2025

    Sam Altman unexpectedly brings his timelines to AGI forward, while OpenAI backtrack on superintelligence. None of these changes were heralded, but they are significant. Plus the new year brings new assessments of the true capability of models to automate 'large swathes of the economy'. I'll give my prediction on that front for 2025, announcement a new Simple Bench competition, and showcase Kling 1.6 vs Veo 2 vs Sora, and much more.

    wandb.me/simple-bench

    (Colab): https://colab.research.google.com/drive/1AVijcPnEkl8Gy_754XbRdG5m7Q5-9slg?usp=sharing


    TheAgentCompany Paper: https://arxiv.org/pdf/2412.14161v1

    Sam Altman Major Interview: https://www.bloomberg.com/features/2025-sam-altman-interview/?srnd=phx-ai

    OpenAI Agent Coming Jan 2025: https://www.theinformation.com/articles/why-openai-is-taking-so-long-to-launch-agents?rc=sy0ihq

    Altman Singularity: https://x.com/sama/status/1875603249472139576

    Altman Original Timeline: https://www.youtube.com/watch?v=7dCPytNTnjk&t=621s

    https://www.ft.com/content/34a7a082-e685-4e02-bca7-61ff89d99ed2

    OpenAI Original Emails: https://www.lesswrong.com/posts/5jjk4CDnj9tA7ugxr/openai-email-archives-from-musk-v-altman-and-openai-blog

    DeepMind Sky News 2014 Article: https://news.sky.com/story/google-buys-uk-intelligence-firm-deepmind-10419783

    Altman Blog Reflections: https://blog.samaltman.com/reflections

    OpenAI Changes Who Gets AGI: https://openai.com/index/why-our-structure-must-evolve-to-advance-our-mission/?s=09

    OpenAI 5 Levels: https://www.bloomberg.com/news/articles/2024-07-11/openai-sets-levels-to-track-progress-toward-superintelligent-ai

    Altman 2015: https://blog.samaltman.com/machine-intelligence-part-1

    OpenAI React to Anthropic: https://www.theinformation.com/articles/how-anthropic-got-inside-openais-head?rc=sy0ihq

    Microsoft $100B Definition: https://www.theinformation.com/articles/microsoft-and-openai-wrangle-over-terms-of-their-blockbuster-partnership?rc=sy0ihq
    Epoch Scramble for Task Benchmark: https://x.com/tamaybes/status/1876692639363612919

    GPQA Progress: https://epoch.ai/data/ai-benchmarking-dashboard

    Task Length Crucial for ARC-AGI: https://anokas.substack.com/p/llms-struggle-with-perception-not-reasoning-arcagi

    RL Environment Tweet: https://x.com/vedantmisra/status/1876327518157807990

    Jason Wei Talk: https://www.youtube.com/watch?v=yhpjpNXJDco

    Miles Brunda

    Voir plus Voir moins
    24 min
  • o3 - wow
    Dec 21 2024

    o3 isn’t one of the biggest developments in AI for 2+ years because it beats a particular benchmark. It is so because it demonstrates a reusable technique through which almost any benchmark could fall, and at short notice. I’ll cover all the highlights, benchmarks broken, and what comes next. Plus, the costs OpenAI didn’t want us to know, Genesis, ARC-AGI 2, Gemini-Thinking, and much more.


    FrontierMath: https://epoch.ai/frontiermath

    https://arxiv.org/pdf/2411.04872

    Chollet Statement:https://arcprize.org/blog/oai-o3-pub-breakthrough

    MLC Paper:

    https://www.scientificamerican.com/article/new-training-method-helps-ai-generalize-like-people-do/?utm_campaign=socialflow&utm_source=twitter&utm_medium=social

    AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf

    Human Performance on ARC-AGI: https://arxiv.org/pdf/2409.01374v1

    Wei Tweet ‘3 months’:https://x.com/_jasonwei/status/1870184982007644614

    Deliberative Alignment Paper: https://openai.com/index/deliberative-alignment/

    Brown Safety Tweet: https://x.com/polynoamial/status/1870196476908834893

    Swe-Bench Verified: https://openai.com/index/introducing-swe-bench-verified/

    Amodei Prediction: https://x.com/OfirPress/status/1858567863788769518

    David Dohan: 16 hours https://x.com/dmdohan/status/1870171404093796638

    OpenAI Personal Writing: https://openai.com/index/learning-to-reason-with-llms/

    https://simple-bench.com/

    John Hallman Tweet: https://x.com/johnohallman/status/1870233375681945725


    00:00 - Introduction

    01:19 - What is o3?

    03:18 - FrontierMath

    05:15 - o4, o5

    06:03 - GPQA

    06:24 - Coding, Codeforces + SWE-verified, AlphaCode 2

    08:13 - 1st Caveat

    09:03 - Compositionality?

    10:16 - SimpleBench?

    13:11 - ARC-AGI, Chollet



    Voir plus Voir moins
    22 min
  • Never Browse Alone? - Gemini 2 Live and ChatGPT Vision
    Dec 12 2024

    The ‘Gemini 2 Era’ begins … with screen-sharing? But really, it’s a great free tool, for curiosity satisfying rather than bleeding-edge intelligence. I give you the benchmarks, the highlights and of course, the latest from OpenAI Advanced Voice Mode with Vision.

    Plus Deep Research in Gemini Advanced, Simple Bench updates, Santa and what might be for some of you Google’s deflating admission.


    00:00 - Introduction

    00:38 - Live Interaction

    03:43 - Gemini 2.0 Flash Benchmarks

    05:10 - Audio and Image Output

    06:38 - Project Mariner (+ WebVoyager Bench)

    08:49 - But Progress Slowing Down?

    10:43 - OpenAI Announcements + Games



    https://aistudio.google.com/live

    Gemini 2.0 Flash Benchmarks: https://deepmind.google/technologies/gemini/

    Project mariner: https://deepmind.google/technologies/project-mariner/

    WebVoyager: https://x.com/laurentsifre/status/1858918588683296875/photo/1

    Gemini Game play: https://www.youtube.com/watch?v=IKuGNHJBGsc

    Advanced Voice Mode OpenAI: https://www.youtube.com/watch?v=NIQDnWlwYyQ

    https://simple-bench.com/

    Claude Computer Use: https://docs.anthropic.com/en/docs/build-with-claude/computer-use

    Oriol Vinyals Interview: https://www.youtube.com/watch?v=78mEYaztGaw&t=687s



    Voir plus Voir moins
    14 min
  • Sora is Out, But is it a Distraction?
    Dec 10 2024

    After a 10 month wait, OpenAI have released Sora to paying users. With just a prompt it can generate videos of up to 20 seconds in lower resolutions, and 10 seconds at 1080p if you can fork out $200/month. I’ve tested it and read the system card. The user interface is quite beautiful, even if the videos themselves operate until entirely new rules of physics. But I can’t help wondering if OpenAI want up to focus on releases like this, rather than some quietly broken promises.



    80,000 hours Website, Podcast + Channel:

    https://80000hours.org/

    https://open.spotify.com/show/2WzJwXWBDnn4iZ7odKwDib https://www.youtube.com/@eightythousandhours/videos


    https://openai.com/sora/


    Sora Countries: https://help.openai.com/en/articles/10250692-sora-supported-countries

    Sora Credits: https://help.openai.com/en/articles/10245774-sora-billing-credits-faq

    https://runwayml.com/ and https://pika.art/home


    DeepMind Veo: https://deepmind.google/technologies/veo/


    Sam Altman Ads as Last Resort: https://www.windowscentral.com/software-apps/openai-could-chase-intrusive-ads-as-last-resort


    But OpenAI Considering Ads: https://www.inc.com/ben-sherry/is-openai-getting-into-the-advertising-business-the-company-is-sending-mixed-messages/91033533


    OpenAI Backtracks on Microsoft AGI Clause: https://www.ft.com/content/2c14b89c-f363-4c2a-9dfc-13023b6bce65


    As Microsoft Boast of Labor Savings: https://www.theinformation.com/articles/microsofts-new-sales-pitch-for-ai-spend-less-money-on-humans?rc=sy0ihq


    OpenAI Military Pivot: https://www.technologyreview.com/2024/12/04/1107897/openais-new-defense-contract-completes-its-military-pivot/


    Employees Have Doubts: https://www.washingtonpost.com/technology/2024/12/06/openai-anduril-employee-military-ai/?nid=top_pb_signin&arcId=KZIV7PLRHBCVNPAIAAAVUNRHIM&account_location=ONSITE_HEADER_ARTICLE



    Voir plus Voir moins
    16 min