Sidebar

Machine Learning

Machine Learning nsa • 11 months ago • 100%

What's In My Big Data? https://arxiv.org/abs/2310.20707

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, we propose What's In My Big Data? (WIMBD), a platform and a set of sixteen analyses that allow us to reveal and compare the contents of large text corpora. WIMBD builds on two basic capabilities -- count and search -- at scale, which allows us to analyze more than 35 terabytes on a standard compute node. We apply WIMBD to ten different corpora used to train popular language models, including C4, The Pile, and RedPajama. Our analysis uncovers several surprising and previously undocumented findings about these corpora, including the high prevalence of duplicate, synthetic, and low-quality content, personally identifiable information, toxic language, and benchmark contamination. For instance, we find that about 50% of the documents in RedPajama and LAION-2B-en are duplicates. In addition, several datasets used for benchmarking models trained on such corpora are contaminated with respect to important benchmarks, including the Winograd Schema Challenge and parts of GLUE and SuperGLUE. We open-source WIMBD's code and artifacts to provide a standard set of evaluations for new text-based corpora and to encourage more analyses and transparency around them: github.com/allenai/wimbd.

Machine Learning nsa • 11 months ago • 100%

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI https://arxiv.org/abs/2310.16787

> > > The race to train language models on vast, diverse, and inconsistently documented datasets has raised pressing concerns about the legal and ethical risks for practitioners. To remedy these practices threatening data transparency and understanding, we convene a multi-disciplinary effort between legal and machine learning experts to systematically audit and trace 1800+ text datasets. We develop tools and standards to trace the lineage of these datasets, from their source, creators, series of license conditions, properties, and subsequent use. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets, with closed datasets monopolizing important categories: lower resource languages, more creative tasks, richer topic variety, newer and more synthetic training data. This points to a deepening divide in the types of data that are made available under different license conditions, and heightened implications for jurisdictional legal interpretations of copyright and fair use. We also observe frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 72%+ and error rates of 50%+. This points to a crisis in misattribution and informed use of the most popular datasets driving many recent breakthroughs. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire audit, with an interactive UI, the Data Provenance Explorer, which allows practitioners to trace and filter on data provenance for the most popular open source finetuning data collections: [www.dataprovenance.org](http://www.dataprovenance.org). > >

Machine Learning KingsmanVince • 11 months ago • 100%

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks

aclanthology.org

Vision and language models (VL) are known to exploit unrobust indicators in individual modalities (e.g., introduced by distributional biases) instead of focusing on relevant information in each modality. That a unimodal model achieves similar accuracy on a VL task to a multimodal one, indicates that so-called unimodal collapse occurred. However, accuracy-based tests fail to detect e.g., when the model prediction is wrong, while the model used relevant information from a modality. Instead, we propose MM-SHAP, a performance-agnostic multimodality score based on Shapley values that reliably quantifies in which proportions a multimodal model uses individual modalities. We apply MM-SHAP in two ways: (1) to compare models for their average degree of multimodality, and (2) to measure for individual models the contribution of individual modalities for different tasks and datasets. Experiments with six VL models – LXMERT, CLIP and four ALBEF variants – on four VL tasks highlight that unimodal collapse can occur to different degrees and in different directions, contradicting the wide-spread assumption that unimodal collapse is one-sided. Based on our results, we recommend MM-SHAP for analysing multimodal tasks, to diagnose and guide progress towards multimodal integration. Code available at [https://github.com/Heidelberg-NLP/MM-SHAP](https://github.com/Heidelberg-NLP/MM-SHAP) .

Machine Learning KingsmanVince • 11 months ago • 100%

Demystifying CLIP Data https://arxiv.org/abs/2309.16671

Contrastive Language-Image Pre-training (CLIP) is an approach that has advanced research and applications in computer vision, fueling modern recognition systems and generative models. We believe that the main ingredient to the success of CLIP is its data and not the model architecture or pre-training objective. However, CLIP only provides very limited information about its data and how it has been collected, leading to works that aim to reproduce CLIP's data by filtering with its model parameters. In this work, we intend to reveal CLIP's data curation approach and in our pursuit of making it open to the community introduce Metadata-Curated Language-Image Pre-training (MetaCLIP). MetaCLIP takes a raw data pool and metadata (derived from CLIP's concepts) and yields a balanced subset over the metadata distribution. Our experimental study rigorously isolates the model and training settings, concentrating solely on data. MetaCLIP applied to CommonCrawl with 400M image-text data pairs outperforms CLIP's data on multiple standard benchmarks. In zero-shot ImageNet classification, MetaCLIP achieves 70.8% accuracy, surpassing CLIP's 68.3% on ViT-B models. Scaling to 1B data, while maintaining the same training budget, attains 72.4%. Our observations hold across various model sizes, exemplified by ViT-H achieving 80.5%, without any bells-and-whistles. Curation code and training data distribution on metadata is made available at [https://github.com/facebookresearch/MetaCLIP](https://github.com/facebookresearch/MetaCLIP) .

Machine Learning nsa • 11 months ago • 100%

GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems https://arxiv.org/abs/2310.12397

> > > There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples, a wide spread belief in their iterative self-critique capabilities persists. In this paper, we set out to systematically investigate the effectiveness of iterative prompting of LLMs in the context of Graph Coloring, a canonical NP-complete reasoning problem that is related to propositional satisfiability as well as practical problems like scheduling and allocation. We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings. In iterative modes, we experiment with the model critiquing its own answers and an external correct reasoner verifying proposed solutions. In both cases, we analyze whether the content of the criticisms actually affects bottom line performance. The study seems to indicate that (i) LLMs are bad at solving graph coloring instances (ii) they are no better at verifying a solution--and thus are not effective in iterative modes with LLMs critiquing LLM-generated solutions (iii) the correctness and content of the criticisms--whether by LLMs or external solvers--seems largely irrelevant to the performance of iterative prompting. We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt (and being recognized as such by an external verifier). Our results thus call into question claims about the self-critiquing capabilities of state of the art LLMs. > >

Machine Learning KingsmanVince • 11 months ago • 100%

PaLI-3 Vision Language Models: Smaller, Faster, Stronger https://arxiv.org/abs/2310.09199

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Machine Learning KingsmanVince • 11 months ago • 100%

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning https://arxiv.org/abs/2310.09478

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The challenge is to use a single model for performing diverse vision-language tasks effectively with simple multi-modal instructions. Towards this objective, we introduce MiniGPT-v2, a model that can be treated as a unified interface for better handling various vision-language tasks. We propose using unique identifiers for different tasks when training the model. These identifiers enable our model to better distinguish each task instruction effortlessly and also improve the model learning efficiency for each task. After the three-stage training, the experimental results show that MiniGPT-v2 achieves strong performance on many visual question-answering and visual grounding benchmarks compared to other vision-language generalist models. Our model and codes are available at [https://minigpt-v2.github.io/](https://minigpt-v2.github.io/)

Machine Learning KingsmanVince • 12 months ago • 100%

Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models https://openaccess.thecvf.com/content/CVPR2023/html/Goyal_Finetune_Like_You_Pretrain_Improved_Finetuning_of_Zero-Shot_Vision_Models_CVPR_2023_paper.html

Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shift, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by 2.3% ID and 2.7% OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of 4.2% OOD over standard finetuning and outperforms current state-ofthe-art (LP-FT) by more than 1% both ID and OOD. Similarly, on 3 few-shot learning benchmarks, FLYP gives gains up to 4.6% over standard finetuning and 4.4% over the state-of-the-art. Thus we establish our proposed method of contrastive finetuning as a simple and intuitive state-ofthe-art for supervised finetuning of image-text models like CLIP. Code is available at [https://github.com/locuslab/FLYP](https://github.com/locuslab/FLYP) .

Machine Learning nsa • 12 months ago • 100%

A Long Way to Go: Investigating Length Correlations in RLHF https://arxiv.org/abs/2310.03716

> > > Great successes have been reported using Reinforcement Learning from Human Feedback (RLHF) to align large language models. Open-source preference datasets and reward models have enabled wider experimentation beyond generic chat settings, particularly to make systems more "helpful" for tasks like web question answering, summarization, and multi-turn dialogue. When optimizing for helpfulness, RLHF has been consistently observed to drive models to produce longer outputs. This paper demonstrates that optimizing for response length is a significant factor behind RLHF's reported improvements in these settings. First, we study the relationship between reward and length for reward models trained on three open-source preference datasets for helpfulness. Here, length correlates strongly with reward, and improvements in reward score are driven in large part by shifting the distribution over output lengths. We then explore interventions during both RL and reward model learning to see if we can achieve the same downstream improvements as RLHF without increasing length. While our interventions mitigate length increases, they aren't uniformly effective across settings. Furthermore, we find that even running RLHF with a reward based solely on length can reproduce most of the downstream improvements over the initial policy model, showing that reward models in these settings have a long way to go. > >

Machine Learning nsa • 12 months ago • 100%

Think before you speak: Training Language Models With Pause Tokens https://arxiv.org/abs/2310.02226

> > > Language models generate responses by producing a series of tokens in immediate succession: the $(K+1)^{th}$ token is an outcome of manipulating $K$ hidden vectors per layer, one vector per preceding token. What if instead we were to let the model manipulate say, $K+10$ hidden vectors, before it outputs the $(K+1)^{th}$ token? We operationalize this idea by performing training and inference on language models with a (learnable) $\\textit{pause}$ token, a sequence of which is appended to the input prefix. We then delay extracting the model's outputs until the last pause token is seen, thereby allowing the model to process extra computation before committing to an answer. We empirically evaluate $\\textit{pause-training}$ on decoder-only models of 1B and 130M parameters with causal pretraining on C4, and on downstream tasks covering reasoning, question-answering, general understanding and fact recall. Our main finding is that inference-time delays show gains when the model is both pre-trained and finetuned with delays. For the 1B model, we witness gains on 8 of 9 tasks, most prominently, a gain of $18%$ EM score on the QA task of SQuAD, $8%$ on CommonSenseQA and $1%$ accuracy on the reasoning task of GSM8k. Our work raises a range of conceptual and practical future research questions on making delayed next-token prediction a widely applicable new paradigm. > > interesting new paper exploring the addition of "pause" tokens for lms. it seems to imply that the computational capacity of lms is bounded by the number of tokens in its context, so simply adding "meaningless" tokens gives the model increased computational capacity leading to increased performance on downstream tasks

Machine Learning KingsmanVince • 12 months ago • 100%

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No https://arxiv.org/abs/2308.12213

Out-of-distribution (OOD) detection refers to training the model on an in-distribution (ID) dataset to classify whether the input images come from unknown classes. Considerable effort has been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, zero-shot OOD detection methods driven by CLIP, which only require class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers the logic of saying no within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples using positive-semantic prompts and negation-semantic prompts. Specifically, we design a novel learnable no prompt and a no text encoder to capture negation semantics within images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with no prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection by utilizing negation semantics from no prompts and the text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN, based on ViT-B-16, outperforms 7 well-used algorithms by at least 2.34% and 11.64% in terms of AUROC and FPR95 for zero-shot OOD detection on ImageNet-1K. Our CLIPN can serve as a solid foundation for effectively leveraging CLIP in downstream OOD tasks. The code is available on [https://github.com/xmed-lab/CLIPN](https://github.com/xmed-lab/CLIPN) .

Machine Learning nsa • 1 year ago • 100%

Language Modeling Is Compression https://arxiv.org/abs/2309.10668

> > > It has long been established that predictive models can be transformed into lossless compressors and vice versa. Incidentally, in recent years, the machine learning community has focused on training increasingly large and powerful self-supervised (language) models. Since these large language models exhibit impressive predictive capabilities, they are well-positioned to be strong compressors. In this work, we advocate for viewing the prediction problem through the lens of compression and evaluate the compression capabilities of large (foundation) models. We show that large language models are powerful general-purpose predictors and that the compression viewpoint provides novel insights into scaling laws, tokenization, and in-context learning. For example, Chinchilla 70B, while trained primarily on text, compresses ImageNet patches to 43.4% and LibriSpeech samples to 16.4% of their raw size, beating domain-specific compressors like PNG (58.5%) or FLAC (30.3%), respectively. Finally, we show that the prediction-compression equivalence allows us to use any compressor (like gzip) to build a conditional generative model. > > I wonder what a paper like this, especially given the title, does for the legal case regarding copyright and generative AI. Haven't had a chance to read the paper yet, so don't know if the findings are relevant to copyright.

Machine Learning KingsmanVince • 1 year ago • 100%

Scaling Vision-Language Models with Sparse Mixture of Experts https://arxiv.org/abs/2303.07226

The field of natural language processing (NLP) has made significant strides in recent years, particularly in the development of large-scale vision-language models (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-language models, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-language models and other multimodal machine learning applications.

Machine Learning KingsmanVince • 1 year ago • 100%

Hydra-MoE: A new class of Open-Source Mixture of Experts

github.com

Machine Learning KingsmanVince • 1 year ago • 100%

Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks https://arxiv.org/abs/2307.16395

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.

Machine Learning KingsmanVince • 1 year ago • 100%

Foundational Models Defining a New Era in Vision: A Survey and Outlook https://arxiv.org/abs/2307.13721

Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundational models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundational models, including typical architecture designs to combine different modalities (vision, text, audio, etc), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundational models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of their contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively. A comprehensive list of foundational models studied in this work is available at [https://github.com/awaisrauf/Awesome-CV-Foundational-Models](https://github.com/awaisrauf/Awesome-CV-Foundational-Models) .

Machine Learning KingsmanVince • 1 year ago • 100%

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

aclanthology.org

Multilingual Vision-Language Pre-training (VLP) is a promising but challenging topic due to the lack of large-scale multilingual image-text pairs. Existing works address the problem by translating English data into other languages, which is intuitive and the generated data is usually limited in form and scale. In this paper, we explore a more practical and scalable setting: weakly supervised multilingual VLP with only English image-text pairs and multilingual text corpora. We argue that the universal multilingual representation learned from texts allows the cross-modal interaction learned in English to be transferable to other languages. To this end, we propose a framework to effectively unify cross-lingual and cross-modal pre-training. For unified modeling on different data, we design an architecture with flexible modules to learn different interactions. Moreover, two unified tasks are introduced to efficiently guide the unified cross-lingual cross-modal learning. Extensive experiments demonstrate that our pre-trained model learns universal multilingual multimodal representations, allowing effective cross-lingual transfer on multimodal tasks. Code and models are available at [https://github.com/FudanDISC/weakly-supervised-mVLP](https://github.com/FudanDISC/weakly-supervised-mVLP).

Machine Learning KingsmanVince • 1 year ago • 100%

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks https://arxiv.org/abs/2303.16839

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one architecture, and further need adaptations for downstream tasks. We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. This is done with a simple model, called MaMMUT. It consists of a single vision encoder and a text decoder, and is able to accommodate contrastive and generative learning by a novel two-pass approach on the text decoder. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Furthermore, the same architecture enables straightforward extensions to open-vocabulary object detection and video-language tasks. The model tackles a diverse range of tasks, while being modest in capacity. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models. It shows very competitive results on VQA and Video Captioning, especially considering its capacity. Ablations confirm the flexibility and advantages of our approach.

Machine Learning KingsmanVince • 1 year ago • 100%

Vision Language Transformers: A Survey https://arxiv.org/abs/2307.03254

Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform. A relatively recent body of research has adapted the pretrained transformer architecture introduced in \\citet{vaswani2017attention} to vision language modeling. Transformer models have greatly improved performance and versatility over previous vision language models. They do so by pretraining models on a large generic datasets and transferring their learning to new tasks with minor changes in architecture and parameter values. This type of transfer learning has become the standard modeling practice in both natural language processing and computer vision. Vision language transformers offer the promise of producing similar advancements in tasks which require both vision and language. In this paper, we provide a broad synthesis of the currently available research on vision language transformer models and offer some analysis of their strengths, limitations and some open questions that remain.

Machine Learning AsAnAILanguageModel • 1 year ago • 100%

SeamlessM4T: Multimodal Model for Speech Translation

Meta releases SeamlessM4T, a general multilingual speech/text model claimed to surpass OpenAI's Whisper. It's available on [github](https://github.com/facebookresearch/seamless_communication) and everything can be used for free in a non-commercial setting. #### Model Features: - Automatic speech recognition for ~100 languages. - Speech-to-text translation for ~100 input/output languages. - Speech-to-speech translation for ~100 input languages and 35 output languages. - Text-to-text and text-to-speech translation for nearly 100 languages. #### Dataset: - SeamlessAlign: Open multimodal translation dataset with 270,000 hours of speech and text alignments. #### Technical Insights: - Utilizes a multilingual and multimodal text embedding space for 200 languages. - Applied a teacher-student approach to extend this embedding space to the speech modality, covering 36 languages. - Mining performed on publicly available repositories resulted in 443,000 hours of speech aligned with texts and 29,000 hours of speech-to-speech alignments. #### Toxicity Filter: - The model identifies toxic words from speech inputs/outputs and filters unbalanced toxicity in training data. - The demo detects toxicity in both input and output. If toxicity is only detected in the output, a warning is included and the output is not shown. - Given how impaired llama2-chat has been due to these kind of filters, it's unclear how useful these models are in a general setting.

Machine Learning AsAnAILanguageModel • 1 year ago • 100%

Hugging Face Releases IDEFICS: An Open-Access 80B Visual Language Model Replicating DeepMind's Flamingo

Hugging Face released IDEFICS, an 80B open-access visual language model replicating DeepMind's unreleased [Flamingo](https://www.deepmind.com/blog/tackling-multiple-tasks-with-a-single-visual-language-model). Built entirely on public data, it's the first of its size available openly. Part of its training utilized OBELICS, a dataset with 141M web pages, 353M images, and 115B text tokens from Common Crawl. - [Announcement](https://huggingface.co/blog/idefics) - [Demo](https://huggingface.co/spaces/HuggingFaceM4/idefics_playground) - [Models](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) - [OBELICS Dataset](https://huggingface.co/datasets/HuggingFaceM4/OBELICS) - [OBELICS Paper](https://arxiv.org/abs/2306.16527) - [Lessons](https://github.com/huggingface/m4-logs/blob/master/memos/README.md)

Machine Learning KingsmanVince • 1 year ago • 100%

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use https://arxiv.org/abs/2308.06595

We introduce VisIT-Bench (Visual InsTruction Benchmark), a benchmark for evaluation of instruction-following vision-language models for real-world use. Our starting point is curating 70 'instruction families' that we envision instruction tuned vision-language models should be able to address. Extending beyond evaluations like VQAv2 and COCO, tasks range from basic recognition to game playing and creative generation. Following curation, our dataset comprises 592 test queries, each with a human-authored instruction-conditioned caption. These descriptions surface instruction-specific factors, e.g., for an instruction asking about the accessibility of a storefront for wheelchair users, the instruction-conditioned caption describes ramps/potential obstacles. These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment. We quantify quality gaps between models and references using both human and automatic evaluations; e.g., the top-performing instruction-following model wins against the GPT-4 reference in just 27% of the comparison. VisIT-Bench is dynamic to participate, practitioners simply submit their model's response on the project website; Data, code and leaderboard is available at visit-bench.github.io .

Machine Learning KingsmanVince • 1 year ago • 100%

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback https://arxiv.org/abs/2307.16039

A key technology for the development of large language models (LLMs) involves instruction tuning that helps align the models' responses with human expectations to realize impressive learning abilities. Two major approaches for instruction tuning characterize supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), which are currently applied to produce the best commercial LLMs (e.g., ChatGPT). To improve the accessibility of LLMs for research and development efforts, various instruction-tuned open-source LLMs have also been introduced recently, e.g., Alpaca, Vicuna, to name a few. However, existing open-source LLMs have only been instruction-tuned for English and a few popular languages, thus hindering their impacts and accessibility to many other languages in the world. Among a few very recent work to explore instruction tuning for LLMs in multiple languages, SFT has been used as the only approach to instruction-tune LLMs for multiple languages. This has left a significant gap for fine-tuned LLMs based on RLHF in diverse languages and raised important questions on how RLHF can boost the performance of multilingual instruction tuning. To overcome this issue, we present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research. We also present benchmark datasets to enable the evaluation of generative LLMs in multiple languages. Our experiments demonstrate the advantages of RLHF for multilingual instruction over SFT for different base models and datasets. Our framework and resources are released at [https://github.com/nlp-uoregon/Okapi](https://github.com/nlp-uoregon/Okapi).

Machine Learning Lenguador • 1 year ago • 66%

Real-Time Radiance Field Rendering

huggingface.co

Achieves SOTA on quality AND on training time AND renders in real-time (60fps+)

Machine Learning KingsmanVince • 1 year ago • 100%

RecycleGPT: An Autoregressive Language Model with Recyclable Module https://arxiv.org/abs/2308.03421

Existing large language models have to run K times to generate a sequence of K tokens. In this paper, we present RecycleGPT, a generative language model with fast decoding speed by recycling pre-generated model states without running the whole model in multiple steps. Our approach relies on the observation that adjacent tokens in a sequence usually have strong correlations and the next token in a sequence can be reasonably guessed or inferred based on the preceding ones. Experiments and analysis demonstrate the effectiveness of our approach in lowering inference latency, achieving up to 1.4x speedup while preserving high performance.

Machine Learning Lenguador • 1 year ago • 100%

Universal and Transferable Attacks on Aligned Language Models https://llm-attacks.org/

Machine Learning KingsmanVince • 1 year ago • 100%

Multitask Pretraining with Structured Knowledge for Text-to-SQL Generation

aclanthology.org

Many machine learning-based low-code or no-code applications involve generating code that interacts with structured knowledge. For example, one of the most studied tasks in this area is generating SQL code from a natural language statement. Prior work shows that incorporating context information from the database schema, such as table and column names, is beneficial to model performance on this task. In this work we present a large pretraining dataset and strategy for learning representations of text, tables, and SQL code that leverages the entire context of the problem. Specifically, we build on existing encoder-decoder architecture by introducing a multitask pretraining framework that complements the unique attributes of our diverse pretraining data. Our work represents the first study on large-scale pretraining of encoder-decoder models for interacting with structured knowledge, and offers a new state-of-the-art foundation model in text-to-SQL generation. We validate our approach with experiments on two SQL tasks, showing improvement over existing methods, including a 1.7 and 2.2 percentage point improvement over prior state-of-the-arts on Spider and CoSQL.

Machine Learning nsa • 1 year ago • 100%

Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621

This is an exciting new paper that replaces attention in the Transformer architecture with a set of decomposable matrix operations that retain the modeling capacity of Transformer models, while allowing parallel training and efficient RNN-like inference without the use of attention (it doesn't use a softmax). It achieves lower perplexity than Transformers models with more than 2B parameters and requires much lower GPU memory and FLOPs compared Transformers for inference. Abstract: > > > In this work, we propose Retentive Network (RetNet) as a foundation architecture for large language models, simultaneously achieving training parallelism, low-cost inference, and good performance. We theoretically derive the connection between recurrence and attention. Then we propose the retention mechanism for sequence modeling, which supports three computation paradigms, i.e., parallel, recurrent, and chunkwise recurrent. Specifically, the parallel representation allows for training parallelism. The recurrent representation enables low-cost $O(1)$ inference, which improves decoding throughput, latency, and GPU memory without sacrificing performance. The chunkwise recurrent representation facilitates efficient long-sequence modeling with linear complexity, where each chunk is encoded parallelly while recurrently summarizing the chunks. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference. The intriguing properties make RetNet a strong successor to Transformer for large language models. Code will be available at [https://aka.ms/retnet](https://aka.ms/retnet). > >

Machine Learning KingsmanVince • 1 year ago • 100%

mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs https://arxiv.org/abs/2307.06930

Modular vision-language models (Vision-LLMs) align pretrained image encoders with (pretrained) large language models (LLMs), representing a computationally much more efficient alternative to end-to-end training of large vision-language models from scratch, which is prohibitively expensive for most. Vision-LLMs instead post-hoc condition LLMs to `understand' the output of an image encoder. With the abundance of readily available high-quality English image-text data as well as monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. In this work, we present mBLIP, the first multilingual Vision-LLM, which we obtain in a computationally efficient manner -- on consumer hardware using only a few million training examples -- by leveraging a pretrained multilingual LLM. To this end, we \\textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM -- for this, we leverage multilingual data from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark, mBLIP yields results competitive with state-of-the-art models. Moreover, in image captioning on XM3600, mBLIP (zero-shot) even outperforms PaLI-X (a model with 55B parameters). Compared to these very large multilingual vision-language models trained from scratch, we obtain mBLIP by training orders of magnitude fewer parameters on magnitudes less data. We release our model and code at \\url{https://github.com/gregor-ge/mBLIP}.

Machine Learning KingsmanVince • 1 year ago • 100%

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors

aclanthology.org

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.

Machine Learning Dienervent • 1 year ago • 100%

Introducing Keras Core: Keras for TensorFlow, JAX, and PyTorch. https://keras.io/keras_core/announcement/

Keras 3.0 now works with TensorFlow, JAX and PyTorch. Also introduces a bunch new features. Check it out.

Machine Learning KingsmanVince • 1 year ago • 100%

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time https://proceedings.mlr.press/v162/wortsman22a.html

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs—we call the results “model soups.” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90.94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at [https://github.com/mlfoundations/model-soups](https://github.com/mlfoundations/model-soups).

Machine Learning KingsmanVince • 1 year ago • 100%

Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages https://arxiv.org/abs/2306.16774

Vision-Language Pre-training (VLP) has advanced the performance of many vision-language tasks, such as image-text retrieval, visual entailment, and visual reasoning. The pre-training mostly utilizes lexical databases and image queries in English. Previous work has demonstrated that the pre-training in English does not transfer well to other languages in a zero-shot setting. However, multilingual pre-trained language models (MPLM) have excelled at a variety of single-modal language tasks. In this paper, we propose a simple yet efficient approach to adapt VLP to unseen languages using MPLM. We utilize a cross-lingual contextualized token embeddings alignment approach to train text encoders for non-English languages. Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data. Our evaluation across three distinct tasks (image-text retrieval, visual entailment, and natural language visual reasoning) demonstrates that this approach outperforms the state-of-the-art multilingual vision-language models without requiring large parallel corpora. Our code is available at [https://github.com/Yasminekaroui/CliCoTea](https://github.com/Yasminekaroui/CliCoTea).

Machine Learning rf5 • 1 year ago • 100%

Voice Conversion With Just Nearest Neighbors https://arxiv.org/abs/2305.18975

TL;DR: want to convert your voice to another person's voice? Or even to a whisper? Or a dog barking? Or to any other random speech clip? Give our new method a try: [https://bshall.github.io/knn-vc](https://bshall.github.io/knn-vc) Longer version: our research team kept seeing new voice conversion methods getting more complex and becoming harder to reproduce. So, we tried to see if we could make a top-tier voice conversion model that was extremely simple. So, we made kNN-VC, where our entire conversion model is just k-nearest neighbors on WavLM features. And, it turns out, this does as well if not better than very complex any-to-any voice conversion methods. What's more, since k-nearest neighbors has no parameters, we can use anything as the reference, even clips of dogs barking, music, or references from other languages. I hope you enjoy our research! We provide a quick-start notebook, code, and audio samples, and vocoder checkpoints [https://bshall.github.io/knn-vc/](https://bshall.github.io/knn-vc/)

Machine Learning KingsmanVince • 1 year ago • 100%

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks https://arxiv.org/abs/2305.11175

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on this https URL. The code shall be released at this https URL.

Machine Learning nsa • 1 year ago • 100%

Inverse Scaling: When Bigger Isn't Better https://arxiv.org/abs/2306.09479

Abstract: > > > Work on scaling laws has found that large language models (LMs) show predictable improvements to overall loss with increased scale (model size, training data, and compute). Here, we present evidence for the claim that LMs may show inverse scaling, or worse task performance with increased scale, e.g., due to flaws in the training objective and data. We present empirical evidence of inverse scaling on 11 datasets collected by running a public contest, the Inverse Scaling Prize, with a substantial prize pool. Through analysis of the datasets, along with other examples found in the literature, we identify four potential causes of inverse scaling: (i) preference to repeat memorized sequences over following in-context instructions, (ii) imitation of undesirable patterns in the training data, (iii) tasks containing an easy distractor task which LMs could focus on, rather than the harder real task, and (iv) correct but misleading few-shot demonstrations of the task. We release the winning datasets at [https://inversescaling.com/data](https://inversescaling.com/data) to allow for further investigation of inverse scaling. Our tasks have helped drive the discovery of U-shaped and inverted-U scaling trends, where an initial trend reverses, suggesting that scaling trends are less reliable at predicting the behavior of larger-scale models than previously understood. Overall, our results suggest that there are tasks for which increased model scale alone may not lead to progress, and that more careful thought needs to go into the data and objectives for training language models. > >

Machine Learning Lenguador • 1 year ago • 100%

RWKV: Reinventing RNNs for the Transformer Era (Paper Explained)

kbin.social

Crosspost from m/ai

Machine Learning Deliverator • 1 year ago • 100%

Machine Learning Beginner Info/Resources

### MOOCs ### Nowadays, there are a couple of really excellent online lectures to get you started. The list is too long to include them all. Every one of the major MOOC sites offers not only one but several good Machine Learning classes, so please check coursera, edX, Udacity yourself to see which ones are interesting to you. However, there are a few that stand out, either because they're very popular or are done by people who are famous for their work in ML. Roughly in order from easiest to hardest, those are: * Andrew Ng's ML-Class at coursera: Focused on application of techniques. Easy to understand, but mathematically very shallow. Good for beginners! [https://www.coursera.org/course/ml](https://www.coursera.org/course/ml) * Hasti/Tibshirani's Statistical Learning [https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about](https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about) * Yaser Abu-Mostafa's Learning From Data: Focuses a lot more on theory, but also doable for beginners [https://work.caltech.edu/telecourse.html](https://work.caltech.edu/telecourse.html) * Geoff Hinton's Neural Nets for Machine Learning: As the title says, this is almost exclusively about Neural Networks. [https://www.coursera.org/course/neuralnets](https://www.coursera.org/course/neuralnets) * Hugo Larochelle's Neural Net lectures: Again mostly on Neural Nets, with a focus on Deep Learning [http://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH](http://www.youtube.com/playlist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH) * Daphne Koller's Probabilistic Graphical Models Is a very challenging class, but has a lot of good material that few of the other MOOCs here will cover [https://www.coursera.org/course/pgm](https://www.coursera.org/course/pgm) ### Books ### The most often recommended textbooks on general Machine Learning are (in no particular order): * Hasti/Tibshirani/Friedman's Elements of Statistical Learning FREE [http://statweb.stanford.edu/%7Etibs/ElemStatLearn/](http://statweb.stanford.edu/%7Etibs/ElemStatLearn/) * Barber's Bayesian Reasoning and Machine Learning FREE [http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage) * MacKay's Information Theory, Inference and Learning Algorithms FREE [http://www.inference.phy.cam.ac.uk/itila/book.html](http://www.inference.phy.cam.ac.uk/itila/book.html) * Goodfellow/Bengio/Courville's Deep Learning FREE [http://www.deeplearningbook.org/](http://www.deeplearningbook.org/) * Nielsen's Neural Networks and Deep Learning FREE [http://neuralnetworksanddeeplearning.com/](http://neuralnetworksanddeeplearning.com/) * Graves' Supervised Sequence Labelling with Recurrent Neural Networks FREE [http://www.cs.toronto.edu/%7Egraves/preprint.pdf](http://www.cs.toronto.edu/%7Egraves/preprint.pdf) * Sutton/Barto's Reinforcement Learning: An Introduction; 2nd Edition FREE [https://www.dropbox.com/s/7jl597kllvtm50r/book2015april.pdf](https://www.dropbox.com/s/7jl597kllvtm50r/book2015april.pdf) Note that these books delve deep into math, and might be a bit heavy for complete beginners. If you don't care so much about derivations or how exactly the methods work but would rather just apply them, then the following are good practical intros: * An Introduction to Statistical Learning FREE [http://www-bcf.usc.edu/%7Egareth/ISL/](http://www-bcf.usc.edu/%7Egareth/ISL/) * Probabilistic Programming and Bayesian Methods for Hackers FREE [http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Prologue/Prologue.ipynb](http://nbviewer.ipython.org/github/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers/blob/master/Prologue/Prologue.ipynb) There are of course a whole plethora on books that only cover specific subjects, as well as many books about surrounding fields in Math. A very good list has been collected by /u/ilsunil here ### Deep Learning Resources ### * Karpathy's Stanford CS231n: Convolutional Neural Networks for Visual Recognition (Lecture Notes) [http://cs231n.github.io/](http://cs231n.github.io/) * Video Lecture's Stanford CS231n: Convolutional Neural Networks for Visual Recognition [https://www.youtube.com/playlist?list=PLlJy-eBtNFt6EuMxFYRiNRS07MCWN5UIA](https://www.youtube.com/playlist?list=PLlJy-eBtNFt6EuMxFYRiNRS07MCWN5UIA) * Silver's Reinforcement Learning Lectures [https://www.youtube.com/watch?v=2pWv7GOvuf0](https://www.youtube.com/watch?v=2pWv7GOvuf0) * Colah's Informational Blog [http://colah.github.io/](http://colah.github.io/) * Bruna's UC Berkeley Stat212b: Topics Course on Deep Learning [https://joanbruna.github.io/stat212b/](https://joanbruna.github.io/stat212b/) * Overview of Neural Network Architectures [http://www.asimovinstitute.org/neural-network-zoo/](http://www.asimovinstitute.org/neural-network-zoo/) ### Math Resources ### * Strang's Linear Algebra Lectures [https://www.youtube.com/watch?v=ZK3O402wf1c](https://www.youtube.com/watch?v=ZK3O402wf1c) * Kolter/Do's Linear Algebra Review and Reference Notes [http://cs229.stanford.edu/section/cs229-linalg.pdf](http://cs229.stanford.edu/section/cs229-linalg.pdf) * Calculus 1 [https://www.edx.org/course/calculus-1a-differentiation-mitx-18-01-1x](https://www.edx.org/course/calculus-1a-differentiation-mitx-18-01-1x) * Introduction to Probability [https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1](https://www.edx.org/course/introduction-probability-science-mitx-6-041x-1) ### Programming Languages and Software ### In general, the most used languages in ML are probably Python, R and Matlab (with the latter losing more and more ground to the former two). Which one suits you better depends wholy on your personal taste. For R, a lot of functionality is either already in the standard library or can be found through various packages in CRAN. For Python, NumPy/SciPy are a must. From there, Scikit-Learn covers a broad range of ML methods. If you just want to play around a bit and don't do much programming yourself then things like Visions of Chaos, WEKA, KNIME or RapidMiner might be of your liking. Word of caution: a lot of people in this subreddit are very critical of WEKA, so even though it's listed here, it is probably not a good tool to do anything more than just playing around a bit. A more detailed discussion can be found here ### Deep Learning Software, GPU's and Examples ### There are a number of modern deep learning toolkits you can utilize to implement your models. Below, you will find some of the more popular toolkits. This is by no means an exhaustive list. Generally speaking, you should utilize whatever GPU has the most memory, highest clock speed, and most CUDA cores available to you. This was the NVIDIA Titan X from the previous generation. These frameworks are all very close in computation speed, so you should choose the one you prefer in terms of syntax. **Theano** is a python based deep learning toolkit developed by the Montreal Institute of Learning Algorithms, a cutting edge deep learning academic research center and home of many users of this forum. This has a large number of tutorials ranging from beginner to cutting edge research. **Torch** is a Luajit based scientific computing framework developed by Facebook Artificial Intelligence Research (FAIR) and is also in use at Twitter Cortex. There is the torch blog which contains examples of the torch framework in action. **TensorFlow** is a python deep learning framework developed by Google Brain and in use at Google Brain and Deepmind. The newest framework around. Some TensorFlow examples may be found here Do not ask questions on the Google Groups, ask them on stackoverflow **Neon** is a python based deep learning framework built around a custom and highly performant CUDA compiler Maxas by NervanaSys. **Caffe** is an easy to use, beginner friendly deep learning framework. It provides many pretrained models and is built around a protobuf format of implementing neural networks. **Keras** can be used to wrap Theano or TensorFlow for ease of use. ### Datasets and Challenges for Beginners ### There are a lot of good datasets here to try out your new Machine Learning skills. * Kaggle has a lot of challenges to sink your teeth into. Some even offer prize money! [http://www.kaggle.com/](http://www.kaggle.com/) * The UCI Machine Learning Repository is a collection of a lot of good datasets [http://archive.ics.uci.edu/ml/](http://archive.ics.uci.edu/ml/) * [http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists](http://blog.mortardata.com/post/67652898761/6-dataset-lists-curated-by-data-scientists) lists some more datasets * Here is a very extensive list of large-scale datasets of all kinds. [http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public](http://www.quora.com/Data/Where-can-I-find-large-datasets-open-to-the-public) * Another dataset list [http://www.datawrangling.com/some-datasets-available-on-the-web](http://www.datawrangling.com/some-datasets-available-on-the-web) ### Research Oriented Datasets ### In many papers, you will find a few datasets are the most common. Below, you can find the links to some of them. * MNIST A short handwriting dataset that is often used as a sanity check in modern research [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/) * SVHN Similar to MNIST, but with color numbers. A sanity check in most cases. [http://ufldl.stanford.edu/housenumbers/](http://ufldl.stanford.edu/housenumbers/) * CIFAR-10/0 CIFAR 10 and 100 are two natural color images that are often used with convolutional neural networks for image classification. [https://www.cs.toronto.edu/%7Ekriz/cifar.html](https://www.cs.toronto.edu/%7Ekriz/cifar.html) ### Communities ### * [http://www.datatau.com/](http://www.datatau.com/) is a data-science centric hackernews * [http://metaoptimize.com/qa/](http://metaoptimize.com/qa/) and [http://stats.stackexchange.com/](http://stats.stackexchange.com/) are Stackoverflow-like discussion forums ### ML Research ### Machine Learning is a very active field of research. The two most prominent conferences are without a doubt NIPS and ICML. Both sites contain the pdf-version of the papers accepted there, they're a great way to catch up on the most up-to-date research in the field. Other very good conferences include UAI (general AI), COLT (covers theoretical aspects) and AISTATS. Good journals for ML papers are the Journal of Machine Learning Research, the Journal of Machine Learning and arxiv. ### Other sites and Tutorials ### * [http://datasciencemasters.org/](http://datasciencemasters.org/) is an extensive list of lectures and textbooks for a whole Data Science curriculum * [http://deeplearning.net/](http://deeplearning.net/) * [http://en.wikipedia.org/wiki/Machine\_learning](http://en.wikipedia.org/wiki/Machine_learning) * [http://videolectures.net/Top/Computer\_Science/Machine\_Learning/](http://videolectures.net/Top/Computer_Science/Machine_Learning/) ### FAQ ### * How much Math/Stats should I know? That depends on how deep you want to go. For a first exposure (e.g. Ng's Coursera class) you won't need much math, but in order to understand how the methods really work,having at least an undergrad level of Statistics, Linear Algebra and Optimization won't hurt.