TUTORIAL
PROGRAM
|
|
|
TUTORIAL 1: TUESDAY, 20 AUGUST
2002, 11:00-13:00
|
Automation in
Information Extraction and
Integration
(PDF
Presentation
Slides
- 1.2MB)
|
|
OBJECTIVES
Data integration
has always been a problem of acute importance in
applications like data warehousing. The problem is
gaining added momentum with the growing popularity
of web portals like Citeseer that lend structure to
data gleaned from multiple different web pages. In
this tutorial we will discuss how the novel use of
techniques from machine learning and data mining
can automate the previous manual processes of
information extraction, duplicate elimination,
schema mapping and missing value
substitution.
CONTENTS
The tutorial will
focus on two core operations: information
extraction and duplicate elimination. We will show
how to automate these via the application of
classification methods like rule learning and
decision trees, and sequence modeling methods like
hidden Markov models. Issues like feature design,
choice of models, and order of extraction present
interesting design alternatives in such cases. Most
automated methods require labeled data that again
involves manual effort. We will review recent
research on exploiting available structured
databases and the techniques of semi-supervised and
active learning to address the problem of sparse
training data.
WHO SHOULD
ATTEND
Researchers and
professionals involved in data warehouse cleaning,
data mining preprocessing tools and Internet
portals that integrate
web information
sources.
ABOUT THE
INSTRUCTOR
Sunita Sarawagi
does research in the fields of databases, data
mining, machine learning and data warehousing. She
is a member of the faculty at IIT Bombay. Prior to
that she was a research staff member at IBM Almaden
Research Center. She got her Ph.D. in databases
from the University of California at
Berkeley.
|
|
|
|
TUTORIAL 2: TUESDAY, 20
AUGUST 2002, 14:30-16:00 & 16:30-18:00
|
Database Tuning:
Principles, Experiments and Troubleshooting
Techniques
(PDF
Presentation
Slides
- 716K)
|
|
OBJECTIVES
- To show that
database tuning can be distilled into a set of
principles that apply across
systems.
- To explain some
of those principles and give evidence from
experiments performed on the major vendor
systems.
- To show the
scope of the database tuning problem: from
hardware to transaction design to application
considerations to schema design.
CONTENTS
- Principles:
eliminating start-up costs, partitioning,
thinking globally.
- locking/logging:
locking granularity, checkpoint tuning,
- hardware: RAID,
buffer size, controller cache
- communication:
ODBC vs. native, user defined
functions.
- electronic
commerce: indexes and communication
- data
warehousing: aggregate targeting and
indexes
- troubleshooting
tools: query plan, performance monitors, event
monitors.
WHO SHOULD
ATTEND
Designers of
DBMS's, consultants, advanced application
developers, professors.
ABOUT THE
INSTRUCTORS
Dennis Shasha is a
professor at NYU's Courant Institute where he does
research on biological pattern discovery for
microarrays, combinatorial pattern matching on
trees and graphs, database tuning, and database
design and algorithms for time series. He has
written or co-written seven books, including some
fun ones. He has a monthly column of mathematical
puzzles in Scientific American and in Dr. Dobb's
Journal.
Philippe Bonnet is
assistant professor in the computer science
department of University of Copenhagen (DIKU),
where he does research on database tuning, query
processing and data management over sensor
networks.
|
|
|
|
TUTORIAL 3: WEDNESDAY, 21
AUGUST 2002, 11:00-13:00
|
Text Search for
Fine-grained Semi-structured Data
(PDF
Presentation
Slides
- 1MB)
|
|
OBJECTIVES
Unlike Web search
engines, relational query languages do not
facilitate schema-less keyword searches. This
tutorial will expose the attendees to recent
research results which bridge the gap between these
extremes. Attendees will learn about indexing,
searching, and ranking techniques for
graph-structured data with free-form text in
columns or nodes of the data.
CONTENTS
Inverted indices,
keyword search, vector space model, relevance
ranking; social network analysis, prestige ranking;
graph models for relational and semi-structured
textual data; query models; responding to a keyword
query using a subgraph; ranking nodes, formulations
based on Steiner trees and biased random walks;
relevance feedback in the graph model; search
strategies and performance issues; integrating
multiple repositories and metadata; user interface
issues; research directions.
WHO SHOULD
ATTEND
Researchers and
builders of systems for searching relational and
semi-structured databases using
keywords.
ABOUT THE
INSTRUCTOR
Soumen Chakrabarti
holds a Ph.D. from U.C. Berkeley. Prior to joining
IIT Bombay he worked at IBM Research on crawling,
searching and mining the Web using its hyperlink
graph structure. He has served as a deputy-chair
for WWW 2002 and ICDE 2003, as a program committee
member for many conferences, including VLDB, ICDE,
SIGKDD, SIGIR, WWW, and SODA.
|
|
|
|
TUTORIAL 4: WEDNESDAY, 21
AUGUST 2002, 14:30-16:00 & 16:30-18:00
|
Application
Servers and Associated Technologies
(PDF
Presentation
Slides
- 4.5MB)
|
|
C.
Mohan
IBM Almaden Research Center,
U.S.A.
|
|
OBJECTIVES
Application Servers
(ASs), which have become very popular in the last
few years, provide the platforms for the execution
of transactional, server-side applications in the
online world. While transaction processing monitors
(TPMs) have been providing similar functionality
for over 3 decades, ASs are their modern
equivalents. ASs play a central role in enabling
electronic commerce in the web context. The
objective of this tutorial is to provide an
introduction to different ASs and their underlying
technologies for the novice as well as the
experienced person. The intent is to broaden the
background of database people for them to be able
to better appreciate application requirements and
scenarios.
CONTENTS
- Introduction:
TP Monitors, Evolution of Application
Environments and Requirements, Distributed
Computing Models, Dynamic Web, Business and
Presentation Logic Encapsulation
- Underlying
Technologies: Java 2 Enterprise Edition (J2EE),
Common Object Request Broker Architecture
(CORBA), Enterprise JavaBeans (EJBs), Java
ServerPages (JSPs), Java Transaction API &
Service (JTA & JTS), Java Messaging Service
(JMS), Java Database Connectivity (JDBC),
Internet Inter-ORB Protocol (IIOP), Simple
Object Access Protocol (SOAP), Web
Services
- Application
Servers: IBM WebSphere, BEA WebLogic, Oracle9i
Application Server, Sun ONE Application Server
(iPlanet), Microsoft .NET
- Functionality
Attributes: Availability, Scalability, High
Performance, Load Balancing, Embeddability,
Portability, Cloning, Failover
- Benchmarks:
Nile, Trade, ECPerf
- Complementary
Functionality Areas: Commerce, Business to
Business Collaboration, Personalization,
Transcoding, Internationalization, Caching,
Directory Services, Visual/Integrated Software
Development Environments, Transactional
Messaging and Queuing, Edge Servers
- Application
Case Studies: eBay, Schwab
WHO SHOULD
ATTEND
This tutorial is
targeted at academic/industrial researchers,
systems designers/implementers and practitioners
who wish to obtain a good understanding of the
state-of-the-art in application servers, especially
those based on J2EE, and their associated
technologies.
ABOUT THE
INSTRUCTOR
C. Mohan (Ph.D.
1981, UT-Austin) was named an IBM Fellow in 1997
for being recognized worldwide as a leading
innovator in transaction management. He is the
primary inventor of the ARIES family of recovery
and locking methods, and the industry-standard
Presumed Abort commit protocol. He received the
1996 ACM SIGMOD Innovations Award. At the 1999 VLDB
Conference, he was honored with the 10 Year Best
Paper Award for the widespread commercial and
research impact of his work on ARIES. Mohan, who is
an inventor on 33 patents, works very closely with
numerous IBM product groups. His research results
are implemented in numerous IBM and non-IBM
prototypes and products like DB2, MQSeries, Lotus
Domino, S/390 Parallel Sysplex and SQLServer.
Currently, Mohan is a member of IBM's Application
Integration Middleware Architecture Board and is
working on next generation messaging technologies
and database caching in the context of WebSphere
and DB2.
|
|
|
|
TUTORIAL 5: THURSDAY, 22
AUGUST 2002, 11:00-12:30 & 14:00-15:30
|
Querying and
Mining Data Streams:
You Only Get One Look
(PDF
Presentation
Slides
- 456K)
|
|
OBJECTIVES
Continuous data
streams arise naturally, for example, in the
network installations of large Telecom and Internet
service providers where detailed usage information
from different parts of the network needs to be
continuously collected and analyzed for interesting
trends. This tutorial will provide a comprehensive
and clear overview of the key research results
surrounding data stream processing at this point in
time.
CONTENTS
Our discussion will
be structured as follows.
- Introduction:
Basic stream-processing models and
architectures; motivating
applications.
- Basic Stream
Summarization Algorithms: Samples,
quantiles/histograms, sketches, wavelets over
streaming data.
- Processing
Queries on Streams: Using sketches for
self-joins, binary joins, and complex joins over
data streams; estimating correlated aggregates;
using histogram and wavelet synopses for
approximate-query processing.
- Mining
High-speed Data Streams: Single-pass algorithms
for rule discovery, clustering, and
decision-tree construction over
streams
- Advanced Topics
and Future Research Directions: Hot-list
maintenance; distinct-value estimation;
multi-dimensional synopses; content-based
filtering of streaming XML
documents.
WHO SHOULD
ATTEND
This tutorial is
targeted at researchers and practitioners who want
to obtain a solid understanding of the
state-of-the-art in stream query processing and
analysis.
ABOUT THE
INSTRUCTORS
Minos Garofalakis
(Ph.D. 1998, UW-Madison) is a Member of Technical
Staff at Bell Labs. His research interests include
data reduction and mining, data streaming,
approximate queries, and XML.
Johannes Gehrke
(Ph.D. 1999, UW-Madison) is an Assistant Professor
at Cornell University. His research interests
include data mining, database systems, and
ubiquitous computing.
Rajeev Rastogi
(Ph.D. 1993, UT-Austin) is a Department Director at
Bell Labs. His research interests include network
management, database systems, and knowledge
discovery.
|
|
|
|
TUTORIAL 6: THURSDAY, 22
AUGUST 2002, 14:00-15:30 & 16:00-17:30
|
eBusiness
Architectures and Standards
(PDF
Presentation
Slides
- 1.6MB)
|
|
OBJECTIVES
eBusiness systems
and solutions enable enterprises to implement their
business processes and turn the enterprises into
real-time enterprises, where customers, partners,
suppliers and employees share data and processes in
real time. The deployment of eBusiness is made real
with the advances in the Internet and eBusiness
technologies and standards. However, there are
numerous standards, sometimes overlapping, often
causing confusion amongst developers and users. In
this tutorial we will study different eBusiness
technologies, standards and their applicability.
CONTENTS
1.
|
History of
eBusiness
|
5.
|
eBusiness
Standards
|
2.
|
eBusiness
Requirements
|
|
a.
Business protocols: ebXML, RosettaNet,
etc.
|
3.
|
eBusiness
Applications
|
|
b. XML
Schema
|
4.
|
eBusiness
Architectures
|
|
c.
Messaging, Web Services: WSDL, SOAP,
UDDI
|
|
a.
Services Oriented, Process Based
architectures
|
|
d.
Workflow: BPML, XLANG, WSFL,
etc.
|
|
b.
Integration framework
|
|
e.
Implementation Standards: J2EE vs.
.NET
|
|
c.
Business objects (e.g., XML
schema)
|
|
f.
Business Process Design
|
|
d.
Business Process Management
|
|
g.
Security standards (e.g. SAML)
|
|
e. Access
Management
|
|
h.
Solutions verticals (e.g. Chemicals,
Financials)
|
|
f. Portal
and Presentation
|
6.
|
Case
Studies
|
|
g. Web
Services
|
7.
|
Open
issues
|
WHO SHOULD
ATTEND
This tutorial is
very practical and systems oriented. The tutorials
is intended for database/middleware researchers,
implementers, application developers and end users
who want to gain a comprehensive understanding of
eBusiness architectures, technologies (e.g.
workflow, rules) and standards and gain
appreciation for their applicability in building
enterprise solutions.
ABOUT THE
INSTRUCTOR
Anil Nori has
considerable experience in building complex
database and eBusiness systems. He is co-founder
and CTO of Asera Inc., which provides eBusiness
solutions supported by a platform and tools for
development, deployment and management of
enterprise business processes. Prior to Asera, Anil
worked as a key database server architect at Oracle
and Digital Equipment Corporation. He also worked
at Computer Corporation of America. Anil has
published papers and presented tutorials at SIGMOD,
VLDB, ICDE and other Industry
conferences.
|
|
|
|
TUTORIAL 7: FRIDAY, 23
AUGUST 2002, 09:00-10:30 & 11:00-13:00
|
Sensor Data
Mining: Similarity Search and Pattern
Analysis
(PDF
Presentation
Slides
- 2.5MB)
|
|
OBJECTIVES
How can we find
patterns in a sequence of sensor measurements
(e.g., a sequence of temperatures or
water-pollutant measurements)? How can we compress
it? What are the major tools for forecasting and
outlier detection? The objective of this tutorial
is to provide a concise and intuitive overview of
the most important tools, that can help us find
patterns in sensor sequences. Sensor data analysis
becomes of increasingly high importance, due to the
decreasing cost of hardware and the increasing
on-sensor processing abilities. We review the state
of the art in three related fields: (a) fast
similarity search for time sequences, (b) linear
forecasting with the traditional AR
(autoregressive) and ARIMA methodologies and (c)
non-linear forecasting, for chaotic/self-similar
time sequences, using lag-plots and fractals. The
emphasis of the tutorial is to give the intuition
behind these powerful tools, which is usually lost
in the technical literature, as well as to give
case studies that illustrate their practical
use.
CONTENTS
- Similarity
Search
- why
we need similarity
search
- distance
functions (Euclidean, LP norms,
time-warping)
- fast
searching (R-trees,
M-trees)
- feature
extraction (DFT, Wavelets, SVD,
FastMap)
- Linear
Forecasting
- main
idea behind linear
forecasting
- AR
methodology
- multivariate
regression
- Recursive
Least Squares
- de-trending;
periodicities
|
- Non-linear/chaotic
forecasting
- main
idea: lag-plots
- 'fractals'
and 'fractal dimensions'
- definition
and intuition
- algorithms
for fast computation
- case
studies
|
WHO SHOULD
ATTEND
Researchers that
want to get up to speed with the major tools in
time sequence analysis. Also, practitioners who
want a concise, intuitive overview of the state of
the art.
ABOUT THE
INSTRUCTOR
Christos Faloutsos
is a Professor at Carnegie Mellon University. He
has received the Presidential Young Investigator
Award by the National Science Foundation (1989),
three "best paper'" awards (SIGMOD 94, VLDB 97,
KDD01-runner-up), and four teaching awards. He is a
member of the executive committee of SIGKDD; he has
published over 100 refereed articles, one
monograph, and holds four patents. His research
interests include data mining, fractals, indexing
methods for multimedia and text data bases, and
data base performance.
|
|
|